--- title: "Audit Trail Walkthrough" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Audit Trail Walkthrough} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(tidyaudit) library(dplyr) ``` ## What is an audit trail? When building data pipelines, it's easy to lose track of what happens at each step. How many rows were dropped by that filter? Did the join introduce duplicates? Which columns have missing values now? tidyaudit's audit trail captures **metadata-only snapshots** at each step of a pipe — row counts, column counts, NA totals, numeric summaries — without storing the data itself. This gives you a lightweight, structured record of your pipeline's behavior. The trail object also allows for custom functions to increase flexibility and capture domain-specific diagnostics. ## Building a basic trail Start by creating a trail object and inserting `audit_tap()` calls into your pipeline. Each tap records a snapshot and passes the data through unchanged. ```{r basic-trail} # Sample data orders <- data.frame( id = 1:20, customer = rep(c("Alice", "Bob", "Carol", "Dan", "Eve"), 4), amount = c(150, 200, 50, 300, 75, 120, 400, 90, 250, 60, 180, 210, 45, 320, 85, 130, 380, 95, 270, 55), status = rep(c("complete", "pending", "complete", "cancelled", "complete"), 4) ) trail <- audit_trail("order_pipeline") result <- orders |> audit_tap(trail, "raw") |> filter(status == "complete") |> audit_tap(trail, "complete_only") |> mutate(tax = amount * 0.1) |> audit_tap(trail, "with_tax") ``` Now print the trail to see the snapshot timeline: ```{r print-trail} print(trail) ``` The timeline shows row counts, column counts, NA totals, and change summaries between consecutive steps. You can see exactly how many rows each filter removed and when columns were added. ## Operation-aware taps Plain `audit_tap()` records what the data looks like, but it can't tell you *why* it changed. Operation-aware taps — `left_join_tap()`, `filter_tap()`, etc. — perform the operation AND record enriched diagnostics. ### Join taps Replace `dplyr::left_join()` + `audit_tap()` with `left_join_tap()` to capture match rates, relationship type, and duplicate key information: ```{r join-tap} customers <- data.frame( customer = c("Alice", "Bob", "Carol", "Dan"), region = c("East", "West", "East", "North") ) trail2 <- audit_trail("join_pipeline") result2 <- orders |> audit_tap(trail2, "raw") |> left_join_tap(customers, by = "customer", .trail = trail2, .label = "with_region") print(trail2) ``` The `Type` column now shows the join type, relationship, and match rate — all without leaving the pipe. All six dplyr join types are supported: `left_join_tap()`, `right_join_tap()`, `inner_join_tap()`, `full_join_tap()`, `anti_join_tap()`, `semi_join_tap()`. ### Filter taps `filter_tap()` keeps matching rows (like `dplyr::filter()`) while recording how many rows were dropped: ```{r filter-tap} trail3 <- audit_trail("filter_pipeline") result3 <- orders |> audit_tap(trail3, "raw") |> filter_tap(status == "complete", .trail = trail3, .label = "complete_only") |> filter_tap(amount > 100, .trail = trail3, .label = "high_value", .stat = amount) print(trail3) ``` The `.stat` argument tracks a numeric column through the filter, reporting how much of the total was dropped — useful for financial pipelines where you want to know the monetary impact of each filter. `filter_out_tap()` works the same way but drops matching rows (the inverse). ## Comparing snapshots `audit_diff()` gives you a detailed before/after comparison between any two snapshots in the trail: ```{r audit-diff} audit_diff(trail3, "raw", "high_value") ``` This shows row/column/NA deltas, columns added or removed, and numeric distribution shifts. ## Full audit report `audit_report()` prints the complete trail summary plus all consecutive diffs in one call: ```{r audit-report} audit_report(trail3) ``` ## Tips ### Custom diagnostics Pass a named list of functions via `.fns` to compute custom diagnostics at any tap: ```{r custom-fns} trail4 <- audit_trail("custom_example") result4 <- orders |> audit_tap(trail4, "raw", .fns = list( mean_amount = ~mean(.x$amount), n_customers = ~length(unique(.x$customer)) )) audit_report(trail4) ``` ### NULL-trail mode All tap functions work without a trail. When `.trail = NULL` (the default): - **No diagnostic args**: behaves like the plain dplyr function - **With `.stat` or `.warn_threshold`**: runs diagnostics and prints results without recording to a trail ```{r null-trail} # Plain filter — no diagnostics orders |> filter_tap(amount > 100) |> nrow() # Diagnostics without a trail orders |> filter_tap(amount > 100, .stat = amount) |> invisible() ``` This makes it easy to add quick diagnostics to any pipeline without setting up a full trail.