--- title: "Getting Started with autoFlagR" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with autoFlagR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## Introduction `autoFlagR` is an R package for automated data quality auditing using unsupervised machine learning. It provides AI-driven anomaly detection for data quality assessment, primarily designed for Electronic Health Records (EHR) data, with benchmarking capabilities for validation and publication. ## Installation Install the package from CRAN: ```{r eval=FALSE} install.packages("autoFlagR") ``` ## Basic Workflow The typical workflow consists of three main steps: 1. **Preprocess** your data 2. **Score** anomalies using AI algorithms 3. **Flag** top anomalies for review ### Step 1: Load the Package ```{r} library(autoFlagR) library(dplyr) ``` ### Step 2: Prepare Your Data The `prep_for_anomaly()` function automatically handles: - Identifier columns (patient_id, encounter_id, etc.) - Missing value imputation - Numerical feature scaling (MAD or min-max) - Categorical variable encoding (one-hot) ```{r} # Example healthcare data data <- data.frame( patient_id = 1:200, age = rnorm(200, 50, 15), cost = rnorm(200, 10000, 5000), length_of_stay = rpois(200, 5), gender = sample(c("M", "F"), 200, replace = TRUE), diagnosis = sample(c("A", "B", "C"), 200, replace = TRUE) ) # Introduce some anomalies data$cost[1:5] <- data$cost[1:5] * 20 # Unusually high costs data$age[6:8] <- c(200, 180, 190) # Impossible ages # Prepare data for anomaly detection prepared <- prep_for_anomaly(data, id_cols = "patient_id") ``` ### Step 3: Score Anomalies Use either Isolation Forest (default) or Local Outlier Factor (LOF): ```{r} # Score anomalies using Isolation Forest scored_data <- score_anomaly( data, method = "iforest", contamination = 0.05 ) # View anomaly scores head(scored_data[, c("patient_id", "anomaly_score")], 10) ``` ### Step 4: Flag Top Anomalies Flag records as anomalous based on threshold or contamination rate: ```{r} # Flag top anomalies flagged_data <- flag_top_anomalies( scored_data, contamination = 0.05 ) # View flagged anomalies anomalies <- flagged_data[flagged_data$is_anomaly, ] head(anomalies[, c("patient_id", "anomaly_score", "is_anomaly")], 10) ``` ### Step 5: Generate Audit Report Generate comprehensive PDF, HTML, or DOCX reports: ```{r eval=FALSE} # Generate PDF report (saves to tempdir() by default) generate_audit_report( data, filename = "my_audit_report", output_dir = tempdir(), output_format = "pdf", method = "iforest", contamination = 0.05 ) ``` ## Key Features - **Automated Preprocessing**: Handles identifiers, scales numerical features, and encodes categorical variables - **Multiple AI Algorithms**: Supports Isolation Forest and Local Outlier Factor (LOF) methods - **Benchmarking Metrics**: Calculates AUC-ROC, AUC-PR, and Top-K Recall when ground truth labels are available - **Professional Reports**: Generates PDF/HTML/DOCX reports with visualizations and prioritized audit listings - **Tidy Interface**: Designed to work seamlessly with the tidyverse ## Next Steps - See the [Healthcare Example](healthcare-example.html) vignette for a detailed walkthrough - Learn about [Benchmarking](benchmarking.html) with ground truth labels - Explore the [Function Reference](https://vikrant31.github.io/autoFlagR/reference/index.html) for detailed documentation