---
title: "Getting Started with autoFlagR"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with autoFlagR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
```

## Introduction

`autoFlagR` is an R package for automated data quality auditing using unsupervised machine learning. It provides AI-driven anomaly detection for data quality assessment, primarily designed for Electronic Health Records (EHR) data, with benchmarking capabilities for validation and publication.

## Installation

Install the package from CRAN:

```{r eval=FALSE}
install.packages("autoFlagR")
```

## Basic Workflow

The typical workflow consists of three main steps:

1. **Preprocess** your data
2. **Score** anomalies using AI algorithms
3. **Flag** top anomalies for review

### Step 1: Load the Package

```{r}
library(autoFlagR)
library(dplyr)
```

### Step 2: Prepare Your Data

The `prep_for_anomaly()` function automatically handles:
- Identifier columns (patient_id, encounter_id, etc.)
- Missing value imputation
- Numerical feature scaling (MAD or min-max)
- Categorical variable encoding (one-hot)

```{r}
# Example healthcare data
data <- data.frame(
  patient_id = 1:200,
  age = rnorm(200, 50, 15),
  cost = rnorm(200, 10000, 5000),
  length_of_stay = rpois(200, 5),
  gender = sample(c("M", "F"), 200, replace = TRUE),
  diagnosis = sample(c("A", "B", "C"), 200, replace = TRUE)
)

# Introduce some anomalies
data$cost[1:5] <- data$cost[1:5] * 20  # Unusually high costs
data$age[6:8] <- c(200, 180, 190)  # Impossible ages

# Prepare data for anomaly detection
prepared <- prep_for_anomaly(data, id_cols = "patient_id")
```

### Step 3: Score Anomalies

Use either Isolation Forest (default) or Local Outlier Factor (LOF):

```{r}
# Score anomalies using Isolation Forest
scored_data <- score_anomaly(
  data, 
  method = "iforest", 
  contamination = 0.05
)

# View anomaly scores
head(scored_data[, c("patient_id", "anomaly_score")], 10)
```

### Step 4: Flag Top Anomalies

Flag records as anomalous based on threshold or contamination rate:

```{r}
# Flag top anomalies
flagged_data <- flag_top_anomalies(
  scored_data, 
  contamination = 0.05
)

# View flagged anomalies
anomalies <- flagged_data[flagged_data$is_anomaly, ]
head(anomalies[, c("patient_id", "anomaly_score", "is_anomaly")], 10)
```

### Step 5: Generate Audit Report

Generate comprehensive PDF, HTML, or DOCX reports:

```{r eval=FALSE}
# Generate PDF report (saves to tempdir() by default)
generate_audit_report(
  data,
  filename = "my_audit_report",
  output_dir = tempdir(),
  output_format = "pdf",
  method = "iforest",
  contamination = 0.05
)
```

## Key Features

- **Automated Preprocessing**: Handles identifiers, scales numerical features, and encodes categorical variables
- **Multiple AI Algorithms**: Supports Isolation Forest and Local Outlier Factor (LOF) methods
- **Benchmarking Metrics**: Calculates AUC-ROC, AUC-PR, and Top-K Recall when ground truth labels are available
- **Professional Reports**: Generates PDF/HTML/DOCX reports with visualizations and prioritized audit listings
- **Tidy Interface**: Designed to work seamlessly with the tidyverse

## Next Steps

- See the [Healthcare Example](healthcare-example.html) vignette for a detailed walkthrough
- Learn about [Benchmarking](benchmarking.html) with ground truth labels
- Explore the [Function Reference](https://vikrant31.github.io/autoFlagR/reference/index.html) for detailed documentation