leakR

Welcome to leakR, an R package designed to help researchers, data scientists, and machine learning practitioners rigorously detect and diagnose data leakage in their workflows.

Data leakage is a pervasive yet often overlooked issue that undermines the integrity and reproducibility of predictive models by allowing unintended information to “leak” between training and testing phases. leakR provides a modular, extensible toolkit for detecting the most common and impactful forms of leakage, starting with tabular data contamination, target leakage, and temporal misalignments, while laying the foundation for a universal leakage detection framework across diverse data domains.

Installation

install.packages("leakr")

From GitHub (Development Version)

For the latest features and bug fixes:

# Install devtools if you don't have it
install.packages("devtools")

# Install leakR from GitHub
devtools::install_github("cherylisabella/leakR")

Quick Start

library(leakr)

# Basic audit of your dataset
report <- leakr_audit(iris, target = "Species")

# View summary of issues found
leakr_summarise(report)

# Generate diagnostic visualizations
leakr_plot(report)

# Access detailed results
print(report)

Main Functions

Function Purpose
leakr_audit() Main auditing function - detects leakage across your dataset
leakr_summarise() Generate human-readable summaries of detected issues
leakr_plot() Create diagnostic visualizations highlighting problems
leakr_from_caret() Import and audit caret workflow objects
leakr_from_tidymodels() Import and audit tidymodels workflow objects
leakr_from_mlr3() Import and audit mlr3 workflow objects

Learn More

Get started with the comprehensive vignettes:

# Getting started guide
vignette("getting-started", package = "leakr")

# Advanced detection techniques
vignette("advanced-detection", package = "leakr") 

# Framework integration examples
vignette("framework-integration", package = "leakr")

Why leakR?

What leakR Detects

Key Features

Development Roadmap

Citation

If you use leakR in your research, please cite:

@Manual{leakr2025,
  title = {leakR: Data Leakage Detection Tools for Machine Learning},
  author = {Cheryl Isabella Lim},
  year = {2025},
  note = {R package version 0.1.0},
  url = {https://github.com/cherylisabella/leakR},
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

leakR is currently under development. Feedback and contributions are welcome from the community!