--- title: "Getting Started with fuzzystring" author: "Paul Efren Santos Andrade" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with fuzzystring} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(fuzzystring) ``` ## Introduction **fuzzystring** provides fast, flexible fuzzy string joins for data frames using approximate string matching. Built on top of `data.table` and `stringdist`, it's designed for efficiently merging datasets where exact matches aren't possible due to misspellings, inconsistent formatting, or slight variations in text. ## Installation You can install the development version of **fuzzystring** from GitHub: ```r # Using pak (recommended) # pak::pak("PaulESantos/fuzzystring") # Or using remotes # remotes::install_github("PaulESantos/fuzzystring") ``` ## Quick Start Here's a simple example matching diamond cuts with slight misspellings: ```{r quick-start} # Your messy data x <- data.frame( name = c("Idea", "Premiom", "Very Good"), id = 1:3 ) # Reference data y <- data.frame( approx_name = c("Ideal", "Premium", "VeryGood"), grp = c("A", "B", "C") ) # Fuzzy join with max distance of 2 edits fuzzystring_inner_join( x, y, by = c(name = "approx_name"), max_dist = 2, distance_col = "distance" ) ``` ## Key Features ### All Join Types Supported **fuzzystring** supports all standard join types. Below is a small, reusable example dataset so you can compare the behavior of each join family. ```{r join-datasets} x_join <- data.frame( name = c("Idea", "Premiom", "Very Good", "Gooood"), id = 1:4 ) y_join <- data.frame( approx_name = c("Ideal", "Premium", "VeryGood", "Good"), grp = c("A", "B", "C", "D") ) ``` - `fuzzystring_inner_join()`: Only matching rows. - `fuzzystring_left_join()`: All rows from `x`, matching rows from `y`. - `fuzzystring_right_join()`: All rows from `y`, matching rows from `x`. - `fuzzystring_full_join()`: All rows from both tables. - `fuzzystring_semi_join()`: Rows from `x` that have a match in `y`. - `fuzzystring_anti_join()`: Rows from `x` that don't have a match in `y`. #### Inner join ```{r join-inner, eval = TRUE} fuzzystring_inner_join( x_join, y_join, by = c(name = "approx_name"), max_dist = 2, distance_col = "distance" ) ``` #### Left join ```{r join-left, eval = TRUE} fuzzystring_left_join( x_join, y_join, by = c(name = "approx_name"), max_dist = 2, distance_col = "distance" ) ``` #### Right join ```{r join-right, eval = TRUE} fuzzystring_right_join( x_join, y_join, by = c(name = "approx_name"), max_dist = 2, distance_col = "distance" ) ``` #### Full join ```{r join-full, eval = TRUE} fuzzystring_full_join( x_join, y_join, by = c(name = "approx_name"), max_dist = 2, distance_col = "distance" ) ``` #### Semi join (rows from `x` with a match in `y`) ```{r join-semi, eval = TRUE} fuzzystring_semi_join( x_join, y_join, by = c(name = "approx_name"), max_dist = 2 ) ``` #### Anti join (rows from `x` without a match in `y`) ```{r join-anti, eval = TRUE} fuzzystring_anti_join( x_join, y_join, by = c(name = "approx_name"), max_dist = 2 ) ``` #### Using the generic `fuzzystring_join()` If you prefer a single entry point, you can use `fuzzystring_join()` directly by specifying `mode`. ```{r join-generic, eval = TRUE} fuzzystring_join( x_join, y_join, by = c(name = "approx_name"), max_dist = 2, mode = "left", distance_col = "distance" ) ``` ### Multiple Distance Methods You can choose from various distance metrics provided by the `stringdist` package: ```{r distance-methods, eval = FALSE} # Optimal String Alignment (default) fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "osa") # Damerau-Levenshtein fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "dl") # Jaro-Winkler (good for names) fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "jw") # Soundex (phonetic matching) fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "soundex") ``` ### Case-Insensitive Matching Use `ignore_case = TRUE` to ignore capitalization: ```{r ignore-case, eval = FALSE} fuzzystring_inner_join( x, y, by = c(name = "approx_name"), ignore_case = TRUE, max_dist = 1 ) ``` ## Advanced Usage ### Multiple Column Joins You can match on multiple columns using different matching functions for each: ```{r multi-column, eval = FALSE} fuzzystring_inner_join( x, y, by = c(name = "approx_name", value = "approx_value"), match_fun = list( name = function(x, y) stringdist::stringdist(x, y) <= 1, value = function(x, y) abs(x - y) < 0.5 ) ) ``` ## Performance **fuzzystring** uses a C++ implementation for row binding combined with a `data.table` backend for fast performance on large datasets. It is optimized for memory efficiency and type safety.