| Title: | Fast Fuzzy String Joins for Data Frames |
| Version: | 0.0.1 |
| Description: | Perform fuzzy joins on data frames using approximate string matching. Implements all standard join types (inner, left, right, full, semi, anti) with support for multiple string distance metrics from the 'stringdist' package including Levenshtein, Damerau-Levenshtein, Jaro-Winkler, and Soundex. Features a high-performance 'data.table' backend with 'C++' row binding for efficient processing of large datasets. Ideal for matching misspellings, inconsistent labels, messy user input, or reconciling datasets with slight variations in identifiers. Optionally returns distance metrics alongside matched records. |
| License: | MIT + file LICENSE |
| Depends: | R (≥ 4.1) |
| Imports: | data.table, Rcpp, stringdist |
| LinkingTo: | Rcpp |
| Suggests: | dplyr, ggplot2, knitr, qdapDictionaries, readr, rmarkdown, rvest, stringr, testthat (≥ 3.0.0), tidyr |
| Config/testthat/edition: | 3 |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.3.3 |
| URL: | https://github.com/PaulESantos/fuzzystring, https://paulesantos.github.io/fuzzystring/ |
| BugReports: | https://github.com/PaulESantos/fuzzystring/issues |
| VignetteBuilder: | knitr |
| NeedsCompilation: | yes |
| Packaged: | 2026-02-05 01:33:47 UTC; PC |
| Author: | Paul E. Santos Andrade
|
| Maintainer: | Paul E. Santos Andrade <paulefrens@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-02-08 17:00:15 UTC |
fuzzystring: Fast Fuzzy String Joins for Data Frames
Description
Perform fuzzy joins on data frames using approximate string matching. Implements all standard join types (inner, left, right, full, semi, anti) with support for multiple string distance metrics from the 'stringdist' package including Levenshtein, Damerau-Levenshtein, Jaro-Winkler, and Soundex. Features a high-performance 'data.table' backend with 'C++' row binding for efficient processing of large datasets. Ideal for matching misspellings, inconsistent labels, messy user input, or reconciling datasets with slight variations in identifiers. Optionally returns distance metrics alongside matched records.
Author(s)
Maintainer: Paul E. Santos Andrade paulefrens@gmail.com (ORCID)
Other contributors:
David Robinson admiral.david@gmail.com (aut of fuzzyjoin) [contributor]
See Also
Useful links:
Report bugs at https://github.com/PaulESantos/fuzzystring/issues
Fuzzy anti join
Description
Convenience wrapper for fuzzystring_join_backend(mode = "anti").
Usage
fstring_anti_join(x, y, by = NULL, match_fun, ...)
Arguments
x |
A |
y |
A |
by |
Columns by which to join the two tables. See
|
match_fun |
A function used to match values. It must return a logical vector (or a data.frame/data.table whose first column is logical) indicating which pairs match. For multi-column joins, you may pass a list of functions (one per column). |
... |
Additional arguments passed to the matching function(s). |
Value
Fuzzy full join
Description
Convenience wrapper for fuzzystring_join_backend(mode = "full").
Usage
fstring_full_join(x, y, by = NULL, match_fun, ...)
Arguments
x |
A |
y |
A |
by |
Columns by which to join the two tables. See
|
match_fun |
A function used to match values. It must return a logical vector (or a data.frame/data.table whose first column is logical) indicating which pairs match. For multi-column joins, you may pass a list of functions (one per column). |
... |
Additional arguments passed to the matching function(s). |
Value
Fuzzy inner join
Description
Convenience wrapper for fuzzystring_join_backend(mode = "inner").
Usage
fstring_inner_join(x, y, by = NULL, match_fun, ...)
Arguments
x |
A |
y |
A |
by |
Columns by which to join the two tables. See
|
match_fun |
A function used to match values. It must return a logical vector (or a data.frame/data.table whose first column is logical) indicating which pairs match. For multi-column joins, you may pass a list of functions (one per column). |
... |
Additional arguments passed to the matching function(s). |
Value
Fuzzy left join
Description
Convenience wrapper for fuzzystring_join_backend(mode = "left").
Usage
fstring_left_join(x, y, by = NULL, match_fun, ...)
Arguments
x |
A |
y |
A |
by |
Columns by which to join the two tables. See
|
match_fun |
A function used to match values. It must return a logical vector (or a data.frame/data.table whose first column is logical) indicating which pairs match. For multi-column joins, you may pass a list of functions (one per column). |
... |
Additional arguments passed to the matching function(s). |
Value
Fuzzy right join
Description
Convenience wrapper for fuzzystring_join_backend(mode = "right").
Usage
fstring_right_join(x, y, by = NULL, match_fun, ...)
Arguments
x |
A |
y |
A |
by |
Columns by which to join the two tables. See
|
match_fun |
A function used to match values. It must return a logical vector (or a data.frame/data.table whose first column is logical) indicating which pairs match. For multi-column joins, you may pass a list of functions (one per column). |
... |
Additional arguments passed to the matching function(s). |
Value
Fuzzy semi join
Description
Convenience wrapper for fuzzystring_join_backend(mode = "semi").
Usage
fstring_semi_join(x, y, by = NULL, match_fun, ...)
Arguments
x |
A |
y |
A |
by |
Columns by which to join the two tables. See
|
match_fun |
A function used to match values. It must return a logical vector (or a data.frame/data.table whose first column is logical) indicating which pairs match. For multi-column joins, you may pass a list of functions (one per column). |
... |
Additional arguments passed to the matching function(s). |
Value
Join two tables based on fuzzy string matching
Description
Uses stringdist::stringdist() to compute distances and a data.table-based
backend to assemble the final result. This is the main user-facing entry point
for fuzzy joins on strings.
Usage
fuzzystring_join(
x,
y,
by = NULL,
max_dist = 2,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
"soundex"),
mode = "inner",
ignore_case = FALSE,
distance_col = NULL,
...
)
fuzzystring_inner_join(x, y, by = NULL, distance_col = NULL, ...)
fuzzystring_left_join(x, y, by = NULL, distance_col = NULL, ...)
fuzzystring_right_join(x, y, by = NULL, distance_col = NULL, ...)
fuzzystring_full_join(x, y, by = NULL, distance_col = NULL, ...)
fuzzystring_semi_join(x, y, by = NULL, distance_col = NULL, ...)
fuzzystring_anti_join(x, y, by = NULL, distance_col = NULL, ...)
Arguments
x |
A |
y |
A |
by |
Columns by which to join the two tables. You can supply a character
vector of common names (e.g. |
max_dist |
Maximum distance to use for joining. Smaller values are stricter. |
method |
Method for computing string distance, see
|
mode |
One of |
ignore_case |
Logical; if |
distance_col |
If not |
... |
Additional arguments passed to |
Details
If method = "soundex", max_dist is automatically set to 0.5,
since Soundex distance is 0 (match) or 1 (no match).
For Levenshtein-like methods ("osa", "lv", "dl"), a fast
prefilter is applied: if abs(nchar(v1) - nchar(v2)) > max_dist, the pair
cannot match, so distance is not computed for that pair.
Value
A joined table (same container type as x). See
fuzzystring_join_backend for details on output structure.
Examples
if (requireNamespace("ggplot2", quietly = TRUE)) {
d <- data.table::data.table(approximate_name = c("Idea", "Premiom"))
# Match diamonds$cut to d$approximate_name
res <- fuzzystring_inner_join(ggplot2::diamonds, d,
by = c(cut = "approximate_name"),
max_dist = 1
)
head(res)
}
Fuzzy join backend using 'data.table' + 'C++' row binding
Description
Low-level engine used by fuzzystring_join and the 'C++'-optimized
fuzzy join helpers. It builds the match index with R 'data.table' and then
assembles the result using a compiled 'C++' binder for speed.
Usage
fuzzystring_join_backend(
x,
y,
by = NULL,
match_fun = NULL,
multi_by = NULL,
multi_match_fun = NULL,
index_match_fun = NULL,
mode = "inner",
...
)
Arguments
x |
A |
y |
A |
by |
Columns by which to join the two tables. See
|
match_fun |
A function used to match values. It must return a logical vector (or a data.frame/data.table whose first column is logical) indicating which pairs match. For multi-column joins, you may pass a list of functions (one per column). |
multi_by |
A character vector of column names used for multi-column
matching when |
multi_match_fun |
A function that receives matrices of unique values for
|
index_match_fun |
A function that receives the joined columns from
|
mode |
One of |
... |
Additional arguments passed to the matching function(s). |
Details
This function works like fuzzystring_join, but replaces the
R-based row binding with a 'C++' implementation. This provides better performance,
especially for large joins with many matches. It is intended as a backend and
does not compute distances itself; use fuzzystring_join for
string-distance based matching.
The C++ implementation handles:
Efficient subsetting by row indices
Proper handling of NA values in outer joins
Type-safe column operations for all common R types
Preservation of factor levels and attributes
Column name conflicts with .x/.y suffixes
Value
A joined table (same container type as x). See
fuzzystring_join.
A corpus of common misspellings, for examples and practice
Description
This is a tbl_df mapping misspellings of their words, compiled by
Wikipedia, where it is licensed under the CC-BY SA license. (Three words with
non-ASCII characters were filtered out). If you'd like to reproduce this
dataset from Wikipedia, see the example code below.
Usage
misspellings
Format
An object of class tbl_df (inherits from tbl, data.frame) with 4505 rows and 2 columns.
Source
https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines
Examples
library(rvest)
library(readr)
library(dplyr)
library(stringr)
library(tidyr)
u <- "https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines"
h <- read_html(u)
misspellings <- h %>%
html_nodes("pre") %>%
html_text() %>%
read_delim(col_names = c("misspelling", "correct"),
delim = ">",
skip = 1) %>%
mutate(misspelling = str_sub(misspelling,
1, -2)) |>
separate_rows(correct, sep = ", ") |>
filter(Encoding(correct) != "UTF-8")