--- title: "Using corrr with databases" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using corrr with databases} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} library(dplyr) library(dbplyr) library(corrr) knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` To calculate correlations with data inside databases, it is very common to import the data into R and then run the analysis. This is not a desirable path, because of the overhead created by copying the data into memory. Taking advantage of the latest features offered by `dbplyr` and `rlang`, it is now possible to run the correlation calculation **inside the database**, thus avoiding importing the data. ## An example A simple SQLite database will be used to this example. A temporary database is created, and the `mtcars` data set is copied to it. The `db_mtcars` variable is only a pointer to the new table inside the database, it does not hold any data. ```{r} con <- DBI::dbConnect(RSQLite::SQLite(), path = ":dbname:") db_mtcars <- copy_to(con, mtcars) ``` Even though it is not a formal `data.frame` object, `db_mtcars` can be use as if it was a `data.frame` and simply pipe it into the `correlate()` function. The `correlate()` function will check the type of object passed, if it is a database-backed table, meaning a `tbl_sql()` object class, then it will use the new `tidyeval` code to calculate the correlations inside the database. The function will automatically retrieve only the results of the operation. ```{r} library(dplyr) library(corrr) db_mtcars %>% correlate(quiet = TRUE) ``` The `tidyeval`-based function ensures that a `cor_df` object is returned, so then it can be used with other functions in the `corrr` package. ```{r} db_mtcars %>% correlate(quiet = TRUE) %>% rplot() ``` ## `sparklyr` For connections using `sparklyr`, `corrr` will use that package function called `ml_corr()` to run all of the correlations at the same time. That is all done under the hood. The user just needs to pass a `tbl_spark` object to the `correlate()` function, and `corrr` will automatically select the right function to run. ## Limitations When using `correlate()` with a database-backed table, please make sure to keep the following items in mind: - Only the **pearson** method is supported. It is the default method, so it is not necessary to pass it. But if a different method is chosen, then the operation will return an error. - Only pairwise complete observations are used. Meaning that the `use` argument has to be set to `pairwise.complete.obs`. - The `y` argument is not supported. This means that if 1-to-1 comparisons are needed, then pre-select the two variables prior passing it to the `correlate()` function. - The `diagonal` argument only accepts NA's.