--- date: "`r Sys.Date()`" title: "Exploring Missingness with mde" output: html_document vignette: > %\VignetteIndexEntry{mde-missingness} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} resource_files: - man/figures/mde_icon_2.png --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` The goal of `mde` is to ease exploration of missingness. **Loading the package** ```{r} library(mde) ``` To get a simple missingness report, use `na_summary`: ```{r} na_summary(airquality) ``` To sort this summary by a given column : ```{r} na_summary(airquality,sort_by = "percent_complete") ``` If one would like to reset (drop) row names, then one can set `row_names` to `TRUE` This may especially be useful in cases where `rownames` are simply numeric and do not have much additional use. ```{r reset_rownames} na_summary(airquality,sort_by = "percent_complete", reset_rownames = TRUE) ``` To sort by `percent_missing` instead: ```{r} na_summary(airquality, sort_by = "percent_missing") ``` To sort the above in descending order: ```{r} na_summary(airquality, sort_by="percent_missing", descending = TRUE) ``` To exclude certain columns from the analysis: ```{r} na_summary(airquality, exclude_cols = c("Day", "Wind")) ``` To include or exclude via regex match: ```{r} na_summary(airquality, regex_kind = "inclusion",pattern_type = "starts_with", pattern = "O|S") ``` ```{r} na_summary(airquality, regex_kind = "exclusion",pattern_type = "regex", pattern = "^[O|S]") ``` To get this summary by group: ```{r} test2 <- data.frame(ID= c("A","A","B","A","B"), Vals = c(rep(NA,4),"No"),ID2 = c("E","E","D","E","D")) na_summary(test2,grouping_cols = c("ID","ID2")) ``` ```{r} na_summary(test2, grouping_cols="ID") ``` * `get_na_counts` This provides a convenient way to show the number of missing values column-wise. It is relatively fast(tests done on about 400,000 rows, took a few microseconds.) To get the number of missing values in each column of `airquality`, we can use the function as follows: ```{r} get_na_counts(airquality) ``` The above might be less useful if one would like to get the results by group. In that case, one can provide a grouping vector of names in `grouping_cols`. ```{r} test <- structure(list(Subject = structure(c(1L, 1L, 2L, 2L), .Label = c("A", "B"), class = "factor"), res = c(NA, 1, 2, 3), ID = structure(c(1L, 1L, 2L, 2L), .Label = c("1", "2"), class = "factor")), class = "data.frame", row.names = c(NA, -4L)) get_na_counts(test, grouping_cols = "ID") ``` * `percent_missing` This is a very simple to use but quick way to take a look at the percentage of data that is missing column-wise. ```{r} percent_missing(airquality) ``` We can get the results by group by providing an optional `grouping_cols` character vector. ```{r} percent_missing(test, grouping_cols = "Subject") ``` To exclude some columns from the above exploration, one can provide an optional character vector in `exclude_cols`. ```{r} percent_missing(airquality,exclude_cols = c("Day","Temp")) ``` * `sort_by_missingness` This provides a very simple but relatively fast way to sort variables by missingness. Unless otherwise stated, this does not currently support arranging grouped percents. Usage: ```{r} sort_by_missingness(airquality, sort_by = "counts") ``` To sort in descending order: ```{r} sort_by_missingness(airquality, sort_by = "counts", descend = TRUE) ``` To use percentages instead: ```{r} sort_by_missingness(airquality, sort_by = "percents") ``` Please note that the `mde` project is released with a [Contributor Code of Conduct](https://github.com/Nelson-Gon/mde/blob/master/.github/CODE_OF_CONDUCT.md). By contributing to this project, you agree to abide by its terms. For further exploration, please `browseVignettes("mde")`. To raise an issue, please do so [here](https://github.com/Nelson-Gon/mde/issues) Thank you, feedback is always welcome :)