---
title: "Introduction to cytominer"
author: "Allen Goodman and Shantanu Singh"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to cytominer}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo=FALSE, warning=FALSE, message=FALSE}
library(magrittr)
futile.logger::flog.threshold(futile.logger::WARN)
```


Typical morphological profiling datasets have millions of cells
and hundreds of features per cells. When working with this data, you must

- clean the data

- normalize the features so that they are comparable across experiments

- transform the features so that their distributions are well-behaved (
i.e., bring them in line with assumptions we want to make about their
disributions)

- select features based on their quality

- aggregate the single-cell data, if needed

The cytominer package makes these steps fast and easy.

## Load data
First, load the data, which is stored in a database backend created using
https://github.com/cytomining/cytominer-database.

```{r}
fixture <-
  system.file("extdata", "fixture_intensities_shapes.sqlite",
              package = "cytominer")

db <- DBI::dbConnect(RSQLite::SQLite(), fixture)
  
```

Then load associated metadata if it exists, and copy it to the backend so that
we can use it later.

```{r}
ext_metadata <-
  readr::read_csv(system.file("extdata", "metadata.csv",
                              package = "cytominer")) %>%
  dplyr::rename(g_well = Well)

ext_metadata <- dplyr::copy_to(db, ext_metadata)

```

Next, select a measurement table that you want to work. Here, we will pick
`intensities` but we can easily extend to using multiple or all measurement
classes if needed by creating new views.

```{r}
intensities <-
  dplyr::tbl(src = db, "view_intensities") %>%
  dplyr::compute()

```

For this example, lets filter the data down to a few wells.

```{r}
measurements <-
  intensities %>%
  dplyr::filter(g_well %in% c("A01", "A02", "A10", "A11"))
```

How many rows does this table have?

```{r}
measurements %>%
  dplyr::tally() %>%
  knitr::kable()
```

That actually 9 times the number of cells in this experiment (n = 40): each
compartment and each channel gets its own row. Here, we have 3 compartments
(cell, nucleus, cytoplasm) and 3 channels (CellMask, Hoechst, Alexa568).
So that's 3 x 3 x 40 = 360.

Next, do some setup stuff that we will need later

```{r}
qualities <- c("q_debris")

groupings <-
  c("g_plate",
    "g_well",
    "g_image",
    "g_pattern",
    "g_channel")

variables <-
  colnames(measurements) %>%
  stringr::str_subset("^m_")

measurements %<>%
  dplyr::select(dplyr::one_of(c(groupings, qualities, variables)))
```

## Clean

Let's remove cells that come from images that were marked as having debris

```{r}
debris_removed <-
  measurements %>% dplyr::filter(q_debris == 0)
```

Then, remove cells where all the measurements are NA's - this may happen
if the identified cell mask was too small to measure any of the features.

```{r}
na_rows_removed <-
  cytominer::drop_na_rows(
    population = debris_removed,
    variables = variables
  ) %>%
  dplyr::compute()
```

## Normalize

We need to normalize the data so that

- features are on the same scale

- plate-to-plate variation is reduced

The default for doing this is `standardization`. Here, we take all the cells
from control wells in the experiment (this is where the external metadata gets
used) and compute normalizations parameters from that (in this case, just the
mean and s.d.) and then apply it to the whole dataset (i.e. the population)

```{r}
normalized <-
  cytominer::normalize(
    population = na_rows_removed %>% 
      dplyr::collect(),
    variables = variables,
    strata =  c("g_plate", "g_pattern", "g_channel"),
    sample =
      na_rows_removed %>%
      dplyr::inner_join(
        ext_metadata %>% 
          dplyr::filter(Type == "ctrl") %>% 
          dplyr::select(g_well) 
      ) %>% dplyr::collect()
  )

normalized %<>% dplyr::collect()
```

In some cases, we may have features that have no variance at all (e.g. Euler
number). If these features have not already been removed by this stage, the
standardization step will results in all values for that feature being NA (
because s.d. = 0). Lets remove them:

First, count how many cells have NA values per feature:

```{r}
na_frequency <-
  cytominer::count_na_rows(
    population = normalized,
    variables = variables)

na_frequency %>%
  tidyr::gather(feature, na_count) %>%
  knitr::kable()
```

As it turns out, no feature has NA in this example.
But lets run this cleaning operation  anyway (no features will be dropped)

```{r}
cleaned <-
  cytominer::variable_select(
    population = normalized,
    variables = variables,
    operation = "drop_na_columns"
)

variables <-
  colnames(cleaned) %>%
  stringr::str_subset("^m_")
```


## Transform

Tranform the data so that assumptions we may later make about the data
distribution are satisfied (e.g. Gaussianity). The default here is
`generalized_log`.

```{r}
transformed <-
  cytominer::transform(
    population = cleaned,
    variables = variables
  )
```

## Select features

Finally, we typically perform feature selection on the data. Feature selection is 
an expensive operation, so we usually want to train the feature selection model on 
a sample of the dataset. Here, we choose to aggregate the data instead of sampling 
it (i.e. collapse it to per-well aggregates)

```{r}
aggregated <-
  cytominer::aggregate(
    population = transformed,
    variables = variables,
    strata = groupings
  ) %>%
  dplyr::collect()

variables <-
  colnames(aggregated) %>%
  stringr::str_subset("^m_")

```

... and then apply feature selection on the per-cell data. Here 
`correlation_threshold` - a method that reduces the redundancy of features - 
is used.
```{r}
selected <-
  cytominer::variable_select(
    population = transformed,
    variables = variables,
    sample = aggregated,
    operation = "correlation_threshold"
  ) %>%
  dplyr::collect()
```

And now lets take a glimpse at the data!
```{r}
selected %>%
  dplyr::glimpse()
```


```{r}
  DBI::dbDisconnect(db)
```