Simulating data with dependentsimr

Setup

Install the required dependencies if not already installed:

# The simulator
  install.packages("remotes")
  remotes::install_github("tgbrooks/dependentsimr")

# DESeq2 - optional dependency only needed if doing RNA-seq data sets
# Needed for type = "DESeq2"
install.packages("BiocManager")
BiocManager::install("DESeq2")

# Two more dependencies that are used in some simulation modes
# Needed for method = "corpcor"
install.packages("corpcor")

# Needed for method = "spiked Wishart"
install.packages("sparsesvd")

# Extras needed for this vignette
install.packages("tidyverse")

Load the libraries:

library(dependentsimr)
suppressWarnings(suppressMessages(library(tidyverse, DESeq2)))

Set a seed for reproducibility.

set.seed(0)

Load data

Simulations with dependentsimr start with a real data set that should be mimicked. This data can be almost any shape, including continuous, or discrete. Here, we will simulate RNA-seq, which has discrete counts that we handle with a special “DESeq2” model. Other data types can use normal, Poisson, or empirical (which uses only the values present in the data set) types, or even a combination of these.

We will use the following data set, which is a trimmed-down RNA-seq data of 1000 genes on 12 samples, originating from GSE151923.

head(read_counts)
#> # A tibble: 6 × 13
#>   gene_id     Arpp19_10_3 Arpp19_4_1 Arpp19_5_1 Arpp19_5_3 Arpp19_5_4 Arpp19_6_1
#>   <chr>             <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
#> 1 ENSMUSG000…        2377       2162       2169       2650       2217       2552
#> 2 ENSMUSG000…         908        799        730        941        761        780
#> 3 ENSMUSG000…         316        342        373        373        325        405
#> 4 ENSMUSG000…         584        650        466        687        551        606
#> 5 ENSMUSG000…        2739       2531       2594       2988       2614       3081
#> 6 ENSMUSG000…         807        753        746        894        806        819
#> # ℹ 6 more variables: Arpp19_7_2 <dbl>, Arpp19_9_1 <dbl>, Dach1_11_1 <dbl>,
#> #   Dach1_2_2 <dbl>, Dach1_3_1 <dbl>, Dach1_5_3 <dbl>

We first just format it as a matrix for use with dependentsimr instead of as a tibble.

count_matrix <- as.matrix(read_counts[,2:ncol(read_counts)])
rownames(count_matrix) <- read_counts$gene_id

Get Dependence Structure

The first step is to compute a ‘random structure’ for the data, which encodes how the variables (genes, in this case) in the data set are dependent upon each other. This computes a covariance matrix which captures the dependence between genes, however, that matrix is not the same as the usual sample covariance matrix. It also computes the shape (such as mean and variance) of each variable individually.

There are three implemented methods (“pca”, “spiked Wishart”, and “corpcor”) with different characteristics. We demonstrate here each.

First, the “pca” method takes a rank and includes only gene-gene dependence up to that many dimensions. Typically small values (2 or 3, for example) would be used for rank, similar to when performing PCA, unless a large number of input samples are available. Note that we put our read_counts data into a list(). This is done since get_random_structure also supports multiple data sets of different ‘modes’ (for example, proteomics and transcriptomics done on the same samples). We only have one ‘mode’ that we give the arbitrary name of data.

rs_pca <- get_random_structure(list(data=count_matrix), method="pca", rank=2, types="DESeq2")
#> converting counts to integer mode
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing

Second, the spiked Wishart method also takes a rank value but uses it differently. Typically, this will be larger, up to the total number of samples minus one, which is what we use here. For very large sample counts, rank might need to be capped well below the number of samples for computational efficiency reasons.

rs_wishart <- get_random_structure(list(data=count_matrix), rank=11, method="spiked Wishart", types="DESeq2")
#> converting counts to integer mode
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing

Third, a method based off the corpcor package is implemented. It takes no parameters but often underestimates the amount of gene-gene dependence in the data.

rs_corpcor <- get_random_structure(list(data=count_matrix), method="corpcor", types="DESeq2")
#> converting counts to integer mode
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
#> Estimating optimal shrinkage intensity lambda (correlation matrix): 0.7601 
#> Estimating optimal shrinkage intensity lambda.var (variance vector): 1

Lastly, we could generate data without gene-gene dependence for comparison by removing the dependence from any of our random structures.

rs_indep <- remove_dependence(rs_pca)

Generate Data

Real RNA-seq data sets typically have variation between samples in the total number of reads per sample. Here, we just copy the total number of reads from the real data set to get realistic values. A library_size of 1 corresponds to the average read depth of the data used to generate the random structure. This step is only needed when generating RNA-seq data with type = "DESeq2".

actual_library_sizes <- count_matrix |> apply(2, sum)

N_SAMPLES <- 6
library_sizes <- sample(actual_library_sizes / mean(actual_library_sizes), size=N_SAMPLES, replace=TRUE)

Simulate Control Data

Now we generate some ‘control’ samples, which have the same expression levels (as well as gene-gene dependence) as was estimated from the original data set. We’ll use rs_pca that we generated above for example, but any of the methods could be used. Each column is a separate simulated sample. Samples are independent of each other in this model, however genes within a sample are dependent upon each other.

control_sim_data <- draw_from_multivariate_corr(rs_pca, n_samples=N_SAMPLES, size_factors=library_sizes)$data

Note that we used $data here because that was the name we provided to get_random_structure and it was the only mode of data that we had. If multiple modes were used to generate the random structure, then we draw_from_multivariate_corr returns a list with each of them.

Simulate Differentially Expressed Data

To generate a set of ‘treatment’ samples which differ from the ‘control’ samples, we will have to modify the random structure to reflect the desired expression values of the treatment group. We do this by just choosing some genes at random and picking a fold-change for each at random.

# fraction of all genes to modify in treatment
FRAC_DE <- 0.1
# Min and max value for the log fold change (LFC) for each DE gene chosen
# non-DE genes will have a LFC of 0
LFC_MIN <- 0.5
LFC_MAX <- 4.0
N_DE <- floor(FRAC_DE * nrow(control_sim_data))

# Pick which genes are DE
de_genes <- sample(nrow(control_sim_data), N_DE)

# Pick the LFCs of each of those genes
de_lfc <- runif(N_DE, LFC_MIN, LFC_MAX) * sample(c(-1, 1), size=N_DE, replace=TRUE)

Now, we create a new random structure object and modify it to have the desired expression values. Information about the marginal distribution of each gene (e.g., mean and variance but not dependence on other genes) is stored in rs$marginals for a random structure rs. The specifics depend upon the type used in get_random_structure, but for DESeq2, the q value determines the mean expression, so we change that here.

rs_treatment <- rs_pca
rs_treatment$marginals$data$q[de_genes] <- 2^(de_lfc) * rs_treatment$marginals$data$q[de_genes]

Lastly, we generate the data for the ‘treatment’ samples.

treatment_sim_data <- draw_from_multivariate_corr(rs_treatment, n_samples=N_SAMPLES, size_factors=library_sizes)$data

Plot the generated data

To demonstrate the data, we show a PCA plot of the control and treatment groups.

sim_data <- cbind(control_sim_data, treatment_sim_data)
pca <- prcomp(t(sim_data), center=TRUE, scale.=TRUE, rank.=2)
pca_points <- predict(pca, t(sim_data))
pca_data <- tibble(x=pca_points[,1], y=pca_points[,2]) |>
    mutate(group = c(rep("control", N_SAMPLES), rep("treatment", N_SAMPLES)))
ggplot(pca_data, aes(x, y, color=group)) +
  geom_point() +
  labs(x = "PC1", y = "PC2")

As expected, the control and treatment group separate nicely, but there is also a substantial component of variation (PC1) which both groups have. This arises from the dependence between genes, which is the same in both groups in our simulation.