--- title: "Empirical Bayes Metrics with openEBGM" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Empirical Bayes Metrics with openEBGM} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` ## Background In Bayesian statistics, the gamma distribution is the conjugate prior distribution for a Poisson likelihood. '*Conjugate*' means that the posterior distribution will follow the same general form as the prior distribution. DuMouchel (1999) used a model with a Poisson($\mu_{ij}$) likelihood for the counts (for row *i* and column *j* of the contingency table). We are interested in the ratio $\lambda_{ij}=\frac{\mu_{ij}}{E_{ij}}$, where $E_{ij}$ are the expected counts. The $\lambda_{ij}$s are considered random draws from a mixture of two gamma distributions (our prior) with hyperparameter $\theta=(\alpha_1,\beta_1,\alpha_2,\beta_2,P)$, where $P$ is the prior probability that $\lambda$ came from the first component of the prior mixture (i.e., the mixture fraction). The prior is a single distribution that models all the cells in the table; however, there is a separate posterior distribution for each cell in the table. The posterior distribution of $\lambda$, given count $N=n$, is a mixture of two gamma distributions with parameters $\theta=(\alpha_1+n,\beta_1+E,\alpha_2+n,\beta_2+E,Q_n)$ (subscripts suppressed for clarity), where $Q_n$ is the probability that $\lambda$ came from the first component of the posterior, given $N=n$ (i.e., the mixture fraction). The posterior distribution, in a sense, is a Bayesian representation of the relative reporting ratio, $RR$ (note the similarity in the equations $RR_{ij}=\frac{N_{ij}}{E_{ij}}$ and $\lambda_{ij}=\frac{\mu_{ij}}{E_{ij}}$). The Empirical Bayes (EB) metrics are taken from the posterior distribution. The Empirical Bayes Geometric Mean $(EBGM)$ is the antilog of the mean of the log~2~-transformed posterior distribution. The $EBGM$ is therefore a measure of central tendency of the posterior distribution. The 5% and 95% quantiles of the posterior distributions can be used to create two-sided 90% credibility intervals for $\lambda_{ij}$, given $N_{ij}$ (i.e, our "sort of" RR). Alternatively, since we are most interested in the lower bound, we could ignore the upper bound and create a one-sided 95% credibility interval. Due to Bayesian shrinkage (please see the **Background** section of the *Introduction to openEBGM* vignette), the EB scores are much more stable than $RR$ for small counts. ----- ## Calculating the EB-Scores Once the product/event combinations have been counted and the hyperparameters have been estimated, we can calculate the EB scores: ```{r hyperparameter estimation, warning = FALSE} library(openEBGM) data(caers) #subset of publicly available CAERS data processed <- processRaw(caers, stratify = FALSE, zeroes = FALSE) squashed <- squashData(processed) squashed2 <- squashData(squashed, count = 2, bin_size = 10, keep_pts = 50) theta_init <- data.frame(alpha1 = c(0.2, 0.1, 0.3, 0.5, 0.2), beta1 = c(0.1, 0.1, 0.5, 0.3, 0.2), alpha2 = c(2, 10, 6, 12, 5), beta2 = c(4, 10, 6, 12, 5), p = c(1/3, 0.2, 0.5, 0.8, 0.4) ) hyper_estimates <- autoHyper(squashed2, theta_init = theta_init) (theta_hat <- hyper_estimates$estimates) ``` ### `Qn()` The `Qn()` function calculates the mixture fractions for the posterior distributions. The values returned by `Qn()` correspond to the probability that $\lambda$ came from the first component of the posterior mixture distribution, given $N=n$ (recall there is a $\lambda|N=n$ for each cell in the table, but that each $\lambda$ comes from a common distribution). Thus, the output from `Qn()` returns a numeric vector of length equal to the total number of product-symptom combinations, which is also the number of rows in the data frame returned by `processRaw()`. When calculating the $Q_n$s, be sure to use the full data set from `processRaw()` -- not the squashed data set or the raw data. ```{r} qn <- Qn(theta_hat, N = processed$N, E = processed$E) head(qn) identical(length(qn), nrow(processed)) summary(qn) ``` ### `ebgm()` The `ebgm()` function calculates the Empirical Bayes Geometric Mean $(EBGM)$ scores. $EBGM$ is a measure of central tendency of the posterior distributions, $\lambda_{ij}|N=n$. Scores much larger than one indicate product/adverse event pairs that are reported at an unusually high rate. ```{r} processed$ebgm <- ebgm(theta_hat, N = processed$N, E = processed$E, qn = qn) head(processed) ``` ### `quantBisect()` The `quantBisect()` function calculates quantiles of the posterior distribution using the bisection method. `quantBisect()` can calculate any quantile of the posterior distribution between 1 and 99%, and these quantiles can be used as limits for credibility intervals. Below, *QUANT_05* is the 5^th^ percentile; *QUANT_95* is the 95^th^ percentile. These form the lower and upper bounds of 90% credibility intervals for the Empirical Bayes (EB) scores. ```{r} processed$QUANT_05 <- quantBisect(5, theta_hat = theta_hat, N = processed$N, E = processed$E, qn = qn) processed$QUANT_95 <- quantBisect(95, theta_hat = theta_hat, N = processed$N, E = processed$E, qn = qn) head(processed) ``` -------- ## Analysis of EB-Scores The EB-scores ($EBGM$ and quantile scores) can be used to look for "signals" in the data. As stated in the **Background** section of the *Introduction to openEBGM* vignette, Bayesian shrinkage causes the EB-scores to be far more stable than their $RR$ counterparts, which allows for better separation between signal and noise. One could, for example, look at all product-symptom combinations where *QUANT_05* (the lower part of the 90% two-sided credibility interval) is 2 or greater. This is often used as a conservative alternative to $EBGM$ since *QUANT_05* scores are naturally smaller than $EBGM$ scores. We can say with high confidence that the "true relative reporting ratios" of product/adverse event combinations above this threshold are much greater than 1, so those combinations are truly reported more than expected. The value of 2 is arbitrarily chosen, and depends on the context. Below is an example of how one may identify product-symptom combinations that require further investigation based on the EB-scores. ```{r} suspicious <- processed[processed$QUANT_05 >= 2, ] nrow(suspicious); nrow(processed); nrow(suspicious)/nrow(processed) ``` From above we see that less than 1% of product-symptom pairs are suspect based on the *QUANT_05* score. One may look more closely at these product-symptom combinations to ascertain which products may need further investigation. Subject matter knowledge is required to determine which signals might identify a possible causal relationship. The EB-scores find statistical associations -- not necessarily causal relationships. ```{r} suspicious <- suspicious[order(suspicious$QUANT_05, decreasing = TRUE), c("var1", "var2", "N", "E", "QUANT_05", "ebgm", "QUANT_95")] head(suspicious, 5) tabbed <- table(suspicious$var1) head(tabbed[order(tabbed, decreasing = TRUE)]) ``` The output above suggests some products which may require further investigation. Next, the *openEBGM Objects and Class Functions* vignette will demonstrate the object-oriented features of the *openEBGM* package.