---
title: "Empirical Bayes Metrics with openEBGM"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Empirical Bayes Metrics with openEBGM}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

## Background

In Bayesian statistics, the gamma distribution is the conjugate prior 
distribution for a Poisson likelihood. '*Conjugate*' means that the posterior 
distribution will follow the same general form as the prior distribution.
DuMouchel (1999) used a model with a Poisson($\mu_{ij}$) likelihood for the
counts (for row *i* and column *j* of the contingency table). We are interested
in the ratio $\lambda_{ij}=\frac{\mu_{ij}}{E_{ij}}$, where $E_{ij}$ are the
expected counts. The $\lambda_{ij}$s are considered random draws from a mixture
of two gamma distributions (our prior) with hyperparameter 
$\theta=(\alpha_1,\beta_1,\alpha_2,\beta_2,P)$, where $P$ is the prior 
probability that $\lambda$ came from the first component of the prior mixture 
(i.e., the mixture fraction). The prior is a single distribution that models all
the cells in the table; however, there is a separate posterior distribution for 
each cell in the table. The posterior distribution of $\lambda$, given count 
$N=n$, is a mixture of two gamma distributions with parameters 
$\theta=(\alpha_1+n,\beta_1+E,\alpha_2+n,\beta_2+E,Q_n)$ (subscripts suppressed 
for clarity), where $Q_n$ is the probability that $\lambda$ came from the first 
component of the posterior, given $N=n$ (i.e., the mixture fraction).

The posterior distribution, in a sense, is a Bayesian representation of the 
relative reporting ratio, $RR$ (note the similarity in the equations 
$RR_{ij}=\frac{N_{ij}}{E_{ij}}$ and $\lambda_{ij}=\frac{\mu_{ij}}{E_{ij}}$). The
Empirical Bayes (EB) metrics are taken from the posterior distribution. The 
Empirical Bayes Geometric Mean $(EBGM)$ is the antilog of the mean of the 
log~2~-transformed posterior distribution. The $EBGM$ is therefore a measure of 
central tendency of the posterior distribution. The 5% and 95% quantiles of the 
posterior distributions can be used to create two-sided 90% credibility
intervals for $\lambda_{ij}$, given $N_{ij}$ (i.e, our "sort of" RR).
Alternatively, since we are most interested in the lower bound, we could ignore
the upper bound and create a one-sided 95% credibility interval.

Due to Bayesian shrinkage (please see the **Background** section of the 
*Introduction to openEBGM* vignette), the EB scores are much more stable than
$RR$ for small counts.

-----

## Calculating the EB-Scores

Once the product/event combinations have been counted and the hyperparameters 
have been estimated, we can calculate the EB scores:


```{r hyperparameter estimation, warning = FALSE}
library(openEBGM)
data(caers)  #subset of publicly available CAERS data

processed <- processRaw(caers, stratify = FALSE, zeroes = FALSE)
squashed <- squashData(processed)
squashed2 <- squashData(squashed, count = 2, bin_size = 10, keep_pts = 50)
theta_init <- data.frame(alpha1 = c(0.2, 0.1, 0.3, 0.5, 0.2),
                         beta1  = c(0.1, 0.1, 0.5, 0.3, 0.2),
                         alpha2 = c(2,   10,  6,   12,  5),
                         beta2  = c(4,   10,  6,   12,  5),
                         p      = c(1/3, 0.2, 0.5, 0.8, 0.4)
)
hyper_estimates <- autoHyper(squashed2, theta_init = theta_init)
(theta_hat <- hyper_estimates$estimates)
```

### `Qn()`

The `Qn()` function calculates the mixture fractions for the posterior 
distributions. The values returned by `Qn()` correspond to the probability that 
$\lambda$ came from the first component of the posterior mixture distribution, 
given $N=n$ (recall there is a $\lambda|N=n$ for each cell in the table, but 
that each $\lambda$ comes from a common distribution). Thus, the output from 
`Qn()` returns a numeric vector of length equal to the total number of 
product-symptom combinations, which is also the number of rows in the data frame
returned by `processRaw()`. When calculating the $Q_n$s, be sure to use the full
data set from `processRaw()` -- not the squashed data set or the raw data.

```{r}
qn <- Qn(theta_hat, N = processed$N, E = processed$E)
head(qn)
identical(length(qn), nrow(processed))
summary(qn)
```

### `ebgm()`

The `ebgm()` function calculates the Empirical Bayes Geometric Mean $(EBGM)$ 
scores. $EBGM$ is a measure of central tendency of the posterior distributions, 
$\lambda_{ij}|N=n$. Scores much larger than one indicate product/adverse event
pairs that are reported at an unusually high rate.

```{r}
processed$ebgm <- ebgm(theta_hat, N = processed$N, E = processed$E, qn  = qn)
head(processed)
```

### `quantBisect()`

The `quantBisect()` function calculates quantiles of the posterior distribution 
using the bisection method. `quantBisect()` can calculate any quantile of the 
posterior distribution between 1 and 99%, and these quantiles can be used as
limits for credibility intervals. Below, *QUANT_05* is the 5^th^ percentile;
*QUANT_95* is the 95^th^ percentile. These form the lower and upper bounds of
90% credibility intervals for the Empirical Bayes (EB) scores.

```{r}
processed$QUANT_05 <- quantBisect(5, theta_hat = theta_hat,
                                  N = processed$N, E = processed$E, qn = qn)
processed$QUANT_95 <- quantBisect(95, theta_hat = theta_hat,
                                  N = processed$N, E = processed$E, qn = qn)
head(processed)
```


--------

## Analysis of EB-Scores

The EB-scores ($EBGM$ and quantile scores) can be used to look for "signals" in 
the data. As stated in the **Background** section of the *Introduction to 
openEBGM* vignette, Bayesian shrinkage causes the EB-scores to be far more 
stable than their $RR$ counterparts, which allows for better separation between 
signal and noise. One could, for example, look at all product-symptom 
combinations where *QUANT_05* (the lower part of the 90% two-sided credibility 
interval) is 2 or greater. This is often used as a conservative alternative to 
$EBGM$ since *QUANT_05* scores are naturally smaller than $EBGM$ scores. We can
say with high confidence that the "true relative reporting ratios" of 
product/adverse event combinations above this threshold are much greater than 1,
so those combinations are truly reported more than expected. The value of 2 is 
arbitrarily chosen, and depends on the context. Below is an example of how one 
may identify product-symptom combinations that require further investigation 
based on the EB-scores.

```{r}
suspicious <- processed[processed$QUANT_05 >= 2, ]
nrow(suspicious); nrow(processed); nrow(suspicious)/nrow(processed)
```

From above we see that less than 1% of product-symptom pairs are suspect based 
on the *QUANT_05* score. One may look more closely at these product-symptom 
combinations to ascertain which products may need further investigation. Subject
matter knowledge is required to determine which signals might identify a
possible causal relationship. The EB-scores find statistical
associations -- not necessarily causal relationships.

```{r}
suspicious <- suspicious[order(suspicious$QUANT_05, decreasing = TRUE),
                         c("var1", "var2", "N", "E", "QUANT_05", "ebgm", 
                           "QUANT_95")]
head(suspicious, 5)

tabbed <- table(suspicious$var1)
head(tabbed[order(tabbed, decreasing = TRUE)])
```

The output above suggests some products which may require further investigation.

Next, the *openEBGM Objects and Class Functions*
vignette will demonstrate the object-oriented features of the *openEBGM*
package.