---
title: 'netgsa: Network-based Gene Set Analysis'
author: "Jing Ma, Michael Hellstern"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to netgsa}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
bibliography: bibli.bib

---

```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction

In this vignette, we demonstrate the NetGSA workflow using a subset of the breast cancer data example downloaded from the Cancer Genome Atlas [@cancer2012comprehensive]. In particular, we illustrate how to incorporate existing network information (e.g. curated edges from KEGG) to improve the power of pathway enrichment analysis with NetGSA. Details of the method are avaialble in @ma2016network.

`netgsa` provides simple functions that automatically search known gene databases using `graphite` for network information and integrate seamlessly with `NetGSA` pathway enrichment and`igraph` and Cytoscape plotting. In this vignette, we demonstrate the NetGSA workflow explaining proper data types and usage of the `netgsa` package.

# Data: Breast cancer study (2012)
Our example data set comes from a breast cancer study [@cancer2012comprehensive], which consists of gene expression data from 520 subjects including 117 estrogen-receptor-negative (ER-) and 403 estrogen-receptor-positive (ER+). The goal is to generate diagnostic pathway signatures that could distinguish patients with different ER statuses by comparing gene expression data from the two classes. These signatures could provide additional clinical benefit in diagnosing breast cancer.


## Step 1: Data set-up

NetGSA works directly with the expression data matrix. When loading the data, it is important to check the distribution of raw sequencing reads and perform log transformation if necessary. Data in this example were already log transformed. Rows of the data matrix correspond to genes and columns to subjects. Genes in this data matrix were labeled with Entrez IDs, same as those used in KEGG pathways. 


```{r data, warning = FALSE, message = FALSE}
library(netgsa)
library(graphite)
library(data.table)
data("breastcancer2012_subset")
ls()
```


The objects loaded which are necessary for this vignette are:

* `x`, the data matrix
* `edgelist`, a data.frame containing user-specified edges
* `nonedgelist`, a data.frame containing user-specified non-edges
* `group`, a vector mapping columns of `x` to conditions
* `pathways`, a list of KEGG pathways,
* `pathways_mat`, an indicator matrix of KEGG pathways to be used directly in `NetGSA`

### Data matrix, `x`

The data matrix, `x` must have rownames that are named as  `"GENE_ID:GENE_VALUE"`. Since  `netgsa`  allows the user to search for edges in multiple databases in `graphite`, each of which may use different identifiers (e.g. KEGG, Reactome, Biocarta, etc.), `netgsa` needs to know the identifier for a given gene so we can convert it properly. For example, if you have two genes "ENTREZID: 7186" and "ENTREZID: 329" and want to search for potential edges between these two within Reactome,  `netgsa` must first convert these genes to their "UNIPROT" IDs and use those to search for edges.

`netgsa` uses the  `AnnotationDbi`  package to convert genes. The valid list of  `"GENE_ID"`'s is:
```{r}
AnnotationDbi::keytypes(org.Hs.eg.db::org.Hs.eg.db)
```

<!-- For metabolites, `netgsa` uses `graphite`'s built in metabolite mapping. These are: -->
<!-- ```{r} -->
<!-- colnames(graphite:::metabolites()) -->
<!-- ``` -->

An example of what the rownames for the data matrix, `x`, should look like is:
```{r}
head(rownames(x))
```

The columns of the data matrix do not need to be named, but it is useful for keeping track of groups. 

### Group vector

The `group` object is a vector (numeric or character) and must be the same length as `ncol(x)`. Each element of the `group` tells us which group each column of `x` corresponds to. For example, if `group[1] = "Positive"` this says that the first column of `x` corresponds to the `"Positive"` condition.

We can find out the ER status by looking at the group labels.
```{r ER status}
table(group)
```

### Edgelist / Non-edgelist

The edgelist and non-edgelist are strings representing file locations and are read in using `data.table`'s `fread()` command. These are where users can specify known edges/non-edges of their own. Each observation is assumed to be a directed edge (for edgelist) or a directed non-edge (for non-edgelist). They both must have 4 columns in the following order. The columns do not necessarily need to be named properly, they simply must be in this specific order:

* 1st column - Source gene (base_gene_src), e.g. "7534""
* 2nd column - Gene identifier of the source gene (base_id_src), e.g. "ENTREZID"
* 3rd column - Destination gene (base_gene_dest), e.g. "8607"
* 4th column - Gene identifier of the destination gene (base_id_dest) e.g. "UNIPROT"
 
Note it is assumed that each edge/non-edge is directed so if you want an undirected edge/non-edge you should put in two observations as in:
```{r, echo = FALSE}
sample_edges <- data.table(base_gene_src = c("7534", "8607"), base_id_src = c("ENTREZID", "ENTREZID"), base_gene_dest = c("8607", "7534"), base_id_dest = c("ENTREZID", "ENTREZID"))
sample_edges
```

### Pathway matrix

Lastly, as documented in `?NetGSA` we need an indicator matrix of pathways. The rows of this matrix should correspond to pathways and the columns to the genes included in the analysis. The dimension is then *npath* x *p*. Values in this matrix should be 1 for genes that are in a given pathway and 0 otherwise. 

We consider pathways from KEGG [@kanehisa2000kegg]. KEGG pathways can be accessed in R using the **graphite** package. 
```{r pathways}
paths <- graphite::pathways('hsapiens','kegg')
paths[[1]]
head(nodes(paths[[1]]))
```

For this vignette, we'll use `pathways_mat`:

```{r}
pathways_mat[1:5,7, drop = FALSE]
```

The 1 shown indicates that `"ENTREZID: 18"` belongs to the *Alanine, aspartate and glutamate metabolism* pathway.

In summary, the data that we'll need to run our pathway enrichment analysis is:

* the data matrix (`x`), with rows for genes and columns for samples, 
* the group labels (`group`) for ER status, 
* the path to the file consisting of known edges (`edgelist`) to be read in with `fread()`, 
* the path to the file consisting of (a subset of) the non-edges (`nonedgelist`) to be read in with `fread()`, 
* pathway identifiers in matrix form (`pathways_mat`). 
 

## Step 2: Pathway enrichment analysis

Now that we have our data set-up properly we can begin with the pathway enrichment analysis. `netgsa` has three main functions. The first is `prepareAdjMat` which searches for edges in public databases and uses this information to estimate the adjacency matrices needed for `NetGSA`. The second is `NetGSA` which performs network-based gene set analysis. We will go into further details about the usage of these functions below.


### prepareAdjMat

After having the data set-up, the first step in pathway enrichment analysis with `netgsa` is to estimate the adjacency matrices. Remember, the rownames of the data matrix `X` must be named as `"GENE_ID:GENE_VALUE"` as in `"ENTREZID:7534"`. The `group` vector is defined as above.

The `databases` argument can be either (1) the result of `obtainEdgeList` or (2) a character vector defining the databases to search. These are essentially the same thing, the only difference is that for the character vector method, `obtainEdgeList` is called inside `prepareAdjMat` and cannot be saved. Since `prepareAdjMat` queries the `graphite` databases when it is called and `graphite` databases can change overtime, this may not be desirable for reproducibility. Using the `obtainEdgeList` method, one can save the edgelist to ensure the same network information is used across iterations or in the future. In both methods, one must specify the databases to search. The options are the databases for homo spaiens available in `graphite` or NDEx (only for development version on Github). The graphite databases are:

```{r, echo = FALSE}
as.character(graphite::pathwayDatabases()[graphite::pathwayDatabases()$species == "hsapiens","database"])
```

Note the `databases` argument is case sensitive so make sure to pass `"reactome"` and not `"Reactome"`.

The `cluster` argument controls whether or not clustering is used when estimating the adjacency matrix. The default behavior is to use clustering if `p > 2,500`. However, the user can override this behavior by setting the `cluster` argument. `prepareAdjMat` chooses the best clustering method from 6 possible methods in the `igraph` package. More details are provided in `?prepareAdjMat`.

User specified edge and non-edge files are specified with the `file_e` and `file_ne` arguments respectively. For more information on the assumptions and how network information is incorporated from the database edgelists, the user edges, and the user non-edges see the Details section and `file_e` and `file_ne` parameters in  `?prepareAdjMat`.

It is also important to note that `prepareAdjMat` will automatically choose the correct network estimation technique based on whether or not the graph is directed so no additional work is needed to determine undirected vs directed graphs. This is documented in `?prepareAdjMat`.

Now let's get to some example code. Suppose we wanted to estimate the network for our example data using our known edges/non-edges and searching for edges in Reactome, KEGG, and BioCarta. Let's also use clustering to speed up computation. We could do this simply by:

```{r}
database_search <- obtainEdgeList(rownames(x), c("kegg", "reactome"))
network_info <- prepareAdjMat(x, group, database_search,
                                         cluster = TRUE, file_e = "edgelist.txt", 
                                         file_ne = "nonedgelist.txt")
```

The main value of interest is `network_info[["Adj"]]`, a list of lists. The top level list indicates the condition while the next level is a list of adjacency matrices (one for each cluster). If `cluster = FALSE` there is an element for each connected component in the network. Also note that in this example, the object `database_search` was created. If desired, this can be saved and used later to ensure the same network information was used. However, the results may not be exactly the same. For example, some of the clustering algorithms are not deterministic so while the network information may be the same the clusters may be different resulting in different adjacency matrices. The adjacency matrix for the first condition and first cluster can be accessed with:

```{r}
network_info[["Adj"]][[1]][[1]][7:9,7:9]
```

The total number of clusters can be identified with:

```{r}
length(network_info[["Adj"]][[1]])
```

Note we cannot control cluster size which is why we try several methods and implement rules on which to choose. For more information see `?prepareAdjMat`.

### NetGSA

Now that the adjacency matrices are assembled we can perform pathway inference using `NetGSA`. The returned value from `prepareAdjMat` will always be in the correct format regardless of whether or not clustering is used, so one should always be able to pass `network_info[["Adj"]]` directly to `NetGSA`. The default is to use restricted Haseman-Elston regression (REHE) with sampling to estimate the variance parameters. This allows for significant computational speed up and little to no loss in power. One can also use REML for variance parameter estimation but this can be quite slow for problems with moderate or large dimension. For explanatory purposes, we explicitly write out the function:

```{r}
pathway_tests_rehe <- NetGSA(network_info[["Adj"]], x, group, pathways_mat, 
                             lklMethod = "REHE", sampling = TRUE, 
                             sample_n = 0.25, sample_p = 0.25)
```

The `sample_n` and `sample_p` parameters control the ratio for subsampling observations (columns of data matrix `x`) and variables (rows of data matrix `x`) respectively. More information on REHE and sampling can be found in the `?NetGSA` Details section.

Note that inference using REML can take several hours depending on the complexity of the problem, while inference using REHE can take minutes. Roughly, REML becomes quite slow for p > 2,000. In this example, with  `sample_n = 0.25`,  `sample_p = 0.25` only took about 2 minutes on a 2 CPU personal computer with 16 GB of RAM.

The main inference results are stored in `pathway_tests[["results"]]` and contain the pathway names as well as their significance (p-values and q-values are reported) and test statistics.

## Step 3: Visualization

`netgsa` provides several options for visualizing the results from `NetGSA`. Cytoscape or igraph is used to display the main results depending on whether the user has the Cytoscape app open.

### Cytoscape

If the user has Cytoscape open on their computer, calling `plot.NetGSA` will create several plots:

1. **Cytoscape plots** - the first place `plot.NetGSA` generates plots is within Cytoscape. A nested network is created where the main network (Pathway Network) displays pathways as nodes. Edges in this network represent edges between genes contained within those pathways. In this network, the `results` object from `NetGSA()` is loaded as the node data and the number of edges between genes of separate pathways are loaded as edge data in a variable called "weight". Node color is mapped to the test statistic. Negative values are orange, values around 0 are white, and positive values are blue. Node border color is mapped to q-values going from red (0) to white (1). 

    In addition to the main network, a network is generated for each pathway. These are the gene networks underlying each pathway and are referred to as the within pathway networks. The within pathway networks are linked to the nodes in the Pathway Network using Cytoscape's nested network format. An example will make this more clear. Calling:

    ```{r, eval = FALSE}
    plot.NetGSA(pathway_tests_rehe)
    ```

    The Pathway Network might look something like:
    ```{r, fig.retina=NULL, out.width=600, echo=FALSE}
        knitr::include_graphics("cyto_pathway_network_default.png")
    ```
    
    The most interesting part of the Cytoscape nested network framework is the ability to view the within pathway networks. Suppose you were looking at the Pathway Network and noticed the *ErbB signaling network* was highly significant and you wanted to investigate it by looking at the underlying gene network. To do this, one could simply right click on the *ErbB signaling pathway* node in the Pathway Network, and go to Nested Networks -> Go To Nested Network. Alternatively, the nested networks have the same name as the pathway they represent so one could simply navigate to the *ErbB signaling pathway* using the *Network* tab in the *Control Panel* on the left side of the Cytoscape GUI. To save time, the nested networks are not formatted. However, we can use the `formatPathways` function to format one or multiple using the default style.

    ```{r, eval = FALSE}
    formatPathways(pathway_tests_rehe, "ErbB signaling pathway")
    ```
    
    ```{r, fig.retina=NULL, out.width=600, echo=FALSE}
    knitr::include_graphics("cyto_erbb.png")
    ```
    
    The `formatPathways` function colors gene nodes based on FDR adjusted q-value. For more information see `?formatPathways`.

    To export images from Cytoscape one can either use the GUI or the `RCy3::exportImage()` function.

2. **Cytoscape legend generated in R** - Legends are difficult to incorporate on the plot within Cytoscape so we generate a legend in R. Alternatively it is also possible to generate an image directly in Cytoscape. The Cytoscape manual page has more information on that: http://manual.cytoscape.org/en/stable/Styles.html.

    ```{r, fig.retina=NULL, out.width=600, echo=FALSE}
    knitr::include_graphics("cyto_legend.png")
    ```
    
3. **Igraph plot** - Lastly, for users that prefer R, `plot.NetGSA` will also generate a plot using `igraph` with the same layout and node color/node border color mapping as in Cytoscape.
    ```{r, fig.retina=NULL, out.width=600, echo=FALSE}
    knitr::include_graphics("igraph_cyto_pathway_network_default.png")
    ```
    
    The Cytoscape plotting section is wrapped up with a few notes on customization. While `plot.NetGSA` has a default layout, a custom layout can be provided to both `plot.NetGSA` and `formatPathways` using the `graph_layout` parameter. For Cytoscape, this parameter is passed as a string to `RCy3::layoutNetwork` so it may be helpful to read that documentation. Alternatively, one does not even need to use this parameter. They can set up a custom layout using the Cytoscape GUI or can even use the `RCy3::layoutNetwork` function directly. For example, calling `layoutNetwork("degree-circle")` will change the layout of the current network selected in Cytoscape.

    ```{r, eval = FALSE}
    # Format the "Neurotrophin signaling pathway" using the "degree-circle" layout
    formatPathways(pathway_tests_rehe, "Neurotrophin signaling pathway",
                 graph_layout = "degree-circle")
    ```

    Lastly, for users who want even more customization, the `results` object returned from `NetGSA()` and the edge weight between pathways are loaded into Cytoscape. So to size edges based on edge weight one could:

    ```{r, eval = FALSE}
    RCy3::setCurrentNetwork("Pathway Network")
    edge_weights <- RCy3::getTableColumns(table = "edge", columns = "weight")
    RCy3::setEdgeLineWidthMapping("weight", c(min(edge_weights), max(edge_weights)), c(1,5))
    ```

### Igraph

If the user does not have Cytoscape open, a 3-D igraph plot is created using `igraph::rglplot`. Due to the nature of the `igraph::rglplot` function we only map test statistics to node color. We are not able to map colors to both the node and node border. 

```{r, eval = FALSE}
plot(pathway_tests_rehe)
```
```{r, fig.retina=NULL, out.width=600, echo=FALSE}
knitr::include_graphics("igraph_pathway_network_default.png")
```
```{r, fig.retina=NULL, out.width=600, echo=FALSE}
knitr::include_graphics("igraph_legend.png")
```

Again, a custom layout can be specified with the `graph_layout` parameter. As documented in `?plot.NetGSA` this must be a function that takes one parameter which is passed to the `igraph::rglplot` function. For example to layout a graph randomly we can do:

```{r, eval = FALSE}
layout_fun <- function(ig) igraph::layout_randomly(ig)
plot(pathway_tests_rehe, graph_layout = layout_fun)
```

While igraph does not provide a similar nested network format as Cytoscape, we can use the `zoomPathway` function to look at the gene interactions within a pathway:

```{r, eval = FALSE}
zoomPathway(pathway_tests_rehe, "ErbB signaling pathway")
```
```{r, fig.retina=NULL, out.width=600, echo=FALSE}
knitr::include_graphics("igraph_erbb.png")
```


## References