In this vignette, we demonstrate the NetGSA workflow using a subset of the breast cancer data example downloaded from the Cancer Genome Atlas (TCGA 2012). In particular, we illustrate how to incorporate existing network information (e.g. curated edges from KEGG) to improve the power of pathway enrichment analysis with NetGSA. Details of the method are avaialble in Ma, Shojaie, and Michailidis (2016).
netgsa provides simple functions that automatically search known gene databases using
graphite for network information and integrate seamlessly with
NetGSA pathway enrichment and
igraph and Cytoscape plotting. In this vignette, we demonstrate the NetGSA workflow explaining proper data types and usage of the
Our example data set comes from a breast cancer study (TCGA 2012), which consists of gene expression data from 520 subjects including 117 estrogen-receptor-negative (ER-) and 403 estrogen-receptor-positive (ER+). The goal is to generate diagnostic pathway signatures that could distinguish patients with different ER statuses by comparing gene expression data from the two classes. These signatures could provide additional clinical benefit in diagnosing breast cancer.
NetGSA works directly with the expression data matrix. When loading the data, it is important to check the distribution of raw sequencing reads and perform log transformation if necessary. Data in this example were already log transformed. Rows of the data matrix correspond to genes and columns to subjects. Genes in this data matrix were labeled with Entrez IDs, same as those used in KEGG pathways.
library(netgsa) library(graphite) library(data.table) data("breastcancer2012_subset") ls()
##  "edgelist" "group" "nonedgelist" "pathways" "pathways_mat" ##  "x"
The objects loaded which are necessary for this vignette are:
x, the data matrix
edgelist, a data.frame containing user-specified edges
nonedgelist, a data.frame containing user-specified non-edges
group, a vector mapping columns of
pathways, a list of KEGG pathways,
pathways_mat, an indicator matrix of KEGG pathways to be used directly in
The data matrix,
x must have rownames that are named as
netgsa allows the user to search for edges in multiple databases in
graphite, each of which may use different identifiers (e.g. KEGG, Reactome, Biocarta, etc.),
netgsa needs to know the identifier for a given gene so we can convert it properly. For example, if you have two genes “ENTREZID: 7186” and “ENTREZID: 329” and want to search for potential edges between these two within Reactome,
netgsa must first convert these genes to their “UNIPROT” IDs and use those to search for edges.
netgsa uses the
AnnotationDbi package to convert genes. The valid list of
##  "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" ##  "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME" ##  "GO" "GOALL" "IPI" "MAP" "OMIM" ##  "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID" ##  "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" "UNIGENE" ##  "UNIPROT"
An example of what the rownames for the data matrix,
x, should look like is:
##  "ENTREZID:127550" "ENTREZID:53947" "ENTREZID:65985" "ENTREZID:51166" ##  "ENTREZID:15" "ENTREZID:60496"
The columns of the data matrix do not need to be named, but it is useful for keeping track of groups.
group object is a vector (numeric or character) and must be the same length as
ncol(x). Each element of the
group tells us which group each column of
x corresponds to. For example, if
group = "Positive" this says that the first column of
x corresponds to the
We can find out the ER status by looking at the group labels.
## group ## 1 2 ## 403 117
The edgelist and non-edgelist are strings representing file locations and are read in using
fread() command. These are where users can specify known edges/non-edges of their own. Each observation is assumed to be a directed edge (for edgelist) or a directed non-edge (for non-edgelist). They both must have 4 columns in the following order. The columns do not necessarily need to be named properly, they simply must be in this specific order:
Note it is assumed that each edge/non-edge is directed so if you want an undirected edge/non-edge you should put in two observations as in:
## base_gene_src base_id_src base_gene_dest base_id_dest ## 1: 7534 ENTREZID 8607 ENTREZID ## 2: 8607 ENTREZID 7534 ENTREZID
Lastly, as documented in
?NetGSA we need an indicator matrix of pathways. The rows of this matrix should correspond to pathways and the columns to the genes included in the analysis. The dimension is then npath x p. Values in this matrix should be 1 for genes that are in a given pathway and 0 otherwise.
We consider pathways from KEGG (Kanehisa and Goto 2000). KEGG pathways can be accessed in R using the graphite package.
graphite::pathways('hsapiens','kegg') paths <-1]]paths[[
## "Glycolysis / Gluconeogenesis" pathway ## Native ID = hsa:00010 ## Database = KEGG ## Species = hsapiens ## Number of nodes = 94 ## Number of edges = 1191 ## Retrieved on = 22-04-2020 ## URL = http://www.kegg.jp/kegg-bin/show_pathway?org_name=hsa&mapno=00010
##  "ENTREZID:10327" "ENTREZID:124" "ENTREZID:125" "ENTREZID:126" ##  "ENTREZID:127" "ENTREZID:128"
For this vignette, we’ll use
1:5,7, drop = FALSE]pathways_mat[
## ENTREZID:18 ## Adipocytokine signaling pathway 0 ## Adrenergic signaling in cardiomyocytes 0 ## AGE-RAGE signaling pathway in diabetic complications 0 ## Alanine, aspartate and glutamate metabolism 1 ## AMPK signaling pathway 0
The 1 shown indicates that
"ENTREZID: 18" belongs to the Alanine, aspartate and glutamate metabolism pathway.
In summary, the data that we’ll need to run our pathway enrichment analysis is:
x), with rows for genes and columns for samples,
group) for ER status,
edgelist) to be read in with
nonedgelist) to be read in with
Now that we have our data set-up properly we can begin with the pathway enrichment analysis.
netgsa has three main functions. The first is
prepareAdjMat which searches for edges in public databases and uses this information to estimate the adjacency matrices needed for
NetGSA. The second is
NetGSA which performs network-based gene set analysis. We will go into further details about the usage of these functions below.
After having the data set-up, the first step in pathway enrichment analysis with
netgsa is to estimate the adjacency matrices. Remember, the rownames of the data matrix
X must be named as
"GENE_ID:GENE_VALUE" as in
group vector is defined as above.
databases argument can be either (1) the result of
obtainEdgeList or (2) a character vector defining the databases to search. These are essentially the same thing, the only difference is that for the character vector method,
obtainEdgeList is called inside
prepareAdjMat and cannot be saved. Since
prepareAdjMat queries the
graphite databases when it is called and
graphite databases can change overtime, this may not be desirable for reproducibility. Using the
obtainEdgeList method, one can save the edgelist to ensure the same network information is used across iterations or in the future. In both methods, one must specify the databases to search. The options are the databases for homo spaiens available in
graphite or NDEx (only for development version on Github). The graphite databases are:
##  "biocarta" "kegg" "nci" "panther" "pathbank" "pharmgkb" "reactome" ##  "smpdb"
databases argument is case sensitive so make sure to pass
"reactome" and not
cluster argument controls whether or not clustering is used when estimating the adjacency matrix. The default behavior is to use clustering if
p > 2,500. However, the user can override this behavior by setting the
prepareAdjMat chooses the best clustering method from 6 possible methods in the
igraph package. More details are provided in
User specified edge and non-edge files are specified with the
file_ne arguments respectively. For more information on the assumptions and how network information is incorporated from the database edgelists, the user edges, and the user non-edges see the Details section and
file_ne parameters in
It is also important to note that
prepareAdjMat will automatically choose the correct network estimation technique based on whether or not the graph is directed so no additional work is needed to determine undirected vs directed graphs. This is documented in
Now let’s get to some example code. Suppose we wanted to estimate the network for our example data using our known edges/non-edges and searching for edges in Reactome, KEGG, and BioCarta. Let’s also use clustering to speed up computation. We could do this simply by:
obtainEdgeList(rownames(x), c("kegg", "reactome")) database_search <- prepareAdjMat(x, group, database_search, network_info <-cluster = TRUE, file_e = "edgelist.txt", file_ne = "nonedgelist.txt")
The main value of interest is
network_info[["Adj"]], a list of lists. The top level list indicates the condition while the next level is a list of adjacency matrices (one for each cluster). If
cluster = FALSE there is an element for each connected component in the network. Also note that in this example, the object
database_search was created. If desired, this can be saved and used later to ensure the same network information was used. However, the results may not be exactly the same. For example, some of the clustering algorithms are not deterministic so while the network information may be the same the clusters may be different resulting in different adjacency matrices. The adjacency matrix for the first condition and first cluster can be accessed with:
## ENTREZID:18 ENTREZID:27 ENTREZID:28 ## ENTREZID:18 0 0 0 ## ENTREZID:27 0 0 0 ## ENTREZID:28 0 0 0
The total number of clusters can be identified with:
##  16
Note we cannot control cluster size which is why we try several methods and implement rules on which to choose. For more information see
Now that the adjacency matrices are assembled we can perform pathway inference using
NetGSA. The returned value from
prepareAdjMat will always be in the correct format regardless of whether or not clustering is used, so one should always be able to pass
network_info[["Adj"]] directly to
NetGSA. The default is to use restricted Haseman-Elston regression (REHE) with sampling to estimate the variance parameters. This allows for significant computational speed up and little to no loss in power. One can also use REML for variance parameter estimation but this can be quite slow for problems with moderate or large dimension. For explanatory purposes, we explicitly write out the function:
NetGSA(network_info[["Adj"]], x, group, pathways_mat, pathway_tests_rehe <-lklMethod = "REHE", sampling = TRUE, sample_n = 0.25, sample_p = 0.25)
sample_p parameters control the ratio for subsampling observations (columns of data matrix
x) and variables (rows of data matrix
x) respectively. More information on REHE and sampling can be found in the
?NetGSA Details section.
Note that inference using REML can take several hours depending on the complexity of the problem, while inference using REHE can take minutes. Roughly, REML becomes quite slow for p > 2,000. In this example, with
sample_n = 0.25,
sample_p = 0.25 only took about 2 minutes on a 2 CPU personal computer with 16 GB of RAM.
The main inference results are stored in
pathway_tests[["results"]] and contain the pathway names as well as their significance (p-values and q-values are reported) and test statistics.
netgsa provides several options for visualizing the results from
NetGSA. Cytoscape or igraph is used to display the main results depending on whether the user has the Cytoscape app open.
If the user has Cytoscape open on their computer, calling
plot.NetGSA will create several plots:
Cytoscape plots - the first place
plot.NetGSA generates plots is within Cytoscape. A nested network is created where the main network (Pathway Network) displays pathways as nodes. Edges in this network represent edges between genes contained within those pathways. In this network, the
results object from
NetGSA() is loaded as the node data and the number of edges between genes of separate pathways are loaded as edge data in a variable called “weight”. Node color is mapped to the test statistic. Negative values are orange, values around 0 are white, and positive values are blue. Node border color is mapped to q-values going from red (0) to white (1).
In addition to the main network, a network is generated for each pathway. These are the gene networks underlying each pathway and are referred to as the within pathway networks. The within pathway networks are linked to the nodes in the Pathway Network using Cytoscape’s nested network format. An example will make this more clear. Calling:
The Pathway Network might look something like:
The most interesting part of the Cytoscape nested network framework is the ability to view the within pathway networks. Suppose you were looking at the Pathway Network and noticed the ErbB signaling network was highly significant and you wanted to investigate it by looking at the underlying gene network. To do this, one could simply right click on the ErbB signaling pathway node in the Pathway Network, and go to Nested Networks -> Go To Nested Network. Alternatively, the nested networks have the same name as the pathway they represent so one could simply navigate to the ErbB signaling pathway using the Network tab in the Control Panel on the left side of the Cytoscape GUI. To save time, the nested networks are not formatted. However, we can use the
formatPathways function to format one or multiple using the default style.
formatPathways(pathway_tests_rehe, "ErbB signaling pathway")
formatPathways function colors gene nodes based on FDR adjusted q-value. For more information see
To export images from Cytoscape one can either use the GUI or the
Cytoscape legend generated in R - Legends are difficult to incorporate on the plot within Cytoscape so we generate a legend in R. Alternatively it is also possible to generate an image directly in Cytoscape. The Cytoscape manual page has more information on that: http://manual.cytoscape.org/en/stable/Styles.html.
Igraph plot - Lastly, for users that prefer R,
plot.NetGSA will also generate a plot using
igraph with the same layout and node color/node border color mapping as in Cytoscape.