Package: MetaGSCA 0.1.0

1 Introduction

This tutorial provides an overview and a tutorial of the R package MetaGSCA for Meta Gene Set Co-Expression Analysis.

MetaGSCA (Meta Gene Set Co-expression Analysis) is a computational tool to systematically assess the co-expression disturbance of a specified gene set with rigorous meta-analysis justification. A nonparametric approach that accounts for the correlation structure between genes is used to test whether the gene set is differentially co-expresssed between the two conditions, and a permutation test p-value is computed. Bootstrapping is used to construct confidence intervals of point estimates. A meta-analysis of single proportions is then performed with two options: random-intercept logistic regression model and the inverse variance method, to yield conclusive results over a number of individual studies. The meta-analysis empowered gene set differential co-expression tool MetaGSCA is implemented in the form of both a web application and an R package.

Packages data.table, meta(metafor), grid, sjstats, igraph are required.

The analysis will start by loading package MetaGSCA.

library(MetaGSCA)


2 How to use the MetaGSCA package

This Section illustrates the typical procedure for applying the methods available in package MetaGSCA.

2.1 Set Working Directory

Create a temporary directory “tutorial” under the user’s current working directory. All output generated in this tutorial will be saved in “tutorial”.

## set your workdir
if (!dir.exists('tutorial')) dir.create('tutorial')
setwd('tutorial')


2.2 Example Data

Load the example data:
1. lists of genesets (3 pathways)
2. lists of datasets (11 TCGA datasets)
The two aspects of datasets are necessary for running the examples and case studies in this tutorial.

###########################
data(meta)
datasets <- vector('list',length(Dsetnames))
names(datasets) <- Dsetnames
for (ca in Dsetnames) {
  cmd=paste0('data(',ca,')')
  eval(parse(text=cmd))
  cmd=paste0(ca,'<-data.frame(V1=rownames(',ca,'),',ca,')') # Create first col to meet MetaGSCA requirement.
  eval(parse(text=cmd))
  cmd=paste0('datasets[[ca]] <- ',ca)
  eval(parse(text=cmd))
}
datasetnames <- Dsetnames
genesetnames <- names(genesets)

Note:
1. lists of genesets (3 pathways):

## [1] "Aminosugars metabolism"     "p38 MAPK pathway"           "p38 MAPK signaling pathway"

2. lists of datasets (11 TCGA datasets):

##          BRCA COAD HNSC KIRC KIRP LIHC LUAD LUSC PRAD STAD THCA
## #Genes   3799 3757 3819 3811 3818 3686 3803 3844 3801 3931 3760
## #Samples  227   81   87  145   65  101  117  103  105   65  119



2.3 Case Study

2.3.1 Example 1: MetaGSCA

2.3.1.1 Code

MetaGSCA is used to perform meta & gene set co-expression analysis of one geneset.

Arguments
* list.geneset: a pre-specified gene list from a gene set or pathway.
* list.dataset: a list of datasets, first column is gene name.
* list.group: a list of samples/patients subgroup or condition (e.g. (1,1,1,2,2,2)).
* name.geneset: the name of the gene set, used for output file name.
* name.dataset: a list of dataset names corresponding to list.dataset, used for forest plot.
* nperm: number of permutations used to estimate the null distribution of the test statistic. If not given, a default value 500 is used.
* nboot: number of bootstraps used to estimate the point and interval estimate. If not given, a default value 200 is used.
* method.Inverse: logical. Indicating whether “Inverse” method is to be used for pooling of studies.
* method.GLMM: logical. Indicating whether “GLMM” method is to be used for pooling of studies.
* fixed.effect: logical. Indicating whether a fixed effect meta-analysis should be conducted.
* random.effect: logical. Indicating whether a random effects meta-analysis should be conducted.

testSingle <- MetaGSCA(list.geneset = genesets[[2]],
         list.dataset = datasets,
         list.group = groups,
         name.geneset = genesetnames[[2]],
         name.dataset = datasetnames,
         nperm = 100,
         nboot = 100,
         method.Inverse = FALSE,
         method.GLMM = TRUE,
         fixed.effect = FALSE,
         random.effect = TRUE)
## 2020-10-29 15:38:41 Processing dataset: BRCA...
## 2020-10-29 15:39:02 Processing dataset: COAD...
## 2020-10-29 15:39:20 Processing dataset: HNSC...
## 2020-10-29 15:39:39 Processing dataset: KIRC...
## 2020-10-29 15:39:58 Processing dataset: KIRP...
## 2020-10-29 15:40:17 Processing dataset: LIHC...
## 2020-10-29 15:40:36 Processing dataset: LUAD...
## 2020-10-29 15:40:55 Processing dataset: LUSC...
## 2020-10-29 15:41:14 Processing dataset: PRAD...
## 2020-10-29 15:41:32 Processing dataset: STAD...
## 2020-10-29 15:41:51 Processing dataset: THCA...


2.3.1.2 Parameters

genesetnames[2]
## [1] "p38 MAPK pathway"
genesets[2]
## $`p38 MAPK pathway`
##  [1] "ATF1"     "DUSP1"    "DUSP10"   "EEF2K"    "EIF4E"    "ELK1"     "GADD45A"  "HSPB1"    "IL1R1"    "MAP2K4"  
## [11] "MAP2K6"   "MAP3K10"  "MAP3K4"   "MAP3K5"   "MAP3K7"   "MAPK11"   "MAPK12"   "MAPK13"   "MAPK14"   "MAPKAPK2"
## [21] "MAPKAPK3" "MEF2B"    "MEF2C"    "MEF2D"    "MKNK1"    "MKNK2"    "RPS6KA4"  "RPS6KA5"  "SRF"      "TAB1"    
## [31] "TAB2"     "TRAF6"


2.3.1.3 Result

There’ll be 2 output file after MetaGSCA. 1 forest plot (.pdf) and 1 output data information file (.csv).

## [1] "p38 MAPK pathway_GLMM_Meta Analysis for bootstrap p-value_Forest plot.pdf"
## [2] "p38 MAPK pathway_Individual Dataset GSAR Results with Bootstrapping.csv"


The PDF file depicts the forest plot for one gene set across multiple datasets, with random effects model result. Attached with bootstrap result at the bottom side.
Forest plot for one gene set across multiple datasets

Forest plot for one gene set across multiple datasets


The CSV file includes the output data information across multiple datasets. Here we only show information for the first two datasets.
p38 MAPK pathway
##                         [,1]                            [,2]                           
## X                       "BRCA"                          "COAD"                         
## Original.TS             "6.719747"                      "9.629239"                     
## Original.Pvalue         "0.019802"                      "0.009901"                     
## Boot.TS.Mean            "6.916047"                      "8.544807"                     
## Boot.TS.Bias            " 0.196300"                     "-1.084432"                    
## Boot.TS.Std             "0.901800"                      "1.321158"                     
## Boot.TS.Norm.CI         "95% CI: [5.148552, 8.683543]"  "95% CI: [5.955386, 11.134228]"
## Boot.TS.PT.CI           "95% CI: [5.40299, 8.765926]"   "95% CI: [6.249312, 10.977783]"
## Boot.Pvalue.Mean        "0.024752"                      "0.095842"                     
## Boot.Pvalue.Bias        " 0.004950"                     " 0.085941"                    
## Boot.Pvalue.Std         "0.027845"                      "0.152952"                     
## Boot.Pvalue.Norm.CI     "95% CI: [-0.029822, 0.079327]" "95% CI: [-0.203938, 0.395621]"
## Boot.Pvalue.PT.CI       "95% CI: [0.009901, 0.118812]"  "95% CI: [0.009901, 0.510644]" 
## Genes.not.found         "-"                             "-"                            
## Genes.removed.sd.0.001. "-"                             "-"


2.3.2 Example 2: MetaGSCA_Multi_Geneset

2.3.2.1 Code

MetaGSCA_Multi_Geneset is used to perform meta & gene set co-expression analysis of multiple geneset.

Arguments
* list.geneset: several pre-specified gene lists from a gene set or pathway.
* name.geneset: the names of the gene sets, used for output file name.

testMultiple <- MetaGSCA_Multi_Geneset(list.geneset = genesets[1:2],
                       list.dataset = datasets,
                       list.group = groups,
                       name.geneset = genesetnames[1:2],
                       name.dataset = datasetnames,
                       nperm = 100,
                       nboot = 100,
                       method.Inverse = FALSE,
                       method.GLMM = TRUE,
                       fixed.effect = FALSE,
                       random.effect = TRUE)


2.3.2.2 Parameters

genesetnames[1:2]
## [1] "Aminosugars metabolism" "p38 MAPK pathway"
genesets[1:2]
## $`Aminosugars metabolism`
##  [1] "CMAS"    "GCK"     "GFPT1"   "GFPT2"   "GNE"     "GNPDA1"  "GNPNAT1" "HK1"     "HK2"     "HK3"     "MPI"    
## [12] "NAGK"    "NANS"    "PGM3"    "RENBP"   "UAP1"   
## 
## $`p38 MAPK pathway`
##  [1] "ATF1"     "DUSP1"    "DUSP10"   "EEF2K"    "EIF4E"    "ELK1"     "GADD45A"  "HSPB1"    "IL1R1"    "MAP2K4"  
## [11] "MAP2K6"   "MAP3K10"  "MAP3K4"   "MAP3K5"   "MAP3K7"   "MAPK11"   "MAPK12"   "MAPK13"   "MAPK14"   "MAPKAPK2"
## [21] "MAPKAPK3" "MEF2B"    "MEF2C"    "MEF2D"    "MKNK1"    "MKNK2"    "RPS6KA4"  "RPS6KA5"  "SRF"      "TAB1"    
## [31] "TAB2"     "TRAF6"


2.3.2.3 Result

Four output files will be generated after MetaGSCA_Multi_Geneset, two for each gene set. That is, 2 forest plot (.pdf) and 2 output data information file (.csv).

## [1] "_Meta Analysis bootstrap result.csv"                                            
## [2] "Aminosugars metabolism_GLMM_Meta Analysis for bootstrap p-value_Forest plot.pdf"
## [3] "Aminosugars metabolism_Individual Dataset GSAR Results with Bootstrapping.csv"  
## [4] "p38 MAPK pathway_GLMM_Meta Analysis for bootstrap p-value_Forest plot.pdf"      
## [5] "p38 MAPK pathway_Individual Dataset GSAR Results with Bootstrapping.csv"


In each forest plot file, we’re able to get the forest plot across multiple datasets, with random effects model result. Attached with bootstrap result at the bottom side.
Forest plot for one gene set across multiple datasets

Forest plot for one gene set across multiple datasets


Forest plot for another gene set across multiple datasets

Forest plot for another gene set across multiple datasets


The 2 output data information files each include the output data information for one specific gene set across multiple datasets. Here we only show information for the first two datasets.
Aminosugars metabolism
##                         [,1]                           [,2]                           
## X                       "BRCA"                         "COAD"                         
## Original.TS             "3.094016"                     "3.630329"                     
## Original.Pvalue         "0.099010"                     "0.029703"                     
## Boot.TS.Mean            "2.955360"                     "3.710180"                     
## Boot.TS.Bias            "-0.138656"                    " 0.079851"                    
## Boot.TS.Std             "0.640507"                     "0.673168"                     
## Boot.TS.Norm.CI         "95% CI: [1.69999, 4.21073]"   "95% CI: [2.390795, 5.029566]" 
## Boot.TS.PT.CI           "95% CI: [1.969187, 4.249253]" "95% CI: [2.53947, 5.220344]"  
## Boot.Pvalue.Mean        "0.202475"                     "0.069604"                     
## Boot.Pvalue.Bias        " 0.103465"                    " 0.039901"                    
## Boot.Pvalue.Std         "0.180077"                     "0.105605"                     
## Boot.Pvalue.Norm.CI     "95% CI: [-0.150469, 0.55542]" "95% CI: [-0.137379, 0.276587]"
## Boot.Pvalue.PT.CI       "95% CI: [0.009901, 0.678465]" "95% CI: [0.009901, 0.356188]" 
## Genes.not.found         "-"                            "-"                            
## Genes.removed.sd.0.001. "-"                            "-"

p38 MAPK pathway
##                         [,1]                            [,2]                           
## X                       "BRCA"                          "COAD"                         
## Original.TS             "6.719747"                      "9.629239"                     
## Original.Pvalue         "0.019802"                      "0.009901"                     
## Boot.TS.Mean            "6.916047"                      "8.544807"                     
## Boot.TS.Bias            " 0.196300"                     "-1.084432"                    
## Boot.TS.Std             "0.901800"                      "1.321158"                     
## Boot.TS.Norm.CI         "95% CI: [5.148552, 8.683543]"  "95% CI: [5.955386, 11.134228]"
## Boot.TS.PT.CI           "95% CI: [5.40299, 8.765926]"   "95% CI: [6.249312, 10.977783]"
## Boot.Pvalue.Mean        "0.024752"                      "0.095842"                     
## Boot.Pvalue.Bias        " 0.004950"                     " 0.085941"                    
## Boot.Pvalue.Std         "0.027845"                      "0.152952"                     
## Boot.Pvalue.Norm.CI     "95% CI: [-0.029822, 0.079327]" "95% CI: [-0.203938, 0.395621]"
## Boot.Pvalue.PT.CI       "95% CI: [0.009901, 0.118812]"  "95% CI: [0.009901, 0.510644]" 
## Genes.not.found         "-"                             "-"                            
## Genes.removed.sd.0.001. "-"                             "-"


At last, the CSV file _Meta Analysis bootstrap result.csv assembles the statistics results of all gene sets altogether, including information for all individual datasets.

** _Meta Analysis bootstrap result.csv **
* Geneset.Name: Gene set (pathway) name.
* Number.of.Genes: Number of genes included in the pathway.
* meta.p: Meta-analysis summarized permutation p-value across multiple datasets.
* bootstrap.p: Median of meta-analysis summarized bootstrap p-values across 11 datasets.
* BRCA: Permutation p-value generated within one individual dataset (BRCA here, but can be other dataset name).

##                    Aminosugars metabolism p38 MAPK pathway
## Number.of.Genes                   16.0000          32.0000
## meta.p.random                      0.0508           0.0582
## bootstrap.p.random                 0.0488           0.0806
## BRCA                               0.0990           0.0099
## COAD                               0.0297           0.0099
## HNSC                               0.3366           0.2673
## KIRC                               0.0099           0.0693
## KIRP                               0.0099           0.2574
## LIHC                               0.1188           0.0099
## LUAD                               0.0297           0.0396
## LUSC                               0.0099           0.0297
## PRAD                               0.0099           0.0198
## STAD                               0.0693           0.1188
## THCA                               0.5545           0.4356



2.3.3 Example 3: Pathway Crosstalk Analysis

For pathway crosstalk analysis, one call PWcTalk function in one line, or alternatively, execute several codelines combining PWcTalkpre and PWcTalkNW to enable greater modulation of network layout. One must execute MetaGSCA_Multi_Geneset function on sufficiently many gene sets (say >100) before attempting pathway crosstalk analysis. To enable swift testing of pathway cross talk analysis, we provide one example input as if it was generated from the prior workflow.

Note: If you are connecting from a Windows desktop to a remote Linux server and run these commands in Linux terminal, you’ll need an X11 display assistance (such as Xming) configured and started before you run pathway crosstalk analysis.

data(input2PWcTalk)
## step 4.1, one code line to execute pathway crosstalk analysis. Simple, but no flexibility for layout tuning. 
PWcTalk(input2PWcTalk,test='binary',
  pTh.dataset=0.01,pTh.pwPair=0.01,pTh.pw=0.01,figname='PWcTalk',
  pdfW=10,pdfH=10,asp=0.7,vbase=15,ebase=2,vlbase=1,power=1/2)
## step 4.2, one code block to execute pathway crosstalk analysis. More code lines, but enabling interactive layout tuning.
preNW <- PWcTalkNWpre(input2PWcTalk,test='binary',
  pTh.dataset=0.01,pTh.pwPair=0.01,pTh.pw=0.01)
g_tkid <- PWcTalkNW(preNW$PW.pair,preNW$PW.p)
##### PAUSE here: adjust the network layout on the pop-out window to reach a satisfaction #####
coords <- tk_coords(g_tkid$tkid)
g_tkid <- PWcTalkNW(preNW$PW.pair,preNW$PW.p,layout=coords,pdfW=14,pdfH=10,figname='PWcTalk',asp=0.5)       
A demonstration pathway crosstalk network

A demonstration pathway crosstalk network


3 Citing MetaGSCA

If you use MetaGSCA in your publication, please cite: XXXXXX