What it does
MAGeCK pathway can also invoke robust ranking aggregation (RRA) to test if a pathway is enriched in one particular gene ranking, see More Information below.
Inputs
Gene Ranking files
A gene ranking file is required as input and can be produced using mageck test. An example of the gene ranking file (gene summary file) is as follows:
id | num | neg|score | neg|p-value | neg|fdr | neg|rank | neg|goodsgrna | neg|lfc | pos|score | pos|p-value | pos|fdr | pos|rank | pos|goodsgrna | pos|lfc |
ESPL1 | 12 | 6.4327e-10 | 7.558e-06 | 7.9e-05 | 1 | -2.35 | 11 | 0.99725 | 0.99981 | 0.999992 | 615 | 0 | -0.07 |
RPL18 | 12 | 6.4671e-10 | 7.558e-06 | 7.9e-05 | 2 | -2.12 | 11 | 0.99799 | 0.99989 | 0.999992 | 620 | 0 | -0.32 |
CDK1 | 12 | 2.6439e-09 | 7.558e-06 | 7.9e-05 | 3 | -1.93 | 12 | 1.0 | 0.99999 | 0.999992 | 655 | 0 | -0.12 |
Pathway file
MAGeCK pathway also requires a pathway file in GMT format. The GMT (Gene Matrix Transposed) file format is a tab delimited file format that describes gene sets and is consistent with the GMT file in Gene Set Enrichment Analysis (GSEA). In the GMT format, each row represents a gene set, with the first column containing the gene set name, and the second column containing a description for the gene set, followed by the names or ids of the genes in the gene set. You can download different GMT pathway files directly from the GSEA MSigDB database. An example of the GMT format is as follows:
Gene Set Name | Description | Genes |
KEGG_RIBOSOME | http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_RIBOSOME | RPL35 RPL23 RPL3... |
Outputs
Pathway summary file
An example of the pathway summary output file is as follows:
id | num | neg|score | neg|rra | neg|p-value | neg|fdr | neg|rank | neg|goodgene | neg|lfc | pos|score | pos|rra | pos|p-value | pos|fdr | pos|rank | pos|goodgene | pos|lfc |
KEGG_RIBOSOME | 88 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 00 |
The contents of each column is as follows:
Genes are ranked by the p.neg field (by default). If you need a ranking by the p.pos, you can use the --sort-criteria option.
More Information
Overview of the MAGeCK algorithm
Briefly, read counts from different samples are first median-normalized to adjust for the effect of library sizes and read count distributions. Then the variance of read counts is estimated by sharing information across features, and a negative binomial (NB) model is used to test whether sgRNA abundance differs significantly between treatments and controls. This approach is similar to those used for differential RNA-Seq analysis. We rank sgRNAs based on P-values calculated from the NB model, and use a modified robust ranking aggregation (RRA) algorithm named α-RRA to identify positively or negatively selected genes. More specifically, α-RRA assumes that if a gene has no effect on selection, then sgRNAs targeting this gene should be uniformly distributed across the ranked list of all the sgRNAs. α-RRA ranks genes by comparing the skew in rankings to the uniform null model, and prioritizes genes whose sgRNA rankings are consistently higher than expected. α-RRA calculates the statistical significance of the skew by permutation, and a detailed description of the algorithm is presented in the Materials and methods section of the MAGeCK paper. Finally, MAGeCK reports positively and negatively selected pathways by applying α-RRA to the rankings of genes in a pathway.
For more information on using MAGeCK, see the MAGeCK website here.