What it does
Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout (MAGeCK) is a computational tool to identify important genes from the recent genome-scale CRISPR-Cas9 knockout screens (or GeCKO) technology. MAGeCK can be used for prioritizing single-guide RNAs, genes and pathways in genome-scale CRISPR/Cas9 knockout screens. MAGeCK identifies both positively and negatively selected genes simultaneously and reports robust results across different experimental conditions. MAGeCK is developed and maintained by Wei Li and Han Xu from Prof. Xiaole Shirley Liu's lab at the Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute and Harvard School of Public Health. MAGeCK has been used to identify functional lncRNAs from screens with close to 100% validation rate.
Inputs
Read files
MAGeCK count accepts one or more FASTQ.GZ, FASTQ or BAM files as input.
Since version 0.5.5, MAGeCK count module supports collecting read counts from BAM files. This will allow you to use a third-party aligner to map reads to the library with mismatches, providing more usable reads for the analysis. However, it is still recommended to directly use the fastq file in the count module (which does not allow any mismatches), because:
It is also possible to input a Count Table to normalize counts and get statistics.
sgRNA library file
When starting from FASTQ, FASTQ.GZ or BAM files, MAGeCK needs to know the sgRNA sequences and targeting genes. Such information is provided in the sgRNA library file and can be specified in the tool form above. The sgRNA library file can be provided in .tsv or .csv format. There are three columns in the library file: the sgRNA ID, the sequence, and the gene it is targeting.
Example:
sgRNA ID Sequence Gene s_10007 TGTTCACAGTATAGTTTGCC CCNA1 s_10008 TTCTCCCTAATTGCTTGCTG CCNA1 s_10027 ACATGTTGCTTCCCCTTGCA CCNC
Control sgRNA file
The optional Control sgRNAs file is used to generate null distribution when calculating the p values. If this option is not specified, MAGeCK generates the null distribution of RRA scores by assuming all of the genes in the library are non-essential, see More Information below. This approach is sometimes over-conservative, and you can improve this if you know some genes are not essential. By providing the corresponding sgRNA IDs in this option, MAGeCK will have a better estimation of p values. To use this option, you need to prepare a text file specifying the IDs of control sgRNAs, one line for one sgRNA ID.
Outputs
This tool outputs
- an sgRNA Counts table
Optionally, under Output Options you can choose to output
- a Count Summary file
- a PDF report
- a Normalized Counts table
- an Unmapped reads file
- the .R and .Rnw files used to generate the plots and PDF
- a Log file of the analysis
sgRNA Count file
An example of the sgRNA count output file is shown below. This file can be used with MAGeCK test.
Example:
sgRNA Gene Sample1 Sample2 A1CF_m52595977 A1CF 213 199 A1CF_m52596017 A1CF 294 164 A1CF_m52596056 A1CF 421 378 A1CF_m52603842 A1CF 274 281 A1CF_m52603847 A1CF 0 0
Count Summary
MAGeCK can produce a Count Summary file containing statistics of the input files (the statistics of fastq files are also in the PDF report). An example count summary file is shown below.
Example:
File Label Reads Mapped Percentage TotalsgRNAs Zerocounts GiniIndex NegSelQC NegSelQCPval NegSelQCPvalPermutation NegSelQCPvalPermutationFDR NegSelQCGene InputFile1 L1 2500 1453 0.5812 2550 1276 0.5267 0 1 1 1 0.0
More Information
Overview of the MAGeCK algorithm
Briefly, read counts from different samples are first median-normalized to adjust for the effect of library sizes and read count distributions. Then the variance of read counts is estimated by sharing information across features, and a negative binomial (NB) model is used to test whether sgRNA abundance differs significantly between treatments and controls. This approach is similar to those used for differential RNA-Seq analysis. We rank sgRNAs based on P-values calculated from the NB model, and use a modified robust ranking aggregation (RRA) algorithm named α-RRA to identify positively or negatively selected genes. More specifically, α-RRA assumes that if a gene has no effect on selection, then sgRNAs targeting this gene should be uniformly distributed across the ranked list of all the sgRNAs. α-RRA ranks genes by comparing the skew in rankings to the uniform null model, and prioritizes genes whose sgRNA rankings are consistently higher than expected. α-RRA calculates the statistical significance of the skew by permutation, and a detailed description of the algorithm is presented in the Materials and methods section of the MAGeCK paper. Finally, MAGeCK reports positively and negatively selected pathways by applying α-RRA to the rankings of genes in a pathway.
MAGeCK FAQs
The 5' trim length option can only trim a fixed length of nucleotides before sgRNA, but what if the trimming length is different in different reads? MAGeCK enables automatically determining trimming length, even the length may be different within the same fastq files. Alternatively, you can use cutadapt to trim the adaptor sequences of variable length before running MAGeCK.
How do I get the simple statistics of my input files? MAGeCK produces a Count Summary file containing the statistics of the input files, the statistics are also in the PDF report. The statistics can also be found in the log file for MAGeCK count.
How do I know the quality of my samples? For simple QC terms, you can just take a look at the sample statistics. Generally in a good negative selection sample:
For more information on using MAGeCK, see the MAGeCK website here.