Galaxy |

What it does

Given a counts matrix, or a set of counts files, for example from featureCounts, and optional information about the genes, this tool produces plots and tables useful in the analysis of differential gene expression.

This tool uses the edgeR quasi-likelihood pipeline (edgeR-quasi) for differential expression analysis. This statistical methodology uses negative binomial generalized linear models, but with F-tests instead of likelihood ratio tests. This method provides stricter error rate control than other negative binomial based pipelines, including the traditional edgeR pipelines or DESeq2. While the limma pipelines are recommended for large-scale datasets, because of their speed and flexibility, the edgeR-quasi pipeline gives better performance in low-count situations. For the data analyzed in this edgeR workflow article ,the edgeR-quasi, limma-voom and limma-trend pipelines are all equally suitable and give similar results.

Inputs

Counts Data:

The counts data can either be input as separate counts files (one sample per file) or a single count matrix (one sample per column). The rows correspond to genes, and columns correspond to the counts for the samples. Values must be tab separated, with the first row containing the sample/column labels and the first column containing the row/gene labels. The sample labels must start with a letter. Gene identifiers can be of any type but must be unique and not repeated within a counts file.

Example - Separate Count Files:

GeneID WT1

11287 1699

11298 1905

11302 6

11303 2099

11304 356

11305 2528

Example - Single Count Matrix:

GeneID WT1 WT2 WT3 Mut1 Mut2 Mut3

11287 1699 1528 1601 1463 1441 1495

11298 1905 1744 1834 1345 1291 1346

11302 6 8 7 5 6 5

11303 2099 1974 2100 1574 1519 1654

11304 356 312 337 361 397 346

11305 2528 2438 2493 1762 1942 2027

Gene Annotations: Optional input for gene annotations, this can contain more information about the genes than just an ID number. The annotations will be available in the differential expression results table and the optional normalised counts table. The file must contain a header row and have the gene IDs in the first column. The number of rows should match that of the counts files, add NA for any gene IDs with no annotation. The Galaxy tool annotateMyIDs can be used to obtain annotations for human, mouse, fly and zebrafish.

Example:

GeneID Symbol GeneName

11287 Pzp pregnancy zone protein

11298 Aanat arylalkylamine N-acetyltransferase

11302 Aatk apoptosis-associated tyrosine kinase

11303 Abca1 ATP-binding cassette, sub-family A (ABC1), member 1

11304 Abca4 ATP-binding cassette, sub-family A (ABC1), member 4

11305 Abca2 ATP-binding cassette, sub-family A (ABC1), member 2

Factor Information: Enter factor names and groups in the tool form, or provide a tab-separated file that has the names of the samples in the first column and one header row. The sample names must be the same as the names in the columns of the count matrix. The second column should contain the primary factor levels (e.g. WT, Mut) with optional additional columns for any secondary factors.

Example:

Sample Genotype Batch

WT1 WT b1

WT2 WT b2

WT3 WT b3

Mut1 Mut b1

Mut2 Mut b2

Mut3 Mut b3

Factor Name: The name of the experimental factor being investigated e.g. Genotype, Treatment. One factor must be entered, the name should start with a letter and spaces must not be used. Optionally, additional factors can be included, these are variables that might influence your experiment e.g. Batch, Gender, Subject. If additional factors are entered, an additive linear model will be used.

Groups: The names of the groups for the factor. The names should start with a letter, and only contain letters, numbers and underscores, other characters such as spaces and hyphens must not be used. If entered into the tool form above, the order must be the same as the samples (to which the groups correspond) are listed in the columns of the counts matrix, with the values separated by commas.

Formula: By default the tool will construct a formula for modelling counts based on the contents of the factors files or the factors given. This can be overriden by directly providing the EdgeR formula in section named Formula.

Contrasts of Interest: The contrasts you wish to make between levels. A common contrast would be a simple difference between two levels: "Mut-WT" represents the difference between the mutant and wild type genotypes. Multiple contrasts must be entered separately using the Insert Contrast button, spaces must not be used.

Alternatively, you can specify a file with contrasts. The file must contain a header (it's value is irrelevant) and one contrast per line on the first column (other columns are ignored). If using this option, make sure to remove any contrast section from the manual part, or the tool will fail.

Filter Low Counts: Genes with very low counts across all libraries provide little evidence for differential expression. In the biological point of view, a gene must be expressed at some minimal level before it is likely to be translated into a protein or to be biologically important. In addition, the pronounced discreteness of these counts interferes with some of the statistical approximations that are used later in the pipeline. These genes should be filtered out prior to further analysis. As a rule of thumb, genes are dropped if they can’t possibly be expressed in all the samples for any of the conditions. Users can set their own definition of genes being expressed. Usually a gene is required to have a count of 5-10 in a library to be considered expressed in that library. Users should also filter with count-per-million (CPM) rather than filtering on the counts directly, as the latter does not account for differences in library sizes between samples.

Option to ignore the genes that do not show significant levels of expression, this filtering is dependent on two criteria: CPM/count and number of samples. You can specify to filter on CPM (Minimum CPM) or count (Minimum Count) values:

Minimum CPM: This is the minimum count per million that a gene must have in at least the number of samples specified under Minimum Samples.

Minimum Count: This is the minimum count that a gene must have. It can be combined with either Filter on Total Count or Minimum Samples.

Filter on Total Count: This can be used with the Minimum Count filter to keep genes with a minimum total read count.

Minimum Samples: This is the number of samples in which the Minimum CPM/Count requirement must be met in order for that gene to be kept.

If the Minimum Samples filter is applied, only genes that exhibit a CPM/count greater than the required amount in at least the number of samples specified will be used for analysis. Care should be taken to ensure that the sample requirement is appropriate. In the case of an experiment with two experimental groups each with two members, if there is a change from insignificant CPM/count to significant CPM/count but the sample requirement is set to 3, then this will cause that gene to fail the criteria. When in doubt simply do not filter or consult the edgeR workflow article for filtering recommendations.

Advanced Options:

By default error rate for multiple testing is controlled using Benjamini and Hochberg's false discovery rate control at a threshold value of 0.05. However there are options to change this to custom values.

Minimum log2-fold-change Required: In addition to meeting the requirement for the adjusted statistic for multiple testing, the observation must have an absolute log2-fold-change greater than this threshold to be considered significant, thus highlighted in the MD plot.

Adjusted Threshold: Set the threshold for the resulting value of the multiple testing control method. Only observations whose statistic falls below this value is considered significant, thus highlighted in the MD plot.

P-Value Adjustment Method: Change the multiple testing control method, the options are BH(1995) and BY(2001) which are both false discovery rate controls. There is also Holm(1979) which is a method for family-wise error rate control.

Normalisation Method: The most obvious technical factor that affects the read counts, other than gene expression levels, is the sequencing depth of each RNA sample. edgeR adjusts any differential expression analysis for varying sequencing depths as represented by differing library sizes. This is part of the basic modeling procedure and flows automatically into fold-change or p-value calculations. It is always present, and doesn’t require any user intervention. The second most important technical influence on differential expression is one that is less obvious. RNA-seq provides a measure of the relative abundance of each gene in each RNA sample, but does not provide any measure of the total RNA output on a per-cell basis. This commonly becomes important when a small number of genes are very highly expressed in one sample, but not in another. The highly expressed genes can consume a substantial proportion of the total library size, causing the remaining genes to be under-sampled in that sample. Unless this RNA composition effect is adjusted for, the remaining genes may falsely appear to be down-regulated in that sample . The edgeR calcNormFactors function normalizes for RNA composition by finding a set of scaling factors for the library sizes that minimize the log-fold changes between the samples for most genes. The default method for computing these scale factors uses a trimmed mean of M values (TMM) between each pair of samples. We call the product of the original library size and the scaling factor the effective library size. The effective library size replaces the original library size in all downsteam analyses. TMM is the recommended method for most RNA-Seq data where the majority (more than half) of the genes are believed not differentially expressed between any pair of the samples. You can change the normalisation method under Advanced Options above. For more information, see the calcNormFactors section in the edgeR User's Guide.

Robust Settings Option to use robust settings. Using robust settings (robust=TRUE) with the edgeR estimateDisp and glmQLFit functions is usually recommended to protect against outlier genes. This is turned on by default. Note that it is only used with the quasi-likelihood F test method. For more information, see the edgeR workflow article.

Test Method Option to use the likelihood ratio test instead of the quasi-likelihood F test. For more information, see the edgeR workflow article.

Outputs

This tool outputs

a table of differentially expressed genes for each contrast of interest

a HTML report with plots and additional information

Optionally, under Output Options you can choose to output

a normalised counts table

the R script used by this tool

an RData file

Citations

Please try to cite the appropriate articles when you publish results obtained using software, as such citation is the main means by which the authors receive credit for their work. For the edgeR method itself, please cite Robinson et al., 2010, and for this tool (which was developed from the Galaxy limma-voom tool) please cite Liu et al., 2015.

GeneID	WT1
11287	1699
11298	1905
11302	6
11303	2099
11304	356
11305	2528

GeneID	WT1	WT2	WT3	Mut1	Mut2	Mut3
11287	1699	1528	1601	1463	1441	1495
11298	1905	1744	1834	1345	1291	1346
11302	6	8	7	5	6	5
11303	2099	1974	2100	1574	1519	1654
11304	356	312	337	361	397	346
11305	2528	2438	2493	1762	1942	2027

GeneID	Symbol	GeneName
11287	Pzp	pregnancy zone protein
11298	Aanat	arylalkylamine N-acetyltransferase
11302	Aatk	apoptosis-associated tyrosine kinase
11303	Abca1	ATP-binding cassette, sub-family A (ABC1), member 1
11304	Abca4	ATP-binding cassette, sub-family A (ABC1), member 4
11305	Abca2	ATP-binding cassette, sub-family A (ABC1), member 2

Sample	Genotype	Batch
WT1	WT	b1
WT2	WT	b2
WT3	WT	b3
Mut1	Mut	b1
Mut2	Mut	b2
Mut3	Mut	b3