Galaxy |

What it does

Given a matrix of counts and optional information about the genes, this tool produces plots and tables useful in the analysis of differential gene expression.

This tool is dependent on the R packages limma and edgeR as a part of the bioconductor project. Please ensure that these packages are installed on the server running this tool.

Counts Data: A matrix of expression level with rows corresponding to particular genes and columns corresponding to the feature count in particular samples. Values must be tab separated and there must be a row for the sample/column labels and a column for the row/gene labels.

Example:

"GeneID"  "Smpl1"       "Smpl2" "Smpl3" "Smpl4" "Smpl5"
"27395" 1699    1528    1463    1441    1495
"18777" 1905    1744    1345    1291    1346
"15037" 6       8       4       5       5
"21399" 2099    1974    1574    1519    1654
"58175" 356     312     347     361     346
"10866" 2528    2438    1762    1942    2027
"12421" 2182    2005    1786    1799    1858
"24069" 3       4       2       3       3
"31926" 1337    1380    1004    1102    1000
"71096" 0       0       2       1       6
"59014" 1466    1426    1296    1097    1175
...

Gene Annotations: Optional input for gene annotations, this can contain more information about the genes than just an ID number. The annotations will be avaiable in the top differential expression table.

Example:

"GeneID"        "Length"        "EntrezID"      "Symbols"       "GeneName"      "Chr"
"11287" "11287" 4681    "11287" "Pzp"   "pregnancy zone protein"        "6"
"11298" "11298" 1455    "11298" "Aanat" "arylalkylamine N-acetyltransferase"    "11"
"11302" "11302" 5743    "11302" "Aatk"  "apoptosis-associated tyrosine kinase"  "11"
"11303" "11303" 10260   "11303" "Abca1" "ATP-binding cassette, sub-family A (ABC1), member 1"   "4"
"11304" "11304" 7248    "11304" "Abca4" "ATP-binding cassette, sub-family A (ABC1), member 4"   "3"
"11305" "11305" 8061    "11305" "Abca2" "ATP-binding cassette, sub-family A (ABC1), member 2"   "2"
...

Factor Name: The name of the factor being investigated. This tool currently assumes that only one factor is of interest.

Factor Levels: The levels of the factor of interest, this must be entered in the same order as the samples to which the levels correspond as listed in the columns of the counts matrix.

The values should be seperated by commas, and spaces must not be used.

Contrasts of Interest: The contrasts you wish to make between levels.

Common contrasts would be a simple difference between two levels: "Mut-WT" represents the difference between the mutant and wild type genotypes.

The values should be seperated by commas and spaces must not be used.

Filter Low CPM: Option to ignore the genes that do not show significant levels of expression, this filtering is dependent on two criteria:

Minimum CPM: This is the counts per million that a gene must have in at least some specified number of samples.

Minumum Samples: This is the number of samples in which the CPM requirement must be met in order for that gene to be acknowledged.

Only genes that exhibit a CPM greater than the required amount in at least the number of samples specified will be used for analysis. Care should be taken to ensure that the sample requirement is appropriate. In the case of an experiment with two experimental groups each with two members, if there is a change from insignificant cpm to significant cpm but the sample requirement is set to 3, then this will cause that gene to fail the criteria. When in doubt simply do not filter.

Normalisation Method: Option for using different methods to rescale the raw library size. For more information, see calcNormFactor section in the edgeR user's manual.

Apply Sample Weights: Option to downweight outlier samples such that their information is still used in the statistical analysis but their impact is reduced. Use this whenever significant outliers are present. The MDS plotting tool in this package is useful for identifying outliers

Use Advanced Testing Options?: By default error rate for multiple testing is controlled using Benjamini and Hochberg's false discovery rate control at a threshold value of 0.05. However there are options to change this to custom values.

P-Value Adjustment Method: Change the multiple testing control method, the options are BH(1995) and BY(2001) which are both false discovery rate controls. There is also Holm(1979) which is a method for family-wise error rate control.

Adjusted Threshold: Set the threshold for the resulting value of the multiple testing control method. Only observations whose statistic falls below this value is considered significant, thus highlighted in the MA plot.

Minimum log2-fold-change Required: In addition to meeting the requirement for the adjusted statistic for multiple testing, the observation must have an absolute log2-fold-change greater than this threshold to be considered significant, thus highlighted in the MA plot.

Citations:

limma

Please cite the paper below for the limma software itself. Please also try to cite the appropriate methodology articles that describe the statistical methods implemented in limma, depending on which limma functions you are using. The methodology articles are listed in Section 2.1 of the limma User's Guide.

Smyth, GK (2005). Limma: linear models for microarray data. In: 'Bioinformatics and Computational Biology Solutions using R and Bioconductor'. R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds), Springer, New York, pages 397-420.

Law, CW, Chen, Y, Shi, W, and Smyth, GK (2014). Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology 15, R29.

Ritchie, M. E., Diyagama, D., Neilson, J., van Laar, R., Dobrovic, A., Holloway, A., and Smyth, G. K. (2006). Empirical array quality weights for microarray data. BMC Bioinformatics 7, Article 261.

edgeR

Please cite the first paper for the software itself and the other papers for the various original statistical methods implemented in edgeR. See Section 1.2 in the User's Guide for more detail.

Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140

Robinson MD and Smyth GK (2007). Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23, 2881-2887

Robinson MD and Smyth GK (2008). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics, 9, 321-332

McCarthy DJ, Chen Y and Smyth GK (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research 40, 4288-4297

Please report problems or suggestions to: su.s@wehi.edu.au