Galaxy |

What it does

Gene Ontology (GO) analysis is widely used to reduce complexity and highlight biological processes in genome-wide expression studies, but standard methods give biased results on RNA-seq data due to over-detection of differential expression for long and highly expressed transcripts. This tool provides methods for performing GO analysis of RNA-seq data, taking length bias into account. The methods and software used by goseq are equally applicable to other category based tests of RNA-seq data, such as KEGG pathway analysis.

Options map closely to the excellent goseq manual.

Inputs

Differentially expressed genes file

goseq needs a tabular file containing information on differentially expressed genes. This should contain all genes assayed in the RNA-seq experiment. The file should have two columns with an optional header row. The first column should contain the Gene IDs, which must be unique within the file and not repeated. The second column should contain True or False. True means the gene should count as differentially expressed, False means it is not differentially expressed. You can use the "Compute on rows" tool to create a True / False column for your dataset.

Example:

ENSG00000236824 False

ENSG00000162526 False

ENSG00000090402 True

ENSG00000169188 False

ENSG00000124103 False

Gene lengths file

goseq needs information about the length of a gene to correct for potential length bias in differentially expressed genes using a Probability Weight Function (PWF). The PWF can be thought of, as a function which gives the probability that a gene will be differentially expressed, based on its length alone. The gene length file should have two columns with an optional header row. The first column should contain the Gene IDs, and the second column should contain the gene length in bp. If length data is unavailable for some genes, that entry should be set to NA. The goseq authors recommend using the gene lengths obtained from upstream summarization programs, such as featureCounts, if provided. Alternatively, the Gene length and GC content tool can produce such a file.

Example:

ENSG00000236824 13458

ENSG00000162526 2191

ENSG00000090402 6138

ENSG00000169188 3245

ENSG00000124103 1137

Gene categories file

This tool can get GO and KEGG categories for some genomes. The three GO categories are GO:MF (Molecular Function - molecular activities of gene products), GO:CC (Cellular Component - where gene products are active), GO:BP (Biological Process - pathways and larger processes made up of the activities of multiple gene products). If your genome is not available, you will also need a file describing the membership of genes in categories. The category file should have two columns with an optional header row. with Gene ID in the first column and category identifier in the second column. As the mapping between categories and genes is usually many-to-many, this table will usually have multiple rows with the same Gene ID and category identifier.

Example:

ENSG00000162526 GO:0000003

ENSG00000198648 GO:0000278

ENSG00000112312 GO:0000278

ENSG00000174442 GO:0000278

ENSG00000108953 GO:0000278

Outputs

This tool outputs a tabular file containing a ranked list of gene categories, similar to below. The default output is the Wallenius method table. If the Sampling and/or Hypergeometric methods are also selected, additional tables are produced.

Example:

category	over_rep_pval	under_rep_pval	numDEInCat	numInCat	term	ontology	p_adjust_over_rep	p_adjust_under_rep
GO:0005576	0.000054	0.999975	56	142	extracellular region	CC	0.394825	1
GO:0005840	0.000143	0.999988	9	12	ribosome	CC	0.394825	1
GO:0044763	0.000252	0.999858	148	473	single-organism cellular process	BP	0.394825	1
GO:0044699	0.000279	0.999844	158	513	single-organism process	BP	0.394825	1
GO:0065010	0.000428	0.999808	43	108	extracellular membrane-bounded organelle	CC	0.394825	1
GO:0070062	0.000428	0.999808	43	108	extracellular exosome	CC	0.394825	1

Optionally, this tool can also output:

a plot of the top 10 over-represented GO categories
some diagnostic plots
a tabular with the differentially expressed genes in categories (GO/KEGG terms)
an RData file

Method options

3 methods, Wallenius, Sampling and Hypergeometric, can be used to calculate the p-values as follows.

Wallenius

approximates the true distribution of numbers of members of a category amongst DE genes by the Wallenius non-central hypergeometric distribution. This distribution assumes that within a category all genes have the same probability of being chosen. Therefore, this approximation works best when the range in probabilities obtained by the probability weighting function is small. This is the method used by default.

Sampling

uses random sampling to approximate the true distribution and uses it to calculate the p-values for over (and under) representation of categories. Although this is the most accurate method given a high enough value of sampling number, its use quickly becomes computationally prohibitive. It may sometimes be desirable to use random sampling to generate the null distribution for category membership. For example, to check consistency against results from the Wallenius approximation. This is easily accomplished by using the method option to additionally specify sampling and the number of samples to generate.

Hypergeometric

assumes there is no bias in power to detect differential expression at all and calculates the p-values using a standard hypergeometric distribution (no length bias correction is performed). Useful if you wish to test the effect of length bias on your results. Caution: Hypergeometric should NEVER be used for producing results for biological interpretation of RNA-seq data. If length bias is truly not present in your data, goseq will produce a nearly flat PWF plot, no length bias correction will be applied to your data, and all methods will produce the same results.

More Information

In order to account for the length bias inherent to RNA-seq data when performing a GO analysis (or other category based tests), one cannot simply use the hypergeometric distribution as the null distribution for category membership, which is appropriate for data without DE length bias, such as microarray data. GO analysis of RNA-seq data requires the use of random sampling in order to generate a suitable null distribution for GO category membership and calculate each categories significance for over representation amongst DE genes.

However, this random sampling is computationally expensive. In most cases, the Wallenius distribution can be used to approximate the true null distribution, without any significant loss in accuracy. The goseq package implements this approximation as its default option. The option to generate the null distribution using random sampling is also included as an option, but users should be aware that the default number of samples generated will not be enough to accurately call enrichment when there are a large number of go terms.

Having established a null distribution, each category is then tested for over and under representation amongst the set of differentially expressed genes and the null is used to calculate a p-value for under and over representation.

Having performed a GO analysis, you may now wish to interpret the results. If you wish to identify categories significantly enriched/unenriched below some p-value cutoff, it is necessary to first apply some kind of multiple hypothesis testing correction. For example, you can identify GO categories over enriched using a 0.05 FDR (p.adjust) cutoff [Benjamini and Hochberg, 1995].

Unless you are a machine, GO and KEGG category identifiers are probably not very meaningful to you. Information about each identifier can be obtained from the Gene Ontology and KEGG websites.

ENSG00000236824	False
ENSG00000162526	False
ENSG00000090402	True
ENSG00000169188	False
ENSG00000124103	False

ENSG00000236824	13458
ENSG00000162526	2191
ENSG00000090402	6138
ENSG00000169188	3245
ENSG00000124103	1137

ENSG00000162526	GO:0000003
ENSG00000198648	GO:0000278
ENSG00000112312	GO:0000278
ENSG00000174442	GO:0000278
ENSG00000108953	GO:0000278