view rank_terms.xml @ 24:248b06e86022

Added gd_genotype datatype. Modified tools to support new datatype.
author Richard Burhans <burhans@bx.psu.edu>
date Tue, 28 May 2013 16:24:19 -0400
parents 95a05c1ef5d5
children 8997f2ca8c7a
line wrap: on
line source

<tool id="gd_rank_terms" name="Rank Terms" version="1.0.0">
  <description>: Assess the enrichment/depletion of a gene set for GO terms</description>

  <command interpreter="python">
    #set $t_col1_0 = int(str($t_col1)) - 1
    #set $t_col2_0 = int(str($t_col2)) - 1
    #set $g_col2_0 = int(str($g_col2)) - 1
    rank_terms.py --input "$input1" --columnENSEMBLT $t_col1_0 --inExtnddfile "$input2" --columnENSEMBLTExtndd $t_col2_0 --columnGOExtndd $g_col2_0 --output "$output"
  </command>

  <inputs>
    <param name="input1" type="data" format="tabular" label="Query dataset" />
    <param name="t_col1" type="data_column" data_ref="input1" label="Column with ENSEMBL transcript codes" />

    <param name="input2" type="data" format="tabular" label="Background dataset" />
    <param name="t_col2" type="data_column" data_ref="input2" label="Column with ENSEMBL transcript codes" />
    <param name="g_col2" type="data_column" data_ref="input2" label="Column with GO terms" />
  </inputs>

  <outputs>
    <data name="output" format="tabular" />
  </outputs>

  <help>

**Dataset formats**

All of the input and output datasets are in tabular_ format.
The query dataset has a column containing ENSEMBL transcript codes for
the gene set of interest, while the background dataset has one column
with ENSEMBL transcript codes and another with GO terms, for some
larger universe of genes.
The output dataset is described below.
(`Dataset missing?`_)

.. _tabular: ./static/formatHelp.html#tab
.. _Dataset missing?: ./static/formatHelp.html

-----

**What it does**

Given a query set of genes from a larger background dataset, this tool
evaluates the statistical over- or under-representation of Gene Ontology
terms in the query set, using a two-tailed Fisher's exact test.

The output contains a row for each GO term, with the following columns:

1. count: the number of genes in the query set that are in this GO category
2. representation: the percentage of this category's genes (from the background dataset) that appear in the query set
3. ranking of this term, based on its representation ("1" is highest)
4. Fisher probability of enrichment/depletion of this GO category in the query dataset
5. GO term

  </help>
</tool>