Mercurial > repos > iuc > transit_hmm

<?xml version="1.0"?>
<tool id="transit_hmm" name="HMM" version="@VERSION@+galaxy2">
    <description>- determine essentiality of a genome</description>
    <macros>
        <import>macros.xml</import>
    </macros>
    <expand macro="requirements" />
    <command detect_errors="exit_code"><![CDATA[
    	@LINK_INPUTS@
        transit hmm $input_files annotation.dat transit_out.txt
        @STANDARD_OPTIONS@
        $loess
        ]]>
    </command>
    <inputs>
        <expand macro="standard_inputs">
        	<expand macro="handle_replicates" />
            <param name="loess" argument="-l" type="boolean" truevalue="-l" falsevalue="" label="Perform LOESS Correction" help="Helps remove possible genomic position bias." />
        </expand>
    </inputs>
    <outputs>
        <expand macro="outputs">
        	<data name="genes" from_work_dir="transit_out_genes.txt" format="tabular" label="${tool.name} on ${on_string} Genes" />
        </expand>
    </outputs>
    <tests>
        <test>
            <param name="inputs" ftype="wig" value="transit-in1-rep1.wig,transit-in1-rep2.wig" />
            <param name="annotation" ftype="tabular" value="transit-in1.prot" />
            <param name="replicates" value="Replicates" />
            <output name="sites" file="hmm-sites1.txt" ftype="tabular" compare="sim_size" />
            <output name="genes" file="hmm-genes1.txt" ftype="tabular" compare="sim_size" />
        </test>
    </tests>
    <help>
<![CDATA[.. class:: infomark

**What it does**

-------------------

The HMM method can be used to determine the essentiality of the entire genome, as opposed to gene-level analysis of the other methods. It is capable of identifying regions that have unusually high or unusually low read counts (i.e. growth advantage or growth defect regions), in addition to the more common categories of essential and non-essential.

Note : Intended only for Himar1 datasets.

-------------------

**Inputs**

-------------------

-   .wig files : Tabulated files containing one column with the TA site coordinate and one column with the read count at this site.

-   annotation .prot_table : Annotation file generated by the `Convert Gff3 to prot_table for TRANSIT` tool.

-------------------

**Parameters**

-------------------


Optional Arguments:
|    -r <string>     :=  How to handle replicates. Sum, Mean. Default: -r Mean
|    -l              :=  Perform LOESS Correction; Helps remove possible genomic position bias. Default: Off.
|    -iN <float>     :=  Ignore TAs occuring at given fraction of the N terminus. Default: -iN 0.0
|    -iC <float>     :=  Ignore TAs occuring at given fraction of the C terminus. Default: -iC 0.0
|    -n <string>      := Determines which normalization method to use. Default -n TTR

The HMM method automatically estimates the necessary statistical parameters from the datasets. You can change how the method handles replicate datasets:
-   Replicates: Determines how the HMM deals with replicate datasets by either averaging the read-counts or summing read counts across datasets. For regular datasets (i.e. mean-read count > 100) the recommended setting is to average read-counts together. For sparse datasets, it summing read-counts may produce more accurate results.
-   Normalization Method: Determines which normalization method to use when comparing datasets. Proper normalization is important as it ensures that other sources of variability are not mistakenly treated as real differences. See the Normalization section for a description of normalization method available in TRANSIT.
-    - TTR (Default) : Trimmed Total Reads (TTR), normalized by the total read-counts (like totreads), but trims top and bottom 5% of read-counts. This is the recommended normalization method for most cases as it has the beneffit of normalizing for difference in saturation in the context of resampling.
-    - nzmean : Normalizes datasets to have the same mean over the non-zero sites.
-    - totreads : Normalizes datasets by total read-counts, and scales them to have the same mean over all counts.
-    - zinfnb : Fits a zero-inflated negative binomial model, and then divides read-counts by the mean. The zero-inflated negative binomial model will treat some empty sites as belonging to the “true” negative binomial distribution responsible for read-counts while treating the others as “essential” (and thus not influencing its parameters).
-    - quantile : Normalizes datasets using the quantile normalization method described by Bolstad et al. (2003). In this normalization procedure, datasets are sorted, an empirical distribution is estimated as the mean across the sorted datasets at each site, and then the original (unsorted) datasets are assigned values from the empirical distribution based on their quantiles.
-    - betageom : Normalizes the datasets to fit an “ideal” Geometric distribution with a variable probability parameter p. Specially useful for datasets that contain a large skew. See Beta-Geometric Correction .
-    - nonorm : No normalization is performed.


-----------

**Outputs**

-----------

The HMM method outputs two files. The first file provides the most likely assignment of states for all the TA sites in the genome. Sites can belong to one of the following states: “E” (Essential), “GD” (Growth-Defect), “NE” (Non-Essential), or “GA” (Growth-Advantage). In addition, the output includes the probability of the particular site belonging to the given state. The columns of this file are defined as follows:

=============================================   ========================================================================================================================
**Column**                                      **Column Definition**
---------------------------------------------   ------------------------------------------------------------------------------------------------------------------------
1                                               Coordinate of TA site
2                                               Observed Read Counts
3                                               Probability for ES state
4                                               Probability for GD state
5                                               Probability for NE state
6                                               Probability for GA state
7                                               State Classification (ES = Essential, GD = Growth Defect, NE = Non-Essential, GA = Growth-Defect)
8                                               Gene(s) that share(s) the TA site.
=============================================   ========================================================================================================================

The second file provides a gene-level classification for all the genes in the genome. Genes are classified as “E” (Essential), “GD” (Growth-Defect), “NE” (Non-Essential), or “GA” (Growth-Advantage) depending on the number of sites within the gene that belong to those states.

=============================================   ========================================================================================================================
**Column Header**                               **Column Definition**
---------------------------------------------   ------------------------------------------------------------------------------------------------------------------------
Orf                                             Gene ID
Name                                            Gene Name
Desc                                            Gene Description
N                                               Number of TA sites
n0                                              Number of sites labeled ES (Essential)
n1                                              Number of sites labeled GD (Growth-Defect)
n2                                              Number of sites labeled NE (Non-Essential)
n3                                              Number of sites labeled GA (Growth-Advantage)
Avg. Insertions                                 Mean insertion rate within the gene
Avg. Reads                                      Mean read count within the gene
State Call                                      State Classification (ES = Essential, GD = Growth Defect, NE = Non-Essential, GA = Growth-Defect)
=============================================   ========================================================================================================================

Note: Libraries that are too sparse (e.g. < 30%) or which contain very low read-counts may be problematic for the HMM method, causing it to label too many Growth-Defect genes
]]></help>

    <expand macro="citations" />
</tool>
author	iuc
date	Wed, 16 Oct 2019 04:32:33 -0400
parents	32da07a53d3b
children	532a84f0de1e