Mercurial > repos > iuc > transit_tn5gaps

<?xml version="1.0"?>
<tool id="transit_tn5gaps" name="Tn5Gaps" version="@VERSION@+galaxy1">
    <description>- determine essential genes</description>
    <macros>
        <import>macros.xml</import>
    </macros>
    <expand macro="requirements" />
    <command detect_errors="exit_code"><![CDATA[
    	@LINK_INPUTS@
        transit tn5gaps $input_files annotation.dat transit_out.txt
        @STANDARD_OPTIONS@
         -m $smallest
        ]]>
    </command>
    <inputs>
        <expand macro="standard_inputs">
            <param name="smallest" argument="-m" type="integer" value="1" label="Smallest read-count to consider" />
        </expand>
    </inputs>
    <outputs>
        <expand macro="outputs" />
    </outputs>
    <tests>
        <test>
            <param name="inputs" ftype="wig" value="transit-in-tn5.wig,transit-in2-tn5.wig" />
            <param name="annotation" ftype="tabular" value="transit_tn5.prot" />
            <param name="replicates" value="Replicates" />
            <output name="sites" file="tn5gaps-sites1.txt" ftype="tabular" compare="sim_size" />
        </test>
    </tests>

<help><![CDATA[
.. class:: infomark

**What it does**

-------------------

This method is loosely is based on the original **Gumbel** analysis method. The **Gumbel** method can be used to determine which genes are essential in a single condition. It does a gene-by-gene analysis of the insertions at TA sites with each gene, makes a call based on the longest consecutive sequence of TA sites without insertion in the genes, calculates the probability of this using a Bayesian model.
The Tn5Gaps method modifies the original method in order to work on Tn5 datasets, which have significantly lower saturation of insertion sites than Himar1 datasets. The main difference comes from the fact that the runs of non-insertion (or “gaps”) are analyzed throughout the whole genome, including non-coding regions, instead of within single genes. In doing so, the expected maximum run length is calculated and a p-value can be derived for every run. A gene is then classified by using the p-value of the run with the largest number of nucleotides overlapping with the gene.

-------------------

**Inputs**

-------------------

Input files for Tn5Gaps need to be:

- .wig files: Tabulated files containing one column with the TA site coordinate and one column with the read count at this site.
- annotation .prot_table: Annotation file generated by the `Convert Gff3 to prot_table for TRANSIT` tool.


-------------------

**Parameters**

-------------------

Optional Arguments:
       -m <integer>    :=  Smallest read-count to consider. Default: -m 1
       -r <string>     :=  How to handle replicates. Sum or Mean. Default: -r Sum
       --iN <float>     :=  Ignore TAs occuring at given fraction of the N terminus. Default: -iN 0.0
       --iC <float>     :=  Ignore TAs occuring at given fraction of the C terminus. Default: -iC 0.0
       -n <string>      := Determines which normalization method to use. Default -n TTR

- Minimum Read: The minimum read count that is considered a true read. Because the Gumbel method depends on determining gaps of TA sites lacking insertions, it may be susceptible to spurious reads (e.g. errors). The default value of 1 will consider all reads as true reads. A value of 2, for example, will ignore read counts of 1.
- Replicates: Determines how to deal with replicates by averaging the read-counts or summing read counts across datasets. This should not have an affect for the Gumbel method, aside from potentially affecting spurious reads.
- Normalisation :
    - TTR (Default) : Trimmed Total Reads (TTR), normalized by the total read-counts (like totreads), but trims top and bottom 5% of read-counts. This is the recommended normalization method for most cases as it has the beneffit of normalizing for difference in saturation in the context of resampling.
    - nzmean : Normalizes datasets to have the same mean over the non-zero sites.
    - totreads : Normalizes datasets by total read-counts, and scales them to have the same mean over all counts.
    - zinfnb : Fits a zero-inflated negative binomial model, and then divides read-counts by the mean. The zero-inflated negative binomial model will treat some empty sites as belonging to the “true” negative binomial distribution responsible for read-counts while treating the others as “essential” (and thus not influencing its parameters).
    - quantile : Normalizes datasets using the quantile normalization method described by Bolstad et al. (2003). In this normalization procedure, datasets are sorted, an empirical distribution is estimated as the mean across the sorted datasets at each site, and then the original (unsorted) datasets are assigned values from the empirical distribution based on their quantiles.
    - betageom : Normalizes the datasets to fit an “ideal” Geometric distribution with a variable probability parameter p. Specially useful for datasets that contain a large skew. See Beta-Geometric Correction .
    - nonorm : No normalization is performed.


-------------------

    **Outputs**

-------------------

=============================================   ========================================================================================================================
**Column Header**                               **Column Definition**
---------------------------------------------   ------------------------------------------------------------------------------------------------------------------------
Orf                                             Gene ID
Name                                            Gene Name
Desc                                            Gene Description
k                                               Number of Transposon Insertions Observed within the ORF.
n                                               Total Number of TA dinucleotides within the ORF.
r                                               Span of nucleotides for the Maximum Run of Non-Insertions.
ovr                                             The number of nucleotides in the overlap with the longest run partially covering the gene.
lenovr                                          The length of the above run with the largest overlap with the gene.
pval                                            P-value calculated by the permutation test.
padj                                            Adjusted p-value controlling for the FDR (Benjamini-Hochberg).
State Call                                      Essentiality call for the gene. Depends on FDR corrected thresholds. E=Essential U=Uncertain, NE=Non-Essential, S=too short
=============================================   ========================================================================================================================


Note: Technically, Bayesian models are used to calculate posterior probabilities, not p-values (which is a concept associated with the frequentist framework). However, we have implemented a method for computing the approximate false-discovery rate (FDR) that serves a similar purpose. This determines a threshold for significance on the posterior probabilities that is corrected for multiple tests. The actual thresholds used are reported in the headers of the output file (and are near 1 for essentials and near 0 for non-essentials). There can be many genes that score between the two thresholds (t1 < zbar < t2). This reflects intrinsic uncertainty associated with either low read counts, sparse insertion density, or small genes. If the insertion_density is too low (< ~30%), the method may not work as well, and might indicate an unusually large number of Uncertain or Essential genes.

-------------------

**More Information**

-------------------

See  `TRANSIT documentation`

- TRANSIT: https://transit.readthedocs.io/en/latest/index.html
- `TRANSIT Tn5Gaps`: https://transit.readthedocs.io/en/latest/transit_methods.html#tn5gaps
    ]]></help>

    <expand macro="citations" />

</tool>
author	iuc
date	Tue, 08 Oct 2019 08:24:46 -0400
parents
children	c68753eedf72