view GenotypingSTR.xml @ 2:d5ed5c2e25c3 draft

Uploaded
author arkarachai-fungtammasan
date Wed, 22 Apr 2015 12:48:40 -0400
parents 07588b899c13
children
line wrap: on
line source

<tool id="GenotypeSTR" name="Correct genotype for STR errors" version="2.0.0">
  <description> that occur during sequencing and library prep </description>
  <command interpreter="python2.7">GenotypeTRcorrection.py  $microsat_raw $microsat_error_profile $microsat_corrected  $expectedminorallele </command>

  <inputs>
    <param name="microsat_raw" type="data" label="Select microsatellite length profile that need to refine genotyping" />
    <param name="microsat_error_profile" type="data" label="Select microsatellite error profile that correspond to this dataset" />
	<param name="expectedminorallele" type="float" value="0.5" label="Expected contribution of minor allele when present (0.5 for genotyping)" />

  </inputs>
  <outputs>
    <data name="microsat_corrected" format="tabular" />
  </outputs>
  <tests>
    <!-- Test data with valid values -->
    <test>
      <param name="microsat_raw" value="sampleTRprofile_C.txt"/>
      <param name="microsat_error_profile" value="PCRinclude.allrate.bymajorallele"/>
      <param name="expectedminorallele" value="0.5"/>
      <output name="microsat_corrected" file="sampleTRgenotypingcorrection"/>
    </test>
    
  </tests>
  <help>


.. class:: infomark

**What it does**

- This tool will correct for STR sequencing and library preparation errors using error rates estimated from hemizygous male X chromosome (https://usegalaxy.org/u/guru%40psu.edu/h/error-rates-files) or rates provided by user. The STR length profile for each locus will be processed independently. 
- First, this tool will find three most common STR lengths from input STR length profile. If the STR length profile has only one length of STR, the length of one motif longer than the observed length will be used as the second most common STR length. 
- Second, it will calculate probability of three forms of homozygotes and use the form with the highest probability. The same goes for heterozygotes. 
- Third, this tools will calculate log10 of the ratio of the probability of homozygote to the probability of heterozygote. If this value is more than 0, it will predict this locus to be homozygote. If this value is less than 0, it will predict this locus to be heterozygote. If this value is 0, read profile at this locus will be discarded. 

**Citation**

When you use this tool, please cite **Fungtammasan A, Ananda G, Hile SE, Su MS, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate Typing of Short Tandem Repeats from Genome-wide Sequencing Data and its Applications, Genome Research**
 
**Input**

- The input files need to contain at least three columns. 
- Column 1 = location of STR locus. 
- Column 2 = length profile (length of STR in each read that mapped to this location in comma separated format). 
- Column 3 = motif of STR in this locus. The input file can contain more than three columns. 

**Output**

The output will be contain original three (or more) columns as the input. However, it will also have these following columns. 

- Additional column 1 = homozygote/heterozygote label.
- Additional column 2 = log based 10 of (the probability of homozygote/the probability of heterozygote)
- Additional column 3 = Allele for most probable homozygote.
- Additional column 4 = Allele 1 for most probable heterozygote.
- Additional column 5 = Allele 2 for most probable heterozygote.

**Example**

- Suppose that we sequence a locus of STR with NGS. This locus has **A** motif and the following STR length (bp) profile. ::

	chr1_100_106	5, 6, 6, 6, 6, 7, 7, 8, 8	A
	
- We want to figure out if this locus is a homoozygote or heterozygote and the corresponding allele(s). Therefore, we use this tool to refine genotype.
- This tool will calculate the probability of homozygote A6A6, A7A7, and A8A8 to generate the observed STR length profile. Among this A7A7 has the highest probability. Therefore, we use this form as the representative for homozygote.
- Then, this tool will calculate the probability of heterozygote A6A7, A7A8, and A6A8 to generate the observed STR length profile. Among this A6A8 has the highest probability. Therefore, we use this form as the representative for heterozygote.    
- Finally, it will compare the representative homozygous and heterozygous forms. The A6A8 has higher probability than A7A7. Therefore, the program will report that this locus as a heterozygous locus of form A6A8. ::

	chr1	5,6,6,6,6,7,7,8,8	A	hetero	-14.8744881854	7	6	8


</help>
</tool>