Mercurial > repos > urgi-team > tandem_repeats_finder

<tool id="tandem_repeats_finder" name="Tandem Repeats Finder" version="1.0.0">
	<description>locates and displays tandem repeats in DNA sequences</description>
	<requirements>
		<requirement type="package" version="4.0">tandem_repeats_finder</requirement>
	</requirements>
	<version_command>trf | grep Version</version_command>
	<command interpreter="python">tandem_repeats_finder_wrapper.py --file $file --match $match --mismatch $mismatch --delta $delta --pm $pm --pi $pi --minscore $minscore --maxperiod $maxperiod
#if $nohtml
	--txt "$output_txt"
#else
	--html "$output_html" --dirhtml "$output_html.files_path" --txt "$output_txt"
#end if
#if $flanking
	--flanking
#end if
#if $mask
	--mask "$output_mask"
#end if

	</command>
	<inputs>
        	<param name="file" type="data" format="fasta" label="DNA sequences in Fasta format"/>
		<param name="match" type="integer" value="2" label="Matching weight" help="default value 2">
			<validator type="in_range" min="1" />
		</param>
		<param name="mismatch" type="integer" value="7" label="Mismatching penalty" help="default value 7">
			<validator type="in_range" min="0" />
		</param>
		<param name="delta" type="integer" value="7" label="Indel penalty" help="default value 7">
			<validator type="in_range" min="0" />
		</param>
		<param name="pm" type="integer" value="80" label="Matching probability" help="default value 80">
			<validator type="in_range" min="1" />
		</param>
		<param name="pi" type="integer" value="10" label="Indel probability" help="default value 10">
			<validator type="in_range" min="1" />
		</param>
		<param name="minscore" type="integer" value="50" label="Minimum alignment score to report" help="">
			<validator type="in_range" min="30" />
		</param>
		<param name="maxperiod" type="integer" value="500" label="Maximum period size to report" help="">
			<validator type="in_range" min="1" />
		</param>
		<param name="nohtml" type="boolean" checked="false" label="No html output" help="Export dat file only" />
		<param name="flanking" type="boolean" checked="false" label="Flanking sequence" help="Flanking sequence consists of the 500 nucleotides on each side of a repeat. Flanking sequence is recorded in the alignment file. This may be useful for PCR primer determination." />
		<param name="mask" type="boolean" checked="false" label="Masked sequence file" help="The masked sequence file is a FASTA format file containing a copy of the sequence with every character that occurred in a tandem repeat changed to the letter 'N'. The word 'masked' is added to the sequence description line just after the '>' character." />
	</inputs>
	<outputs>
 		<data format="html" name="output_html" label="TRF_summary_${match}_${mismatch}_${delta}_${pm}_${pi}_${minscore}_${maxperiod}.html">
			<filter>(nohtml == False)</filter>
		</data>
		<data format="txt" name="output_mask" label="TRF_summary_${match}_${mismatch}_${delta}_${pm}_${pi}_${minscore}_${maxperiod}.mask">
			<filter>(mask == True)</filter>
		</data>
		<data format="txt" name="output_txt" label="TRF_summary_${match}_${mismatch}_${delta}_${pm}_${pi}_${minscore}_${maxperiod}.txt"/>
	</outputs>
	<tests>
		<test>
			<param name="file" value="sequence_trf_test.fasta" />
			<param name="nohtml" value="True" />
			<output name="output_txt" file="TRF_summary_2_7_80_10_50_500.txt" ftype="txt" />
		</test>
	</tests>
	<help>


**What it does**

A tandem repeat in DNA is two or more adjacent, approximate copies of a pattern of nucleotides. Tandem Repeats Finder is a program to locate and display tandem repeats in DNA sequences. In order to use the program, the user submits a sequence in FASTA format. There is no need to specify the pattern, the size of the pattern or any other parameter. The output consists of two files: a repeat table file and an alignment file. The repeat table contains information about each repeat, including its location, size, number of copies and nucleotide content. Clicking on the location indices for one of the table entries opens a second web browser that shows an alignment of the copies against a consensus pattern. The program is very fast, analyzing sequences on the order of .5Mb in just a few seconds. Submitted sequences may be of arbitrary length. Repeats with pattern size in the range from 1 to 2000 bases are detected.

-------

**Input format**

The FASTA format is a plain text format which looks something like this:

>myseq
AGTCGTCGCT AGCTAGCTAG CATCGAGTCT TTTCGATCGA GGACTAGACT TCTAGCTAGC TAGCATAGCA TACGAGCATA TCGGTCATGA GACTGATTGG GCTTTAGCTA GCTAGCATAG CATACGAGCA TATCGGTAGA CTGATTGGGT TTAGGTTACC

The first line starts with a greater than sign ">" and contains a name or other identifier for the sequence. This is the sequence header and must be in a single line. The remaining lines contain the sequence data. The sequence can be in upper or lower case letters. Anything other than letters (numbers for example) is ignored. Multiple sequences can be present in the same file as long as each sequence has its own header.

-------

**Output format**

Table Explanation:

The summary table includes the following information::

 1 Indices of the repeat relative to the start of the sequence.
 2 Period size of the repeat.
 3 Number of copies aligned with the consensus pattern.
 4 Size of consensus pattern (may differ slightly from the period size).
 5 Percent of matches between adjacent copies overall.
 6 Percent of indels between adjacent copies overall.
 7 Alignment score.
 8 Percent composition for each of the four nucleotides.
 9 Entropy measure based on percent composition.

If the output contains more than 120 repeats, multiple linked tables are produced. The links to the other tables appear at the top and bottom of each table.

Note: If you save multiple linked summary table files, use the default names supplied by your browser to preserve the automatic linking.

Alignment Explanation:

The alignment is presented as follows::

 1 In each pair of lines, the actual sequence is on the top and a consensus sequence for all the copies is on the bottom.
 2 Each pair of lines is one period except for very small patterns.
 3 The 10 sequence characters before and after a repeat are shown.
 4 Symbol * indicates a mismatch.
 5 Symbol - indicates an insertion or deletion.
 6 Statistics refers to the matches, mismatches and indels overall between adjacent copies in the sequence, not between the sequence and the consensus pattern.
 7 Distances between matching characters at corresponding positions are listed as distance, number at that distance, percentage of all matches.
 8 ACGTcount is percentage of each nucleotide in the repeat sequence.
 9 Consensus sequence is shown by itself.
 10 If chosen as an option, 500 characters of flanking sequence on each side of the repeat are shown.

Note: If you save the alignment file, use the default name supplied by your browser to preserve the automatic cross-referencing with the summary table.

The data file is a text file which contains the same information, in the same order, as the repeat table file, plus consensus and repeat sequences. This file contains no labeling and is suitable for additional processing, for example with a perl script, outside of the program.


-------

**References**

If you use this Galaxy tool in work leading to a scientific publication please
cite the following papers:

G. Benson,
"Tandem repeats finder: a program to analyze DNA sequences"
Nucleic Acids Research (1999)
Vol. 27, No. 2, pp. 573-580.
	</help>
</tool>
author	urgi-team
date	Thu, 10 Jul 2014 09:32:30 -0400
parents
children