Galaxy |

What it does

A tandem repeat in DNA is two or more adjacent, approximate copies of a pattern of nucleotides. Tandem Repeats Finder is a program to locate and display tandem repeats in DNA sequences. In order to use the program, the user submits a sequence in FASTA format. There is no need to specify the pattern, the size of the pattern or any other parameter. The output consists of two files: a repeat table file and an alignment file. The repeat table contains information about each repeat, including its location, size, number of copies and nucleotide content. Clicking on the location indices for one of the table entries opens a second web browser that shows an alignment of the copies against a consensus pattern. The program is very fast, analyzing sequences on the order of .5Mb in just a few seconds. Submitted sequences may be of arbitrary length. Repeats with pattern size in the range from 1 to 2000 bases are detected.

Input format

The FASTA format is a plain text format which looks something like this:

>myseq AGTCGTCGCT AGCTAGCTAG CATCGAGTCT TTTCGATCGA GGACTAGACT TCTAGCTAGC TAGCATAGCA TACGAGCATA TCGGTCATGA GACTGATTGG GCTTTAGCTA GCTAGCATAG CATACGAGCA TATCGGTAGA CTGATTGGGT TTAGGTTACC

The first line starts with a greater than sign ">" and contains a name or other identifier for the sequence. This is the sequence header and must be in a single line. The remaining lines contain the sequence data. The sequence can be in upper or lower case letters. Anything other than letters (numbers for example) is ignored. Multiple sequences can be present in the same file as long as each sequence has its own header.

Output format

Table Explanation:

The summary table includes the following information:

1 Indices of the repeat relative to the start of the sequence.
2 Period size of the repeat.
3 Number of copies aligned with the consensus pattern.
4 Size of consensus pattern (may differ slightly from the period size).
5 Percent of matches between adjacent copies overall.
6 Percent of indels between adjacent copies overall.
7 Alignment score.
8 Percent composition for each of the four nucleotides.
9 Entropy measure based on percent composition.

If the output contains more than 120 repeats, multiple linked tables are produced. The links to the other tables appear at the top and bottom of each table.

Note: If you save multiple linked summary table files, use the default names supplied by your browser to preserve the automatic linking.

Alignment Explanation:

The alignment is presented as follows:

1 In each pair of lines, the actual sequence is on the top and a consensus sequence for all the copies is on the bottom.
2 Each pair of lines is one period except for very small patterns.
3 The 10 sequence characters before and after a repeat are shown.
4 Symbol * indicates a mismatch.
5 Symbol - indicates an insertion or deletion.
6 Statistics refers to the matches, mismatches and indels overall between adjacent copies in the sequence, not between the sequence and the consensus pattern.
7 Distances between matching characters at corresponding positions are listed as distance, number at that distance, percentage of all matches.
8 ACGTcount is percentage of each nucleotide in the repeat sequence.
9 Consensus sequence is shown by itself.
10 If chosen as an option, 500 characters of flanking sequence on each side of the repeat are shown.

Note: If you save the alignment file, use the default name supplied by your browser to preserve the automatic cross-referencing with the summary table.

The data file is a text file which contains the same information, in the same order, as the repeat table file, plus consensus and repeat sequences. This file contains no labeling and is suitable for additional processing, for example with a perl script, outside of the program.

References

If you use this Galaxy tool in work leading to a scientific publication please cite the following papers:

G. Benson, "Tandem repeats finder: a program to analyze DNA sequences" Nucleic Acids Research (1999) Vol. 27, No. 2, pp. 573-580.