Galaxy |

What it does

This tool converts FASTA formatted sequences to TAB-delimited format.

Many tools consider the first word of the FASTA ">" title line to be an identifier, and any remaining text to be a free form description. It is therefore useful to split this text into two columns in Galaxy (identifier and any description) by setting How many columns to divide title string into? to 2. In some cases the description can be usefully broken up into more columns -- see the examples .

The option How many characters to keep? allows to select a specified number of letters from the beginning of each FASTA entry. With the introduction of the How many columns to divide title string into? option this setting is of limited use, but does still allow you to truncate the identifier.

Example

Suppose you have the following FASTA formatted sequences from a Roche (454) FLX sequencing run:

>EYKX4VC02EQLO5 length=108 xy=1826_0455 region=2 run=R_2007_11_07_16_15_57_
TCCGCGCCGAGCATGCCCATCTTGGATTCCGGCGCGATGACCATCGCCCGCTCCACCACG
TTCGGCCGGCCCTTCTCGTCGAGGAATGACACCAGCGCTTCGCCCACG
>EYKX4VC02D4GS2 length=60 xy=1573_3972 region=2 run=R_2007_11_07_16_15_57_
AATAAAACTAAATCAGCAAAGACTGGCAAATACTCACAGGCTTATACAATACAAATGTAA

Running this tool with the default settings will produce this (2 column output):

EYKX4VC02EQLO5 length=108 xy=1826_0455 region=2 run=R_2007_11_07_16_15_57_	TCCGCGCCGAGCATGCCCATCTTGGATTCCGGC...ACG
EYKX4VC02D4GS2 length=60 xy=1573_3972 region=2 run=R_2007_11_07_16_15_57_	AATAAAACTAAATCAGCAAAGACTGGCAAATAC...TAA

Having the full title line (the FASTA ">" line text) as a column is not always ideal.

The How many characters to keep? option is useful if your identifiers are all the same length. In this example the identifier is 14 characters, so setting How many characters to keep? to 14 (and leaving How many columns to divide title string into? as the default, 1) will produce this (2 column output):

EYKX4VC02EQLO5	TCCGCGCCGAGCATGCCCATCTTGGATTCCGGC...ACG
EYKX4VC02D4GS2	AATAAAACTAAATCAGCAAAGACTGGCAAATAC...TAA

If however your FASTA file has identifiers of variable length, it is better to split the text into at least two columns. Running this tool with How many columns to divide title string into? to 2 will produce this (3 column output):

EYKX4VC02EQLO5	length=108 xy=1826_0455 region=2 run=R_2007_11_07_16_15_57_	TCCGCGCCGAGCATGCCCATCTTGGATTCCGGC...ACG
EYKX4VC02D4GS2	length=60 xy=1573_3972 region=2 run=R_2007_11_07_16_15_57_	AATAAAACTAAATCAGCAAAGACTGGCAAATAC...TAA

Running this tool with How many columns to divide title string into? to 5 will produce this (5 column output):

EYKX4VC02EQLO5	length=108	xy=1826_0455	region=2	run=R_2007_11_07_16_15_57_	TCCGCGCCGAGCATGCCCATCTTGGATTCCGGC...ACG
EYKX4VC02D4GS2	length=60	xy=1573_3972	region=2	run=R_2007_11_07_16_15_57_	AATAAAACTAAATCAGCAAAGACTGGCAAATAC...TAA

Running this tool with How many columns to divide title string into? to 5 and How many characters to keep? to 10 will produce this (5 column output). Notice that only the first column is truncated to 10 characters -- and be careful not to trim your sequence names too much (generally they should be unique):

EYKX4VC02E	length=108	xy=1826_0455	region=2	run=R_2007_11_07_16_15_57_	TCCGCGCCGAGCATGCCCATCTTGGATTCCGGC...ACG
EYKX4VC02D	length=60	xy=1573_3972	region=2	run=R_2007_11_07_16_15_57_	AATAAAACTAAATCAGCAAAGACTGGCAAATAC...TAA

Note the sequences have been truncated for display purposes in the above tables.