What it does
This tool converts FASTA formatted sequences to TAB-delimited format.
Many tools consider the first word of the FASTA ">" title line to be an identifier, and any remaining text to be a free form description. It is therefore useful to split this text into two columns in Galaxy (identifier and any description) by setting How many columns to divide title string into? to 2. In some cases the description can be usefully broken up into more columns -- see the examples .
The option How many characters to keep? allows to select a specified number of letters from the beginning of each FASTA entry. With the introduction of the How many columns to divide title string into? option this setting is of limited use, but does still allow you to truncate the identifier.
Example
Suppose you have the following FASTA formatted sequences from a Roche (454) FLX sequencing run:
>EYKX4VC02EQLO5 length=108 xy=1826_0455 region=2 run=R_2007_11_07_16_15_57_ TCCGCGCCGAGCATGCCCATCTTGGATTCCGGCGCGATGACCATCGCCCGCTCCACCACG TTCGGCCGGCCCTTCTCGTCGAGGAATGACACCAGCGCTTCGCCCACG >EYKX4VC02D4GS2 length=60 xy=1573_3972 region=2 run=R_2007_11_07_16_15_57_ AATAAAACTAAATCAGCAAAGACTGGCAAATACTCACAGGCTTATACAATACAAATGTAA
Running this tool with the default settings will produce this (2 column output):
EYKX4VC02EQLO5 length=108 xy=1826_0455 region=2 run=R_2007_11_07_16_15_57_ | TCCGCGCCGAGCATGCCCATCTTGGATTCCGGC...ACG |
EYKX4VC02D4GS2 length=60 xy=1573_3972 region=2 run=R_2007_11_07_16_15_57_ | AATAAAACTAAATCAGCAAAGACTGGCAAATAC...TAA |
Having the full title line (the FASTA ">" line text) as a column is not always ideal.
The How many characters to keep? option is useful if your identifiers are all the same length. In this example the identifier is 14 characters, so setting How many characters to keep? to 14 (and leaving How many columns to divide title string into? as the default, 1) will produce this (2 column output):
EYKX4VC02EQLO5 | TCCGCGCCGAGCATGCCCATCTTGGATTCCGGC...ACG |
EYKX4VC02D4GS2 | AATAAAACTAAATCAGCAAAGACTGGCAAATAC...TAA |
If however your FASTA file has identifiers of variable length, it is better to split the text into at least two columns. Running this tool with How many columns to divide title string into? to 2 will produce this (3 column output):
EYKX4VC02EQLO5 | length=108 xy=1826_0455 region=2 run=R_2007_11_07_16_15_57_ | TCCGCGCCGAGCATGCCCATCTTGGATTCCGGC...ACG |
EYKX4VC02D4GS2 | length=60 xy=1573_3972 region=2 run=R_2007_11_07_16_15_57_ | AATAAAACTAAATCAGCAAAGACTGGCAAATAC...TAA |
Running this tool with How many columns to divide title string into? to 5 will produce this (5 column output):
EYKX4VC02EQLO5 | length=108 | xy=1826_0455 | region=2 | run=R_2007_11_07_16_15_57_ | TCCGCGCCGAGCATGCCCATCTTGGATTCCGGC...ACG |
EYKX4VC02D4GS2 | length=60 | xy=1573_3972 | region=2 | run=R_2007_11_07_16_15_57_ | AATAAAACTAAATCAGCAAAGACTGGCAAATAC...TAA |
Running this tool with How many columns to divide title string into? to 5 and How many characters to keep? to 10 will produce this (5 column output). Notice that only the first column is truncated to 10 characters -- and be careful not to trim your sequence names too much (generally they should be unique):
EYKX4VC02E | length=108 | xy=1826_0455 | region=2 | run=R_2007_11_07_16_15_57_ | TCCGCGCCGAGCATGCCCATCTTGGATTCCGGC...ACG |
EYKX4VC02D | length=60 | xy=1573_3972 | region=2 | run=R_2007_11_07_16_15_57_ | AATAAAACTAAATCAGCAAAGACTGGCAAATAC...TAA |
Note the sequences have been truncated for display purposes in the above tables.