Galaxy | Tool Preview

Filter sequences by ID (version 0.2.9)
FASTA, FASTQ, or SFF format.
Multi-select list - hold the appropriate key while clicking to select multiple columns

What it does

By default it divides a FASTA, FASTQ or Standard Flowgram Format (SFF) file in two, those sequences with or without an ID present in the tabular file column(s) specified. You can opt to have a single output file of just the matching records, or just the non-matching ones.

Instead of providing the identifiers in a tabular file, you can alternatively provide them as a parameter (type or paste them into the text box). This is a useful shortcut for extracting a few sequences of interest without first having to prepare a tabular file.

Note that the order of sequences in the original sequence file is preserved, as is any Roche XML Manifest in an SFF file. Also, if any sequences share an identifier (which would be very unusual in SFF files), duplicates are not removed.

Example Usage

You may have performed some kind of contamination search, for example running BLASTN against a database of cloning vectors or bacteria, giving you a tabular file containing read identifiers. You could use this tool to extract only the reads without BLAST matches (i.e. those which do not match your contaminant database).

You may have a file of FASTA sequences which has been used with some analysis tool giving tabular output, which has then been filtered on some criteria. You can then use this tool to divide the original FASTA file into those entries matching or not matching your criteria (those with or without their identifier in the filtered tabular file).

References

If you use this Galaxy tool in work leading to a scientific publication please cite the following papers:

Peter J.A. Cock, Björn A. Grüning, Konrad Paszkiewicz and Leighton Pritchard (2013). Galaxy tools and workflows for sequence analysis with applications in molecular plant pathology. PeerJ 1:e167 https://doi.org/10.7717/peerj.167

This tool uses Biopython to read and write SFF files, so you may also wish to cite the Biopython application note (and Galaxy too of course):

Cock et al (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3. https://doi.org/10.1093/bioinformatics/btp163 pmid:19304878.

This tool is available to install into other Galaxy Instances via the Galaxy Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/seq_filter_by_id