Galaxy | Tool Preview

Sequence lengths (version 0.0.5)
FASTA, QUAL, FASTQ, or SFF format.

What it does

Takes a FASTA, QUAL, FASTQ or Standard Flowgram Format (SFF) file and produces a two-column tabular file containing one line per sequence giving the sequence identifier and the associated sequence's length.

Additionally, the tool will report some basic statistics about the sequences (visible via the output file's meta data, or the stdout log for the job), namely the number of sequences, total length, mean length, minimum length and maximum length.

You can optionally request additional statistics be computed which will use more RAM and take fractionally longer, namely the median and N50.

WARNING: If there are any duplicate sequence identifiers, these will all appear in the tabular output.

If using SFF files, this will use the trimmed lengths of the reads.

References

This tool uses Biopython's SeqIO library to read sequences, so please cite the Biopython application note (and Galaxy too of course):

Cock et al (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3. https://doi.org/10.1093/bioinformatics/btp163 pmid:19304878.

This tool is available to install into other Galaxy Instances via the Galaxy Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/seq_length