What it does
Takes an input file of sequences (typically FASTA or FASTQ, but also Standard Flowgram Format (SFF) is supported), and returns a new sequence file sub-sampling uniformly from this (in the same format, preserving the input order and selecting sequencing evenly though the input file).
Several sampling modes are supported, all designed to do non-random uniform sampling (i.e. evenly through the input file). This allows reproducibility, and also works on paired sequence files (run the tool twice, once on each file using the same settings).
By sampling uniformly (evenly) through the file, this avoids any bias should reads in any part of the file be of lesser quality (e.g. for high throughput sequencing the reads at the start and end of the file can be of lower quality).
The simplest mode is to take every N-th sequence, for example taking every 2nd sequence would sample half the file - while taking every 5th sequence would take 20% of the file.
The target count method picks N sequences from the input file, which again will be distributed uniformly (evenly) though the file. This works by first counting the number of records, then calculating the desired percentage of sequences to take. Note if your input file has exactly N sequences this selects them all (effectively copying the input file). If your input file has less than N sequences, this is treated as an error.
If you tick the interleaved option, the file is processed as pairs of records to ensure your read pairs are not separated by sampling. For example using 20% would take every 5th pair of records, or you could request 1000 read pairs.
If instead of interleaved paired reads you have two matched files (one for each pair), run the tool twice with the same sampling options to make to matched smaller files.
Note interleaved/pair mode does not actually check your read names match a known pair naming scheme!
Example Usage
Suppose you have some Illumina paired end data as files R1.fastq and R2.fastq which give an estimated x200 coverage, and you wish to do a de novo assembly with a tool like MIRA which recommends lower coverage. Running the tool twice (on R1.fastq and R2.fastq) taking every 3rd read would reduce the estimated coverage to about x66, and would preserve the pairing as well (as two smaller FASTQ files).
Similarly, if you had some Illumina paired end data interleaved into one file with an estimated x200 coverage, you would run this tool in interleaved mode, taking every 3rd read pair. This would again reduce the estimated coverage to about x66, while preserving the read pairing.
Suppose you have a transcriptome assembly, and wish to look at the species distribution of the top BLAST hits for an initial quality check. Rather than using all your sequences, you could pick 1000 only for this.
Citation
This tool uses Biopython, so if you use this Galaxy tool in work leading to a scientific publication please cite the following paper:
Cock et al (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3. https://doi.org/10.1093/bioinformatics/btp163 pmid:19304878.
This tool is available to install into other Galaxy Instances via the Galaxy Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/sample_seqs