Galaxy |

Split libraries (version 1.9.1.0)

Metadata mapping filepath:

The file must contain header line indicating SampleID in the first column and BarcodeSequence in the second, LinkerPrimerSequence in the third. It is recommended to check the mapping file using the dedicated file

Input fasta files:

Add quality files:

Minimum sequence length:

Maximum sequence length:

Compute sequence lengths after trimming and barcodes?:

Remove primer from sequences?:

Remove barcode from sequences?:

Maximum number of ambiguous bases:

Maximum length of homopolymer run:

Maximum number of primer mismatch:

Type of barcode:

Maximum number of errors in barcode:

Sequence id to use for the first sequence:

Retain sequences with are Unassigned in the output sequence file?:

Disable attempts to find nearest corrected barcode?:

It can improve performance

Disable primer usage when demultiplexing?:

It should be enabled for unusual circumstances, such as analyzing Sanger sequence data generated with different primers

Enable removal of the reverse primer and any subsequence sequence from the end of each read?:

Median length filtering (optional):

It disables minimum and maximum sequence length filtering, and instead calculates the median sequence length and filters the sequences based upon the number of median absolute deviations specified by this parameter. Any sequences with lengths outside the number of deviations will be removed

Field to use in the mapping file as additional demultiplexing (optional):

It can be used with or without barcodes. All combinations of barcodes/primers and these fields must be unique. The fields must contain values that can be parsed from the fasta labels such as 'plate=R_2008_12_09'. In this case, 'plate' would be the column header and 'R_2008_12_09' would be the field data (minus quotes) in the mapping file. To use the run prefix from the fasta label, such as 'FLP3FBN01ELBSX', where 'FLP3FBN01' is generated from the run ID, use 'run_prefix' and set the run prefix to be used as the data under the column header 'run_prefix'

Enable to truncate at the first N character encountered in the sequences?:

This will disable testing for ambiguous bases

What it does

This tool splits libraries according to barcodes specified in mapping file.

Since newer sequencing technologies provide many reads per run (e.g. the 454 GS FLX Titanium series can produce 400-600 million base pairs with 400-500 base pair read lengths) researchers are now finding it useful to combine multiple samples into a single 454 run. This multiplexing is achieved through the application of a pyrosequencing-tailored nucleotide barcode design (described in (Parameswaran et al., 2007)). By assigning individual, unique sample specific barcodes, multiple sequencing runs may be performed in parallel and the resulting reads can later be binned according to sample. The script %prog performs this task, in addition to several quality filtering steps including user defined cut-offs for: sequence lengths; end-trimming; minimum quality score. To summarize, by using the fasta, mapping, and quality files, the program %prog will parse sequences that meet user defined quality thresholds and then rename each read with the appropriate Sample ID, thus formatting the sequence data for downstream analysis. If a combination of different sequencing technologies are used in any particular study, %prog can be used to perform the quality-filtering for each library individually and the output may then be combined.

Sequences from samples that are not found in the mapping file (no corresponding barcode) and sequences without the correct primer sequence will be excluded. Additional scripts can be used to exclude sequences that match a given reference sequence (e.g. the human genome; exclude_seqs_by_blast.py) and/or sequences that are flagged as chimeras (identify_chimeric_seqs.py).

More information about this tool is available on QIIME documentation.