Galaxy |

Perform OTU picking (version 1.9.1.0)

Input sequences file:

Method for picking OTUs:

Sequence similarity threshold:

OTU identifier prefix for the de novo OTU pickers:

Enable reverse strand matching?:

Will double the amount of memory used

Suppress presorting of sequences by abundance?:

Pass the –optimal flag to uclust?:

Pass the –exact flag to uclust?:

Pass the -user_sort flag to uclust?:

Max_accepts value:

Max_rejects value:

Stepwords value:

Word length value:

Don't pass –stable-sort to uclust?:

Don't collapse exact matches before calling?:

Threshold to automatically group first identical prefix_prefilter_length into a single OTU:

This is useful for large sequence collections where OTU picking doesn't scale well

Prefilter data so seqs which are identical prefixes of a longer seq are automatically grouped into a single OTU?:

This is useful for large sequence collections where OTU picking doesn't scale well

Selects subsets of sequences detected as non-chimeras to retain after de novo and reference based chimera detection:

What it does

The OTU picking step assigns similar sequences to operational taxonomic units, or OTUs, by clustering sequences based on a user-defined similarity threshold. Sequences which are similar at or above the threshold level are taken to represent the presence of a taxonomic unit (e.g., a genus, when the similarity threshold is set at 0.94) in the sequence collection.

Currently, the following clustering methods have been implemented in QIIME:

uclust, creates "seeds" of sequences which generate clusters based on percent identity.
uclust_ref, as uclust, but takes a reference database to use as seeds. New clusters can be toggled on or off.
usearch, creates "seeds" of sequences which generate clusters based on percent identity, filters low abundance clusters, performs de novo and reference based chimera detection.
usearch_ref, as usearch, but takes a reference database to use as seeds. New clusters can be toggled on or off.
sumaclust, creates "seeds" of sequences which generate clusters based on similarity threshold.
swarm, creates "seeds" of sequences which generate clusters based on a resolution threshold.

Chimera checking with usearch 6.X is implemented in identify_chimeric_seqs.py. Chimera checking should be done first with usearch 6.X, and the filtered resulting fasta file can then be clustered.

The primary inputs for pick_otus.py are:

A FASTA file containing sequences to be clustered
An OTU threshold (default is 0.97, roughly corresponding to species-level OTUs);
The method to be applied for clustering sequences into OTUs.

pick_otus.py takes a standard fasta file as input.

The output consists of two files (i.e. seqs_otus.txt and seqs_otus.log). The .txt file is composed of tab-delimited lines, where the first field on each line corresponds to an (arbitrary) cluster identifier, and the remaining fields correspond to sequence identifiers assigned to that cluster. Sequence identifiers correspond to those provided in the input FASTA file. Usearch (i.e. usearch quality filter) can additionally have log files for each intermediate call to usearch.

Example lines from the resulting .txt file:

0	seq1	seq5
1	seq2
2	seq3
3	seq4	seq6	seq7

This result implies that four clusters were created based on 7 input sequences. The first cluster (cluster id 0) contains two sequences, sequence ids seq1 and seq5; the second cluster (cluster id 1) contains one sequence, sequence id seq2; the third cluster (cluster id 2) contains one sequence, sequence id seq3, and the final cluster (cluster id 3) contains three sequences, sequence ids seq4, seq6, and seq7.

The resulting .log file contains a list of parameters passed to the pick_otus.py script along with the output location of the resulting .txt file.