Galaxy | Tool Preview

Perform OTU picking (version 1.9.1.0)
Will double the amount of memory used
This is useful for large sequence collections where OTU picking doesn't scale well
This is useful for large sequence collections where OTU picking doesn't scale well

What it does

The OTU picking step assigns similar sequences to operational taxonomic units, or OTUs, by clustering sequences based on a user-defined similarity threshold. Sequences which are similar at or above the threshold level are taken to represent the presence of a taxonomic unit (e.g., a genus, when the similarity threshold is set at 0.94) in the sequence collection.

Currently, the following clustering methods have been implemented in QIIME:

  1. uclust, creates "seeds" of sequences which generate clusters based on percent identity.
  2. uclust_ref, as uclust, but takes a reference database to use as seeds. New clusters can be toggled on or off.
  3. usearch, creates "seeds" of sequences which generate clusters based on percent identity, filters low abundance clusters, performs de novo and reference based chimera detection.
  4. usearch_ref, as usearch, but takes a reference database to use as seeds. New clusters can be toggled on or off.
  5. sumaclust, creates "seeds" of sequences which generate clusters based on similarity threshold.
  6. swarm, creates "seeds" of sequences which generate clusters based on a resolution threshold.

Chimera checking with usearch 6.X is implemented in identify_chimeric_seqs.py. Chimera checking should be done first with usearch 6.X, and the filtered resulting fasta file can then be clustered.

The primary inputs for pick_otus.py are:

  1. A FASTA file containing sequences to be clustered
  2. An OTU threshold (default is 0.97, roughly corresponding to species-level OTUs);
  3. The method to be applied for clustering sequences into OTUs.

pick_otus.py takes a standard fasta file as input.

The output consists of two files (i.e. seqs_otus.txt and seqs_otus.log). The .txt file is composed of tab-delimited lines, where the first field on each line corresponds to an (arbitrary) cluster identifier, and the remaining fields correspond to sequence identifiers assigned to that cluster. Sequence identifiers correspond to those provided in the input FASTA file. Usearch (i.e. usearch quality filter) can additionally have log files for each intermediate call to usearch.

Example lines from the resulting .txt file:

0 seq1 seq5  
1 seq2    
2 seq3    
3 seq4 seq6 seq7

This result implies that four clusters were created based on 7 input sequences. The first cluster (cluster id 0) contains two sequences, sequence ids seq1 and seq5; the second cluster (cluster id 1) contains one sequence, sequence id seq2; the third cluster (cluster id 2) contains one sequence, sequence id seq3, and the final cluster (cluster id 3) contains three sequences, sequence ids seq4, seq6, and seq7.

The resulting .log file contains a list of parameters passed to the pick_otus.py script along with the output location of the resulting .txt file.