Description

Two additional files are generated by this tool, the log files in tabbed and human-readable format that are hidden from the history list. You can view these outputs by clicking on the cogwheel next to the History panel and select "Include Hidden Dataset".

UCHIME is an algorithm for detecting chimeric sequences. It is implemented in the USEARCH-Tool-Suite.

The fundamental step in UCHIME is a search for a 3-way alignment of a query sequence with two parent sequences (A and B) such that one parent is more similar to one segment of the query (Q) and the other parent is similar over another segment.

A score is calculated from the alignment. Higher scores indicate a stronger chimeric signal. A score cutoff set by the .minh option (0.28 by default) determines whether the query is classified as a chimera.

This search can be performed with a reference database of parent sequences believed to be chimera-free provided by the user, or the database can be constructed de novo from the query sequences. In de novo mode, the sequences are assumed to be derived from one PCR run. In this case, parent sequences should be more abundant than their chimeras because the parent amplicons will have undergone more rounds of amplification.

Please note: The free 32-bit version of USEARCH is limited to using 4GB or less RAM (Linux, OSX). If you are using the free 32-bit version of USEARCH, we recommend to use reference datasets up to 800MB in size to avoid running into the "out of memory" error. Please see the USEARCH site for more info on the memory requirments.

Parameters

Reference database (ref) mode: A database file of nucleotide sequences must be specified using the Reference Database (ref) option. The database may be in FASTA format. The reference database should include sequences that might appear as parents in the query set. These should be high-quality sequences that are believed to be free of chimeras. Errors in reference sequences will degrade detection accuracy and increase the number of false positives. Chimeras will not be detected if their parents (or sufficiently close relatives) are not present in the database.

The reference database should contain high-quality sequences that are believed to be chimera-free.

De novo mode

De novo chimera detection using the UCHIME algorithm. The input file must contain estimated amplicons with abundances specified by size annotations. In de novo mode, abundance skew is used to distinguish chimeras from parents. input should be estimated amplicon sequences with integer abundances specified using size annotations, e.g.:

>FQ23BBGZ5;size=23;

The minimum abundance skew is specified by the .abskew parameter, which defaults to 2.0 (because one round of PCR doubles the abundance). Abundance is a measure of how many amplicons with a given unique sequence were present in the sample after amplification by PCR. One way to estimate this is to sum the total number of reads in the cluster used to estimate the given amplicon sequence. UCHIME uses only ratios of abundances, so the absolute value does not matter. However, using the number of reads is a useful indicator.for example, a cluster containing one read is likely to be spurious. Amplicon sequences and abundances can be estimated using USEARCH, or by using another algorithm such as Chris Quince's PyroNoise or AmpliconNoise. When using de novo mode, sequences should be estimated amplicons from one sequencing run (strictly, one PCR amplification stage), otherwise abundances may not be directly comparable.

Inputs

Reference database mode

An input file containing the sequences in FASTA format.
A reference database file in FASTA format containing nucleotide sequences believed to be free of chimeras.

De novo mode

A FASTA file containing for each sequence estimated amplicons with abundances specified by size annotations, e.g. >FQ23BBGZ5;size=23; .

Output

This tool produced four output files two of which are hidden by default.

To view the hidden files: click on the cogwheel icon in the history panel and select 'Include Hidden Datasets'.

A FASTA file of predicted chimeras
A FASTA file of non-chimeras
(hidden) A human readable file of chimeric alignments
(hidden) A tab-separated file with the following 18 columns:

1	Score	Value >= 0.0, high score means more likely to be a chimera
2	Q	Query label
3	A	Parent A label
4	B	Parent B label
5	T	Top parent (T) label. This isthe closest reference sequence; usuallly either A or B
6	IdQM	Percent identity of query and the model (M) constructed as a segment of A and a segment of B
7	IdQA	Percent identity of Q and A
8	IdQB	Percent identity of Q and B
9	IdAB	Percent identity of A and B
10	IdQT	Percent identity of Q and T
11	LY	Yes votes in left segment
12	LN	No votes in left segment
13	LA	Abstain votes in left segment
14	RY	Yes votes in right segment
15	RN	No votes in right segment
16	RA	Abstain votes in right segmen
17	Div	Divergence, defined as (IdQM -IdQT)
18	YN	Y(yes) or N(no) classification as a chimera

Resources

UCHIME

Author

Robert C. Edgar (bob@drive5.com)

Wrapper Author

QFAB Bioinformatics (support@qfab.org)