Error rates are learned by alternating between sample inference and error rate estimation until convergence. Additionally a plot is generated that shows the observed frequency of each transition (eg. A->C) as a function of the associated quality score, the final estimated error rates (if they exist), the initial input rates, and the expected error rates under the nominal definition of quality scores.
In addition a plot is generated (with plotErrors) that shows the observed frequency of each transition (eg. A->C) as a function of the associated quality score. Also the final estimated error rates (if they exist) are shown. Optionally also the initial input rates and the expected error rates under the nominal definition of quality scores can be added to the plot.
Input are the FASTQ dataset containing the filtered and trimmed reads of the samples.
Output a dataset with type dada2_errorrates (which is a RData file containing the output of dada2's learnErrors function) and a plot showing the error rates for each possible transition (A→C, A→G,...)
The learned error rates are input the the dada2: dada tool.
The learnErrors method learns a parametric error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).
It is expected that the estimated error rates (black lines in the plot) are in a good fit to the observed rates (points in the plot), and that the error rates drop with increased quality. Try to increase the number of bases to use for learning if this is not the case.
Error functions:
The intended use of the dada2 tools for paired sequencing data is shown in the following image.
Note: In particular for the analysis of paired collections the collections should be sorted lexicographical before the analysis.
For single end data you the steps "Unzip collection" and "mergePairs" are not necessary.
More information may be found on the dada2 homepage:: https://benjjneb.github.io/dada2/index.html (in particular tutorials) or the documentation of dada2's R package https://bioconductor.org/packages/release/bioc/html/dada2.html (in particular the pdf which contains the full documentation of all parameters)