Unicycler
Unicycler is a hybrid assembly pipeline for bacterial genomes. It uses both Illumina reads and long reads (PacBio or Nanopore) to produce complete and accurate assemblies. It is written by Ryan Wick at the University of Melbourne's Centre for Systems Genomics. Much of the description below is lifted from Unicycler's github page.
Input data
Unicycler accepts inputs short (Illumina) reads in FASTQ format. Galaxy places additional requirement of having these in FASTQ format with Sanger encoding of quality scores. Long reads (from Oxford Nanopore or PacBio) can be either in FASTQ of FASTA form.
The input options are:
-1 SHORT1, --short1 SHORT1 FASTQ file of short reads (first reads in each pair) -2 SHORT2, --short2 SHORT2 FASTQ file of short reads (second reads in each pair) -s SHORT_UNPAIRED, --short_unpaired SHORT_UNPAIRED FASTQ file of unpaired short reads -l LONG, --long LONG FASTQ or FASTA file of long reads, if all reads are available at start.
Bridging mode
Unicycler can be run in three modes: conservative, normal (the default) and bold, set with the --mode option. Conservative mode is least likely to produce a complete assembly but has a very low risk of misassembly. Bold mode is most likely to produce a complete assembly but carries greater risk of misassembly. Normal mode is intermediate regarding both completeness and misassembly risk. See description of modes for more information.
The available modes are:
--mode {conservative,normal,bold} Bridging mode (default: normal) conservative = smaller contigs, lowest misassembly rate normal = moderate contig size and misassembly rate bold = longest contigs, higher misassembly rate
Skip SPAdes error correction step
Sequencing data contains a substantial number of sequencing errors that manifest themselves as deviations (bulges and non-connected components) within the assembly graph. One of the ways to improve the graph even constructing it is to minimize the amount sequencing errors by performing error correction. SPAdes, which is used by Unicycler for error correction and assembly, uses BayesHammer to correct the reads. Here is a brief summary of what it does:
- SPAdes (or rather BayesHammer) counts k-mers in reads and computed k-mer statistics that takes into account base quality values.
- Hamming graph is constructed for k-mers is which k-mers are nodes. In this graph edges connect nodes (k-mers) is they differ from each other by a number of nucleotides up to a certain threshold (the Hamming distance). The graph is central to the error correction algorithm.
- At this step Bayesian subclustering of the graph produced in the previous step. For each k-mer we now know the center of its subcluster.
- Solid k-mers are derived from cluster centers and are assumed to be error free.
- Solid k-mers are mapped back to the reads and used to correct them.
This step takes considerable time, so if one need to quickly evaluate assemblies this step can be skipped. However, this is not recommended if one if trying to produce a final high quality assembly.
Do not rotate completed replicons to start at a standard gene
Unicycler uses TBLASTN to search for dnaA or repA alleles in each completed replicon. If one is found, the sequence is rotated and/or flipped so that it begins with that gene encoded on the forward strand. This provides consistently oriented assemblies and reduces the risk that a gene will be split across the start and end of the sequence.
The following option turns rotation on and off:
--no_rotate Do not rotate completed replicons to start at a standard gene (default: completed replicons are rotated)
Do not use Pilon to polish the final assembly
Pilon is a tool for improving overall quality of draft assemblies and finding variation among strains. Unicycler uses it for assembly polishing.
The following option turns pilon part of Unicycler pipeline on and off:
--no_pilon Do not use Pilon to polish the final assembly (default: Pilon is used)
Expected number of linear sequences
If you expect your sample to contain linear (non circular) sequences, set this option:
--linear_seqs EXPECTED_LINEAR_SEQS The expected number of linear (i.e. non-circular) sequences in the underlying sequence
SPAdes options
This section provides control of SPAdes options:
--min_kmer_frac MIN_KMER_FRAC Lowest k-mer size for SPAdes assembly, expressed as a fraction of the read length (default: 0.2) --max_kmer_frac MAX_KMER_FRAC Highest k-mer size for SPAdes assembly, expressed as a fraction of the read length (default: 0.95) --kmer_count KMER_COUNT Number of k-mer steps to use in SPAdes assembly (default: 10) --depth_filter DEPTH_FILTER Filter out contigs lower than this fraction of the chromosomal depth, if doing so does not result in graph dead ends (default: 0.25)
Rotation options
Unicycler attempts to rotate circular assemblies to make sure that they begin at a consistent starting gene. The following parameters control assembly rotation:
--start_genes START_GENES FASTA file of genes for start point of rotated replicons (default: start_genes.fasta) --start_gene_id START_GENE_ID The minimum required BLAST percent identity for a start gene search (default: 90.0) --start_gene_cov START_GENE_COV The minimum required BLAST percent coverage for a start gene search (default: 95.0)
Graph cleaning options
These options control the removal of small leftover sequences after bridging is complete:
--min_component_size MIN_COMPONENT_SIZE Unbridged graph components smaller than this size (bp) will be removed from the final graph (default: 1000) --min_dead_end_size MIN_DEAD_END_SIZE Graph dead ends smaller than this size (bp) will be removed from the final graph (default: 1000)
Long read alignment options
These options control the alignment of long reads to the assembly graph:
--contamination CONTAMINATION FASTA file of known contamination in long reads --scores SCORES Comma-delimited string of alignment scores: match, mismatch, gap open, gap extend (default: 3,-6,-5,-2) --low_score LOW_SCORE Score threshold - alignments below this are considered poor (default: set threshold automatically)
Outputs
Galaxy's wrapped for Unicycler produces two outputs:
- final assembly in FASTA format
- final assembly grapth in graph format
While most will likely be interested in the FASTA dataset, the graph dataset is also quite useful and can be visualized using tools such as Bandage.