Galaxy |

Unicycler

Unicycler is a hybrid assembly pipeline for bacterial genomes. It uses both Illumina reads and long reads (PacBio or Nanopore) to produce complete and accurate assemblies. It is written by Ryan Wick at the University of Melbourne's Centre for Systems Genomics. Much of the description below is lifted from Unicycler's github page.

Input data

Unicycler accepts inputs short (Illumina) reads in FASTQ format. Galaxy places additional requirement of having these in FASTQ format with Sanger encoding of quality scores. Long reads (from Oxford Nanopore or PacBio) can be either in FASTQ of FASTA form.

The input options are:

-1 SHORT1, --short1 SHORT1
    FASTQ file of short reads (first reads in each pair)
-2 SHORT2, --short2 SHORT2
    FASTQ file of short reads (second reads in each pair)
-s SHORT_UNPAIRED, --short_unpaired SHORT_UNPAIRED
    FASTQ file of unpaired short reads
-l LONG, --long LONG
    FASTQ or FASTA file of long reads, if all reads are available at start.

Bridging mode

Unicycler can be run in three modes: conservative, normal (the default) and bold, set with the --mode option. Conservative mode is least likely to produce a complete assembly but has a very low risk of misassembly. Bold mode is most likely to produce a complete assembly but carries greater risk of misassembly. Normal mode is intermediate regarding both completeness and misassembly risk. See description of modes for more information.

The available modes are:

--mode {conservative,normal,bold}
    Bridging mode (default: normal)
    conservative = smaller contigs, lowest misassembly rate
    normal = moderate contig size and misassembly rate
    bold = longest contigs, higher misassembly rate

Skip SPAdes error correction step

Sequencing data contains a substantial number of sequencing errors that manifest themselves as deviations (bulges and non-connected components) within the assembly graph. One of the ways to improve the graph even constructing it is to minimize the amount sequencing errors by performing error correction. SPAdes, which is used by Unicycler for error correction and assembly, uses BayesHammer to correct the reads. Here is a brief summary of what it does:

SPAdes (or rather BayesHammer) counts k-mers in reads and computed k-mer statistics that takes into account base quality values.

Hamming graph is constructed for k-mers is which k-mers are nodes. In this graph edges connect nodes (k-mers) is they differ from each other by a number of nucleotides up to a certain threshold (the Hamming distance). The graph is central to the error correction algorithm.

At this step Bayesian subclustering of the graph produced in the previous step. For each k-mer we now know the center of its subcluster.

Solid k-mers are derived from cluster centers and are assumed to be error free.

Solid k-mers are mapped back to the reads and used to correct them.

This step takes considerable time, so if one need to quickly evaluate assemblies this step can be skipped. However, this is not recommended if one if trying to produce a final high quality assembly.

Do not rotate completed replicons to start at a standard gene

Unicycler uses TBLASTN to search for dnaA or repA alleles in each completed replicon. If one is found, the sequence is rotated and/or flipped so that it begins with that gene encoded on the forward strand. This provides consistently oriented assemblies and reduces the risk that a gene will be split across the start and end of the sequence.

The following option turns rotation on and off:

--no_rotate
    Do not rotate completed replicons
    to start at a standard gene
    (default: completed replicons are rotated)

Do not use Pilon to polish the final assembly

Pilon is a tool for improving overall quality of draft assemblies and finding variation among strains. Unicycler uses it for assembly polishing.

The following option turns pilon part of Unicycler pipeline on and off:

--no_pilon
    Do not use Pilon to polish the
    final assembly (default: Pilon is used)

Expected number of linear sequences

If you expect your sample to contain linear (non circular) sequences, set this option:

--linear_seqs EXPECTED_LINEAR_SEQS
    The expected number of linear (i.e. non-circular)
    sequences in the underlying sequence

SPAdes options

This section provides control of SPAdes options:

--min_kmer_frac MIN_KMER_FRAC
    Lowest k-mer size for SPAdes assembly,
    expressed as a fraction of the read length
    (default: 0.2)
--max_kmer_frac MAX_KMER_FRAC
    Highest k-mer size for SPAdes assembly,
    expressed as a fraction of the read length
    (default: 0.95)
--kmer_count KMER_COUNT
    Number of k-mer steps to use in
    SPAdes assembly (default: 10)
--depth_filter DEPTH_FILTER
    Filter out contigs lower than this fraction
    of the chromosomal depth, if doing so does
    not result in graph dead ends (default: 0.25)

Rotation options

Unicycler attempts to rotate circular assemblies to make sure that they begin at a consistent starting gene. The following parameters control assembly rotation:

--start_genes START_GENES
    FASTA file of genes for start point
    of rotated replicons
    (default: start_genes.fasta)
--start_gene_id START_GENE_ID
    The minimum required BLAST percent identity
    for a start gene search
    (default: 90.0)
--start_gene_cov START_GENE_COV
    The minimum required BLAST percent coverage
    for a start gene search
    (default: 95.0)

Graph cleaning options

These options control the removal of small leftover sequences after bridging is complete:

--min_component_size MIN_COMPONENT_SIZE
    Unbridged graph components smaller
    than this size (bp) will be removed
    from the final graph (default: 1000)
--min_dead_end_size MIN_DEAD_END_SIZE
    Graph dead ends smaller than this size (bp)
    will be removed from the final graph
    (default: 1000)

Long read alignment options

These options control the alignment of long reads to the assembly graph:

--contamination CONTAMINATION
    FASTA file of known contamination in long reads
--scores SCORES
    Comma-delimited string of alignment scores:
    match, mismatch, gap open, gap extend
    (default: 3,-6,-5,-2)
--low_score LOW_SCORE
    Score threshold - alignments below this
    are considered poor
    (default: set threshold automatically)

Outputs

Galaxy's wrapped for Unicycler produces two outputs:

final assembly in FASTA format

final assembly grapth in graph format

While most will likely be interested in the FASTA dataset, the graph dataset is also quite useful and can be visualized using tools such as Bandage.