Galaxy |

Flye (version 2.9.3+galaxy0)

Input reads:

Mode:

Number of polishing iterations:

Polishing is performed as the final assembly stage. By default, Flye runs one polishing iteration. Additional iterations might correct a small number of extra errors (due to improvements on how reads may align to the corrected assembly). If the parameter is set to 0, the polishing is not performed

Minimum overlap between reads:

This sets a minimum overlap length for two reads to be considered overlapping. By default it is chosen automatically based on the read length distribution (reads N90) and does not require manual setting. Typical value is 3k-5k (and down to 1k for datasets with shorter read length). Intuitively, we want to set this parameter as high as possible, so the repeat graph is less tangled. However, higher values might lead to assembly gaps. In some rare cases it makes sense to manually increase minimum overlap for assemblies of big genomes with long reads and high coverage.

Keep haplotypes:

By default, Flye collapses graph structures caused by alternative haplotypes (bubbles, superbubbles, roundabouts) to produce longer consensus contigs. This option retains the alternative paths on the graph, producing less contigouos, but more detailed assembly.

Enable scaffolding using graph:

Starting from the version 2.9 Flye does not perform scaffolding by default, which guarantees that all assembled sequences do not have any gaps

Perform metagenomic assembly:

It is designed for highly non-uniform coverage and is sensitive to underrepresented sequence at low coverage (as low as 2x). In some examples of simple metagenomes, we observed that the normal mode assembled more contigious bacterial consensus sequence, while the metagenome mode was slightly more fragmented, but revealed strain mixtures

Reduced contig assembly coverage:

Typically, assemblies of large genomes at high coverage require a hundreds of RAM. For high coverage assemblies, you can reduce memory usage by using only a subset of longest reads for initial contig extension stage (usually, the memory bottleneck)

Remove all non-primary contigs from the assembly:

Generate a log file:

Purpose

Flye is a de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PacBio/ONT reads as input and outputs polished contigs. Flye also has a special mode for metagenome assembly.

Quick usage

Input reads can be in FASTA or FASTQ format, uncompressed or compressed with gz. Currently, PacBio (raw, corrected, HiFi) and ONT reads (raw, corrected) are supported. Expected error rates are <30% for raw, <3% for corrected, and <1% for HiFi. Note that Flye was primarily developed to run on raw reads. You may specify multiple files with reads (separated by spaces). Mixing different read types is not yet supported. The --meta o ption enables the mode for metagenome/uneven coverage assembly.

Genome size estimate is no longer a required option. You need to provide an estimate if using --asm-coverage option.

To reduce memory consumption for large genome assemblies, you can use a subset of the longest reads for initial disjointig assembly by specifying --asm-coverage and --genome-size options. Typically, 40x coverage is enough to produce good disjointigs.

Outputs

The main output files are:

Final assembly: contains contigs and possibly scaffolds (see below).
Final repeat graph: note that the edge sequences might be different (shorter) than contig sequences, because contigs might include multiple graph edges.
Extra information about contigs (such as length or coverage).

Each contig is formed by a single unique graph edge. If possible, unique contigs are extended with the sequence from flanking unresolved repeats on the graph. Thus, a contig fully contains the corresponding graph edge (with the same id), but might be longer then this edge. This is somewhat similar to unitig-contig relation in OLC assemblers. In a rare case when a repetitive graph edge is not covered by the set of "extended" contigs, it will be also output in the assembly file.

Sometimes it is possible to further order contigs into scaffolds based on the repeat graph structure. These ordered contigs will be output as a part of scaffold in the assembly file (with a scaffold prefix). Since it is hard to give a reliable estimate of the gap size, those gaps are represented with the default 100 Ns. assembly_info.txt file (below) contains additional information about how scaffolds were formed.

Extra information about contigs/scaffolds is output into the assembly_info.txt file. It is a tab-delimited table with the columns as follows:

Contig/scaffold id
Length
Coverage
Is circular, (Y)es or (N)o
Is repetitive, (Y)es or (N)o
Multiplicity (based on coverage)
Alternative group
Graph path (graph path corresponding to this contig/scaffold).

Scaffold gaps are marked with ?? symbols, and * symbol denotes a terminal graph node. Alternative contigs (representing alternative haplotypes) will have the same alt. group ID. Primary contigs are marked by *.

Algorithm Description

This is a brief description of the Flye algorithm. Please refer to the manuscript for more detailed information. The draft contig extension is organized as follows:

K-mer counting / erroneous k-mer pre-filtering
Solid k-mer selection (k-mers with sufficient frequency, which are unlikely to be erroneous)
Contig extension. The algorithm starts from a single read and extends it with a next overlapping read (overlaps are dynamically detected using the selected solid k-mers).

Note that we do not attempt to resolve repeats at this stage, thus the reconstructed contigs might contain misassemblies. Flye then aligns the reads on these draft contigs using minimap2 and calls a consensus. Afterwards, Flye performs repeat analysis as follows:

Repeat graph is constructed from the (possibly misassembled) contigs
In this graph all repeats longer than minimum overlap are collapsed
The algorithm resolves repeats using the read information and graph structure
The unbranching paths in the graph are output as contigs

If enabled, after resolving bridged repeats, Trestle module attempts to resolve simple unbridged repeats (of multiplicity 2) using the heterogeneities between repeat copies. Finally, Flye performs polishing of the resulting assembly to correct the remaining errors:

Alignment of all reads to the current assembly using minimap2
Partition the alignment into mini-alignments (bubbles)
Error correction of each bubble using a maximum likelihood approach

The polishing steps could be repeated, which might slightly increase quality for some datasets.