Galaxy |

HyPo (version 1.0.3+galaxy0)

Illumina FASTQ files:

Draft genome assembly:

BAM with illumina read alignments:

Input file name containing the alignments of short reads against the draft (must have CIGAR information)

BAM with ONT reads aligned:

Input file name containing the alignments of long reads against the draft (must have CIGAR information). Optional (only Short reads polishing will be performed if this argument is not given)

Aproximate mean coverage of the short reads:

Aproximate size of the genome:

A number can be followed by units k/m/g; e.g. 10m, 2.3g.

Type of short reads:

Advanced options

Advanced options 0

Purpose

HyPo - a Hybrid Polisher - utilizes short as well as long reads within a single run to polish a long reads assembly of small and large genomes. It exploits unique genomic kmers to selectively polish segments of contigs using partial order alignment of selective read-segments. As demonstrated on human genome assemblies, Hypo generates significantly more accurate polished assembly in about one-third time with about half the memory requirements in comparison to contemporary widely used polishers like Racon.

Please note that "short reads" doesn't necessarily have to be NGS short reads; HiFi genomic reads (e.g. CCS) like those generated from PacBio SequelII could also be used instead. The requirement is that those reads should be highly accurate (>98% accuracy).

Input files

Hypo requires the following as input:

Short reads/HiFi reads (in FASTA/FASTQ format; can be compressed)
Draft contigs (in FASTA/FASTQ format; can be compressed)
Alignments between short reads (or HiFi reads) and the draft (hould contain CIGAR). If long reads are also to be used for polishing, then alignments between long reads and the draft.
Expected mean coverage of short reads (or HiFi reads) and approximate size of the genome.

In what follows, short reads can be replaced with HiFi reads.

How it works

Broadly, we (conceptually) divide a draft (uncorrected) contig into two types of regions (segments): strong and weak.

Strong regions are those which have strong evidence (support) of their correctness and thus do not need polishing. Weak regions, on the other hand, will be polished using POA. Each weak region will be polished using either short reads or long reads; short reads taking precedence over long reads. To identify strong regions, we make use of solid kmers (expected unique genomic kmers). Strong regions also play a role in selecting the read-segments to polish their neighbouring weak regions. Furthermore, our approach takes into account that the long reads and thus the assemblies generated from them are prone to homopolymer errors as mentioned in the beginning.