Galaxy |

What it does

This tool extracts heterozygous kmer pairs from kmer count databases and performs gymnastics with them. We are able to disentangle genome structure by comparing the sum of kmer pair coverages (CovA + CovB) to their relative coverage (CovB / (CovA + CovB)). Such an approach also allows us to analyze obscure genomes with duplications, various ploidy levels, etc.

Smudgeplots are computed from raw or even better from trimmed reads and show the haplotype structure using heterozygous kmer pairs. For example:

Every haplotype structure has a unique smudge on the graph and the heat of the smudge indicates how frequently the haplotype structure is represented in the genome compared to the other structures. The image above is an ideal case, where the sequencing coverage is sufficient to beautifully separate all the smudges, providing very strong and clear evidence of triploidy.

Please see Smudgeplot on GitHub for further documentation and tutorials.

Inputs

You have two choices when running Smudgeplot in Galaxy:

Input reads file(s) for default kmer-counting with Jellyfish

This should be at least one file which providing coverage of your genome of interest. The tool accepts compressed (.gz) inputs. If choosing this option, you can (optionally) specify manual cutoff values for the kmer dump step. The Smudgeplot docs suggest that you can use GenomeScope on a kmer histogram in order to choose reasonable lower and upper cutoff values.

Input your own kmer dump file for more control of kmer counting parameters

This file would be created by running jellyfish count and then jellyfish dump - the process is well described on GitHub.

Outputs

smudgeplot.png smudgeplot image
smudgeplot_log10.png smudgeplot with log scale
my_genome_summary.tsv summarized genome statistics
my_genome_verbose.txt detailed genome statistics
my_genome_warnings.txt warnings emitted from the Smudgeplot tool

Default operation

If choosing reads as the input, a default kmer counting procedure will be used to create a kmer dump. This default process is summarized as follows:

jellyfish count -m 21 > counts.jf
jellyfish histo counts.jf > counts.hist
smudgeplot.py cutoff counts.hist to get kmer cutoff values (U & L)
jellyfish dump -c -L <L> -U <U> counts.jf > dump.jf

The kmer dump file is then used to create a smudgeplot:

smudgeplot.py hetkmers -o kmer_pairs dump.jf
smudgeplot.py plot kmer_pairs_coverages.tsv -o my_genome