Galaxy | Tool Preview

RefSeq Masher Matches (version 0.1.2)
Advanced Options
Advanced Options 0

RefSeq Masher - Genomic Distance

Find what NCBI RefSeq genomes most closely match your sequence data using Mash with a Mash sketch database of 54,925 NCBI RefSeq Genomes.

Source code available on Github at github.com/phac-nml/refseq_masher

matches - find the closest matching NCBI RefSeq Genomes in your input sequences

Command-line usage information:

Usage: refseq_masher matches [OPTIONS] INPUT...

  Find NCBI RefSeq genome matches for an input genome fasta file

  Input is expected to be one or more FASTA/FASTQ files or one or more
  directories containing FASTA/FASTQ files. Files can be Gzipped.

Options:
  --mash-bin TEXT                 Mash binary path (default="mash")
  -o, --output PATH               Output file path (default="-"/stdout)
  --output-type [tab|csv]         Output file type (tab|csv)
  -n, --top-n-results INTEGER     Output top N results sorted by distance in
                                  ascending order (default=5)
  -m, --min-kmer-threshold INTEGER
                                  Mash sketch of reads: "Minimum copies of
                                  each k-mer required to pass noise filter for
                                  reads" (default=8)
  -h, --help                      Show this message and exit.

Example

With the FNA.GZ file for Salmonella enterica subsp. enterica serovar Enteritidis str. CHS44:

# download sequence file
wget ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/329/025/GCF_000329025.1_ASM32902v1/GCF_000329025.1_ASM32902v1_genomic.fna.gz

# find RefSeq matches
refseq_masher -vv matches GCF_000329025.1_ASM32902v1_genomic.fna.gz

Log:

2018-01-29 11:02:13,786 INFO: Collected 1 FASTA inputs and 0 read sets [in ...refseq_masher/refseq_masher/utils.py:185]
2018-01-29 11:02:13,786 INFO: Creating Mash sketch file for ...refseq_masher/GCF_000329025.1_ASM32902v1_genomic.fna.gz [in ...refseq_masher/refseq_masher/mash/sketch.py:24]
2018-01-29 11:02:14,055 INFO: Created Mash sketch file at "/tmp/GCF_000329025.1_ASM32902v1_genomic.msh" [in ...refseq_masher/refseq_masher/mash/sketch.py:40]
2018-01-29 11:02:14,613 INFO: Ran Mash dist successfully (output length=11647035). Parsing Mash dist output [in ...refseq_masher/refseq_masher/mash/dist.py:64]
2018-01-29 11:02:15,320 INFO: Parsed Mash dist output into Pandas DataFrame with 54924 rows [in ...refseq_masher/refseq_masher/mash/dist.py:67]
2018-01-29 11:02:15,321 INFO: Deleting temporary sketch file "/tmp/GCF_000329025.1_ASM32902v1_genomic.msh" [in ...refseq_masher/refseq_masher/mash/dist.py:72]
2018-01-29 11:02:15,321 INFO: Sketch file "/tmp/GCF_000329025.1_ASM32902v1_genomic.msh" deleted! [in ...refseq_masher/refseq_masher/mash/dist.py:74]
2018-01-29 11:02:15,322 INFO: Ran Mash dist on all input. Merging NCBI taxonomic information into results output. [in ...refseq_masher/refseq_masher/cli.py:88]
2018-01-29 11:02:15,323 INFO: Fetching all taxonomy info for 5 unique NCBI Taxonomy UIDs [in ...refseq_masher/refseq_masher/taxonomy.py:35]
2018-01-29 11:02:15,325 INFO: Dropping columns with all NA values (ncol=32) [in ...refseq_masher/refseq_masher/taxonomy.py:38]
2018-01-29 11:02:15,327 INFO: Columns with all NA values dropped (ncol=11) [in ...refseq_masher/refseq_masher/taxonomy.py:40]
2018-01-29 11:02:15,327 INFO: Merging Mash results with relevant taxonomic information [in ...refseq_masher/refseq_masher/taxonomy.py:41]
2018-01-29 11:02:15,329 INFO: Merged Mash results with taxonomy info [in ...refseq_masher/refseq_masher/taxonomy.py:43]
2018-01-29 11:02:15,329 INFO: Merged taxonomic info into results output [in ...refseq_masher/refseq_masher/cli.py:90]
2018-01-29 11:02:15,329 INFO: Reordering output columns [in ...refseq_masher/refseq_masher/cli.py:91]
2018-01-29 11:02:15,331 INFO: Writing output to stdout [in ...refseq_masher/refseq_masher/writers.py:16]

Output

sample top_taxonomy_name distance pvalue matching full_taxonomy taxonomic_subspecies taxonomic_species taxonomic_genus taxonomic_family taxonomic_order taxonomic_class taxonomic_phylum taxonomic_superkingdom subspecies serovar plasmid bioproject biosample taxid assembly_accession match_id
GCF_000329025.1_ASM32902v1_genomic Salmonella enterica subsp. enterica serovar Enteritidis str. CHS44 0.0 0.0 400/400 Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales; Enterobacteriaceae; Salmonella; enterica; subsp. enterica; serovar Enteritidis; str. CHS44 Salmonella enterica subsp. enterica Salmonella enterica Salmonella Enterobacteriaceae Enterobacterales Gammaproteobacteria Proteobacteria Bacteria enterica Enteritidis   PRJNA185053 SAMN01041154 702979 NZ_ALFF ./rcn/refseq-NZ-702979-PRJNA185053-SAMN01041154-NZ_ALFF-.-Salmonella_enterica_subsp._enterica_serovar_Enteritidis_str._CHS44.fna

The top match is Salmonella enterica subsp. enterica serovar Enteritidis str. CHS44 with a distance of 0.0 and 400/400 sketches matching, which is what we expected. There's other taxonomic information available in the results table that may be useful.

Contact

Gary van Domselaar: gary.vandomselaar@phac-aspc.gc.ca