Galaxy | Tool Preview

SnpEff build: (version 4.3+T.galaxy5)
For E. coli K12, for example, you may want to use 'EcK12'. Note: Spaces are not allowed in the name and will get converted to underscores.
Specify format for annotations you are using to create SnpEff database
This Genbank file will be used to generate snpEff database
This will generate an additional dataset containing all sequences from Genbank file in FASTA format
Genbank sequences have version numbers such as B000564.2. This option removes them leaving only B000564
If this sequence uses non-standard genetic code, select one from these options

What it does

This tool uses "snpEff build -genbank" or "snpEff build -gff3" commands to create a snpEff database.


Working with Genbank files

Using Genbank data for creating databases has several advantages:

  1. Genbank files contain annotations (such as locations of genes) together with sequences. This ensures that these two are in sync with each other.
  2. When you are analyzing small genomes (or not so small) it is much more convenient to create a database on the fly and use it.

SnpEff errors out on highly fragmented genomes containing multiple scaffolds. This is because a single gene may be split between multiple scaffolds causing SnpEff to crash. If this is happening use GFF route described below.


Genbank usage scenario

Suppose you have a series of Illumina reads from an experiment involving E. coli K-12 MG1655. You want to map these reads to the reference genome of K-12 MG1655, call variants, and annotate them using snpEff. This tool enables you to follow the following analysis steps:

  1. Go to NCBI page for K-12 MG1655 genome (note that all NCBI genomes have similar list of files associated with them).
  2. Copy URL for file with extension gbff.gz
  3. Paste the URL into upload tool and set datatype to genbank.gz.
  4. Use this tool to generate a snpEff database and FASTA sequences from the dataset you've uploaded during the previous step.
  5. Use your Illumina reads to map against FASTA dataset generated in the previous step using BWA-MEM.
  6. Call variants using Freebayes.
  7. Annotate vcf output of Freebayes with SnpEff eff using database generated at step 2 (using Custom option for Genome source parameter).

In this scenario Genbank dataset is used twice. First, it is used to produce FASTA sequences that are using by BWA to map against. Second, it is used to create snpEff database. This guarantees that you will not have any issues related to reference sequence naming.


Working with GFF files

Alternatively you can create a SnpEff database from GFF3 files downloaded from NCBI or any other source. Using GFF dataset for building SnpEff database requires two inputs:

  1. The GFF file itself
  2. A genome in FASTA format

The GFF file contains coordinates of various features, but does not contain underlying sequences. This is why a FASTA file needs to be provided as well.


GFF usage scenario

The following example also uses E. coli K-12 MG1655:

  1. Go to NCBI page for K-12 MG1655 genome.
  2. Copy URLs for files with gff.gz and fna.gz extensions. The first file contains annotations in GFF3 format. The second file contains entire genome as a FASTA record.
  3. Paste URLs into upload tool and set datatypes to gff3 and fasta.gz for annotations and genome, respectively.
  4. Use this tool to generate a snpEff database from the GFF dataset.
  5. Map your reads against the FASTA dataset and continue as described in the above example.

Using SnpEff in Galaxy: A few points to remember

SnpEff relies on specially formatted databases to generate annotations. It will not work without them. There are several ways in which these databases can be obtained.

Pre-cached databases

Many standard (e.g., human, mouse, Drosophila) databases are likely pre-cached within a given Galaxy instance. You should be able to see them listed in Genome drop-down of SnpEff eff tool.

In you do not see them keep reading...

Download pre-built databases

SnpEff project generates large numbers of pre-build databases. These are available at https://sourceforge.net/projects/snpeff/files/databases/v4_3/ and can downloaded. Follow these steps:

  1. Use SnpEff databases tool to generate a list of existing databases. Note the name of the database you need.
  2. Use SnpEff download tool to download the database.
  3. Finally, use SnpEff eff by choosing the downloaded database from the history using Downloaded snpEff database in your history option of the Genome source parameter.

Alternatively, you can specify the name of the database directly in SnpEff eff using the Download on demand option (again, Genome source parameter). In this case snpEff will download the database before performing annotation.

Create your own database

In cases when you are dealing with bacterial or viral (or, frankly, any other) genomes it may be easier to create database yourself. For this you need:

  1. Download Genbank record corresponding to your genome of interest from NCBI or use annotations in GFF format accompanied by the corresponding genome in FASTA format.
  2. Use SnpEff build to create the database.
  3. Use the database in SnpEff eff (using Custom option for Genome source parameter).

Creating custom database has one major advantage. It guaranteess that you will not have any issues related to reference sequence naming -- the most common source of SnpEff errors.


To learn more about snpEff read its manual at http://snpeff.sourceforge.net/SnpEff_manual.html