What it does
This tool uses "snpEff build -genbank" or "snpEff build -gff3" commands to create a snpEff database.
Working with Genbank files
Using Genbank data for creating databases has several advantages:
- Genbank files contain annotations (such as locations of genes) together with sequences. This ensures that these two are in sync with each other.
- When you are analyzing small genomes (or not so small) it is much more convenient to create a database on the fly and use it.
SnpEff errors out on highly fragmented genomes containing multiple scaffolds. This is because a single gene may be split between multiple scaffolds causing SnpEff to crash. If this is happening use GFF route described below.
Genbank usage scenario
Suppose you have a series of Illumina reads from an experiment involving E. coli K-12 MG1655. You want to map these reads to the reference genome of K-12 MG1655, call variants, and annotate them using snpEff. This tool enables you to follow the following analysis steps:
- Go to NCBI page for K-12 MG1655 genome (note that all NCBI genomes have similar list of files associated with them).
- Copy URL for file with extension gbff.gz
- Paste the URL into upload tool and set datatype to genbank.gz.
- Use this tool to generate a snpEff database and FASTA sequences from the dataset you've uploaded during the previous step.
- Use your Illumina reads to map against FASTA dataset generated in the previous step using BWA-MEM.
- Call variants using Freebayes.
- Annotate vcf output of Freebayes with SnpEff eff using database generated at step 2 (using Custom option for Genome source parameter).
In this scenario Genbank dataset is used twice. First, it is used to produce FASTA sequences that are using by BWA to map against. Second, it is used to create snpEff database. This guarantees that you will not have any issues related to reference sequence naming.
Working with GFF files
Alternatively you can create a SnpEff database from GFF3 files downloaded from NCBI or any other source. Using GFF dataset for building SnpEff database requires two inputs:
- The GFF file itself
- A genome in FASTA format
The GFF file contains coordinates of various features, but does not contain underlying sequences. This is why a FASTA file needs to be provided as well.
GFF usage scenario
The following example also uses E. coli K-12 MG1655:
Using SnpEff in Galaxy: A few points to remember
SnpEff relies on specially formatted databases to generate annotations. It will not work without them. There are several ways in which these databases can be obtained.
Pre-cached databases
Many standard (e.g., human, mouse, Drosophila) databases are likely pre-cached within a given Galaxy instance. You should be able to see them listed in Genome drop-down of SnpEff eff tool.
In you do not see them keep reading...
Download pre-built databases
SnpEff project generates large numbers of pre-build databases. These are available at https://sourceforge.net/projects/snpeff/files/databases/v4_3/ and can downloaded. Follow these steps:
- Use SnpEff databases tool to generate a list of existing databases. Note the name of the database you need.
- Use SnpEff download tool to download the database.
- Finally, use SnpEff eff by choosing the downloaded database from the history using Downloaded snpEff database in your history option of the Genome source parameter.
Alternatively, you can specify the name of the database directly in SnpEff eff using the Download on demand option (again, Genome source parameter). In this case snpEff will download the database before performing annotation.
Create your own database
In cases when you are dealing with bacterial or viral (or, frankly, any other) genomes it may be easier to create database yourself. For this you need:
- Download Genbank record corresponding to your genome of interest from NCBI or use annotations in GFF format accompanied by the corresponding genome in FASTA format.
- Use SnpEff build to create the database.
- Use the database in SnpEff eff (using Custom option for Genome source parameter).
Creating custom database has one major advantage. It guaranteess that you will not have any issues related to reference sequence naming -- the most common source of SnpEff errors.
To learn more about snpEff read its manual at http://snpeff.sourceforge.net/SnpEff_manual.html