Galaxy |

What it does

SnpEff is a variant annotation and effect prediction tool. It annotates and predicts the effects of genetic variants (such as amino acid changes).

A typical SnpEff use case would be:

Input: The inputs are predicted variants (SNPs, insertions, deletions and MNPs). The input file is usually obtained as a result of a sequencing experiment, and it is usually in variant call format (VCF).

Output: SnpEff analyzes the input variants. It annotates the variants and calculates the effects they produce on known genes (e.g. amino acid changes). A list of effects and annotations that SnpEff can calculate can be found here.

By genetic variant we mean difference between a genome and a "reference" genome. As an example, imagine we are sequencing a "sample". Here "sample" can mean anything that you are interested in studying, from a cell culture, to a mouse or a cancer patient. It is a standard procedure to compare your sample sequences against the corresponding "reference genome". For instance you may compare the cancer patient genome against the "reference genome".

In a typical sequencing experiment, you will find many places in the genome where your sample differs from the reference genome. These are called "genomic variants" or just "variants". Typically, variants are categorized as follows:

SNP (Single-Nucleotide Polymorphism) Reference = 'A', Sample = 'C'

Ins (Insertion) Reference = 'A', Sample = 'AGT'

Del (Deletion) Reference = 'AC', Sample = 'C'

MNP (Multiple-nucleotide polymorphism) Reference = 'ATA', Sample = 'GTC'

MIXED (Multiple-nucleotide and an InDel) Reference = 'ATA', Sample = 'GTCAGT'

This is not a comprehensive list, it is just to give you an idea.

Suppose you have a huge file describing all the differences between your sample and the reference genome. But you want to know more about these variants than just their genetic coordinates. E.g.: Are they in a gene? In an exon? Do they change protein coding? Do they cause premature stop codons? SnpEff can help you answer all these questions. The process of adding this information about the variants is called "Annotation". SnpEff provides several degrees of annotations, from simple (e.g. which gene is each variant affecting) to extremely complex annotations (e.g. will this non-coding variant affect the expression of a gene?). It should be noted that the more complex the annotations, the more it relies in computational predictions. Such computational predictions can be incorrect, so results from SnpEff (or any prediction algorithm) cannot be trusted blindly, they must be analyzed and independently validated by corresponding wet-lab experiments.

Using SnpEff in Galaxy: A few points to remember

SnpEff relies on specially formatted databases to generate annotations. It will not work without them. There are several ways in which these databases can be obtained.

Pre-cached databases

Many standard (e.g., human, mouse, Drosophila) databases are likely pre-cached within a given Galaxy instance. You should be able to see them listed in Genome drop-down of SnpEff eff tool.

In you do not see them keep reading...

Download pre-built databases

SnpEff project generates large numbers of pre-build databases. These are available at https://sourceforge.net/projects/snpeff/files/databases/v4_3/ and can downloaded. Follow these steps:

Use SnpEff databases tool to generate a list of existing databases. Note the name of the database you need.

Use SnpEff download tool to download the database.

Finally, use SnpEff eff by choosing the downloaded database from the history using Downloaded snpEff database in your history option of the Genome source parameter.

Alternatively, you can specify the name of the database directly in SnpEff eff using the Download on demand option (again, Genome source parameter). In this case snpEff will download the database before performing annotation.

Create your own database

In cases when you are dealing with bacterial or viral (or, frankly, any other) genomes it may be easier to create database yourself. For this you need:

Download Genbank record corresponding to your genome of interest from NCBI or use annotations in GFF format accompanied by the corresponding genome in FASTA format.

Use SnpEff build to create the database.

Use the database in SnpEff eff (using Custom option for Genome source parameter).

Creating custom database has one major advantage. It guaranteess that you will not have any issues related to reference sequence naming -- the most common source of SnpEff errors.

To learn more about snpEff read its manual at http://snpeff.sourceforge.net/SnpEff_manual.html