Galaxy |

What is does

SnpSift Extract Fields selects columns from a VCF dataset into a Tab-delimited format.

How to know which fields to extract?

A VCF dataset contains mandatory fields as well as optional fields. Mandatory fields are required by VCF specifications and present in any valid VCF dataset. The Fields to extract input box of the tool above is already pre-filled with names of mandatory fields.

To know what other fields are available in a given VCF file simply look at its header. INFO and FORMAT lines will contain description of existing fields. For example, if you see a line:

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">

you can use NS as the field name.

Dealing with field generated with SnpEff

The current version of SnpEff produces so called ANN fields:

"ANN[*].ALLELE" (alias GENOTYPE)
"ANN[*].EFFECT" (alias ANNOTATION): Effect in Sequence ontology terms (e.g. 'missense_variant', 'synonymous_variant', 'stop_gained', etc.)
"ANN[*].IMPACT" { HIGH, MODERATE, LOW, MODIFIER }
"ANN[*].GENE" Gene name (e.g. 'PSD3')
"ANN[*].GENEID" Gene ID
"ANN[*].FEATURE"
"ANN[*].FEATUREID" (alias TRID: Transcript ID)
"ANN[*].BIOTYPE" Biotype, as described by the annotations (e.g. 'protein_coding')
"ANN[*].RANK" Exon or Intron rank (i.e. exon number in a transcript)
"ANN[*].HGVS_C" (alias HGVS_DNA, CODON): Variant in HGVS (DNA) notation
"ANN[*].HGVS_P" (alias HGVS, HGVS_PROT, AA): Variant in HGVS (protein) notation
"ANN[*].CDNA_POS" (alias POS_CDNA)
"ANN[*].CDNA_LEN" (alias LEN_CDNA)
"ANN[*].CDS_POS" (alias POS_CDS)
"ANN[*].CDS_LEN" (alias LEN_CDS)
"ANN[*].AA_POS" (alias POS_AA)
"ANN[*].AA_LEN" (alias LEN_AA)
"ANN[*].DISTANCE"
"ANN[*].ERRORS" (alias WARNING, INFOS)

Older versions produced EFF fields:

"EFF[*].EFFECT"
"EFF[*].IMPACT"
"EFF[*].FUNCLASS"
"EFF[*].CODON"
"EFF[*].AA"
"EFF[*].AA_LEN"
"EFF[*].GENE"
"EFF[*].BIOTYPE"
"EFF[*].CODING"
"EFF[*].TRID"
"EFF[*].RANK"

In addition there are LOF and NMD fields:

"LOF[*].GENE"
"LOF[*].GENEID"
"LOF[*].NUMTR"
"LOF[*].PERC"

"NMD[*].GENE"
"NMD[*].GENEID"
"NMD[*].NUMTR"
"NMD[*].PERC"

To find our whether your VCF contains ANN or EFF annotations simply look at its header.

Usage examples

Extracting chromosome, position, ID and allele frequency from a VCF file:

CHROM POS ID AF