Galaxy | Tool Preview

SnpSift Extract Fields (version 4.3+t.galaxy0)
Separated by spaces. See help below for an explanation
When variants have more than one effect, lists one effect per line, while all other parameters in the line are repeated across mutiple lines
Separate multiple fields in one column with this character, e.g. a comma, rather than a column for each of the multiple values
Represent empty fields with this value, rather than leaving them blank

What is does

SnpSift Extract Fields selects columns from a VCF dataset into a Tab-delimited format.


How to know which fields to extract?

A VCF dataset contains mandatory fields as well as optional fields. Mandatory fields are required by VCF specifications and present in any valid VCF dataset. The Fields to extract input box of the tool above is already pre-filled with names of mandatory fields.

To know what other fields are available in a given VCF file simply look at its header. INFO and FORMAT lines will contain description of existing fields. For example, if you see a line:

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">

you can use NS as the field name.


Dealing with field generated with SnpEff

The current version of SnpEff produces so called ANN fields:

"ANN[*].ALLELE" (alias GENOTYPE)
"ANN[*].EFFECT" (alias ANNOTATION): Effect in Sequence ontology terms (e.g. 'missense_variant', 'synonymous_variant', 'stop_gained', etc.)
"ANN[*].IMPACT" { HIGH, MODERATE, LOW, MODIFIER }
"ANN[*].GENE" Gene name (e.g. 'PSD3')
"ANN[*].GENEID" Gene ID
"ANN[*].FEATURE"
"ANN[*].FEATUREID" (alias TRID: Transcript ID)
"ANN[*].BIOTYPE" Biotype, as described by the annotations (e.g. 'protein_coding')
"ANN[*].RANK" Exon or Intron rank (i.e. exon number in a transcript)
"ANN[*].HGVS_C" (alias HGVS_DNA, CODON): Variant in HGVS (DNA) notation
"ANN[*].HGVS_P" (alias HGVS, HGVS_PROT, AA): Variant in HGVS (protein) notation
"ANN[*].CDNA_POS" (alias POS_CDNA)
"ANN[*].CDNA_LEN" (alias LEN_CDNA)
"ANN[*].CDS_POS" (alias POS_CDS)
"ANN[*].CDS_LEN" (alias LEN_CDS)
"ANN[*].AA_POS" (alias POS_AA)
"ANN[*].AA_LEN" (alias LEN_AA)
"ANN[*].DISTANCE"
"ANN[*].ERRORS" (alias WARNING, INFOS)

Older versions produced EFF fields:

"EFF[*].EFFECT"
"EFF[*].IMPACT"
"EFF[*].FUNCLASS"
"EFF[*].CODON"
"EFF[*].AA"
"EFF[*].AA_LEN"
"EFF[*].GENE"
"EFF[*].BIOTYPE"
"EFF[*].CODING"
"EFF[*].TRID"
"EFF[*].RANK"

In addition there are LOF and NMD fields:

"LOF[*].GENE"
"LOF[*].GENEID"
"LOF[*].NUMTR"
"LOF[*].PERC"

"NMD[*].GENE"
"NMD[*].GENEID"
"NMD[*].NUMTR"
"NMD[*].PERC"

To find our whether your VCF contains ANN or EFF annotations simply look at its header.


Usage examples

Extracting chromosome, position, ID and allele frequency from a VCF file:

CHROM POS ID AF

The result will look something like:

#CHROM        POS        ID            AF
1             69134                    0.086
1             69496      rs150690004   0.001

Extracting genotype fields:

CHROM POS ID THETA GEN[0].GL[1] GEN[1].GL GEN[3].GL[*] GEN[*].GT

This means to extract:

The result will look something like:

#CHROM  POS     ID              THETA   GEN[0].GL[1]    GEN[1].GL               GEN[3].GL[*]            GEN[*].GT
1       10583   rs58108140      0.0046  -0.47           -0.24,-0.44,-1.16       -0.48   -0.48   -0.48   0|0     0|0     0|0     0|1     0|0     0|1     0|0     0|0     0|1
1       10611   rs189107123     0.0077  -0.48           -0.24,-0.44,-1.16       -0.48   -0.48   -0.48   0|0     0|1     0|0     0|0     0|0     0|0     0|0     0|0     0|0
1       13302   rs180734498     0.0048  -0.58           -2.45,-0.00,-5.00       -0.48   -0.48   -0.48   0|0     0|1     0|0     0|0     0|0     1|0     0|0     0|1     0|0
Extracting fields with multiple values:
(notice that there are multiple effect columns per line because there are multiple effects per variant)

CHROM POS REF ALT ANN[*].EFFECT

The result will look something like:

#CHROM  POS REF ALT ANN[*].EFFECT
22  17071756    T   C   3_prime_UTR_variant downstream_gene_variant
22  17072035    C   T   missense_variant    downstream_gene_variant
22  17072258    C   A   missense_variant    downstream_gene_variant

Extracting fields with multiple values using a comma as a multiple field separator:

CHROM POS REF ALT ANN[*].EFFECT ANN[*].HGVS_P

The result will look something like:

#CHROM  POS REF ALT ANN[*].EFFECT   ANN[*].HGVS_P
22  17071756    T   C   3_prime_UTR_variant,downstream_gene_variant .,.
22  17072035    C   T   missense_variant,downstream_gene_variant    p.Gly469Glu,.
22  17072258    C   A   missense_variant,downstream_gene_variant    p.Gly395Cys,.

Extracting fields with multiple values, one effect per line:

CHROM POS REF ALT ANN[*].EFFECT

The result will look something like:

#CHROM  POS REF ALT ANN[*].EFFECT
22  17071756    T   C   3_prime_UTR_variant
22  17071756    T   C   downstream_gene_variant
22  17072035    C   T   missense_variant
22  17072035    C   T   downstream_gene_variant
22  17072258    C   A   missense_variant
22  17072258    C   A   downstream_gene_variant

For details about this tool, please go to: