What is does
SnpSift Extract Fields selects columns from a VCF dataset into a Tab-delimited format.
How to know which fields to extract?
A VCF dataset contains mandatory fields as well as optional fields. Mandatory fields are required by VCF specifications and present in any valid VCF dataset. The Fields to extract input box of the tool above is already pre-filled with names of mandatory fields.
To know what other fields are available in a given VCF file simply look at its header. INFO and FORMAT lines will contain description of existing fields. For example, if you see a line:
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
you can use NS as the field name.
Dealing with field generated with SnpEff
The current version of SnpEff produces so called ANN fields:
"ANN[*].ALLELE" (alias GENOTYPE) "ANN[*].EFFECT" (alias ANNOTATION): Effect in Sequence ontology terms (e.g. 'missense_variant', 'synonymous_variant', 'stop_gained', etc.) "ANN[*].IMPACT" { HIGH, MODERATE, LOW, MODIFIER } "ANN[*].GENE" Gene name (e.g. 'PSD3') "ANN[*].GENEID" Gene ID "ANN[*].FEATURE" "ANN[*].FEATUREID" (alias TRID: Transcript ID) "ANN[*].BIOTYPE" Biotype, as described by the annotations (e.g. 'protein_coding') "ANN[*].RANK" Exon or Intron rank (i.e. exon number in a transcript) "ANN[*].HGVS_C" (alias HGVS_DNA, CODON): Variant in HGVS (DNA) notation "ANN[*].HGVS_P" (alias HGVS, HGVS_PROT, AA): Variant in HGVS (protein) notation "ANN[*].CDNA_POS" (alias POS_CDNA) "ANN[*].CDNA_LEN" (alias LEN_CDNA) "ANN[*].CDS_POS" (alias POS_CDS) "ANN[*].CDS_LEN" (alias LEN_CDS) "ANN[*].AA_POS" (alias POS_AA) "ANN[*].AA_LEN" (alias LEN_AA) "ANN[*].DISTANCE" "ANN[*].ERRORS" (alias WARNING, INFOS)
Older versions produced EFF fields:
"EFF[*].EFFECT" "EFF[*].IMPACT" "EFF[*].FUNCLASS" "EFF[*].CODON" "EFF[*].AA" "EFF[*].AA_LEN" "EFF[*].GENE" "EFF[*].BIOTYPE" "EFF[*].CODING" "EFF[*].TRID" "EFF[*].RANK"
In addition there are LOF and NMD fields:
"LOF[*].GENE" "LOF[*].GENEID" "LOF[*].NUMTR" "LOF[*].PERC" "NMD[*].GENE" "NMD[*].GENEID" "NMD[*].NUMTR" "NMD[*].PERC"
To find our whether your VCF contains ANN or EFF annotations simply look at its header.
Usage examples
Extracting chromosome, position, ID and allele frequency from a VCF file:
CHROM POS ID AF
The result will look something like:
#CHROM POS ID AF 1 69134 0.086 1 69496 rs150690004 0.001
Extracting genotype fields:
CHROM POS ID THETA GEN[0].GL[1] GEN[1].GL GEN[3].GL[*] GEN[*].GT
This means to extract:
The result will look something like:
#CHROM POS ID THETA GEN[0].GL[1] GEN[1].GL GEN[3].GL[*] GEN[*].GT 1 10583 rs58108140 0.0046 -0.47 -0.24,-0.44,-1.16 -0.48 -0.48 -0.48 0|0 0|0 0|0 0|1 0|0 0|1 0|0 0|0 0|1 1 10611 rs189107123 0.0077 -0.48 -0.24,-0.44,-1.16 -0.48 -0.48 -0.48 0|0 0|1 0|0 0|0 0|0 0|0 0|0 0|0 0|0 1 13302 rs180734498 0.0048 -0.58 -2.45,-0.00,-5.00 -0.48 -0.48 -0.48 0|0 0|1 0|0 0|0 0|0 1|0 0|0 0|1 0|0
CHROM POS REF ALT ANN[*].EFFECT
The result will look something like:
#CHROM POS REF ALT ANN[*].EFFECT 22 17071756 T C 3_prime_UTR_variant downstream_gene_variant 22 17072035 C T missense_variant downstream_gene_variant 22 17072258 C A missense_variant downstream_gene_variant
Extracting fields with multiple values using a comma as a multiple field separator:
CHROM POS REF ALT ANN[*].EFFECT ANN[*].HGVS_P
The result will look something like:
#CHROM POS REF ALT ANN[*].EFFECT ANN[*].HGVS_P 22 17071756 T C 3_prime_UTR_variant,downstream_gene_variant .,. 22 17072035 C T missense_variant,downstream_gene_variant p.Gly469Glu,. 22 17072258 C A missense_variant,downstream_gene_variant p.Gly395Cys,.
Extracting fields with multiple values, one effect per line:
CHROM POS REF ALT ANN[*].EFFECT
The result will look something like:
#CHROM POS REF ALT ANN[*].EFFECT 22 17071756 T C 3_prime_UTR_variant 22 17071756 T C downstream_gene_variant 22 17072035 C T missense_variant 22 17072035 C T downstream_gene_variant 22 17072258 C A missense_variant 22 17072258 C A downstream_gene_variant
For details about this tool, please go to: