Annotate and edit VCF/BCF files.
Examples:
# Remove three fields bcftools annotate -x ID,INFO/DP,FORMAT/DP file.vcf.gz
# Remove all INFO fields and all FORMAT fields except for GT and PL bcftools annotate -x INFO,^FORMAT/GT,FORMAT/PL file.vcf
# Add ID, QUAL and INFO/TAG, not replacing TAG if already present bcftools annotate -a src.bcf -c ID,QUAL,+TAG dst.bcf
# Carry over all INFO and FORMAT annotations except FORMAT/GT bcftools annotate -a src.bcf -c INFO,^FORMAT/GT dst.bcf
# Annotate from a tab-delimited file with six columns (the fifth is ignored), # first indexing with tabix. The coordinates are 1-based. tabix -s1 -b2 -e2 annots.tab.gz bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,POS,REF,ALT,-,TAG file.vcf
# Annotate from a tab-delimited file with regions (1-based coordinates, inclusive) tabix -s1 -b2 -e3 annots.tab.gz bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,FROM,TO,TAG inut.vcf
# Annotate from a bed file (0-based coordinates, half-closed, half-open intervals) bcftools annotate -a annots.bed.gz -h annots.hdr -c CHROM,FROM,TO,TAG input.vcf
Regions can be specified in a VCF, BED, or tab-delimited file (the default). The columns of the tab-delimited file are: CHROM, POS, and, optionally, POS_TO, where positions are 1-based and inclusive. Uncompressed files are stored in memory, while bgzip-compressed and tabix-indexed region files are streamed. Note that sequence names must match exactly, "chr20" is not the same as "20". Also note that chromosome ordering in FILE will be respected, the VCF will be processed in the order in which chromosomes first appear in FILE. However, within chromosomes, the VCF will always be processed in ascending genomic coordinate order no matter what order they appear in FILE. Note that overlapping regions in FILE can result in duplicated out of order positions in the output. This option requires indexed VCF/BCF files.
Valid expressions may contain:
numerical constants, string constants
1, 1.0, 1e-4 "String"
arithmetic operators
+,*,-,/
comparison operators
== (same as =), >, >=, <=, <, !=
regex operators "~" and its negation "!~"
INFO/HAYSTACK ~ "needle"
parentheses
(, )
logical operators
&& (same as &), ||, |
INFO tags, FORMAT tags, column names
INFO/DP or DP FORMAT/DV, FMT/DV, or DV FILTER, QUAL, ID, REF, ALT[0]
1 (or 0) to test the presence (or absence) of a flag
FlagA=1 && FlagB=0
"." to test missing values
DP=".", DP!=".", ALT="."
missing genotypes can be matched regardless of phase and ploidy (".|.", "./.", ".") using this expression
GT="."
TYPE for variant type in REF,ALT columns (indel,snp,mnp,ref,other)
TYPE="indel" | TYPE="snp"
array subscripts, "*" for any field
(DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0.3 DP4[*] == 0 CSQ[*] ~ "missense_variant.*deleterious"
function on FORMAT tags (over samples) and INFO tags (over vector fields)
MAX, MIN, AVG, SUM, STRLEN, ABS
variables calculated on the fly if not present: number of alternate alleles; number of samples; count of alternate alleles; minor allele count (similar to AC but is always smaller than 0.5); frequency of alternate alleles (AF=AC/AN); frequency of minor alleles (MAF=MAC/AN); number of alleles in called genotypes
N_ALT, N_SAMPLES, AC, MAC, AF, MAF, AN
Notes:
Examples:
MIN(DV)>5 MIN(DV/DP)>0.3 MIN(DP)>10 & MIN(DV)>3 FMT/DP>10 & FMT/GQ>10 .. both conditions must be satisfied within one sample FMT/DP>10 && FMT/GQ>10 .. the conditions can be satisfied in different samples QUAL>10 | FMT/GQ>10 .. selects only GQ>10 samples QUAL>10 || FMT/GQ>10 .. selects all samples at QUAL>10 sites TYPE="snp" && QUAL>=10 && (DP4[2]+DP4[3] > 2) MIN(DP)>35 && AVG(GQ)>50 ID=@file .. selects lines with ID present in the file ID!=@~/file .. skip lines with ID present in the ~/file MAF[0]<0.05 .. select rare variants at 5% cutoff