What it does
This tool parses variant counts from a special VCF file. It counts simple variants, calculates numbers of alleles, and calculates minor allele frequency. It can apply filters based on coverage, strand bias, and minor allele frequency cutoffs.
Input Format
Note: variants that are not A/C/G/T SNVs will be ignored!
The input VCF should be like the output of the Naive Variant Detector tool (using the stranded option). The sample column(s) must give the read count for each variant on each strand. Below is an example of a valid sample column entry (the important part is after the last colon):
0/0:1:0.02:+T=27,+G=1,-T=22,
Output
Each row represents one site in one sample. For unstranded output, 13 fields give information about that site:
1. SAMPLE - Sample name (from VCF sample column labels) 2. CHR - Chromosome of the site 3. POS - Chromosomal coordinate of the site 4. A - Number of reads supporting an 'A' 5. C - 'C' reads 6. G - 'G' reads 7. T - 'T' reads 8. CVRG - Total (number of reads supporting one of the four bases above) 9. ALLELES - Number of qualifying alleles 10. MAJOR - Major allele 11. MINOR - Minor allele (2nd most prevalent variant) 12. MAF - Frequency of minor allele 13. BIAS - Strand bias measure
For stranded output, instead of using 4 columns to report read counts per base, 8 are used to report the stranded counts per base:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 SAMPLE CHR POS +A +C +G +T -A -C -G -T CVRG ALLELES MAJOR MINOR MAF BIAS
Example
Below is a header line, followed by some example data lines. Since the input contained three samples, the data for each site is reported on three consecutive lines. However, if a sample fell below the coverage threshold at that site, the line will be omitted:
#SAMPLE CHR POS A C G T CVRG ALLELES MAJOR MINOR MAF BIAS BLOOD_1 chr20 99 0 101 1 2 104 1 C T 0.01923 0.33657 BLOOD_2 chr20 99 82 44 0 1 127 2 A C 0.34646 0.07823 BLOOD_3 chr20 99 0 110 1 0 111 1 C G 0.009 1.00909 BLOOD_1 chr20 100 3 5 100 0 108 1 G C 0.0463 0.15986 BLOOD_3 chr20 100 1 118 11 0 130 0 C G 0.08462 0.04154
Site printing and allele tallying requirements
Coverage threshold:
If a coverage threshold is used, the number of reads on each strand must be at or above the threshold. If either strand is below the threshold, the line will be omitted. N.B. this means the total coverage for each printed site will be at least twice the number you give in the "coverage threshold" option. Also, since only simple variants are counted, a site with 100 reads, all supporting a deletion variant, would not be printed.
Frequency threshold:
If a frequency threshold is used, alleles are only counted (in the ALLELES column) if they meet or exceed this minor allele frequency threshold.
Strand bias:
The alleles passing the threshold on each strand must match (though not in order), or the allele count will be 0. So a site with A, C, G on the plus strand and A, G on the minus strand will get an allele count of zero, though the (strand-independent) major allele, minor allele, and minor allele frequency will still be reported. If there is a tie for the minor allele, one will be randomly chosen.
Additionally, a measure of strand bias is given in the last column. This is calculated using the method of Guo et al., 2012. A value of "." is given when there is no valid result of the calculation due to a zero denominator. This occurs when there are no reads on one of the strands, or when there is no minor allele.