annotate allele-counts.xml @ 4:898eb3daab43

Complete documentation
author nick
date Tue, 04 Jun 2013 00:16:29 -0400
parents 933a9435939c
children 31361191d2d2
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
1 <tool id="allele_counts_1" version="1.0" name="Count alleles">
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
2 <description>and minor allele frequencies</description>
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
3 <command interpreter="python">allele-counts.py -i $input -o $output -f $freq -c $covg $header</command>
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
4 <inputs>
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
5 <param name="input" type="data" format="vcf" label="Input variants from Naive Variants Detector"/>
3
933a9435939c Current xml
nick
parents: 0
diff changeset
6 <param name="freq" type="float" value="1.0" min="0" max="100" label="Minor allele frequency threshold (in percent)"/>
0
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
7 <param name="covg" type="integer" value="10" min="0" label="Coverage threshold (per strand)"/>
3
933a9435939c Current xml
nick
parents: 0
diff changeset
8 <param name="header" type="boolean" truevalue="-H" falsevalue="" checked="True" label="Write header line" />
0
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
9 </inputs>
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
10 <outputs>
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
11 <data name="output" format="tabular"/>
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
12 </outputs>
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
13 <stdio>
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
14 <exit_code range="1:" err_level="fatal"/>
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
15 <exit_code range=":-1" err_level="fatal"/>
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
16 </stdio>
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
17
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
18 <help>
3
933a9435939c Current xml
nick
parents: 0
diff changeset
19
4
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
20 .. class:: infomark
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
21
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
22 **What it does**
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
23
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
24 This tool parses variant counts from a special VCF file (normally the output of the **Naive Variant Detector** tool). It counts simple (ACGT) variants, calculates numbers of alleles, and calculates minor allele frequency. It applies filters based on coverage, strand bias, and minor allele frequency cutoffs.
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
25
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
26 -----
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
27
3
933a9435939c Current xml
nick
parents: 0
diff changeset
28 .. class:: warningmark
933a9435939c Current xml
nick
parents: 0
diff changeset
29
933a9435939c Current xml
nick
parents: 0
diff changeset
30 **Note**
933a9435939c Current xml
nick
parents: 0
diff changeset
31
4
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
32 The VCF must have a certain genotype field in the sample columns, giving the read count of each type of variant. Also, the variant data **must be stranded**. The **Naive Variant Detector** tool produces this type of VCF.
3
933a9435939c Current xml
nick
parents: 0
diff changeset
33
933a9435939c Current xml
nick
parents: 0
diff changeset
34 -----
933a9435939c Current xml
nick
parents: 0
diff changeset
35
933a9435939c Current xml
nick
parents: 0
diff changeset
36 .. class:: infomark
933a9435939c Current xml
nick
parents: 0
diff changeset
37
933a9435939c Current xml
nick
parents: 0
diff changeset
38 **Output columns**
933a9435939c Current xml
nick
parents: 0
diff changeset
39
933a9435939c Current xml
nick
parents: 0
diff changeset
40 Each row represents one site in one sample. 12 fields give information about that site::
0
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
41
3
933a9435939c Current xml
nick
parents: 0
diff changeset
42 1. SAMPLE - Sample names (from VCF sample column labels)
933a9435939c Current xml
nick
parents: 0
diff changeset
43 2. CHR - Chromosome of the site
933a9435939c Current xml
nick
parents: 0
diff changeset
44 3. POS - Chromosomal coordinate of the site
933a9435939c Current xml
nick
parents: 0
diff changeset
45 4. A - Number of reads supporting an 'A'
933a9435939c Current xml
nick
parents: 0
diff changeset
46 5. C - ditto, for 'C'
933a9435939c Current xml
nick
parents: 0
diff changeset
47 6. G - ditto, for 'G'
933a9435939c Current xml
nick
parents: 0
diff changeset
48 7. T - ditto, for 'T'
933a9435939c Current xml
nick
parents: 0
diff changeset
49 8. CVRG - Total (number of reads supporting one of the four bases above)
933a9435939c Current xml
nick
parents: 0
diff changeset
50 9. ALLELES - Number of qualifying alleles
933a9435939c Current xml
nick
parents: 0
diff changeset
51 10. MAJOR - Major allele base
933a9435939c Current xml
nick
parents: 0
diff changeset
52 11. MINOR - Minor allele base (2nd most prevalent variant)
933a9435939c Current xml
nick
parents: 0
diff changeset
53 12. MINOR.FREQ.PERC. - Frequency of minor allele
933a9435939c Current xml
nick
parents: 0
diff changeset
54
4
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
55 **Example**
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
56
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
57 This is the header line, followed by some example data lines. Note that some samples and/or sites will not be included in the output, if they fall below the coverage threshold::
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
58
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
59 #SAMPLE CHR POS A C G T CVRG ALLELES MAJOR MINOR MINOR.FREQ.PERC.
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
60 BLOOD_1 chr20 99 0 101 1 2 104 1 C T 0.01923
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
61 BLOOD_2 chr20 99 82 44 0 1 127 2 A C 0.34646
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
62 BLOOD_3 chr20 99 0 110 1 0 111 1 C G 0.009
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
63 BLOOD_1 chr20 100 3 5 100 0 108 1 G C 0.0463
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
64 BLOOD_3 chr20 100 1 118 11 0 130 0 C G 0.08462
3
933a9435939c Current xml
nick
parents: 0
diff changeset
65
933a9435939c Current xml
nick
parents: 0
diff changeset
66 -----
933a9435939c Current xml
nick
parents: 0
diff changeset
67
933a9435939c Current xml
nick
parents: 0
diff changeset
68 .. class:: warningmark
933a9435939c Current xml
nick
parents: 0
diff changeset
69
933a9435939c Current xml
nick
parents: 0
diff changeset
70 **Site printing and allele tallying requirements**
933a9435939c Current xml
nick
parents: 0
diff changeset
71
933a9435939c Current xml
nick
parents: 0
diff changeset
72 Each line is printed only when the site is covered by the threshold number of reads **on each strand**. If coverage of either strand is below the threshold, the line (sample + site combination) is omitted.
933a9435939c Current xml
nick
parents: 0
diff changeset
73
933a9435939c Current xml
nick
parents: 0
diff changeset
74 **N.B.**: This means the total coverage for each printed site will be at least twice the number you give in the "coverage threshold" option.
933a9435939c Current xml
nick
parents: 0
diff changeset
75
4
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
76 Also, reads supporting a variant outside the canonical 4 nucleotides will not count towards the coverage requirement. For instance, a site/sample line with 100x coverage, all of which support a deletion variant, will not be printed.
3
933a9435939c Current xml
nick
parents: 0
diff changeset
77
4
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
78 Alleles are only counted (in column 9) if they meet or exceed the minor allele frequency threshold. So a site/sample line with types of variants, 96% A, 3.3% C, and 0.7% G, will count as 2 alleles (at 1% threshold).
3
933a9435939c Current xml
nick
parents: 0
diff changeset
79
4
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
80 Strand bias: the alleles passing the threshold on each strand have to match (though not in order). Otherwise, the allele count will be 0. So a site/sample line whose + strand shows 70% A, 27% C, and 3% G, and - strand shows 70% A and 30% C will have an allele count of 0. The minor allele and minor allele frequency, though, will always be reported\*.
3
933a9435939c Current xml
nick
parents: 0
diff changeset
81
4
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
82 But in this version, there is no requirement that the strands show similar allele frequencies, as long as they both pass the threshold.
3
933a9435939c Current xml
nick
parents: 0
diff changeset
83
4
898eb3daab43 Complete documentation
nick
parents: 3
diff changeset
84 \*One specific case will actually affect the reported minor allele identity and frequency. If there is a tie for the minor allele (between the 2nd and 3rd most common alleles), the minor allele will be reporated as 'N', and the frequency as 0.0.
3
933a9435939c Current xml
nick
parents: 0
diff changeset
85
0
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
86 </help>
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
87
28c40f4b7d2b Uploaded xml description
nick
parents:
diff changeset
88 </tool>