annotate vcf_gff.py @ 6:f201e8c6e004 draft default tip

Uploaded
author ben-warren
date Mon, 07 Jul 2014 19:28:17 -0400
parents b321e0517be3
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
1
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
1 #!/usr/local/bin/python2.6
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
2 """
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
3 Usage vcf_gff.py <vcf_file input> <gff_file output>
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
4
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
5 Input vcf file format:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
6 CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
7
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
8 Note: Generating vcf from a single merged bam file, using multiple bam files results in multiple FORMAT columns!!!
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
9
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
10 Output gff format:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
11 SEQID SOURCE TYPE START END SCORE STRAND PHASE ATTRIBUTES
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
12
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
13 """
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
14 #Copyright 2012 Susan Thomson
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
15 #New Zealand Institute for Plant and Food Research
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
16 #This program is free software: you can redistribute it and/or modify
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
17 # it under the terms of the GNU General Public License as published by
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
18 # the Free Software Foundation, either version 3 of the License, or
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
19 # (at your option) any later version.
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
20 #
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
21 # This program is distributed in the hope that it will be useful,
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
22 # but WITHOUT ANY WARRANTY; without even the implied warranty of
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
23 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
24 # GNU General Public License for more details.
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
25 #
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
26 # You should have received a copy of the GNU General Public License
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
27 # along with this program. If not, see <http://www.gnu.org/licenses/>.
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
28
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
29
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
30 import sys
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
31
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
32 in_vcf_file = open(sys.argv[1], 'r')
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
33 out_gff_file = open(sys.argv[2], 'w')
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
34
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
35 def get_info(attribute_input):
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
36 """
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
37 Get the record_type by determining if INDEL is stated in column INFO; get
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
38 raw read count
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
39 """
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
40 INFO = {}
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
41 rec = attribute_input.split(";")
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
42 if "INDEL" in rec:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
43 record_type = "INDEL"
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
44 rec = rec[1:]
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
45 else:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
46 record_type = "SNP"
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
47 for entry in rec:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
48 detail = entry.split("=")
5
b321e0517be3 Uploaded
ben-warren
parents: 3
diff changeset
49 if len(detail) < 2:
b321e0517be3 Uploaded
ben-warren
parents: 3
diff changeset
50 continue
1
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
51 INFO[detail[0]] = detail[1]
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
52 if INFO.has_key("DP"):
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
53 reads = INFO.get("DP")
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
54 else:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
55 reads = "NA"
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
56 data = (record_type, reads)
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
57 return data
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
58
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
59 def get_gen(formatcols, ref):
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
60 """
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
61 Get info on heterozyosity or homozygosity (could be useful later),
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
62 by estimating genotype(GT) calling ref allele=0, variant allele=1
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
63
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
64 """
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
65 formats = []
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
66 sample_dict = {}
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
67 genos = ""
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
68 format_types = formatcols[0].split(":")
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
69 samples = formatcols[1:]
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
70 for entry in format_types:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
71 formats.append(entry)
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
72 for sample in samples:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
73 var = ""
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
74 data = sample.split(":")
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
75 sample_dict = dict(zip(formats, data))
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
76 if sample_dict.has_key("DP"):
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
77 reads = sample_dict.get("DP")
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
78 else:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
79 reads = "NA"
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
80 if sample_dict.has_key("NF"):
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
81 """
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
82 polysnp tool output
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
83 """
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
84 freq = sample_dict.get("NF")
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
85 variants = freq.split("|")
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
86 if int(variants[0]) >= 1:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
87 var += "A"
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
88 if int(variants[1]) >= 1:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
89 var += "C"
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
90 if int(variants[2]) >= 1:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
91 var += "G"
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
92 if int(variants[3]) >= 1:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
93 var += "T"
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
94 if var == "":
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
95 gen = "NA"
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
96 elif var == ref:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
97 gen = "HOM_ref"
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
98 elif len(var) >= 2:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
99 gen = "HET"
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
100 else:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
101 gen = "HOM_mut"
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
102 elif sample_dict.has_key("GT"):
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
103 """
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
104 mpileup output, recommend ALWAYS have GT, note this only good scoring for diploids too!
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
105 """
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
106 genotypes = sample_dict.get("GT")
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
107 if genotypes == "1/1":
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
108 gen = "HOM_mut"
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
109 if genotypes == "0/1":
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
110 gen = "HET"
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
111 if genotypes == "0/0":
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
112 gen = "HOM_ref"
5
b321e0517be3 Uploaded
ben-warren
parents: 3
diff changeset
113 try: # set gen to 'NA' if still unset
b321e0517be3 Uploaded
ben-warren
parents: 3
diff changeset
114 gen
b321e0517be3 Uploaded
ben-warren
parents: 3
diff changeset
115 except NameError:
1
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
116 gen = "NA"
3
402c3f0fe807 Uploaded revised vcf_gff.py from Github to fix bug
john-mccallum
parents: 1
diff changeset
117 geno = ("%s:%s " % (reads, gen))
1
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
118 genos += geno
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
119 sample_dict = {}
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
120 return genos
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
121
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
122 attributes = {}
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
123 """
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
124 Get relevant info from vcf file and put to proper gff columns
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
125 """
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
126
3
402c3f0fe807 Uploaded revised vcf_gff.py from Github to fix bug
john-mccallum
parents: 1
diff changeset
127 out_gff_file.write("#gff-version 3\n")
1
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
128 for line in in_vcf_file:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
129 if line.startswith("#") == False:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
130 info = line.split()
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
131 seqid = info[0].strip()
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
132 source = "SAMTOOLS"
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
133 start = int(info[1])
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
134 score = info[5]
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
135 strand = "."
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
136 phase = "."
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
137 reference = info[3].strip()
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
138 variant = info[4].strip()
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
139 attr = get_info(info[7])
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
140 record_type = attr[0]
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
141 reads = attr[1]
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
142 if record_type == "INDEL":
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
143 end = start + len(reference)
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
144 else:
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
145 end = start
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
146 gen = get_gen(info[8:], reference)
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
147 out_gff_file.write(
5
b321e0517be3 Uploaded
ben-warren
parents: 3
diff changeset
148 ("%s\t%s\t%s\t%d\t%d\t%s\t%s\t%s\tID=%s:%s:%s:%d;Variant" +
3
402c3f0fe807 Uploaded revised vcf_gff.py from Github to fix bug
john-mccallum
parents: 1
diff changeset
149 "_seq=%s;Reference_seq=%s;Total_reads=%s;Zygosity=%s\n") %
1
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
150 ( seqid, source,record_type, start, end, score, strand, phase,seqid,
5
b321e0517be3 Uploaded
ben-warren
parents: 3
diff changeset
151 source, record_type, start, variant, reference, reads, gen))
b321e0517be3 Uploaded
ben-warren
parents: 3
diff changeset
152
1
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
153 out_gff_file.close()
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
154
a0689dc29b7f Updated vcf to gff conversion tool
john-mccallum
parents:
diff changeset
155