annotate dedup_barcode_fingerprint.py @ 0:d1d0ee366702 draft default tip

Uploaded first version
author brenninc
date Wed, 11 May 2016 04:53:30 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
1 """
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
2 .. module:: fingerprint
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
3 :platform: Unix
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
4 :synopsis: Use UMI to count transcripts
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
5 :version: 1.0
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
6
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
7 .. moduleauthor:: Mickael Mendez <mendez.mickael@gmail.com>
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
8
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
9 .. source: https://github.com/mmendez12/umicount
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
10
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
11 """
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
12
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
13 import csv
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
14 import itertools
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
15 import subprocess
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
16 import argparse
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
17 import tempfile
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
18 import os
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
19 import shutil
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
20 from collections import defaultdict
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
21
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
22 import bed12
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
23
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
24
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
25 def get_fingerprint(read):
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
26 """Get fingerprint id from the read's name. It assumes that the read's name
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
27 contains the following pattern *FP:XXX;* where *XXX* is the fingerprint id.
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
28
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
29 Args:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
30 read: A list of twelve elements where each element refers to a field in the BED format.
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
31
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
32 Returns:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
33 A string containing the fingerprint id
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
34
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
35 >>> read = ['chrX', '100', '200', 'BC:ATGC;FP:0012', '12', '+', '100', '110', '255,0,0', '2', '21,25', '0,75']
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
36 >>> get_fingerprint(read)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
37 '0012'
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
38 """
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
39 return read[3].split('FP:')[1].split(';')[0]
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
40
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
41
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
42 def get_barcode(read):
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
43 """Get barcode from the read's name. It assumes that the read's name
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
44 contains the following pattern *BC:XXX;* where *XXX* is the actual barcode.
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
45
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
46 Args:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
47 read: A list of twelve elements where each element refers to a field in the BED format.
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
48
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
49 Returns:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
50 A string containing the barcode
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
51
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
52 >>> read = ['chrX', '100', '200', 'BC:ATGC;FP:0012', '12', '+', '100', '110', '255,0,0', '2', '21,25', '0,75']
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
53 >>> get_barcode(read)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
54 'ATGC'
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
55 """
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
56 return read[3].split('BC:')[1].split(';')[0]
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
57
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
58
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
59 def print_read_to_bed12(key, reads):
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
60 """ Merge the reads by blocks and print a single read in the BED12 format on stdout.
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
61 It assumes that the reads are on the same TSS and contains
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
62 barcode and fingerprint information in the read's name.
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
63
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
64 Args:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
65 key: A tuple that contain the chromosome, barcode and fingerprint information.
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
66
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
67 reads: A list of reads (in a list) from the same TSS, that have similar barcode and fingerprint.
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
68
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
69 >>> reads = []
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
70 >>> reads.append(['chrX', '100', '200', 'BC:AAA;FP:0012', '12', '+', '100', '110', '255,0,0', '2', '20,25', '0,75'])
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
71 >>> reads.append(['chrX', '100', '300', 'BC:AAA;FP:0012', '12', '+', '100', '110', '255,0,0', '3', '20,25', '0,175'])
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
72 >>> print_read_to_bed12(('chrX', 'AAA', '0012'), reads) #doctest: +NORMALIZE_WHITESPACE
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
73 chrX 100 300 BC:AAA;FP:0012 2 + 100 120 255,0,0 3 20,25,25 0,75,175
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
74 """
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
75 block_sizes, block_starts = bed12.merge_overlapping_blocks(reads)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
76
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
77 #bed12
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
78 first_read = sorted(reads, key = bed12.get_start)[0]
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
79 chrom, barcode, fingerprint = key
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
80 start = bed12.get_start(first_read)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
81 end = start + block_starts[-1] + block_sizes[-1]
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
82 name = "BC:{0};FP:{1}".format(barcode, fingerprint)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
83 score = len(reads)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
84
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
85 strand = bed12.get_strand(first_read)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
86
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
87 if strand == '+':
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
88 thick_start = start
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
89 thick_end = start + block_sizes[0]
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
90 else:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
91 thick_start = end - block_sizes[-1]
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
92 thick_end = end
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
93
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
94 color = "255,0,0"
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
95 block_count = len(block_sizes)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
96 block_sizes = ','.join(map(str, block_sizes))
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
97 block_starts = ','.join(map(str, block_starts))
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
98
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
99 output = [chrom, start, end, name, score, strand, thick_start, thick_end,
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
100 color, block_count, block_sizes, block_starts]
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
101
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
102 output_str = map(str, output)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
103 print '\t'.join(output_str)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
104
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
105
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
106 def main():
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
107
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
108 #PARSER TODO: move this code somewhere else
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
109 parser = argparse.ArgumentParser()
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
110 group = parser.add_mutually_exclusive_group(required=True)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
111 group.add_argument("-d", "--directory", help="absolute path of the folder containing the bed files")
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
112 group.add_argument("-f", "--file", help="a bed file")
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
113 parser.add_argument("-o", help='name of the output file. Only works if the script is called with the -f option, \
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
114 ignored otherwise.')
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
115
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
116 args = parser.parse_args()
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
117
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
118 if args.directory:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
119 path, folder, files = os.walk(args.directory).next()
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
120 elif args.file:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
121 path = ''
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
122 files = [args.file]
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
123 #ENDPARSER
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
124
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
125 #create a temporary directory
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
126 tmp_dir = tempfile.mkdtemp()
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
127
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
128 plus_strand_tmp_file = open(os.path.join(tmp_dir, '+'), 'w')
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
129 minus_strand_tmp_file = open(os.path.join(tmp_dir, '-'), 'w')
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
130 plus_and_minus_sorted_path = os.path.join(tmp_dir, '+-s')
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
131
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
132 #creates two temporary bed files containing either reads on the plus or minus strand
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
133 for bed_file in files:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
134
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
135 with open(os.path.join(path, bed_file)) as bed_file:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
136 reader = csv.reader(bed_file, delimiter='\t')
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
137
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
138 for read in reader:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
139 strand = bed12.get_strand(read)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
140 if strand == '+':
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
141 plus_strand_tmp_file.write('\t'.join(read) + '\n')
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
142 elif strand == '-':
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
143 minus_strand_tmp_file.write('\t'.join(read) + '\n')
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
144
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
145
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
146 #close the files
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
147 plus_strand_tmp_file.close()
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
148 minus_strand_tmp_file.close()
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
149
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
150 #call unix sort on the file containing reads on the plus strand by tss
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
151 with open(os.path.join(tmp_dir, '+sorted'), "w") as outfile:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
152 subprocess.call(["sort", '-k2,2n', os.path.join(tmp_dir, '+')], stdout=outfile)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
153
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
154 #call unix sort on the file containing reads on the minus strand by tss
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
155 with open(os.path.join(tmp_dir, '-sorted'), "w") as outfile:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
156 subprocess.call(["sort", '-k3,3n', os.path.join(tmp_dir, '-')], stdout=outfile)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
157
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
158 #concatenate the files sorted by tss
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
159 with open(plus_and_minus_sorted_path, "w") as outfile:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
160 subprocess.call(['cat', os.path.join(tmp_dir, '+sorted'), os.path.join(tmp_dir, '-sorted')], stdout=outfile)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
161
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
162 with open(plus_and_minus_sorted_path) as bedfile:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
163 reader = csv.reader(bedfile, delimiter='\t')
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
164 reads = (line for line in reader)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
165
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
166 #for each reads on the same tss
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
167 for tss, reads in itertools.groupby(reads, bed12.get_tss):
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
168 d = defaultdict(list)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
169
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
170 #group the reads by chr, barcode and fingerprint
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
171 for read in reads:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
172 key = (bed12.get_chrom(read), get_barcode(read), get_fingerprint(read))
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
173 d[key].append(read)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
174
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
175 #merge and print the reads that have similar tss, barcode and fingerprint
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
176 for key, reads in d.iteritems():
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
177 print_read_to_bed12(key, reads)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
178
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
179 shutil.rmtree(tmp_dir)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
180
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
181
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
182 if __name__ == '__main__':
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
183 main()
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
184
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
185 #TODO: combine this script with dedup_fingerprint