annotate cage_scan_clustering.py @ 0:d1d0ee366702 draft default tip

Uploaded first version
author brenninc
date Wed, 11 May 2016 04:53:30 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
1 """
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
2 .. module:: cage_scan_clustering
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
3 :platform: Unix
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
4 :synopsis: cluster cage scan reads based on arbitrary distance. It assumes that the reads are sorted by chrom, TSS and strand.
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
5 :version: 1.0
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
6
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
7 .. moduleauthor:: Mickael Mendez <mendez.mickael@gmail.com>
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
8
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
9 .. source: https://github.com/mmendez12/umicount
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
10 """
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
11
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
12 import csv
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
13 import argparse
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
14
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
15
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
16 import bed12
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
17
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
18 #TODO: rewrite tests
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
19 def print_read_to_bed12(reads):
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
20 """ Merge the reads by blocks and print a single read in the BED12 format on stdout.
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
21 It assumes that the reads are on the same TSS and contains
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
22 fingerprint information in the read's name.
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
23
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
24 Args:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
25 reads: A list of reads
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
26
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
27 """
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
28 block_sizes, block_starts = bed12.merge_overlapping_blocks(reads)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
29
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
30 #bed12
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
31 first_read = sorted(reads, key=bed12.get_start)[0]
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
32 chrom = bed12.get_chrom(first_read)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
33 start = bed12.get_start(first_read)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
34 end = start + block_starts[-1] + block_sizes[-1]
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
35
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
36 score = len(reads)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
37
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
38 strand = bed12.get_strand(first_read)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
39
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
40 if strand == '+':
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
41 thick_start = start
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
42 thick_end = start + block_sizes[0]
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
43 else:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
44 thick_start = end - block_sizes[-1]
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
45 thick_end = end
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
46
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
47 color = "255,0,0"
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
48 block_count = len(block_sizes)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
49 block_sizes = ','.join(map(str, block_sizes))
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
50 block_starts = ','.join(map(str, block_starts))
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
51
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
52 name = map(str, [chrom, start, end, strand])
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
53 name = ":".join(name)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
54
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
55 output = [chrom, start, end, name, score, strand, thick_start, thick_end,
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
56 color, block_count, block_sizes, block_starts]
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
57
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
58 output_str = map(str, output)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
59 print '\t'.join(output_str)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
60
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
61
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
62 def overlapping_reads(reads, distance):
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
63 """returns all the overlapping reads within a given distance"""
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
64
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
65 reads_list = []
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
66 cur_tss = 0
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
67 cur_chrom = ''
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
68
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
69 for read in reads:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
70
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
71 if not cur_tss:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
72 cur_tss = bed12.get_tss(read)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
73 reads_list.append(read)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
74 cur_chrom = bed12.get_chrom(read)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
75 continue
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
76
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
77
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
78 tss = bed12.get_tss(read)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
79 chrom = bed12.get_chrom(read)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
80
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
81 #if not overlap
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
82 if (tss - cur_tss > distance) or (chrom != cur_chrom):
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
83 yield reads_list
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
84 reads_list = [read]
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
85 cur_tss = tss
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
86 cur_chrom = chrom
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
87 else:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
88 reads_list.append(read)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
89 cur_tss = tss
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
90
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
91 yield reads_list
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
92
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
93
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
94 def main():
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
95
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
96 parser = argparse.ArgumentParser()
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
97 parser.add_argument('bed_file', help='input file')
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
98 parser.add_argument('-t', '--tag_distance', default=20, type=int, help='cluster all the cage tags at distance d')
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
99
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
100 args = parser.parse_args()
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
101
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
102 with open(args.bed_file) as bedfile:
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
103
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
104 reader = csv.reader(bedfile, delimiter='\t')
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
105 reads = (line for line in reader)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
106
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
107 #for each reads on the same tss
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
108 for read_list in overlapping_reads(reads, args.tag_distance):
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
109 print_read_to_bed12(read_list)
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
110
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
111
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
112 if __name__ == '__main__':
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
113 main()
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
114
d1d0ee366702 Uploaded first version
brenninc
parents:
diff changeset
115 #TODO: combine this script with fingerprint.py