annotate rDiff/tools/GFFParser.py @ 3:29a698dc5c7e default tip

Merge multiple heads.
author Dave Bouvier <dave@bx.psu.edu>
date Mon, 27 Jan 2014 14:15:36 -0500
parents 233c30f91d66
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
2
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
1 #!/usr/bin/env python
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
2 """
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
3 Extract genome annotation from a GFF (a tab delimited format for storing sequence features and annotations) file.
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
4
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
5 Requirements:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
6 Numpy :- http://numpy.org/
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
7 Scipy :- http://scipy.org/
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
8
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
9 Copyright (C)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
10
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
11 2009-2012 Friedrich Miescher Laboratory of the Max Planck Society, Tubingen, Germany.
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
12 2012-2013 Memorial Sloan-Kettering Cancer Center, New York City, USA.
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
13 """
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
14
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
15 import re
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
16 import os
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
17 import sys
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
18 import urllib
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
19 import numpy as np
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
20 import scipy.io as sio
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
21 from collections import defaultdict
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
22 import helper as utils
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
23
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
24 def _attribute_tags(col9):
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
25 """
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
26 Split the key-value tags from the attribute column, it takes column number 9 from GTF/GFF file
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
27 """
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
28 info = defaultdict(list)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
29 is_gff = False
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
30
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
31 if not col9:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
32 return is_gff, info
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
33
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
34 # trim the line ending semi-colon ucsc may have some white-space
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
35 col9 = col9.rstrip(';| ')
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
36 # attributes from 9th column
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
37 atbs = col9.split(" ; ")
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
38 if len(atbs) == 1:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
39 atbs = col9.split("; ")
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
40 if len(atbs) == 1:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
41 atbs = col9.split(";")
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
42 # check the GFF3 pattern which has key value pairs like:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
43 gff3_pat = re.compile("\w+=")
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
44 # sometime GTF have: gene_id uc002zkg.1;
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
45 gtf_pat = re.compile("\s?\w+\s")
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
46
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
47 key_vals = []
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
48
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
49 if gff3_pat.match(atbs[0]): # gff3 pattern
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
50 is_gff = True
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
51 key_vals = [at.split('=') for at in atbs]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
52 elif gtf_pat.match(atbs[0]): # gtf pattern
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
53 for at in atbs:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
54 key_vals.append(at.strip().split(" ",1))
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
55 else:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
56 # to handle attribute column has only single value
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
57 key_vals.append(['ID', atbs[0]])
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
58 # get key, val items
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
59 for item in key_vals:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
60 key, val = item
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
61 # replace the double qoutes from feature identifier
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
62 val = re.sub('"', '', val)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
63 # replace the web formating place holders to plain text format
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
64 info[key].extend([urllib.unquote(v) for v in val.split(',') if v])
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
65
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
66 return is_gff, info
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
67
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
68 def _spec_features_keywd(gff_parts):
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
69 """
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
70 Specify the feature key word according to the GFF specifications
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
71 """
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
72 for t_id in ["transcript_id", "transcriptId", "proteinId"]:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
73 try:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
74 gff_parts["info"]["Parent"] = gff_parts["info"][t_id]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
75 break
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
76 except KeyError:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
77 pass
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
78 for g_id in ["gene_id", "geneid", "geneId", "name", "gene_name", "genename"]:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
79 try:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
80 gff_parts["info"]["GParent"] = gff_parts["info"][g_id]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
81 break
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
82 except KeyError:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
83 pass
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
84 ## TODO key words
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
85 for flat_name in ["Transcript", "CDS"]:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
86 if gff_parts["info"].has_key(flat_name):
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
87 # parents
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
88 if gff_parts['type'] in [flat_name] or re.search(r'transcript', gff_parts['type'], re.IGNORECASE):
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
89 if not gff_parts['id']:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
90 gff_parts['id'] = gff_parts['info'][flat_name][0]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
91 #gff_parts["info"]["ID"] = [gff_parts["id"]]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
92 # children
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
93 elif gff_parts["type"] in ["intron", "exon", "pseudogenic_exon", "three_prime_UTR",
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
94 "coding_exon", "five_prime_UTR", "CDS", "stop_codon",
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
95 "start_codon"]:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
96 gff_parts["info"]["Parent"] = gff_parts["info"][flat_name]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
97 break
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
98 return gff_parts
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
99
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
100 def Parse(ga_file):
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
101 """
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
102 Parsing GFF/GTF file based on feature relationship, it takes the input file.
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
103 """
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
104 child_map = defaultdict(list)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
105 parent_map = dict()
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
106
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
107 ga_handle = utils._open_file(ga_file)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
108
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
109 for rec in ga_handle:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
110 rec = rec.strip('\n\r')
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
111
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
112 # skip empty line fasta identifier and commented line
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
113 if not rec or rec[0] in ['#', '>']:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
114 continue
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
115 # skip the genome sequence
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
116 if not re.search('\t', rec):
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
117 continue
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
118
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
119 parts = rec.split('\t')
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
120 assert len(parts) >= 8, rec
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
121
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
122 # process the attribute column (9th column)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
123 ftype, tags = _attribute_tags(parts[-1])
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
124 if not tags: # skip the line if no attribute column.
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
125 continue
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
126
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
127 # extract fields
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
128 if parts[1]:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
129 tags["source"] = parts[1]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
130 if parts[7]:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
131 tags["phase"] = parts[7]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
132
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
133 gff_info = dict()
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
134 gff_info['info'] = dict(tags)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
135 #gff_info["is_gff3"] = ftype
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
136 gff_info['chr'] = parts[0]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
137
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
138 if parts[3] and parts[4]:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
139 gff_info['location'] = [int(parts[3]) ,
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
140 int(parts[4])]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
141 gff_info['type'] = parts[2]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
142 gff_info['id'] = tags.get('ID', [''])[0]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
143 if parts[6] in ['?', '.']:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
144 parts[6] = None
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
145 gff_info['strand'] = parts[6]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
146
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
147 # key word according to the GFF spec.
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
148 if not ftype:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
149 gff_info = _spec_features_keywd(gff_info)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
150
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
151 # link the feature relationships
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
152 if gff_info['info'].has_key('Parent'):
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
153 for p in gff_info['info']['Parent']:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
154 if p == gff_info['id']:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
155 gff_info['id'] = ''
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
156 break
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
157 rec_category = 'child'
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
158 elif gff_info['id']:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
159 rec_category = 'parent'
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
160 else:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
161 rec_category = 'record'
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
162
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
163 # depends on the record category organize the features
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
164 if rec_category == 'child':
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
165 for p in gff_info['info']['Parent']:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
166 # create the data structure based on source and feature id
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
167 child_map[(gff_info['chr'], gff_info['info']['source'], p)].append(
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
168 dict( type = gff_info['type'],
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
169 location = gff_info['location'],
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
170 strand = gff_info['strand'],
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
171 ID = gff_info['id'],
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
172 gene_id = gff_info['info'].get('GParent', '')
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
173 ))
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
174 elif rec_category == 'parent':
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
175 parent_map[(gff_info['chr'], gff_info['info']['source'], gff_info['id'])] = dict(
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
176 type = gff_info['type'],
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
177 location = gff_info['location'],
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
178 strand = gff_info['strand'],
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
179 name = tags.get('Name', [''])[0])
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
180 elif rec_category == 'record':
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
181 #TODO how to handle plain records?
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
182 c = 1
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
183 ga_handle.close()
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
184
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
185 # depends on file type create parent feature
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
186 if not ftype:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
187 parent_map, child_map = _create_missing_feature_type(parent_map, child_map)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
188
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
189 # connecting parent child relations
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
190 # // essentially the parent child features are here from any type of GTF/GFF2/GFF3 file
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
191 gene_mat = _format_gene_models(parent_map, child_map)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
192
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
193 return gene_mat
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
194
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
195 def _format_gene_models(parent_nf_map, child_nf_map):
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
196 """
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
197 Genarate GeneObject based on the parsed file contents
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
198
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
199 parent_map: parent features with source and chromosome information
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
200 child_map: transctipt and exon information are encoded
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
201 """
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
202 g_cnt = 0
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
203 gene = np.zeros((len(parent_nf_map),), dtype = utils.init_gene_DE())
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
204
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
205 for pkey, pdet in parent_nf_map.items():
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
206 # considering only gene features
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
207 if not re.search(r'gene', pdet.get('type', '')):
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
208 continue
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
209 # infer the gene start and stop if not there in the
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
210 if not pdet.get('location', []):
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
211 GNS, GNE = [], []
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
212 # multiple number of transcripts
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
213 for L1 in child_nf_map[pkey]:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
214 GNS.append(L1.get('location', [])[0])
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
215 GNE.append(L1.get('location', [])[1])
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
216 GNS.sort()
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
217 GNE.sort()
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
218 pdet['location'] = [GNS[0], GNE[-1]]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
219 orient = pdet.get('strand', '')
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
220
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
221 gene[g_cnt]['id'] = g_cnt +1
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
222 gene[g_cnt]['chr'] = pkey[0]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
223 gene[g_cnt]['source'] = pkey[1]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
224 gene[g_cnt]['name'] = pkey[-1]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
225 gene[g_cnt]['start'] = pdet.get('location', [])[0]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
226 gene[g_cnt]['stop'] = pdet.get('location', [])[1]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
227 gene[g_cnt]['strand'] = orient
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
228
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
229 # default value
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
230 gene[g_cnt]['is_alt_spliced'] = 0
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
231 if len(child_nf_map[pkey]) > 1:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
232 gene[g_cnt]['is_alt_spliced'] = 1
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
233
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
234 # complete sub-feature for all transcripts
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
235 dim = len(child_nf_map[pkey])
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
236 TRS = np.zeros((dim,), dtype=np.object)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
237 EXON = np.zeros((dim,), dtype=np.object)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
238
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
239 # fetching corresponding transcripts
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
240 for xq, Lv1 in enumerate(child_nf_map[pkey]):
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
241
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
242 TID = Lv1.get('ID', '')
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
243 TRS[xq]= np.array([TID])
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
244
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
245 orient = Lv1.get('strand', '')
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
246
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
247 # fetching different sub-features
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
248 child_feat = defaultdict(list)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
249 for Lv2 in child_nf_map[(pkey[0], pkey[1], TID)]:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
250 E_TYP = Lv2.get('type', '')
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
251 child_feat[E_TYP].append(Lv2.get('location'))
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
252
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
253 # make exon coordinate from cds and utr regions
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
254 if not child_feat.get('exon'):
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
255 if child_feat.get('CDS'):
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
256 exon_cod = utils.make_Exon_cod( orient,
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
257 NonetoemptyList(child_feat.get('five_prime_UTR')),
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
258 NonetoemptyList(child_feat.get('CDS')),
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
259 NonetoemptyList(child_feat.get('three_prime_UTR')))
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
260 child_feat['exon'] = exon_cod
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
261 else:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
262 # searching through keys to find a pattern describing exon feature
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
263 ex_key_pattern = [k for k in child_feat if k.endswith("exon")]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
264 child_feat['exon'] = child_feat[ex_key_pattern[0]]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
265 # TODO only UTR's
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
266
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
267 # make general ascending order of coordinates
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
268 if orient == '-':
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
269 for etype, excod in child_feat.items():
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
270 if len(excod) > 1:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
271 if excod[0][0] > excod[-1][0]:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
272 excod.reverse()
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
273 child_feat[etype] = excod
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
274
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
275 # add sub-feature # make array for export to different out
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
276 EXON[xq] = np.array(child_feat.get('exon'), np.float64)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
277
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
278 # add sub-features to the parent gene feature
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
279 gene[g_cnt]['transcripts'] = TRS
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
280 gene[g_cnt]['exons'] = EXON
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
281
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
282 gene[g_cnt]['gene_info'] = dict( ID = pkey[-1],
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
283 Name = pdet.get('name'),
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
284 Source = pkey[1])
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
285 g_cnt += 1
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
286
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
287 ## deleting empty gene records from the main array
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
288 for XP, ens in enumerate(gene):
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
289 if ens[0]==0:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
290 break
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
291
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
292 XQC = range(XP, len(gene)+1)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
293 gene = np.delete(gene, XQC)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
294
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
295 return gene
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
296
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
297 def NonetoemptyList(XS):
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
298 """
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
299 Convert a None type to empty list
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
300 """
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
301 return [] if XS is None else XS
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
302
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
303 def _create_missing_feature_type(p_feat, c_feat):
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
304 """
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
305 GFF/GTF file defines only child features. This function tries to create
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
306 the parent feature from the information provided in the attribute column.
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
307
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
308 example:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
309 chr21 hg19_knownGene exon 9690071 9690100 0.000000 + . gene_id "uc002zkg.1"; transcript_id "uc002zkg.1";
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
310 chr21 hg19_knownGene exon 9692178 9692207 0.000000 + . gene_id "uc021wgt.1"; transcript_id "uc021wgt.1";
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
311 chr21 hg19_knownGene exon 9711935 9712038 0.000000 + . gene_id "uc011abu.2"; transcript_id "uc011abu.2";
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
312
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
313 This function gets the parsed feature annotations.
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
314 """
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
315 child_n_map = defaultdict(list)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
316 for fid, det in c_feat.items():
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
317 # get the details from grand child
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
318 GID = STRD = None
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
319 SPOS, EPOS = [], []
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
320 TYP = dict()
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
321 for gchild in det:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
322 GID = gchild.get('gene_id', [''])[0]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
323 SPOS.append(gchild.get('location', [])[0])
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
324 EPOS.append(gchild.get('location', [])[1])
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
325 STRD = gchild.get('strand', '')
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
326 TYP[gchild.get('type', '')] = 1
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
327 SPOS.sort()
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
328 EPOS.sort()
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
329
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
330 # infer transcript type
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
331 transcript_type = 'transcript'
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
332 transcript_type = 'mRNA' if TYP.get('CDS', '') or TYP.get('cds', '') else transcript_type
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
333
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
334 # gene id and transcript id are same
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
335 transcript_id = fid[-1]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
336 if GID == transcript_id:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
337 transcript_id = 'Transcript:' + str(GID)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
338
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
339 # level -1 feature type
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
340 p_feat[(fid[0], fid[1], GID)] = dict( type = 'gene',
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
341 location = [], ## infer location based on multiple transcripts
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
342 strand = STRD,
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
343 name = GID )
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
344 # level -2 feature type
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
345 child_n_map[(fid[0], fid[1], GID)].append(
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
346 dict( type = transcript_type,
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
347 location = [SPOS[0], EPOS[-1]],
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
348 strand = STRD,
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
349 ID = transcript_id,
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
350 gene_id = '' ))
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
351 # reorganizing the grand child
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
352 for gchild in det:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
353 child_n_map[(fid[0], fid[1], transcript_id)].append(
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
354 dict( type = gchild.get('type', ''),
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
355 location = gchild.get('location'),
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
356 strand = gchild.get('strand'),
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
357 ID = gchild.get('ID'),
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
358 gene_id = '' ))
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
359 return p_feat, child_n_map
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
360
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
361
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
362 ## General instruction to use the above functions:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
363 ## Usage: GFFParser.py in.gff3 out.mat
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
364
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
365 try:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
366 gff_file = sys.argv[1]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
367 out_mat = sys.argv[2]
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
368 except:
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
369 print __doc__
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
370 sys.exit(-1)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
371
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
372 ## Parse the file accoring to the type and returns the genes informations --
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
373 gene_struct = Parse(gff_file)
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
374
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
375 ## Write the gene annotations to a matlab struct array format --
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
376 sio.savemat(out_mat,
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
377 mdict = dict(genes = gene_struct),
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
378 format = '5',
233c30f91d66 updated python based GFF parsing module which will handle GTF/GFF/GFF3 file types
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
379 oned_as = 'row')