annotate query_mass_repos.py @ 60:35f506f30ae4

fixed small rule in pdfread, and other small enhancements
author pieter.lukasse@wur.nl
date Fri, 19 Dec 2014 11:30:22 +0100
parents 60b53f2aa48a
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
23
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
1 #!/usr/bin/env python
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
2 # encoding: utf-8
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
3 '''
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
4 Module to query a set of accurate mass values detected by high-resolution mass spectrometers
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
5 against various repositories/services such as METabolomics EXPlorer database or the
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
6 MFSearcher service (http://webs2.kazusa.or.jp/mfsearcher/).
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
7
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
8 It will take the input file and for each record it will query the
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
9 molecular mass in the selected repository/service. If one or more compounds are found
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
10 then extra information regarding these compounds is added to the output file.
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
11
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
12 The output file is thus the input file enriched with information about
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
13 related items found in the selected repository/service.
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
14
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
15 The service should implement the following interface:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
16
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
17 http://service_url/mass?targetMs=500&margin=1&marginUnit=ppm&output=txth (txth means there is guaranteed to be a header line before the data)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
18
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
19 The output should be tab separated and should contain the following columns (in this order)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
20 db-name molecular-formula dbe formula-weight id description
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
21
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
22
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
23 '''
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
24 import csv
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
25 import sys
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
26 import fileinput
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
27 import urllib2
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
28 import time
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
29 from collections import OrderedDict
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
30
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
31 __author__ = "Pieter Lukasse"
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
32 __contact__ = "pieter.lukasse@wur.nl"
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
33 __copyright__ = "Copyright, 2014, Plant Research International, WUR"
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
34 __license__ = "Apache v2"
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
35
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
36 def _process_file(in_xsv, delim='\t'):
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
37 '''
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
38 Generic method to parse a tab-separated file returning a dictionary with named columns
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
39 @param in_csv: input filename to be parsed
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
40 '''
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
41 data = list(csv.reader(open(in_xsv, 'rU'), delimiter=delim))
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
42 return _process_data(data)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
43
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
44 def _process_data(data):
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
45
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
46 header = data.pop(0)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
47 # Create dictionary with column name as key
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
48 output = OrderedDict()
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
49 for index in xrange(len(header)):
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
50 output[header[index]] = [row[index] for row in data]
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
51 return output
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
52
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
53
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
54 def _query_and_add_data(input_data, molecular_mass_col, repository_dblink, error_margin, margin_unit):
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
55
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
56 '''
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
57 This method will iterate over the record in the input_data and
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
58 will enrich them with the related information found (if any) in the
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
59 chosen repository/service
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
60
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
61 # TODO : could optimize this with multi-threading, see also nice example at http://stackoverflow.com/questions/2846653/python-multithreading-for-dummies
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
62 '''
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
63 merged = []
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
64
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
65 for i in xrange(len(input_data[input_data.keys()[0]])):
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
66 # Get the record in same dictionary format as input_data, but containing
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
67 # a value at each column instead of a list of all values of all records:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
68 input_data_record = OrderedDict(zip(input_data.keys(), [input_data[key][i] for key in input_data.keys()]))
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
69
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
70 # read the molecular mass :
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
71 molecular_mass = input_data_record[molecular_mass_col]
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
72
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
73
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
74 # search for related records in repository/service:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
75 data_found = None
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
76 if molecular_mass != "":
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
77 molecular_mass = float(molecular_mass)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
78
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
79 # 1- search for data around this MM:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
80 query_link = repository_dblink + "/mass?targetMs=" + str(molecular_mass) + "&margin=" + str(error_margin) + "&marginUnit=" + margin_unit + "&output=txth"
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
81
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
82 data_found = _fire_query_and_return_dict(query_link + "&_format_result=tsv")
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
83 data_type_found = "MM"
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
84
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
85
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
86 if data_found == None:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
87 # If still nothing found, just add empty columns
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
88 extra_cols = ['', '','','','','']
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
89 else:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
90 # Add info found:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
91 extra_cols = _get_extra_info_and_link_cols(data_found, data_type_found, query_link)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
92
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
93 # Take all data and merge it into a "flat"/simple array of values:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
94 field_values_list = _merge_data(input_data_record, extra_cols)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
95
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
96 merged.append(field_values_list)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
97
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
98 # return the merged/enriched records:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
99 return merged
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
100
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
101
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
102 def _get_extra_info_and_link_cols(data_found, data_type_found, query_link):
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
103 '''
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
104 This method will go over the data found and will return a
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
105 list with the following items:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
106 - details of hits found :
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
107 db-name molecular-formula dbe formula-weight id description
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
108 - Link that executes same query
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
109
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
110 '''
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
111
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
112 # set() makes a unique list:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
113 db_name_set = []
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
114 molecular_formula_set = []
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
115 id_set = []
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
116 description_set = []
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
117
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
118
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
119 if 'db-name' in data_found:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
120 db_name_set = set(data_found['db-name'])
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
121 elif '# db-name' in data_found:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
122 db_name_set = set(data_found['# db-name'])
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
123 if 'molecular-formula' in data_found:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
124 molecular_formula_set = set(data_found['molecular-formula'])
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
125 if 'id' in data_found:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
126 id_set = set(data_found['id'])
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
127 if 'description' in data_found:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
128 description_set = set(data_found['description'])
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
129
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
130 result = [data_type_found,
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
131 _to_xsv(db_name_set),
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
132 _to_xsv(molecular_formula_set),
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
133 _to_xsv(id_set),
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
134 _to_xsv(description_set),
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
135 #To let Excel interpret as link, use e.g. =HYPERLINK("http://stackoverflow.com", "friendly name"):
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
136 "=HYPERLINK(\""+ query_link + "\", \"Link to entries found in DB \")"]
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
137 return result
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
138
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
139
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
140 def _to_xsv(data_set):
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
141 result = ""
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
142 for item in data_set:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
143 result = result + str(item) + "|"
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
144 return result
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
145
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
146
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
147 def _fire_query_and_return_dict(url):
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
148 '''
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
149 This method will fire the query as a web-service call and
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
150 return the results as a list of dictionary objects
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
151 '''
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
152
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
153 try:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
154 data = urllib2.urlopen(url).read()
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
155
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
156 # transform to dictionary:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
157 result = []
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
158 data_rows = data.split("\n")
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
159
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
160 # remove comment lines if any (only leave the one that has "molecular-formula" word in it...compatible with kazusa service):
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
161 data_rows_to_remove = []
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
162 for data_row in data_rows:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
163 if data_row == "" or (data_row[0] == '#' and "molecular-formula" not in data_row):
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
164 data_rows_to_remove.append(data_row)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
165
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
166 for data_row in data_rows_to_remove:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
167 data_rows.remove(data_row)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
168
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
169 # check if there is any data in the response:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
170 if len(data_rows) <= 1 or data_rows[1].strip() == '':
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
171 # means there is only the header row...so no hits:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
172 return None
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
173
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
174 for data_row in data_rows:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
175 if not data_row.strip() == '':
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
176 row_as_list = _str_to_list(data_row, delimiter='\t')
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
177 result.append(row_as_list)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
178
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
179 # return result processed into a dict:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
180 return _process_data(result)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
181
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
182 except urllib2.HTTPError, e:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
183 raise Exception( "HTTP error for URL: " + url + " : %s - " % e.code + e.reason)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
184 except urllib2.URLError, e:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
185 raise Exception( "Network error: %s" % e.reason.args[1] + ". Administrator: please check if service [" + url + "] is accessible from your Galaxy server. ")
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
186
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
187 def _str_to_list(data_row, delimiter='\t'):
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
188 result = []
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
189 for column in data_row.split(delimiter):
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
190 result.append(column)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
191 return result
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
192
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
193
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
194 # alternative: ?
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
195 # s = requests.Session()
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
196 # s.verify = False
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
197 # #s.auth = (token01, token02)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
198 # resp = s.get(url, params={'name': 'anonymous'}, stream=True)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
199 # content = resp.content
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
200 # # transform to dictionary:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
201
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
202
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
203
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
204
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
205 def _merge_data(input_data_record, extra_cols):
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
206 '''
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
207 Adds the extra information to the existing data record and returns
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
208 the combined new record.
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
209 '''
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
210 record = []
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
211 for column in input_data_record:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
212 record.append(input_data_record[column])
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
213
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
214
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
215 # add extra columns
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
216 for column in extra_cols:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
217 record.append(column)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
218
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
219 return record
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
220
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
221
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
222 def _save_data(data_rows, headers, out_csv):
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
223 '''
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
224 Writes tab-separated data to file
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
225 @param data_rows: dictionary containing merged/enriched dataset
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
226 @param out_csv: output csv file
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
227 '''
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
228
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
229 # Open output file for writing
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
230 outfile_single_handle = open(out_csv, 'wb')
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
231 output_single_handle = csv.writer(outfile_single_handle, delimiter="\t")
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
232
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
233 # Write headers
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
234 output_single_handle.writerow(headers)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
235
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
236 # Write one line for each row
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
237 for data_row in data_rows:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
238 output_single_handle.writerow(data_row)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
239
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
240 def _get_repository_URL(repository_file):
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
241 '''
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
242 Read out and return the URL stored in the given file.
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
243 '''
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
244 file_input = fileinput.input(repository_file)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
245 try:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
246 for line in file_input:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
247 if line[0] != '#':
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
248 # just return the first line that is not a comment line:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
249 return line
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
250 finally:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
251 file_input.close()
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
252
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
253
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
254 def main():
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
255 '''
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
256 Query main function
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
257
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
258 The input file can be any tabular file, as long as it contains a column for the molecular mass.
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
259 This column is then used to query against the chosen repository/service Database.
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
260 '''
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
261 seconds_start = int(round(time.time()))
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
262
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
263 input_file = sys.argv[1]
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
264 molecular_mass_col = sys.argv[2]
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
265 repository_file = sys.argv[3]
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
266 error_margin = float(sys.argv[4])
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
267 margin_unit = sys.argv[5]
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
268 output_result = sys.argv[6]
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
269
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
270 # Parse repository_file to find the URL to the service:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
271 repository_dblink = _get_repository_URL(repository_file)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
272
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
273 # Parse tabular input file into dictionary/array:
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
274 input_data = _process_file(input_file)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
275
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
276 # Query data against repository :
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
277 enriched_data = _query_and_add_data(input_data, molecular_mass_col, repository_dblink, error_margin, margin_unit)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
278 headers = input_data.keys() + ['SEARCH hits for ','SEARCH hits: db-names', 'SEARCH hits: molecular-formulas ',
30
60b53f2aa48a Small fixes, added microminutes support to MsClust, removed TIC or MsClust output
pieter.lukasse@wur.nl
parents: 23
diff changeset
279 'SEARCH hits: ids','SEARCH hits: descriptions', 'Link to SEARCH hits'] #TODO - add min and max formula weigth columns
23
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
280
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
281 _save_data(enriched_data, headers, output_result)
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
282
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
283 seconds_end = int(round(time.time()))
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
284 print "Took " + str(seconds_end - seconds_start) + " seconds"
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
285
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
286
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
287
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
288 if __name__ == '__main__':
85fd05d0d16c New tool to Query multiple public repositories for elemental compositions
pieter.lukasse@wur.nl
parents:
diff changeset
289 main()