annotate COG/bac-genomics-scripts/tbl2tab/tbl2tab.pl @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 #!/usr/bin/perl
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3 #######
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 # POD #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5 #######
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7 =pod
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 =head1 NAME
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 C<tbl2tab.pl> - convert tbl to tab-separated format and back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13 =head1 SYNOPSIS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15 C<perl tbl2tab.pl -m tbl2tab -i feature_table.tbl -s -l locus_prefix>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17 B<or>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19 C<perl tbl2tab.pl -m tab2tbl -i feature_table.tab -g -l locus_prefix
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20 -p "gnl|dbname|">
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22 =head1 DESCRIPTION
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24 NCBI's feature table (B<tbl>) format is needed for the submission of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25 genomic data to GenBank with the NCBI tools
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26 L<Sequin|http://www.ncbi.nlm.nih.gov/Sequin/> or
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27 L<tbl2asn|http://www.ncbi.nlm.nih.gov/genbank/tbl2asn2>. tbl files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28 can be created with automatic annotation systems like
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29 L<Prokka|http://www.vicbioinformatics.com/software.prokka.shtml>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 C<tbl2tab.pl> can convert a tbl file to a tab-separated format (tab)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31 and back to the tbl format. The tab-delimited format is useful to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 manipulate the data more comfortably in a spreadsheet software (e.g.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33 LibreOffice or MS Excel). For a conversion back to tbl format save
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 the file in the spreadsheet software as a tab-delimited text file.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35 The script is intended for microbial genomes, but might also be
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 useful for eukaryotes.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38 Regular expressions are applied in mode B<tbl2tab> to correct gene
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39 names and words in '/product' values to lowercase initials (with
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 the exception of 'Rossman' and 'Willebrand'). The resulting tab file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41 can then be used to check for possible errors.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43 The first four header columns of the B<tab> format are mandatory,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44 'seq_id' for the SeqID, and for each primary tag/feature (e.g. CDS,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45 RNAs, repeat_region etc.), 'start', 'stop', and 'primary_tag'. These
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46 mandatory columns have to be filled in every row in the tab file.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47 All the following columns will be included as tags/qualifiers (e.g.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48 '/locus_tag', '/product', '/EC_number', '/note' etc.) in the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49 conversion to the tbl file if a value is present.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51 There are three special cases:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53 B<First>, '/pseudo' will be included as a tag if I<any> value (the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54 script uses 'T' for true) is present in the B<tab> format. If a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55 primary tag is indicated as pseudo both the primary tag and the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56 accessory 'gene' primary tag (for CDS/RNA features with option
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57 B<-g>) will include a '/pseudo' qualifier in the resulting B<tbl>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58 file. B<Pseudo-genes> are indicated by 'pseudo' in the 'primary_tag'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59 column, thus the 'pseudo' column is ignored in these cases.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 B<Second>, tag '/gene_desc' is reserved for the 'product' values of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62 pseudo-genes, thus a 'gene_desc' column in a tab file will be
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63 ignored in the conversion to tbl.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65 B<Third>, column 'protein_id' in a tab file will also be ignored in
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66 the conversion. '/protein_id' values are created from option B<-p>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67 and the locus_tag for each CDS primary feature.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69 Furthermore, with option B<-s> G2L-style spreadsheet formulas
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70 (L<Goettingen Genomics
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71 Laboratory|http://appmibio.uni-goettingen.de/>) can be included with
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72 additional columns, 'spreadsheet_locus_tag', 'position', 'distance',
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73 'gene_number', and 'contig_order'. These columns will not be
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74 included in a conversion to the tbl format. Thus, if you want to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75 include e.g. the locus_tags from the formula in column
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76 'spreadsheet_locus_tag' in the resulting tbl file copy the B<values>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77 to the column 'locus_tag'!
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79 To illustrate the process two example files are included in the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80 repository, F<example.tbl> and F<example2.tab>, which are
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81 interconvertible (see L</"EXAMPLES"> below).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83 B<Warning>, be aware of possible errors introduced by automatic
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84 format conversions using a spreadsheet software like MS Excel, see
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85 e.g. Zeeberg et al. 2004
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86 (L<http://www.ncbi.nlm.nih.gov/pubmed/15214961>).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88 For more information regarding the feature table and the submission
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89 process see NCBI's L<prokaryotic annotation
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90 guide|http://www.ncbi.nlm.nih.gov/genbank/genomesubmit> and the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91 L<bacterial genome submission
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92 guide|http://www.ncbi.nlm.nih.gov/genbank/genomesubmit_annotation>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94 =head1 OPTIONS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96 =head2 Mandatory options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100 =item B<-m>=I<tbl2tab|tab2tbl>, B<-mode>=I<tbl2tab|tab2tbl>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102 Conversion mode, either 'tbl2tab' or 'tab2tbl' [default = 'tbl2tab']
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104 =item B<-i>=I<str>, B<-input>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106 Input tbl or tab file to be converted to the other format
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110 =head2 Optional options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114 =item B<-h>, B<-help>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118 =item B<-v>, B<-version>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120 Print version number to C<STDERR>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124 =head3 Mode B<tbl2tab>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128 =item B<-l>=I<str>, B<-locus_prefix>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130 Only in combination with option B<-s> and there mandatory to include
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131 the locus_tag prefix in the formula for column 'spreadsheet_locus_tag'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133 =item B<-c>, B<-concat>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135 Concatenate values of identical tags within one primary tag with '~'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136 (e.g. several '/EC_number' or '/inference' tags)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138 =item B<-e>=I<str>, B<-empty>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
139
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
140 String used for primary features without value for a tag [default = '']
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
141
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
142 =item B<-s>, B<-spreadsheet>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
143
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
144 Include formulas for spreadsheet editing
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
145
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
146 =item B<-f>=I<e|g>, B<-formula_lang>=I<e|g>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
147
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
148 Syntax language of the spreadsheet formulas, either 'English' or
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
149 'German'. If you're still encountering problems with the formulas
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
150 set the decimal and thousands separator manually in the options of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
151 the spreadsheet software (instead of using the operating system
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
152 separators). [default = 'e']
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
153
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
154 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
155
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
156 =head3 Mode B<tab2tbl>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
157
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
158 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
159
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
160 =item B<-l>=I<str>, B<-locus_prefix>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
161
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
162 Prefix to the SeqID if not present already in the SeqID
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
163
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
164 =item B<-g>, B<-gene>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
165
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
166 Include accessory 'gene' primary tags (with '/gene', '/locus_tag'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
167 and possibly '/pseudo' tags) for 'CDS/RNA' primary tags; NCBI standard
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
168
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
169 =item B<-t>, B<-tags_full>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
170
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
171 Only in combination with option B<-g>, include '/gene' and
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
172 '/locus_tag' tags additionally in primary tag, not only in accessory
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
173 'gene' primary tag
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
174
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
175 =item B<-p>=I<str>, B<-protein_id_prefix>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
176
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
177 Prefix for '/protein_id' tags; don't forget the double quotes for
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
178 the string, otherwise the shell will intepret as pipe [default =
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
179 'gnl|goetting|']
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
180
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
181 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
182
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
183 =head1 OUTPUT
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
184
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
185 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
186
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
187 =item F<*.tab|tbl>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
188
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
189 Result file in the opposite format
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
190
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
191 =item (F<hypo_putative_genes.txt>)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
192
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
193 Created in mode 'tab2tbl', indicates if CDSs are annotated as
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
194 'hypothetical/putative/predicted protein' but still have a gene name
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
195
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
196 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
197
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
198 =head1 EXAMPLES
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
199
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
200 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
201
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
202 =item C<perl tbl2tab.pl -m tbl2tab -i example.tbl -s -l EPE>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
203
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
204 =item C<perl tbl2tab.pl -m tab2tbl -i example2.tab -g -l EPE>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
205
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
206 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
207
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
208 =head1 VERSION
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
209
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
210 0.2 update: 29-10-2014
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
211 0.1 24-06-2014
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
212
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
213 =head1 AUTHOR
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
214
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
215 Andreas Leimbach aleimba[at]gmx[dot]de
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
216
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
217 =head1 LICENSE
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
218
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
219 This program is free software: you can redistribute it and/or modify
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
220 it under the terms of the GNU General Public License as published by
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
221 the Free Software Foundation; either version 3 (GPLv3) of the License,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
222 or (at your option) any later version.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
223
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
224 This program is distributed in the hope that it will be useful, but
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
225 WITHOUT ANY WARRANTY; without even the implied warranty of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
226 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
227 General Public License for more details.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
228
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
229 You should have received a copy of the GNU General Public License
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
230 along with this program. If not, see L<http://www.gnu.org/licenses/>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
231
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
232 =cut
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
233
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
234
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
235 ########
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
236 # MAIN #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
237 ########
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
238
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
239 use strict;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
240 use warnings;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
241 use autodie;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
242 use Getopt::Long;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
243 use Pod::Usage;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
244
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
245
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
246
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
247 ### Get options with Getopt::Long, works also abbreviated and with two "--": -i, --i, -input ...
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
248 my $Input_File; # input file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
249 my $Mode = 'tbl2tab'; # mode of script, i.e. either convert from tbl2tab or from tab2tbl; default 'tbl2tab'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
250 my $Locus_Prefix = ''; # required for option 'spreadsheet' in mode 'tbl2tab', in mode 'tab2tbl' optional
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
251 my $Opt_Concat; # optionally, concatenate values of the same tag within one primary tag in a tbl file in one column in the resulting tab file with '~' (e.g. several 'EC_number' tags etc.)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
252 my $Empty = ''; # optionally, set what should be used for tags without a value in resulting tab file; default is nothing
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
253 my $Opt_Spreadsheet; # optionally, include formulas for spreadsheet editing (e.g. Libre Office, MS Excel)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
254 my $Formula_Lang_Spreadsheet = 'e'; # optionally, either German or English formulas in Spreadsheet option; default 'e' for English
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
255 my $Opt_Gene; # optionally, include accessory gene primary tags (with '/gene' and '/locus_tag' [and '/pseudo'] tags) for CDS|RNA primary tags
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
256 my $Opt_Tags_Full; # optionally, include '/gene' and '/locus_tag' additionally in primary tag not only accessory 'gene' primary tag
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
257 my $Protein_Id_Prefix = 'gnl|goetting|'; # optionally give a different string to prefix the '/protein_id' tags
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
258 my $VERSION = 0.2;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
259 my ($Opt_Version, $Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
260 GetOptions ('input=s' => \$Input_File,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
261 'mode=s' => \$Mode,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
262 'locus_prefix:s' => \$Locus_Prefix,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
263 'concat' => \$Opt_Concat,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
264 'empty:s' => \$Empty,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
265 'spreadsheet' => \$Opt_Spreadsheet,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
266 'formula_lang:s' => \$Formula_Lang_Spreadsheet,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
267 'gene' => \$Opt_Gene,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
268 'tags_full' => \$Opt_Tags_Full,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
269 'protein_id_prefix:s' => \$Protein_Id_Prefix,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
270 'version' => \$Opt_Version,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
271 'help|?' => \$Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
272
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
273
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
274
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
275 ### Run perldoc on POD
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
276 pod2usage(-verbose => 2) if ($Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
277 die "$0 $VERSION\n" if ($Opt_Version);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
278 if (!$Input_File) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
279 my $warning = "\n### Fatal error: Option '-i' or its argument is missing!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
280 pod2usage(-verbose => 1, -message => $warning, -exitval => 2);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
281 } elsif (!($Mode =~ /tbl2tab/i || $Mode =~ /tab2tbl/i)) { # case-insensitive
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
282 my $warning = "\n### Fatal error: Incorrect run mode with option '-m' given! Please choose either 'tbl2tab' or 'tab2tbl' for '-m'!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
283 pod2usage(-verbose => 1, -message => $warning, -exitval => 2);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
284 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
285
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
286
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
287
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
288 ### Enforce mandatory or optional options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
289 if ($Mode =~ /tbl2tab/i && ($Opt_Gene || $Opt_Tags_Full || $Protein_Id_Prefix ne 'gnl|goetting|')) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
290 warn "\nIncompatible option(s) '-g', '-p', or '-t' set with mode 'tbl2tab'. Ignoring the option(s)!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
291 } elsif ($Mode =~ /tab2tbl/i && ($Opt_Concat || $Empty || $Opt_Spreadsheet || $Formula_Lang_Spreadsheet ne 'e')) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
292 warn "\nIncompatible option(s) '-c', '-e', '-f', or '-s' set with mode 'tab2tbl'. Ignoring the option(s)!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
293 $Formula_Lang_Spreadsheet = 'e' if ($Formula_Lang_Spreadsheet ne 'e'); # avoid die with error below if option not set correctly
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
294 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
295
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
296 if ($Mode =~ /tbl2tab/i && ($Locus_Prefix || $Formula_Lang_Spreadsheet ne 'e') && !$Opt_Spreadsheet) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
297 warn "\nOption(s) '-l' or '-f' set, but not '-s'. Forcing option '-s'!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
298 $Opt_Spreadsheet = 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
299 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
300 if ($Mode =~ /tbl2tab/i && !$Locus_Prefix && $Opt_Spreadsheet) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
301 warn "\nOption '-s' set, but not '-l'. Please give a prefix for the locus tags: ";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
302 chomp($Locus_Prefix = <>);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
303 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
304
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
305 if ($Formula_Lang_Spreadsheet !~ /^(e|g)/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
306 die "\n### Fatal error: Incorrect language for option '-f' given! Please choose either 'e|eng' or 'g|ger' for '-f'!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
307 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
308
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
309 if ($Mode =~ /tab2tbl/i && !$Opt_Gene && $Opt_Tags_Full) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
310 warn "\nOption '-t' set, but not '-g'. Forcing option '-g'!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
311 $Opt_Gene = 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
312 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
313
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
314
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
315
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
316 ### Read in tbl or tab-separated data and write to result file in the opposite format
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
317 my $Out_File = $Input_File;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
318 $Out_File =~ s/^(.+)\.\w+$/$1/; # strip filename extension
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
319 my $Error_File = 'hypo_putative_genes.txt';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
320 if ($Mode =~ /tbl2tab/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
321 my ($data_hash_ref, $tags_max_count_hash_ref) = read_tbl(); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
322 $Out_File .= '.tab';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
323 write_tab($data_hash_ref, $tags_max_count_hash_ref); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
324
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
325 } elsif ($Mode =~ /tab2tbl/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
326 $Out_File .= '.tbl';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
327 read_tab_write_tbl(); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
328 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
329
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
330
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
331
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
332 ### Message which file was created
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
333 if ($Mode =~ /tbl2tab/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
334 print "Input tbl file '$Input_File' was converted to tab output file '$Out_File'!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
335 } elsif ($Mode =~ /tab2tbl/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
336 print "Input tab file '$Input_File' was converted to tbl output file '$Out_File'!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
337 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
338
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
339 if (-e $Error_File) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
340 if (-s $Error_File >= 40) { # smaller than just the header, which should be 27 bytes
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
341 warn "\n### Warning: CDSs found that are annotated with 'hypothetical|putative|predicted protein' but still include a '/gene' tag, see file '$Error_File'!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
342 } elsif (-s $Error_File < 40) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
343 unlink $Error_File;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
344 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
345 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
346
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
347
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
348 exit;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
349
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
350 ###############
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
351 # Subroutines #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
352 ###############
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
353
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
354 ### Subroutine to test for file existence and give warning to STDERR
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
355 sub file_exist {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
356 my $file = shift;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
357 if (-e $file) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
358 warn "\nThe result file \'$file\' exists already and will be overwritten!\n\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
359 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
360 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
361 return 0;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
362 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
363
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
364
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
365
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
366 ### Print tag values to tab result file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
367 sub print_tag2tab {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
368 my ($tag, $data_hash_ref, $seq_id, $pos, $tag_max_count) = @_;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
369
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
370 if ($Opt_Concat) { # values concatenated by '~' in $data_hash_ref
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
371 if ($data_hash_ref->{$seq_id}->{$pos}->{$tag}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
372 print "\t$data_hash_ref->{$seq_id}->{$pos}->{$tag}";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
373 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
374 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
375 print "\t$Empty";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
376 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
377 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
378
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
379 } elsif (!$Opt_Concat) { # split concatenated values in individual values
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
380 my @values = split(/~/, $data_hash_ref->{$seq_id}->{$pos}->{$tag}) if ($data_hash_ref->{$seq_id}->{$pos}->{$tag});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
381 if (@values) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
382 foreach (@values) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
383 print "\t$_";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
384 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
385 print "\t$Empty" x ($tag_max_count - @values); # fill residual columns till maximum occurrence
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
386 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
387 print "\t$Empty" x $tag_max_count;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
388 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
389 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
390 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
391
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
392 return 0;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
393 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
394
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
395
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
396
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
397 ### Print tag values to tbl result file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
398 sub print_tag2tbl {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
399 my ($tag, $value) = @_;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
400 return 0 if ($value =~ /^$Empty$/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
401 if ($tag =~ /pseudo/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
402 print "\t\t\t$tag\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
403 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
404 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
405
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
406 ### remove quotations from values introduced by Excel by saving as tab-separated file:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
407 ### https://office.microsoft.com/en-001/excel-help/excel-formatting-and-features-that-are-not-transferred-to-other-file-formats-HP010014105.aspx
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
408 ### - if a cell contains a comma, the cell contents are enclosed in double quotation marks
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
409 ### - if the data contains a quotation mark, double quotation marks will replace the quotation mark, and the cell contents are also enclosed in double quotation marks
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
410 $value =~ s/""/"/g;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
411 $value =~ s/^"//;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
412 $value =~ s/"$//;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
413
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
414 foreach (split(/~/, $value)) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
415 print "\t\t\t$tag\t$_\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
416 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
417 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
418 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
419
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
420
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
421 ### Read in data from tab input file and write it to tbl output file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
422 sub read_tab_write_tbl {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
423 file_exist($Error_File); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
424 open (my $error_file_fh, ">", $Error_File);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
425 print $error_file_fh "row\tlocus_tag\tgene\tproduct\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
426
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
427 open (my $input_file_fh, "<", $Input_File);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
428 my $header = <$input_file_fh>;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
429 $header =~ s/\R/\012/; # convert line to unix-style line endings
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
430 chomp $header;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
431 if ($header !~ /^seq_id\tstart\tstop\tprimary_tag\t/) { # check if tbl file starts with mandatory header fields or quit
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
432 die "\n### Fatal error: Input tab file '$Input_File' doesn't start with the mandatory 'seq_id', 'start', 'stop', and 'primary_tag' tab-separated header fields. Sure this is a valid tab file?\nExiting program!\n\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
433 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
434
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
435 my @tags;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
436 foreach (split(/\t/, $header)) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
437 last if (/spreadsheet_locus_tag/); # skip all optional extra spreadsheet columns
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
438 push(@tags, $_); # store all header fields/columns to associate with each field in each line
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
439 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
440
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
441 file_exist($Out_File); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
442 open (my $out_file_fh, ">", $Out_File);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
443 select $out_file_fh; # select fh for standard print/f output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
444
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
445 my $row = 1; # count row numbers of tab input file for $Error_File (start with '1' as header already parsed above)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
446 my $seq_id = ''; # store previous SeqID for multi-contig/replicon tab files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
447 while (<$input_file_fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
448 $row++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
449 $_ =~ s/\R/\012/; # convert line to unix-style line ending
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
450 chomp;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
451 next if ($_ =~ /^\s+$/ || $_ =~ /^$/); # skip empty lines
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
452
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
453 my ($locus_tag, $gene, $hypo_putative) = ('', '', ''); # needed for $Error_File
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
454
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
455 my @cells = split(/\t/, $_);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
456 for (my $i = 0; $i < 4; $i++) { # check each row for mandatory fields
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
457 if ($cells[$i] =~ /^$/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
458 close $out_file_fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
459 unlink $Out_File;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
460 die "\n### Fatal error: Row $row of input tab file '$Input_File' is missing a value for one of the mandatory fields 'seq_id', 'start', 'stop', or 'primary_tag'!\nExiting program!\n\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
461 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
462 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
463
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
464 ### print SeqID
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
465 if ($cells[0] ne $seq_id) { # print new contig for multi-contig/replicon tab files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
466 $seq_id = $cells[0];
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
467 $cells[0] = $Locus_Prefix."_".$cells[0] if ($cells[0] !~ /$Locus_Prefix/ && $Locus_Prefix); # append locus_tag prefix only if not present already
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
468 print ">Feature $cells[0]\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
469 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
470
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
471 ### print accessory 'gene' primary tags with '/locus_tag', '/gene', and potential '/pseudo' tags
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
472 if ($Opt_Gene && $cells[3] =~ /CDS|RNA/) { # accessory 'gene' primary tags only for CDS and RNA (rRNA, tRNA ...) features
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
473 print "$cells[1]\t$cells[2]\tgene\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
474 my $column_count = 0;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
475 foreach my $tag (@tags) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
476 if ($tag =~ /locus_tag/ && $cells[$column_count] =~ /^$/) { # CDSs|RNAs mandatory need a '/locus_tag' with option '-g'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
477 close $out_file_fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
478 unlink $Out_File;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
479 die "\n### Fatal error: Row $row of input tab file '$Input_File' is missing a 'locus_tag' which is mandatory for option '-g'!\nExiting program!\n\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
480 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
481 print_tag2tbl($tag, $cells[$column_count]) if ($tag =~ /locus_tag|^gene$|pseudo/); # subroutine; '^gene$' needed, so 'gene_desc' isn't hit (see below)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
482 $column_count++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
483 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
484 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
485
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
486 ### print primary tag
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
487 print "$cells[1]\t$cells[2]"; # start\tstop
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
488 if ($cells[3] =~ /pseudo/) { # pseudo-gene should include '/pseudo' tag
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
489 print "\tgene\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
490 print "\t\t\tpseudo\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
491 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
492 print "\t$cells[3]\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
493 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
494
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
495 ### print tags with values
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
496 for (my $i = 4; $i < @tags; $i++) { # start with field 5 of array with header fields/columns (the first 4 are mandatory, see above)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
497 next if ($tags[$i] =~ /gene_desc/); # skip 'gene_desc' fields in tab file, reserved for pseudo-genes (see below)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
498
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
499 ### enforce mandatory tags for CDS primary tags
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
500 if ($tags[$i] =~ /product/ && $cells[3] =~ /CDS/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
501 if ($cells[$i] =~ /(hypothetical|putative|predicted) protein/) { # needed for $Error_File
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
502 $hypo_putative = $cells[$i];
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
503 } elsif ($cells[$i] =~ /^$/) { # CDSs mandatory need a value for '/product'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
504 close $out_file_fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
505 unlink $Out_File;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
506 die "\n### Fatal error: Row $row of input tab file '$Input_File' is missing a 'product' value which is mandatory for CDS primary tags!\nExiting program!\n\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
507 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
508 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
509
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
510 if ($tags[$i] =~ /locus_tag/ && $cells[3] =~ /CDS/) { # '/protein_id' mandatory for 'CDS' primary tags
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
511 $locus_tag = $cells[$i]; # needed for $Error_File
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
512 my $protein_id = "$Protein_Id_Prefix".$cells[$i];
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
513 print_tag2tbl('protein_id', $protein_id); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
514 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
515 next if ($tags[$i] =~ /protein_id/); # skip 'protein_id' field in tab file as they should be created from the 'locus_tag' column
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
516
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
517 $gene = $cells[$i] if ($tags[$i] =~ /^gene$/ && $cells[3] =~ /CDS/); # needed for $Error_File
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
518
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
519 ### enforce mandatory tags for CDS/RNA primary tags
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
520 next if ($tags[$i] =~ /locus_tag|^gene$/ && $Opt_Gene && !$Opt_Tags_Full && $cells[3] =~ /CDS|RNA/); # skip '/locus_tag' and '/gene' tags if accessory gene primary tags are present for CDS|RNA features (except option '-t' is set)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
521 if ($tags[$i] =~ /product/ && $cells[3] =~ /RNA/ && $cells[$i] =~ /^$/) { # RNAs mandatory need a value for '/product'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
522 close $out_file_fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
523 unlink $Out_File;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
524 die "\n### Fatal error: Row $row of input tab file '$Input_File' is missing a 'product' value which is mandatory for RNA primary tags!\nExiting program!\n\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
525 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
526
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
527 ### enforce mandatory tag for pseudo-genes (have 'pseudo' as primary tag in tab file)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
528 if ($tags[$i] =~ /locus_tag/ && $Opt_Gene && $cells[3] =~ /pseudo/ && $cells[$i] =~ /^$/) { # pseudo-genes mandatory need a '/locus_tag' with option '-g'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
529 close $out_file_fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
530 unlink $Out_File;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
531 die "\n### Fatal error: Row $row of input tab file '$Input_File' is missing a 'locus_tag' which is mandatory for option '-g'!\nExiting program!\n\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
532 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
533
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
534 ### the rest
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
535 if ($tags[$i] =~ /product/ && $cells[3] =~ /pseudo/) { # write 'product' values for pseudo-genes to '/gene_desc' tags
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
536 print_tag2tbl('gene_desc', $cells[$i]); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
537 } elsif ($tags[$i] =~ /pseudo/ && $cells[3] =~ /pseudo/) { # skip 'pseudo' tag if pseudo-gene
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
538 next;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
539 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
540 print_tag2tbl($tags[$i], $cells[$i]); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
541 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
542
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
543 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
544 print $error_file_fh "$row\t$locus_tag\t$gene\t$hypo_putative\n" if ($hypo_putative && $gene);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
545 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
546
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
547 select STDOUT;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
548 close $input_file_fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
549 close $error_file_fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
550 close $out_file_fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
551 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
552 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
553
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
554
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
555
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
556 ### Read in data from tbl input file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
557 sub read_tbl {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
558 open (my $input_file_fh, "<", $Input_File);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
559
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
560 my $seq_id = <$input_file_fh>; # SeqID
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
561 $seq_id =~ s/\R/\012/; # convert line to unix-style line endings
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
562 chomp $seq_id;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
563 if ($seq_id !~ /^>Feature/) { # check if tbl file starts with mandatory '>Feature' and get first SeqID
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
564 die "\n### Fatal error: tbl file doesn't start with a '>Feature SeqID' line. Sure this is a valid tbl file?\nExiting program!\n\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
565 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
566 $seq_id =~ s/>Feature (\S+)\s*$/$1/; # only use non-whitespace characters as SeqID
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
567 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
568
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
569 my %data; # hash-in-hash-in-hash to store tbl input data
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
570 my $pos_key; # store start..stop for each primary tag and use as key in %data
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
571 my $primary_tag; # store previous primary tag to determine if values for repeatedly occuring tags should be concatenated
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
572 my %tags_max_count; # hash to store all occuring tags with maximal number of presence (within a single primary tag) in the tbl file for final tab column headers
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
573 my @tags; # array to store all tags of each primary tag, supplement to %tags_max_count
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
574
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
575 while (<$input_file_fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
576 $_ =~ s/\R/\012/; # convert line to unix-style line ending
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
577 chomp;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
578 next if ($_ =~ /^\s+$/); # skip empty lines
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
579 my @fields = split(/\t/, $_);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
580
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
581 ### get next SeqID from '>Feature' line for multi-contig/replicon tbl files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
582 if ($fields[0] =~ /^>Feature (\S+)\s*$/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
583 $seq_id = $1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
584
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
585 ### get primary tags/features and fill %tags_max_count
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
586 } elsif ($fields[0] =~ /^\d+$/) { # $fields[2] with primary tag
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
587 foreach my $tag (@tags) { # fill %tags_max_count for previous primary tag
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
588 if ($tags_max_count{$tag}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
589 $tags_max_count{$tag} = grep(/$tag/, @tags) if ($tags_max_count{$tag} < grep(/$tag/, @tags));
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
590 } elsif (!$tags_max_count{$tag}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
591 $tags_max_count{$tag} = grep(/$tag/, @tags);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
592 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
593 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
594 @tags = (); # empty tags array for new current primary tag
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
595
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
596 $pos_key = "$fields[0]..$fields[1]"; # position of primary tag used as key for %data, "start..stop"
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
597 $primary_tag = $fields[2];
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
598 if (!$data{$seq_id}->{$pos_key}->{'primary_tag'} || $data{$seq_id}->{$pos_key}->{'primary_tag'} =~ /gene/) { # if primary tag not present or overwrite accessory 'gene' primary tag
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
599 $data{$seq_id}->{$pos_key}->{'primary_tag'} = $primary_tag; # store data in anonymous hash-in-hash-in-hash
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
600 $data{$seq_id}->{$pos_key}->{'start'} = $fields[0]; # to be able to sort afterwards via the start position
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
601 } elsif ($data{$seq_id}->{$pos_key}->{'primary_tag'} =~ /pseudo/) { # 'gene' primary tag with '/pseudo' tag will be replaced by 'pseudo' primary tag for pseudo-genes (see below), however if 'gene' primary tag is ACCESSORY to CDS|RNA primary tag replace by this primary tag and include '/pseudo' tag
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
602 $data{$seq_id}->{$pos_key}->{'primary_tag'} = $primary_tag;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
603 $data{$seq_id}->{$pos_key}->{'pseudo'} = 'T'; # value 'T' for true
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
604 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
605
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
606 ### get tags/qualifiers
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
607 } elsif ($fields[3] =~ /^\w+/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
608 push(@tags, $fields[3]) if ($fields[3] !~ /gene_desc/); # store tags for current primary tag; skip '/gene_desc' as reserved for pseudo-genes (replaced by '/product' see below)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
609 if ($fields[3] =~ /pseudo/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
610 if ($data{$seq_id}->{$pos_key}->{'primary_tag'} =~ /gene/) { # change 'gene' primary tag of pseudo-genes to 'pseudo' (if accessory 'gene' primary tag will be replaced by *actual* primary tag, see above)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
611 $data{$seq_id}->{$pos_key}->{'primary_tag'} = 'pseudo';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
612 } else { # else include a '/pseudo' tag with value 'T' for true
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
613 $data{$seq_id}->{$pos_key}->{'pseudo'} = 'T';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
614 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
615 next; # next line
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
616 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
617
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
618 ### remove quotations from values introduced by Excel by saving as tab-separated file (see above)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
619 $fields[4] =~ s/""/"/g;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
620 $fields[4] =~ s/^"//;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
621 $fields[4] =~ s/"$//;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
622
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
623 ### adjust '/gene' and '/product' values to NCBI standard
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
624 if ($fields[3] =~ /gene/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
625 $fields[4] =~ s/(\w+)/\l$1/; # first character of gene name should be lower case
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
626 $fields[4] =~ s/^(\w)$/\u$1/; # one letter phage genes should be upper case
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
627 } elsif ($fields[3] =~ /product/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
628 $fields[4] =~ s/\b([A-Z][a-z]{3,})/\l$1/g; # lower the case for '/protein' value initials
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
629 $fields[4] =~ s/(rossman|willebrand)/\u$1/; # exception to the rule
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
630 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
631
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
632 if ($fields[3] =~ /gene_desc/) { # '/gene_desc' tags from pseudo-genes replaced by '/product' for resulting tab file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
633 $data{$seq_id}->{$pos_key}->{'product'} = $fields[4];
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
634 next;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
635 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
636 if ($data{$seq_id}->{$pos_key}->{$fields[3]} && $data{$seq_id}->{$pos_key}->{'primary_tag'} =~ /$primary_tag/) { # tag already exists for this position (e.g. several EC_numbers), concatenate the additional values with '~' as separator only WITHIN the same primary tag (OTHERWISE overwrite in 'else' below)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
637 $data{$seq_id}->{$pos_key}->{$fields[3]} .= '~'.$fields[4];
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
638 } else { # tag doesn't exist yet or overwrite if current primary tag at the same position of previous (e.g. accessory 'gene' primary tag to CDS/RNA primary tag)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
639 $data{$seq_id}->{$pos_key}->{$fields[3]} = $fields[4];
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
640 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
641 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
642 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
643 foreach my $tag (@tags) { # fill %tags_max_count for last primary tag
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
644 if ($tags_max_count{$tag}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
645 $tags_max_count{$tag} = grep(/$tag/, @tags) if ($tags_max_count{$tag} < grep(/$tag/, @tags));
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
646 } elsif (!$tags_max_count{$tag}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
647 $tags_max_count{$tag} = grep(/$tag/, @tags);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
648 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
649 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
650 close $input_file_fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
651 return \%data, \%tags_max_count;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
652 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
653
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
654
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
655
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
656 ### Write data to tab output file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
657 sub write_tab {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
658 my ($data_hash_ref, $tags_max_count_hash_ref) = @_;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
659 file_exist($Out_File); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
660 open (my $out_file_fh, ">", $Out_File);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
661 select $out_file_fh; # select fh for standard print/f output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
662
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
663 ### print header for tab result file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
664 print "seq_id\tstart\tstop\tprimary_tag\tlocus_tag"; # mandatory columns/fields in tab file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
665 if ($Opt_Concat) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
666 foreach (sort keys %{$tags_max_count_hash_ref}) { # print residual tags
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
667 print "\t$_" if (!/locus_tag/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
668 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
669 } elsif (!$Opt_Concat) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
670 foreach (sort keys %{$tags_max_count_hash_ref}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
671 print "\t$_" x $tags_max_count_hash_ref->{$_} if (!/locus_tag/); # print max occurrence (in tbl) of each residual tag
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
672 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
673 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
674
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
675 print "\tspreadsheet_locus_tag\tposition\tdistance\tgene_number\tcontig_order" if ($Opt_Spreadsheet); # print optional spreadsheet header columns
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
676 print "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
677
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
678 ### variables for optional spreadsheet formulas
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
679 my @spread_columns = ("A".."AZ") if ($Opt_Spreadsheet); # columns in spreadsheet software for formulas
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
680 my ($tags_column_count, $spread_row_count, $spread_contig_order) = (0, 1, 1) if ($Opt_Spreadsheet);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
681 if ($Opt_Spreadsheet) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
682 if ($Opt_Concat) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
683 $tags_column_count = (scalar keys %{$tags_max_count_hash_ref}) - scalar grep($_ =~ /locus_tag/, keys %{$tags_max_count_hash_ref}); # subtract tags for correct spreadsheet formulas
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
684 } elsif (!$Opt_Concat) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
685 foreach (keys %{$tags_max_count_hash_ref}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
686 next if ($_ =~ /locus_tag/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
687 $tags_column_count += $tags_max_count_hash_ref->{$_};
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
688 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
689 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
690 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
691
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
692 ### print data from hash into tab result file, optional with G2L-style spreadsheet formulas
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
693 foreach my $seq_id (sort keys %{$data_hash_ref}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
694 foreach my $pos (sort {$data_hash_ref->{$seq_id}->{$a}->{'start'} <=> $data_hash_ref->{$seq_id}->{$b}->{'start'}} keys $data_hash_ref->{$seq_id}) { # sort each position entry in %data via start position
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
695 print "$seq_id";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
696 my ($start, $stop) = split(/\.\./, $pos);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
697 print "\t$start\t$stop";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
698 print "\t$data_hash_ref->{$seq_id}->{$pos}->{'primary_tag'}"; # primary_tag should always be present
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
699 print_tag2tab('locus_tag', $data_hash_ref, $seq_id, $pos, 1); # subroutine; locus_tag should occur always just one time per primary tag
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
700 foreach (sort keys %{$tags_max_count_hash_ref}) { # print residual tags
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
701 print_tag2tab($_, $data_hash_ref, $seq_id, $pos, $tags_max_count_hash_ref->{$_}) if (!/locus_tag/); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
702 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
703
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
704 ### G2L-style spreadsheet formulas
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
705 if ($Opt_Spreadsheet) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
706 $spread_row_count++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
707 if ($Formula_Lang_Spreadsheet =~ /^e/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
708 print "\t=\"$Locus_Prefix\"", '&"_"&A', "$spread_row_count&TEXT(", $spread_columns[$tags_column_count+8], $spread_row_count, ',"0000")&"0"'; # spreadsheet column 'spreadsheet_locus_tag'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
709 } elsif ($Formula_Lang_Spreadsheet =~ /^g/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
710 print "\t=\"$Locus_Prefix\"", '&"_"&A', "$spread_row_count&TEXT(", $spread_columns[$tags_column_count+8], $spread_row_count, ';"0000")&"0"';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
711 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
712 print "\t=MIN(B$spread_row_count:C$spread_row_count)"; # spreadsheet column 'position'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
713 print "\t=", $spread_columns[$tags_column_count+6], $spread_row_count + 1, "-MAX(B$spread_row_count:C$spread_row_count)"; # spreadsheet column 'distance'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
714 if ($Formula_Lang_Spreadsheet =~ /^e/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
715 print "\t=IF(", $spread_columns[$tags_column_count+9], $spread_row_count, '=', $spread_columns[$tags_column_count+9], $spread_row_count - 1, ",$spread_columns[$tags_column_count+8]", $spread_row_count - 1, '+1,1)'; # spreadsheet column 'gene_number'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
716 } elsif ($Formula_Lang_Spreadsheet =~ /^g/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
717 print "\t=WENN(", $spread_columns[$tags_column_count+9], $spread_row_count, '=', $spread_columns[$tags_column_count+9], $spread_row_count - 1, ";$spread_columns[$tags_column_count+8]", $spread_row_count - 1, '+1;1)';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
718 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
719 print "\t$spread_contig_order"; # spreadsheet column 'contig_order'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
720 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
721
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
722 print "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
723 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
724 $spread_contig_order++; # next contig/replicon (SeqID) in tbl file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
725 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
726
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
727 select STDOUT;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
728 close $out_file_fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
729 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
730 }