Mercurial > repos > dereeper > pangenome_explorer
comparison COG/bac-genomics-scripts/prot_finder/binary_group_stats.pl @ 3:e42d30da7a74 draft
Uploaded
author | dereeper |
---|---|
date | Thu, 30 May 2024 11:52:25 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
2:97e4e3e818b6 | 3:e42d30da7a74 |
---|---|
1 #!/usr/bin/perl | |
2 | |
3 ####### | |
4 # POD # | |
5 ####### | |
6 | |
7 =pod | |
8 | |
9 =head1 NAME | |
10 | |
11 C<binary_group_stats.pl> - categorize binary matrix rows according to | |
12 column groups | |
13 | |
14 =head1 SYNOPSIS | |
15 | |
16 C<perl binary_group_stats.pl -i binary_matrix.tsv -g group_file.tsv | |
17 -p E<gt> overall_stats.tsv> | |
18 | |
19 =head1 DESCRIPTION | |
20 | |
21 Categorize rows of a delimited TEXT input B<binary> matrix (option | |
22 B<-i>) according to column group affiliations. All fields of the binary | |
23 matrix need to be filled with either a B<0> indicating absence or a | |
24 B<1> indicating presence, i.e. all rows need to have the same number of | |
25 columns. Use option B<-d> to set the delimiter of the input matrix, | |
26 default is set to tab-delimited/separated matrices. | |
27 | |
28 The group affiliations of the columns are intended to get overall | |
29 presence/absence statistics for groups of columns and not simply | |
30 single columns of the matrix. Percentage inclusion (option | |
31 B<-cut_i>) and exclusion (option B<-cut_e>) cutoffs can be set to | |
32 define how strict the presence/absence of column groups within a row | |
33 are defined. Of course groups can also hold only single column | |
34 headers to get single column statistics. Group affiliations are | |
35 defined in a mandatory B<tab-delimited> group input file (option | |
36 B<-g>), including the column headers from the input binary matrix, | |
37 with B<minimal two> and B<maximal four> groups. | |
38 | |
39 The script was designed to handle a presence/absence query protein | |
40 binary matrix from a C<prot_finder.pl> protein homolog search in | |
41 bacterial genomes with a subsequent C<prot_binary_matrix.pl> run to | |
42 create the matrix. But, any binary matrix can be used for | |
43 C<binary_group_stats.pl>, optionally beforehand transposed with | |
44 C<transpose_matrix.pl>. However, column headers in the first row and | |
45 row headers in the first column are B<mandatory> for the input | |
46 binary matrix. Only alphanumeric (a-z, A-Z, 0-9), underscore (_), | |
47 dash (-), and period (.) characters are allowed for the B<column | |
48 headers> and B<group names> in the group file (option B<-g>) to | |
49 avoid downstream problems with the operating/file system. As a | |
50 consequence, also no whitespaces are allowed in these! Additionally, | |
51 B<column headers>, B<row headers>, and B<group names> need to be | |
52 B<unique>. | |
53 | |
54 <binary_group_stats.pl> is based upon | |
55 L<C<po2group_stats.pl>|/po2group_stats>, which does the same thing | |
56 for genomes in an ortholog/paralog output matrix from a | |
57 L<B<Proteinortho5>|http://www.bioinf.uni-leipzig.de/Software/proteinortho/> | |
58 calculation. | |
59 | |
60 To explain the logic behind the categorization, the following | |
61 annotation for example groups will be used. A following '1' | |
62 exemplifies a group column presence count in a respective row E<gt>= | |
63 the rounded inclusion cutoff, a '0' a group column presence count | |
64 E<lt>= the rounded exclusion cutoff. The presence and absence of | |
65 rows for the column group affiliations are structured in different | |
66 categories depending on the number of groups. For B<two groups> | |
67 (e.g. A and B) there are five categories: 'A specific' (A:B = 1:0), | |
68 'B specific' (0:1), 'cutoff core' (1:1), 'underrepresented' (0:0), | |
69 and 'unspecific'. Unspecific rows have a column presence count for | |
70 at least B<one> group outside the cutoffs (exclusion cutoff E<lt> | |
71 column presence count E<lt> inclusion cutoff) and thus cannot be | |
72 categorized. The respective row headers of these 'unspecific' rows | |
73 will only be printed to a final result file with option B<-u>. | |
74 Overall stats for all categories are printed to C<STDOUT> in a final | |
75 tab-delimited output matrix. | |
76 | |
77 B<Three groups> (A, B, and C) have the following nine categories: 'A | |
78 specific' (A:B:C = 1:0:0), 'B specific' (0:1:0), 'C specific' | |
79 (0:0:1), 'A absent' (0:1:1), 'B absent' (1:0:1), 'C absent' (1:1:0), | |
80 'cutoff core' (1:1:1), 'underrepresented' (0:0:0), and 'unspecific'. | |
81 | |
82 B<Four groups> (A, B, C, and D) are classified in 17 categories: 'A | |
83 specific' (A:B:C:D = 1:0:0:0), 'B specific' (0:1:0:0), 'C specific' | |
84 (0:0:1:0), 'D specific' (0:0:0:1), 'A-B specific' (1:1:0:0), 'A-C | |
85 specific' (1:0:1:0), 'A-D specific' (1:0:0:1), 'B-C specific' | |
86 (0:1:1:0), 'B-D specific' (0:1:0:1), 'C-D specific' (0:0:1:1), 'A | |
87 absent' (0:1:1:1), 'B absent' (1:0:1:1), 'C absent' (1:1:0:1), 'D | |
88 absent' (1:1:1:0), 'cutoff core' (1:1:1:1), 'underrepresented' | |
89 (0:0:0:0), and 'unspecific'. | |
90 | |
91 The resulting B<group> presence/absence (according to the cutoffs) | |
92 can also be printed to a group binary matrix (option B<-b>) in the | |
93 result directory (option B<-r>), excluding the 'unspecific' | |
94 category. Since the categories are the logics underlying venn | |
95 diagrams, you can also plot the results in a venn diagram using the | |
96 group binary matrix (option B<-p>). The 'underrepresented' category | |
97 is exempt from the venn diagram, because it is outside of venn | |
98 diagram logics. | |
99 | |
100 For an illustration of the logics have a look at the example venn | |
101 diagrams in the L<C<po2group_stats.pl>|/po2group_stats> folder | |
102 F</po2group_stats/pics/venn_diagram_logics.[svg|png]>. | |
103 | |
104 There are two optional categories (which are only considered for the | |
105 final print outs and in the final stats matrix, not for the group | |
106 binary matrix and the venn diagram): 'strict core' (option B<-co>) | |
107 for rows where B<all> columns have a presence '1'. Of course all the | |
108 'strict core' rows are also included in the 'cutoff core' category | |
109 ('strict core' is identical to 'cutoff core' with B<-cut_i> 1 and | |
110 B<-cut_e> 0). Option B<-s> activates the detection of | |
111 'singleton/column-specific' rows with a '1' in only B<one> column. | |
112 Depending on the cutoffs and number of columns in the groups, | |
113 category 'underrepresented' includes most of these singletons. | |
114 | |
115 Each row header is printed in the respective category output file in | |
116 the result directory. The number of row headers in the category | |
117 result files are the same as listed in the venn diagram and the | |
118 final stats matrix. Groups with only a single column header will | |
119 have the same result as the respective 'singleton' category (with | |
120 option B<-s>). | |
121 | |
122 =head1 OPTIONS | |
123 | |
124 =head2 Mandatory options | |
125 | |
126 =over 20 | |
127 | |
128 =item B<-i>=I<str>, B<-input>=I<str> | |
129 | |
130 Input delimited TEXT binary matrix (e.g. *.tsv, *.csv, or *.txt), | |
131 see option B<-d> | |
132 | |
133 =item B<-g>=I<str>, B<-groups_file>=I<str> | |
134 | |
135 Tab-delimited file with group affiliation for the columns from the | |
136 input binary matrix with B<minimal two> and B<maximal four> groups | |
137 (easiest to create in a spreadsheet software and save in | |
138 tab-separated format). B<All> column headers from the input binary | |
139 matrix need to be included. Column headers and group names can only | |
140 include alphanumeric (a-z, A-Z, 0-9), underscore (_), dash (-), and | |
141 period (.) characters (no whitespaces allowed either). Example | |
142 format with two column headers in group A, three in group B and D, | |
143 and one in group C: | |
144 | |
145 group_A\tgroup_B\tgroup_C\tgroup_D | |
146 column_header1\tcolumn_header9\tcolumn_header3\tcolumn_header8 | |
147 column_header7\tcolumn_header6\t\tcolumn_header5 | |
148 \tcolumn_header4\t\tcolumn_header2 | |
149 | |
150 =back | |
151 | |
152 =head2 Optional options | |
153 | |
154 =over 20 | |
155 | |
156 =item B<-h>, B<-help> | |
157 | |
158 Help (perldoc POD) | |
159 | |
160 =item B<-d>=I<str>, B<-delimiter>=I<str> | |
161 | |
162 Set delimiter of input binary matrix (e.g. comma ',', single | |
163 space ' ' etc.) [default = tab-delimited/separated] | |
164 | |
165 =item B<-r>=I<str>, B<-result_dir>=I<str> | |
166 | |
167 Path to result folder [default = inclusion and exclusion percentage | |
168 cutoff, './results_i#_e#'] | |
169 | |
170 =item B<-cut_i>=I<float>, B<-cut_inclusion>=I<float> | |
171 | |
172 Percentage inclusion cutoff for column presence counts in a group | |
173 per row, has to be E<gt> 0 and E<lt>= 1. Cutoff will be rounded | |
174 according to the column header number in each group and has to be | |
175 E<gt> the rounded exclusion cutoff in this group. [default = 0.9] | |
176 | |
177 =item B<-cut_e>=I<float>, B<-cut_exclusion>=I<float> | |
178 | |
179 Percentage exclusion cutoff, has to be E<gt>= 0 and E<lt> 1. Rounded | |
180 cutoff has to be E<lt> rounded inclusion cutoff. [default = 0.1] | |
181 | |
182 =item B<-b>, B<-binary_group_matrix> | |
183 | |
184 Print a group binary matrix with the presence/absence column group | |
185 results according to the cutoffs (excluding 'unspecific' category | |
186 rows) | |
187 | |
188 =item B<-p>, B<-plot_venn> | |
189 | |
190 Plot venn diagram from the group binary matrix (except 'unspecific' | |
191 and 'underrepresented' categories) with function C<venn> from B<R> | |
192 package B<gplots>, requires option B<-b> | |
193 | |
194 =item B<-co>, B<-core_strict> | |
195 | |
196 Include 'strict core' category in output for rows where B<all> | |
197 columns have a '1' | |
198 | |
199 =item B<-s>, B<-singletons> | |
200 | |
201 Include singleton/column-specific rows for each column header in the | |
202 output, activates also overall column header presence ('1') counts | |
203 in final stats matrix for columns with singletons | |
204 | |
205 =item B<-u>, B<-unspecific> | |
206 | |
207 Include 'unspecific' category in output | |
208 | |
209 =item B<-a>, B<-all_column_presence_overall> | |
210 | |
211 Report overall presence counts for all column headers (appended to | |
212 the final stats matrix), also those without singletons; will include | |
213 all overall column header presence counts without option B<-s> | |
214 | |
215 =item B<-v>, B<-version> | |
216 | |
217 Print version number to C<STDERR> | |
218 | |
219 =back | |
220 | |
221 =head1 OUTPUT | |
222 | |
223 =over 20 | |
224 | |
225 =item C<STDOUT> | |
226 | |
227 The tab-delimited final stats matrix is printed to C<STDOUT>. | |
228 Redirect or pipe into another tool as needed. | |
229 | |
230 =item F<./results_i#_e#> | |
231 | |
232 All output files are stored in a results folder | |
233 | |
234 =item F<./results_i#_e#/[*_specific|*_absent|cutoff_core|underrepresented]_rows.txt> | |
235 | |
236 Files including the row headers for rows in non-optional categories | |
237 | |
238 =item (F<./results_i#_e#/[*_singletons|strict_core|unspecific]_rows.txt>) | |
239 | |
240 Optional category output files with the respective row headers | |
241 | |
242 =item (F<./results_i#_e#/binary_matrix.tsv>) | |
243 | |
244 Tab-delimited binary matrix of group presence/absence results | |
245 according to cutoffs (excluding 'unspecific' rows) | |
246 | |
247 =item (F<./results_i#_e#/venn_diagram.pdf>) | |
248 | |
249 Venn diagram for non-optional categories (except 'unspecific' and | |
250 'underrepresented' categories) | |
251 | |
252 =back | |
253 | |
254 =head1 EXAMPLES | |
255 | |
256 =over | |
257 | |
258 =item C<perl binary_group_stats.pl -i binary_matrix_transposed.csv -g group_file.tsv -d , -r result_dir -cut_i 0.7 -cut_e 0.2 -b -p -co -s -u -a E<gt> overall_stats.tsv> | |
259 | |
260 =back | |
261 | |
262 =head1 DEPENDENCIES | |
263 | |
264 =over | |
265 | |
266 =item B<Statistical computing language L<R|http://www.r-project.org/>> | |
267 | |
268 C<Rscript> is needed to plot the venn diagram with option B<-p>, | |
269 tested with version 3.2.2 | |
270 | |
271 =item B<L<gplots|https://cran.r-project.org/web/packages/gplots/index.html>> | |
272 | |
273 Package needed for B<R> to plot the venn diagram with function | |
274 C<venn>. Tested with B<gplots> version 2.17.0. | |
275 | |
276 =back | |
277 | |
278 =head1 VERSION | |
279 | |
280 0.1 06-06-2016 | |
281 | |
282 =head1 AUTHOR | |
283 | |
284 Andreas Leimbach aleimba[at]gmx[dot]de | |
285 | |
286 =head1 LICENSE | |
287 | |
288 This program is free software: you can redistribute it and/or modify | |
289 it under the terms of the GNU General Public License as published by | |
290 the Free Software Foundation; either version 3 (GPLv3) of the | |
291 License, or (at your option) any later version. | |
292 | |
293 This program is distributed in the hope that it will be useful, but | |
294 WITHOUT ANY WARRANTY; without even the implied warranty of | |
295 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU | |
296 General Public License for more details. | |
297 | |
298 You should have received a copy of the GNU General Public License | |
299 along with this program. If not, see L<http://www.gnu.org/licenses/>. | |
300 | |
301 =cut | |
302 | |
303 | |
304 ######## | |
305 # MAIN # | |
306 ######## | |
307 | |
308 use strict; | |
309 use warnings; | |
310 use autodie; | |
311 use Getopt::Long; | |
312 use Pod::Usage; | |
313 | |
314 my $Cmdline = "$0 @ARGV"; # used call command | |
315 | |
316 ### Get the options with Getopt::Long | |
317 my $Binary_Matrix_File; # binary input matrix | |
318 my $Groups_File; # tab-separated file with group affiliation for the input binary matrix column headers | |
319 my $Delimiter = "\t"; # set delimiter/separator of input matrix | |
320 my $Result_Dir; # path to results folder; default is set below to '"./results_i".$Inclusion_Cutoff."c".$Exclusion_Cutoff' | |
321 my $Inclusion_Cutoff = 0.9; # inclusion percentage cutoff for column presence ('1') count per row (floating point number >0 && <=1) | |
322 my $Exclusion_Cutoff = 0.1; # exclusion percentage cutoff (floating point number >=0 && <1) | |
323 my $Opt_Group_Binary; # print resulting group binary matrix according to the group inclusion/exclusion results | |
324 my $Opt_Venn; # create venn diagram with R package 'gplots' function 'venn'; requires $Opt_Group_Binary | |
325 my $Opt_Strict_Core; # report also strict core, i.e. rows with presence in ALL columns of input binary matrix (identical to 'cutoff_core' with inclusion and exclusion cutoffs of 1 and 0 resp.) | |
326 my $Opt_Singletons; # report also singletons for each column header (outside of the group classifications) | |
327 my $Opt_Unspecific; # report also group-unspecific rows ('exclusion < row_group_column-presence_count < inclusion' for any group) | |
328 my $Opt_Columns_Overall; # report overall row presence counts also for column headers without singletons | |
329 my $VERSION = 0.1; | |
330 my ($Opt_Version, $Opt_Help); | |
331 GetOptions ('input=s' => \$Binary_Matrix_File, | |
332 'groups_file=s' => \$Groups_File, | |
333 'delimiter=s' => \$Delimiter, | |
334 'result_dir=s' => \$Result_Dir, | |
335 'cut_inclusion=f' => \$Inclusion_Cutoff, | |
336 'cut_exclusion=f' => \$Exclusion_Cutoff, | |
337 'binary_group_matrix' => \$Opt_Group_Binary, | |
338 'plot_venn' => \$Opt_Venn, | |
339 'core_strict' => \$Opt_Strict_Core, | |
340 'singletons' => \$Opt_Singletons, | |
341 'unspecific' => \$Opt_Unspecific, | |
342 'all_column_presence_overall' => \$Opt_Columns_Overall, | |
343 'version' => \$Opt_Version, | |
344 'help|?' => \$Opt_Help) | |
345 or pod2usage(-verbose => 1, -exitval => 2); | |
346 | |
347 | |
348 | |
349 ### Run perldoc on POD and enforce mandatory options | |
350 pod2usage(-verbose => 2) if ($Opt_Help); | |
351 die "$0 $VERSION\n" if ($Opt_Version); | |
352 | |
353 if (!$Binary_Matrix_File || !$Groups_File) { | |
354 my $warning = "\n### Fatal error: Mandatory options '-i' or '-g' or their arguments are missing!\n"; | |
355 pod2usage(-verbose => 1, -message => $warning, -exitval => 2); | |
356 } | |
357 | |
358 if (($Inclusion_Cutoff <= 0 || $Inclusion_Cutoff > 1) || ($Exclusion_Cutoff < 0 || $Exclusion_Cutoff >= 1)) { | |
359 my $warning = "\n### Fatal error:\nInclusion (0 < '-cut_i' <= 1) or exclusion (0 <= '-cut_e' < 1) cutoff(s) not chosen correctly!\n"; | |
360 pod2usage(-verbose => 1, -message => $warning, -exitval => 2); | |
361 } | |
362 | |
363 print STDERR "Script call command: $Cmdline\n"; # print call command after '-h|-v' | |
364 | |
365 if ($Opt_Venn && !$Opt_Group_Binary) { | |
366 warn "Option '-p' to draw the venn diagram set, but not it's required option '-b' for the output group binary matrix. Forcing option '-b'!\n"; | |
367 $Opt_Group_Binary = 1; | |
368 } | |
369 | |
370 | |
371 | |
372 ### Parse column headers (from the input binary matrix) in groups file | |
373 print STDERR "Parsing group file '$Groups_File' "; # run status of script | |
374 open (my $Groups_File_Fh, "<", "$Groups_File"); | |
375 | |
376 my @Group_Names; # store group name column order | |
377 my %Column_Header_Groups; # hash of hash with group name as key, internal hash with resp. input binary matrix column header anonymous array (key 'column_headers'), or calculated integer inclusion/exclusion cutoffs (key 'inclusion' etc.) | |
378 while (<$Groups_File_Fh>) { | |
379 chomp; | |
380 next if (/^\s*$/); # skip empty lines | |
381 | |
382 # check groups file header and get group names | |
383 if ($. == 1) { # header of groups file (first line) | |
384 die "\n### Fatal error:\nGroups file '$Groups_File' does not have a correctly formatted header line with at least TWO and maximal FOUR tab-separated group name columns (without other whitespaces)!\n" if (!/^\S+\t\S+\t?\S*\t?\S*$/); | |
385 foreach my $group_name (split(/\t/)) { | |
386 check_file_system_conformity($group_name, 'group names'); # subroutine to check string only contains alphanumeric '0-9a-zA-Z', underscore '_', dash '-', and period '.' characters | |
387 die "\n### Fatal error:\nGroup name '$group_name' is not unique in the group file '$Groups_File'! Please choose unique names for the groups in the group file!\n" if (grep {/$group_name/} @Group_Names); | |
388 push (@Group_Names, $group_name); # store groups array | |
389 } | |
390 next; # skip header line for subsequent code | |
391 } | |
392 | |
393 # parse input binary matrix column headers in groups file | |
394 die "\n### Fatal error:\nLine '$.' in the groups file '$Groups_File' does not have the correct format with maximal four tab-separated column header names from the input binary matrix '$Binary_Matrix_File' according to the groups (without other whitespaces):\n$_\n" if (!/^\S*\t?\S*\t?\S*\t?\S*$/); | |
395 my $column = -1; # group file column counter to associate input binary matrix column header to correct group (@Group_Names array zero-based, thus start with '-1') | |
396 foreach my $column_header (split(/\t/)) { | |
397 $column++; | |
398 next if ($column_header =~ /^$/); # skip empty columns in group file, in case the groups have uneven column header numbers | |
399 warn "\n### Warning:\nInput binary matrix column header '$column_header' in the group file has a binary ('0' or '1') instead of an alphanumeric string, sure this is a column header? The matrix needs to have mandatory row and column headers, see help with '-h'!\n" if ($column_header =~ /^(0|1)$/); | |
400 check_file_system_conformity($column_header, 'column headers'); # subroutine | |
401 die "\n### Fatal error:\nInput binary matrix column header '$column_header' is not unique in the group file '$Groups_File', but needs to be in the group file and the input binary matrix '$Binary_Matrix_File'! The group file needs to contain all of the column headers from the input binary matrix and vice versa. Make sure the strings correspond exactly (also lower/uppercase) to the header of the matrix!\n\n" if (check_column_header_in_groups($column_header)); # subroutine to check if a column header is present in %Column_Header_Groups | |
402 push (@{ $Column_Header_Groups{$Group_Names[$column]}->{'column_headers'} }, $column_header); # push into anonymous array in hash of hash | |
403 } | |
404 } | |
405 close $Groups_File_Fh; | |
406 print STDERR "with ", scalar map(@{ $Column_Header_Groups{$_}->{'column_headers'} }, keys %Column_Header_Groups), " input binary matrix column headers ...\n"; # run status of script | |
407 | |
408 foreach (@Group_Names) { | |
409 die "\n### Fatal error:\nGroup '$_' does not contain an element, please add a suitable input binary matrix column header or remove the group from the group file '$Groups_File'!\n" if (!$Column_Header_Groups{$_}); | |
410 } | |
411 | |
412 | |
413 | |
414 ### Round inclusion and exclusion count cutoffs for each group | |
415 foreach (sort keys %Column_Header_Groups) { | |
416 $Column_Header_Groups{$_}->{'inclusion'} = int((scalar @{ $Column_Header_Groups{$_}->{'column_headers'} } * $Inclusion_Cutoff) + 0.5); # round positive number half up to integer ((s)printf rounds half to even), see e.g. https://stackoverflow.com/questions/178539/how-do-you-round-a-floating-point-number-in-perl | |
417 print STDERR "Group '$_' with ", scalar @{ $Column_Header_Groups{$_}->{'column_headers'} }, " input binary matrix column headers has a rounded inclusion cutoff of $Column_Header_Groups{$_}->{'inclusion'} column header(s) and "; | |
418 | |
419 $Column_Header_Groups{$_}->{'exclusion'} = int((scalar @{ $Column_Header_Groups{$_}->{'column_headers'} } * $Exclusion_Cutoff) + 0.5); | |
420 print STDERR "an exclusion cutoff of $Column_Header_Groups{$_}->{'exclusion'} column header(s)!\n"; | |
421 die "\n### Fatal error:\nThe rounded inclusion cutoff has to be greater than the exclusion cutoff for all groups, please choose appropriate cutoffs!\n" if ($Column_Header_Groups{$_}->{'inclusion'} <= $Column_Header_Groups{$_}->{'exclusion'}); | |
422 } | |
423 | |
424 | |
425 | |
426 ### Create result folder | |
427 if (!$Result_Dir) { # can't give default before 'GetOptions' in case cutoffs are set by the user | |
428 $Result_Dir = "./results_i".$Inclusion_Cutoff."_e".$Exclusion_Cutoff; | |
429 } else { | |
430 $Result_Dir =~ s/\/$//; # get rid of a potential '/' at the end of $Result_Dir path | |
431 } | |
432 if (-e $Result_Dir) { | |
433 empty_dir($Result_Dir); # subroutine to empty a directory with user interaction | |
434 } else { | |
435 mkdir $Result_Dir; | |
436 } | |
437 | |
438 | |
439 | |
440 ### Parse rows in input binary matrix | |
441 print STDERR "Parsing input binary matrix '$Binary_Matrix_File' ...\n"; # run status of script | |
442 open (my $Binary_Matrix_Fh, "<", "$Binary_Matrix_File"); | |
443 | |
444 my @Column_Headers; # store input binary matrix column header order | |
445 my %Row_Header_Groups; # hash of hash with row header as key (first column of each row) and internal hash with column header as key and '1' for presence ('0' absence is not stored) | |
446 while (<$Binary_Matrix_Fh>) { | |
447 chomp; | |
448 next if (/^\s*$/); # skip empty lines | |
449 die "\n### Fatal error:\nSet delimiter/separator '$Delimiter' (option '-d') not found in line '$.' of the input binary matrix '$Binary_Matrix_File', sure the correct one is set?\n" if ($_ !~ /$Delimiter/); | |
450 | |
451 # check input binary matrix file header and get column header names | |
452 if ($. == 1) { # header of input binary matrix file (first line) | |
453 my (undef, @headers) = split(/$Delimiter/); # get rid of first line row header | |
454 foreach my $column_header (@headers) { | |
455 die "\n### Fatal error:\nColumn header '$column_header' is present in the input binary matrix '$Binary_Matrix_File' but not in the groups file '$Groups_File'. However, the group file needs to contain all of the column headers from the matrix and vice versa. Make sure the strings correspond exactly (also lower/uppercase) to the matrix header!\n" unless (check_column_header_in_groups($column_header)); # subroutine to check if a column header is present in %Column_Header_Groups | |
456 push (@Column_Headers, $column_header); | |
457 } | |
458 die "\n### Fatal error:\nNot the same amount of input binary matrix column headers in the group file '$Groups_File' and matrix '$Binary_Matrix_File', but the group file needs to contain all of the column headers in the matrix and vice versa!\n" if (@Column_Headers != map(@{ $Column_Header_Groups{$_}->{'column_headers'} }, keys %Column_Header_Groups)); | |
459 next; # skip header line of input binary matrix for subsequent code | |
460 } | |
461 | |
462 # parse input binary matrix | |
463 my ($row_header, @presence_absence) = split(/$Delimiter/); | |
464 die "\n### Fatal error:\nInput binary matrix '$Binary_Matrix_File' row '$row_header' doesn't have the same amount of columns as overall column headers, however the matrix needs to be filled/consistent in every row!\n" unless (@Column_Headers == @presence_absence); | |
465 die "\n### Fatal error:\nInput binary matrix row header '$row_header' is not unique in the input binary matrix '$Binary_Matrix_File', however all row headers need to be unique!\n" if ($Row_Header_Groups{$row_header}); | |
466 warn "\n### Warning:\nLine number '$.' of the binary input matrix '$Binary_Matrix_File' has a binary ('0' or '1') row header instead of an alphanumeric string, sure this is a row header? The matrix needs to have mandatory row and column headers, see help with '-h'!\n\n" if ($row_header =~ /^(0|1)$/); | |
467 my $column = -1; # column counter to associate presence/absence field to correct column header (@Column_Headers array zero-based, thus start with '-1') | |
468 foreach (@presence_absence) { | |
469 $column++; | |
470 die "\n###Fatal error:\nThere's an empty field in the input binary matrix '$Binary_Matrix_File' in row '$row_header', column '$column+2'. This shouldn't happen, as all the fields of the BINARY matrix should either have a '0' indicating absence or a '1' indicating presence!\n" if (/^$/); | |
471 die "\n### Fatal error:\nThe input binary matrix '$Binary_Matrix_File' contains a field with '$_' in row/line '$.', column '$column+2', instead of the mandatory binary '0' indicating absence or '1' indicating presence!\n" if (!/^(0|1)$/); | |
472 next if (!$_); # skip absence columns in binary input matrix (indicated by '0'), have to skip after '$column++' otherwise column won't fit | |
473 $Row_Header_Groups{$row_header}{$Column_Headers[$column]} = $_; # store presence ('1') in hash of hash | |
474 } | |
475 } | |
476 close $Binary_Matrix_Fh; | |
477 | |
478 | |
479 | |
480 ### Calculate column stats from input binary matrix according to the group cutoffs, categorize the results and print to result files | |
481 print STDERR "Calculating column stats from the input binary matrix according to the group cutoffs ...\n"; # run status of script | |
482 | |
483 # open result file for group binary matrix and print header | |
484 my $Group_Binary_Matrix_File; # filename needed also for venn diagram R script below | |
485 my $Group_Binary_Matrix_Fh; | |
486 if ($Opt_Group_Binary) { | |
487 $Group_Binary_Matrix_File = "$Result_Dir/group_binary_matrix.tsv"; | |
488 open ($Group_Binary_Matrix_Fh, ">", "$Group_Binary_Matrix_File"); | |
489 print $Group_Binary_Matrix_Fh join("\t", @Group_Names), "\n"; # print header for group binary matrix | |
490 } | |
491 | |
492 my %Count_Group_Pres_Abs; # hash of hash to store presence count and output filehandle for each category ('core', 'group-specific' etc.) | |
493 foreach my $row_header (sort {lc($a) cmp lc($b)} keys %Row_Header_Groups) { | |
494 my %present_row_columns; # hash of array to store input binary matrix columns of a group with PRESENCE ('1') in the current row for cutoff tests and print outs | |
495 | |
496 # 'strict_core' category row, if ALL columns present in row (with option '-co') | |
497 if ($Opt_Strict_Core && keys %{$Row_Header_Groups{$row_header}} == @Column_Headers) { | |
498 $Count_Group_Pres_Abs{'strict_core'}->{'presence_count'}++; | |
499 open_result_file("$Result_Dir/strict_core_rows.txt", 'strict_core') if (!$Count_Group_Pres_Abs{'strict_core'}->{'fh'}); # subroutine to open result file and store fh (if not opened yet) | |
500 print { $Count_Group_Pres_Abs{'strict_core'}->{'fh'} } "$row_header\n"; # block around fh needed if not a simple string variable (or glob) | |
501 } | |
502 | |
503 # count and save number of columns with presence ('1') for the current $row_header with their corresponding group affiliation | |
504 foreach my $group (@Group_Names) { | |
505 foreach my $column_header (@{ $Column_Header_Groups{$group}->{'column_headers'} }) { | |
506 if ($Row_Header_Groups{$row_header}->{$column_header}) { # $column_header with presence in current $row_header | |
507 push(@{ $present_row_columns{$group} }, $column_header); # push into anonymous array in hash | |
508 | |
509 # 'singleton' category row if only ONE column with presence (with option '-s') | |
510 if ($Opt_Singletons && keys %{$Row_Header_Groups{$row_header}} == 1) { | |
511 $Count_Group_Pres_Abs{$column_header}->{'presence_count'}++; # here: category = $column_header | |
512 open_result_file("$Result_Dir/$column_header\_singletons.txt", $column_header) if (!$Count_Group_Pres_Abs{$column_header}->{'fh'}); # subroutine | |
513 print { $Count_Group_Pres_Abs{$column_header}->{'fh'} } "$row_header\n"; | |
514 } | |
515 } | |
516 } | |
517 } | |
518 | |
519 # filter column presence counts of the groups in this row according to inclusion-/exclusion-cutoffs | |
520 my @row_included_groups; # array for groups with number of present column in this row >= the inclusion cutoff | |
521 my @row_excluded_groups; # array for groups with column presence <= the exclusion cutoff | |
522 my @row_unspec_groups; # array for unspecific groups with 'exclusion < column_presence_count < inclusion' | |
523 foreach my $group (@Group_Names) { | |
524 if ($present_row_columns{$group}) { # only run if group has a column presence in this row (otherwise undefined ARRAY reference) | |
525 if (@{ $present_row_columns{$group} } >= $Column_Header_Groups{$group}->{'inclusion'}) { | |
526 push(@row_included_groups, $group); | |
527 } elsif (@{ $present_row_columns{$group} } <= $Column_Header_Groups{$group}->{'exclusion'}) { | |
528 push(@row_excluded_groups, $group); | |
529 } elsif (@{ $present_row_columns{$group} } > $Column_Header_Groups{$group}->{'exclusion'} && @{ $present_row_columns{$group} } < $Column_Header_Groups{$group}->{'inclusion'}) { | |
530 push(@row_unspec_groups, $group); | |
531 } | |
532 } else { # $present_row_columns{$group} == 0 for group (undef anonymous array), i.e. always <= exclusion | |
533 push(@row_excluded_groups, $group); | |
534 } | |
535 } | |
536 | |
537 # 'unspecific' category row if at least ONE group present in '@row_unspec_groups' | |
538 # no further categorization, i.e. skip rest of logics | |
539 if (@row_unspec_groups) { | |
540 $Count_Group_Pres_Abs{'unspecific'}->{'presence_count'}++; | |
541 if ($Opt_Unspecific) { # print only to output file with option '-u' | |
542 open_result_file("$Result_Dir/unspecific_rows.txt", 'unspecific') if (!$Count_Group_Pres_Abs{'unspecific'}->{'fh'}); # subroutine | |
543 print { $Count_Group_Pres_Abs{'unspecific'}->{'fh'} } "$row_header\n"; | |
544 } | |
545 next; # skip to next $row_header, because now unspec and not suitable for any further categories | |
546 } | |
547 | |
548 # print group binary matrix (in contrast to rows in category 'unspecific' above, 'underrepresented' rows below not skipped because package 'venn' can handle a row with only zeros in the group binary matrix) | |
549 if ($Opt_Group_Binary) { | |
550 my $i = 0; | |
551 foreach my $group (@Group_Names) { | |
552 $i++; | |
553 print $Group_Binary_Matrix_Fh "1" if (grep {/$group/} @row_included_groups); | |
554 print $Group_Binary_Matrix_Fh "0" if (grep {/$group/} @row_excluded_groups); | |
555 print $Group_Binary_Matrix_Fh "\t" if ($i < @Group_Names); | |
556 } | |
557 print $Group_Binary_Matrix_Fh "\n"; | |
558 } | |
559 | |
560 # 'cutoff_core' category row if ALL groups >= inclusion (includes 'strict_core' rows where ALL columns are present in the row) | |
561 # 2 groups A/B [2 inclus :0 exclus], 3 groups A/B/C [3:0], 4 groups A/B/C/D [4:0] | |
562 if (@row_included_groups == @Group_Names && @row_excluded_groups == 0) { | |
563 $Count_Group_Pres_Abs{'cutoff_core'}->{'presence_count'}++; | |
564 open_result_file("$Result_Dir/cutoff_core_rows.txt", 'cutoff_core') if (!$Count_Group_Pres_Abs{'cutoff_core'}->{'fh'}); # subroutine | |
565 print { $Count_Group_Pres_Abs{'cutoff_core'}->{'fh'} } "$row_header\n"; | |
566 | |
567 # 'underrepresented' category row if ALL groups <= exclusion_cutoff (depending on the cutoffs and number of columns in the groups will include most singletons) | |
568 # 2 groups [0:2], 3 [0:3], 4 [0:4] | |
569 } elsif (@row_included_groups == 0 && @row_excluded_groups == @Group_Names) { | |
570 $Count_Group_Pres_Abs{'underrepresented'}->{'presence_count'}++; | |
571 open_result_file("$Result_Dir/underrepresented_rows.txt", 'underrepresented') if (!$Count_Group_Pres_Abs{'underrepresented'}->{'fh'}); # subroutine | |
572 print { $Count_Group_Pres_Abs{'underrepresented'}->{'fh'} } "$row_header\n"; | |
573 | |
574 # 'group-specific' category row if only ONE group >= inclusion_cutoff | |
575 # 2 groups [1:1], 3 [1:2], 4 [1:3] | |
576 } elsif (@row_included_groups == 1 && @row_excluded_groups == @Group_Names - 1) { | |
577 $Count_Group_Pres_Abs{"$row_included_groups[0]_specific"}->{'presence_count'}++; # only one element in array thus [0] | |
578 open_result_file("$Result_Dir/$row_included_groups[0]_specific_rows.txt", "$row_included_groups[0]_specific") if (!$Count_Group_Pres_Abs{"$row_included_groups[0]_specific"}->{'fh'}); # subroutine | |
579 print { $Count_Group_Pres_Abs{"$row_included_groups[0]_specific"}->{'fh'} } "$row_header\n"; | |
580 | |
581 # 'group-absent' category row if ONE group <= exclusion_cutoff (only for 3 and 4 groups) | |
582 # 3 groups (A+B not C [2:1], A+C not B, and B+C not A) and 4 groups (A+B+C not D [3:1], A+B+D not C, A+C+D not B, B+C+D not A) | |
583 } elsif (@Group_Names >= 3 && (@row_included_groups == @Group_Names - 1 && @row_excluded_groups == 1)) { | |
584 $Count_Group_Pres_Abs{"$row_excluded_groups[0]_absent"}->{'presence_count'}++; | |
585 open_result_file("$Result_Dir/$row_excluded_groups[0]_absent_rows.txt", "$row_excluded_groups[0]_absent") if (!$Count_Group_Pres_Abs{"$row_excluded_groups[0]_absent"}->{'fh'}); # subroutine | |
586 print { $Count_Group_Pres_Abs{"$row_excluded_groups[0]_absent"}->{'fh'} } "$row_header\n"; | |
587 | |
588 # 'two group-specific' category row if TWO groups >= inclusion_cutoff (only for 4 groups) | |
589 # 4 groups (B+D not A or C [2:2], B+C not A or D, A+D not B or C, A+C not B or D, A+B not C or D, C+D not A or B) | |
590 } elsif (@Group_Names == 4 && (@row_included_groups == @row_excluded_groups)) { | |
591 $Count_Group_Pres_Abs{"$row_included_groups[0]-$row_included_groups[1]_specific"}->{'presence_count'}++; | |
592 open_result_file("$Result_Dir/$row_included_groups[0]-$row_included_groups[1]_specific_rows.txt", "$row_included_groups[0]-$row_included_groups[1]_specific") if (!$Count_Group_Pres_Abs{"$row_included_groups[0]-$row_included_groups[1]_specific"}->{'fh'}); # subroutine | |
593 print { $Count_Group_Pres_Abs{"$row_included_groups[0]-$row_included_groups[1]_specific"}->{'fh'} } "$row_header\n"; | |
594 | |
595 } else { | |
596 die "\n### Fatal error:\nNo category logic test fits for row '$row_header', which shouldn't happen. Have you smuggled only one or five groups into the groups file '$Groups_File' (which also shouldn't be possible)? The script can handle only two, three, or four groups please edit the input!\n"; # shouldn't happen, since regex in group file parse checks for minimally two and maximally four tab-separated group names and all category logics should be covered | |
597 } | |
598 } | |
599 | |
600 | |
601 | |
602 ### Close all output files | |
603 close $Group_Binary_Matrix_Fh if ($Opt_Group_Binary); | |
604 map { close $Count_Group_Pres_Abs{$_}->{'fh'} } keys %Count_Group_Pres_Abs; | |
605 | |
606 | |
607 | |
608 ### Plot the venn diagram with R package 'gplots' function 'venn' | |
609 if ($Opt_Venn) { | |
610 my $tmp_r_script = "$Result_Dir/tmp_venn.R"; # filename of the R script | |
611 my $venn_diagram_name = "$Result_Dir/venn_diagram.pdf"; # filename of the output PDF histogram | |
612 print STDERR "Drawing venn diagram (option '-p') '$venn_diagram_name' ...\n"; # run status of script | |
613 open (my $r_fh, ">", "$tmp_r_script"); | |
614 | |
615 select $r_fh; # select fh for standard print/f output | |
616 print "#!/usr/bin/Rscript --vanilla --slave\n"; # header of R script | |
617 print "suppressMessages(library(gplots))\n"; # load library 'gplots' for function 'venn' and suppress STDOUT message from loading | |
618 print "group_binary_matrix <- read.table(\"$Group_Binary_Matrix_File\", header=TRUE, sep=\"\\t\")\n"; | |
619 print "pdf(\"$venn_diagram_name\")\n"; | |
620 print "venn(group_binary_matrix)\n"; # create the venn diagram | |
621 print "out <- dev.off()\n"; # write histogram to PDF and suppress STDOUT output by diverting it | |
622 select STDOUT; # change standard print/f output back to STDOUT | |
623 close $r_fh; | |
624 | |
625 # execute tmp R script with 'Rscript' | |
626 system("Rscript $tmp_r_script") == 0 or die "### Fatal error:\nPlotting the venn diagram didn't work. Either statistical programming language 'R' or the required R package 'gplots' is not installed, not in global \$PATH, or something wrong with the group binary matrix '$Group_Binary_Matrix_File' or the temporary R script '$tmp_r_script', which runs the drawing of the venn diagram. Function 'venn' of the 'gplots' package is required and needs to be installed!\n"; | |
627 unlink $tmp_r_script; | |
628 } | |
629 | |
630 | |
631 | |
632 ### Count overall category rows, and plausibility check for above logics | |
633 my $Total_Category_Rows = 0; | |
634 foreach my $category (keys %Count_Group_Pres_Abs) { # can't use a map in case of optionally strict_core, singletons categories (which overlap non-optional categories) | |
635 next if ($category =~ /strict_core/ || grep {$category eq $_} @Column_Headers); # skip optional 'strict_core' (option '-co') or singleton ('-s') categories | |
636 | |
637 $Total_Category_Rows += $Count_Group_Pres_Abs{$category}->{'presence_count'}; | |
638 } | |
639 die "\n### Fatal error:\nResulting categories don't include all present rows, this shouldn't happen!\n" if (keys %Row_Header_Groups != $Total_Category_Rows); # is this really always the case? | |
640 | |
641 | |
642 | |
643 ### Print final output stats matrix | |
644 print STDERR "Finished printing output files to result directory '$Result_Dir', printing final stats matrix to STDOUT ...\n"; # run status of script | |
645 print "# category\tpresence_count\n"; # header for matrix | |
646 print "total_rows\t", scalar keys %Row_Header_Groups, "\n"; | |
647 foreach my $category (sort {lc($a) cmp lc($b)} keys %Count_Group_Pres_Abs) { | |
648 print "$category"; | |
649 if (grep {$category eq $_} @Column_Headers) { # print overall presence counts for column headers with singletons (option '-s') | |
650 print " overall_presence\t", scalar grep($Row_Header_Groups{$_}->{$category} , keys %Row_Header_Groups), "\n"; | |
651 print "$category singletons"; # include category "key word" singleton for category stats below | |
652 } | |
653 print "\t$Count_Group_Pres_Abs{$category}->{'presence_count'}\n"; | |
654 } | |
655 | |
656 # print out overall presence counts for all columns even without singletons (option '-a') | |
657 if ($Opt_Columns_Overall) { | |
658 foreach my $column_header (sort {lc($a) cmp lc($b)} @Column_Headers) { | |
659 next if (grep {$column_header eq $_} keys %Count_Group_Pres_Abs); # skip column headers with singletons (option '-s'), already printed above | |
660 print "$column_header overall_presence\t", scalar grep($Row_Header_Groups{$_}->{$column_header} , keys %Row_Header_Groups), "\n"; | |
661 } | |
662 } | |
663 | |
664 exit; | |
665 | |
666 | |
667 ############# | |
668 #Subroutines# | |
669 ############# | |
670 | |
671 ### Subroutine to check if a column header is present in hash %Column_Header_Groups (hash filled from the $Groups_File) | |
672 sub check_column_header_in_groups { | |
673 my $column_header = shift; | |
674 foreach my $group (keys %Column_Header_Groups) { | |
675 return 1 if (grep {$_ eq $column_header} @{ $Column_Header_Groups{$group}->{'column_headers'} }); | |
676 } | |
677 return; # return 'undef' | |
678 } | |
679 | |
680 | |
681 | |
682 ### Subroutine to check if a string adheres to file system conformity, esp. for filenames | |
683 sub check_file_system_conformity { | |
684 my ($string, $type) = @_; | |
685 die "\n\n### Fatal error:\nOnly alphanumeric '0-9a-zA-Z', underscore '_', dash '-', and period '.' characters allowed for $type to avoid operating/file system problems. Thus, please adapt '$string' in the group file '$Groups_File' and if necessary also in the input binary matrix '$Binary_Matrix_File' accordingly!\n" if ($string !~ /^[\w_\-\.]+$/); | |
686 return 1; | |
687 } | |
688 | |
689 | |
690 | |
691 ### Subroutine to empty a directory with user interaction | |
692 sub empty_dir { | |
693 my $dir = shift; | |
694 print STDERR "\nDirectory '$dir' already exists! You can use either option '-r' to set a different output result directory name, or do you want to replace the directory and all its contents [y|n]? "; | |
695 my $user_ask = <STDIN>; | |
696 if ($user_ask =~ /y/i) { | |
697 unlink glob "$dir/*"; # remove all files in results directory | |
698 } else { | |
699 die "\nScript abborted!\n"; | |
700 } | |
701 return 1; | |
702 } | |
703 | |
704 | |
705 | |
706 ### Open a result file and print the header | |
707 sub open_result_file { | |
708 my ($filename, $category) = @_; | |
709 open ($Count_Group_Pres_Abs{$category}->{'fh'}, ">", "$filename"); # store filehandle in hash of hash | |
710 print { $Count_Group_Pres_Abs{$category}->{'fh'} } "# row_header\n"; # block around fh needed if not a simple string variable (or glob) | |
711 return 1; | |
712 } |