annotate COG/bac-genomics-scripts/prot_finder/binary_group_stats.pl @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 #!/usr/bin/perl
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3 #######
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 # POD #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5 #######
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7 =pod
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 =head1 NAME
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 C<binary_group_stats.pl> - categorize binary matrix rows according to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12 column groups
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14 =head1 SYNOPSIS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16 C<perl binary_group_stats.pl -i binary_matrix.tsv -g group_file.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17 -p E<gt> overall_stats.tsv>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19 =head1 DESCRIPTION
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21 Categorize rows of a delimited TEXT input B<binary> matrix (option
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22 B<-i>) according to column group affiliations. All fields of the binary
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23 matrix need to be filled with either a B<0> indicating absence or a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24 B<1> indicating presence, i.e. all rows need to have the same number of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25 columns. Use option B<-d> to set the delimiter of the input matrix,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26 default is set to tab-delimited/separated matrices.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28 The group affiliations of the columns are intended to get overall
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29 presence/absence statistics for groups of columns and not simply
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 single columns of the matrix. Percentage inclusion (option
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31 B<-cut_i>) and exclusion (option B<-cut_e>) cutoffs can be set to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 define how strict the presence/absence of column groups within a row
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33 are defined. Of course groups can also hold only single column
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 headers to get single column statistics. Group affiliations are
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35 defined in a mandatory B<tab-delimited> group input file (option
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 B<-g>), including the column headers from the input binary matrix,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37 with B<minimal two> and B<maximal four> groups.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39 The script was designed to handle a presence/absence query protein
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 binary matrix from a C<prot_finder.pl> protein homolog search in
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41 bacterial genomes with a subsequent C<prot_binary_matrix.pl> run to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42 create the matrix. But, any binary matrix can be used for
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43 C<binary_group_stats.pl>, optionally beforehand transposed with
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44 C<transpose_matrix.pl>. However, column headers in the first row and
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45 row headers in the first column are B<mandatory> for the input
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46 binary matrix. Only alphanumeric (a-z, A-Z, 0-9), underscore (_),
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47 dash (-), and period (.) characters are allowed for the B<column
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48 headers> and B<group names> in the group file (option B<-g>) to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49 avoid downstream problems with the operating/file system. As a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50 consequence, also no whitespaces are allowed in these! Additionally,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51 B<column headers>, B<row headers>, and B<group names> need to be
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52 B<unique>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54 <binary_group_stats.pl> is based upon
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55 L<C<po2group_stats.pl>|/po2group_stats>, which does the same thing
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56 for genomes in an ortholog/paralog output matrix from a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57 L<B<Proteinortho5>|http://www.bioinf.uni-leipzig.de/Software/proteinortho/>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58 calculation.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60 To explain the logic behind the categorization, the following
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 annotation for example groups will be used. A following '1'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62 exemplifies a group column presence count in a respective row E<gt>=
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63 the rounded inclusion cutoff, a '0' a group column presence count
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64 E<lt>= the rounded exclusion cutoff. The presence and absence of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65 rows for the column group affiliations are structured in different
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66 categories depending on the number of groups. For B<two groups>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67 (e.g. A and B) there are five categories: 'A specific' (A:B = 1:0),
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68 'B specific' (0:1), 'cutoff core' (1:1), 'underrepresented' (0:0),
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69 and 'unspecific'. Unspecific rows have a column presence count for
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70 at least B<one> group outside the cutoffs (exclusion cutoff E<lt>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71 column presence count E<lt> inclusion cutoff) and thus cannot be
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72 categorized. The respective row headers of these 'unspecific' rows
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73 will only be printed to a final result file with option B<-u>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74 Overall stats for all categories are printed to C<STDOUT> in a final
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75 tab-delimited output matrix.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77 B<Three groups> (A, B, and C) have the following nine categories: 'A
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78 specific' (A:B:C = 1:0:0), 'B specific' (0:1:0), 'C specific'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79 (0:0:1), 'A absent' (0:1:1), 'B absent' (1:0:1), 'C absent' (1:1:0),
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80 'cutoff core' (1:1:1), 'underrepresented' (0:0:0), and 'unspecific'.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82 B<Four groups> (A, B, C, and D) are classified in 17 categories: 'A
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83 specific' (A:B:C:D = 1:0:0:0), 'B specific' (0:1:0:0), 'C specific'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84 (0:0:1:0), 'D specific' (0:0:0:1), 'A-B specific' (1:1:0:0), 'A-C
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85 specific' (1:0:1:0), 'A-D specific' (1:0:0:1), 'B-C specific'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86 (0:1:1:0), 'B-D specific' (0:1:0:1), 'C-D specific' (0:0:1:1), 'A
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87 absent' (0:1:1:1), 'B absent' (1:0:1:1), 'C absent' (1:1:0:1), 'D
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88 absent' (1:1:1:0), 'cutoff core' (1:1:1:1), 'underrepresented'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89 (0:0:0:0), and 'unspecific'.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91 The resulting B<group> presence/absence (according to the cutoffs)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92 can also be printed to a group binary matrix (option B<-b>) in the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93 result directory (option B<-r>), excluding the 'unspecific'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94 category. Since the categories are the logics underlying venn
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95 diagrams, you can also plot the results in a venn diagram using the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96 group binary matrix (option B<-p>). The 'underrepresented' category
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97 is exempt from the venn diagram, because it is outside of venn
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98 diagram logics.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100 For an illustration of the logics have a look at the example venn
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101 diagrams in the L<C<po2group_stats.pl>|/po2group_stats> folder
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102 F</po2group_stats/pics/venn_diagram_logics.[svg|png]>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104 There are two optional categories (which are only considered for the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105 final print outs and in the final stats matrix, not for the group
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106 binary matrix and the venn diagram): 'strict core' (option B<-co>)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107 for rows where B<all> columns have a presence '1'. Of course all the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108 'strict core' rows are also included in the 'cutoff core' category
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109 ('strict core' is identical to 'cutoff core' with B<-cut_i> 1 and
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110 B<-cut_e> 0). Option B<-s> activates the detection of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111 'singleton/column-specific' rows with a '1' in only B<one> column.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112 Depending on the cutoffs and number of columns in the groups,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113 category 'underrepresented' includes most of these singletons.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115 Each row header is printed in the respective category output file in
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116 the result directory. The number of row headers in the category
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117 result files are the same as listed in the venn diagram and the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118 final stats matrix. Groups with only a single column header will
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119 have the same result as the respective 'singleton' category (with
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120 option B<-s>).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122 =head1 OPTIONS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124 =head2 Mandatory options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128 =item B<-i>=I<str>, B<-input>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130 Input delimited TEXT binary matrix (e.g. *.tsv, *.csv, or *.txt),
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131 see option B<-d>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133 =item B<-g>=I<str>, B<-groups_file>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135 Tab-delimited file with group affiliation for the columns from the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136 input binary matrix with B<minimal two> and B<maximal four> groups
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137 (easiest to create in a spreadsheet software and save in
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138 tab-separated format). B<All> column headers from the input binary
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
139 matrix need to be included. Column headers and group names can only
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
140 include alphanumeric (a-z, A-Z, 0-9), underscore (_), dash (-), and
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
141 period (.) characters (no whitespaces allowed either). Example
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
142 format with two column headers in group A, three in group B and D,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
143 and one in group C:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
144
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
145 group_A\tgroup_B\tgroup_C\tgroup_D
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
146 column_header1\tcolumn_header9\tcolumn_header3\tcolumn_header8
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
147 column_header7\tcolumn_header6\t\tcolumn_header5
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
148 \tcolumn_header4\t\tcolumn_header2
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
149
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
150 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
151
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
152 =head2 Optional options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
153
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
154 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
155
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
156 =item B<-h>, B<-help>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
157
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
158 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
159
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
160 =item B<-d>=I<str>, B<-delimiter>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
161
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
162 Set delimiter of input binary matrix (e.g. comma ',', single
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
163 space ' ' etc.) [default = tab-delimited/separated]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
164
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
165 =item B<-r>=I<str>, B<-result_dir>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
166
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
167 Path to result folder [default = inclusion and exclusion percentage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
168 cutoff, './results_i#_e#']
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
169
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
170 =item B<-cut_i>=I<float>, B<-cut_inclusion>=I<float>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
171
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
172 Percentage inclusion cutoff for column presence counts in a group
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
173 per row, has to be E<gt> 0 and E<lt>= 1. Cutoff will be rounded
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
174 according to the column header number in each group and has to be
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
175 E<gt> the rounded exclusion cutoff in this group. [default = 0.9]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
176
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
177 =item B<-cut_e>=I<float>, B<-cut_exclusion>=I<float>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
178
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
179 Percentage exclusion cutoff, has to be E<gt>= 0 and E<lt> 1. Rounded
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
180 cutoff has to be E<lt> rounded inclusion cutoff. [default = 0.1]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
181
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
182 =item B<-b>, B<-binary_group_matrix>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
183
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
184 Print a group binary matrix with the presence/absence column group
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
185 results according to the cutoffs (excluding 'unspecific' category
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
186 rows)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
187
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
188 =item B<-p>, B<-plot_venn>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
189
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
190 Plot venn diagram from the group binary matrix (except 'unspecific'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
191 and 'underrepresented' categories) with function C<venn> from B<R>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
192 package B<gplots>, requires option B<-b>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
193
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
194 =item B<-co>, B<-core_strict>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
195
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
196 Include 'strict core' category in output for rows where B<all>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
197 columns have a '1'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
198
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
199 =item B<-s>, B<-singletons>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
200
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
201 Include singleton/column-specific rows for each column header in the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
202 output, activates also overall column header presence ('1') counts
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
203 in final stats matrix for columns with singletons
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
204
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
205 =item B<-u>, B<-unspecific>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
206
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
207 Include 'unspecific' category in output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
208
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
209 =item B<-a>, B<-all_column_presence_overall>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
210
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
211 Report overall presence counts for all column headers (appended to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
212 the final stats matrix), also those without singletons; will include
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
213 all overall column header presence counts without option B<-s>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
214
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
215 =item B<-v>, B<-version>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
216
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
217 Print version number to C<STDERR>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
218
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
219 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
220
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
221 =head1 OUTPUT
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
222
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
223 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
224
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
225 =item C<STDOUT>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
226
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
227 The tab-delimited final stats matrix is printed to C<STDOUT>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
228 Redirect or pipe into another tool as needed.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
229
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
230 =item F<./results_i#_e#>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
231
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
232 All output files are stored in a results folder
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
233
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
234 =item F<./results_i#_e#/[*_specific|*_absent|cutoff_core|underrepresented]_rows.txt>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
235
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
236 Files including the row headers for rows in non-optional categories
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
237
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
238 =item (F<./results_i#_e#/[*_singletons|strict_core|unspecific]_rows.txt>)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
239
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
240 Optional category output files with the respective row headers
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
241
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
242 =item (F<./results_i#_e#/binary_matrix.tsv>)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
243
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
244 Tab-delimited binary matrix of group presence/absence results
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
245 according to cutoffs (excluding 'unspecific' rows)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
246
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
247 =item (F<./results_i#_e#/venn_diagram.pdf>)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
248
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
249 Venn diagram for non-optional categories (except 'unspecific' and
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
250 'underrepresented' categories)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
251
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
252 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
253
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
254 =head1 EXAMPLES
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
255
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
256 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
257
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
258 =item C<perl binary_group_stats.pl -i binary_matrix_transposed.csv -g group_file.tsv -d , -r result_dir -cut_i 0.7 -cut_e 0.2 -b -p -co -s -u -a E<gt> overall_stats.tsv>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
259
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
260 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
261
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
262 =head1 DEPENDENCIES
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
263
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
264 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
265
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
266 =item B<Statistical computing language L<R|http://www.r-project.org/>>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
267
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
268 C<Rscript> is needed to plot the venn diagram with option B<-p>,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
269 tested with version 3.2.2
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
270
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
271 =item B<L<gplots|https://cran.r-project.org/web/packages/gplots/index.html>>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
272
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
273 Package needed for B<R> to plot the venn diagram with function
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
274 C<venn>. Tested with B<gplots> version 2.17.0.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
275
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
276 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
277
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
278 =head1 VERSION
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
279
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
280 0.1 06-06-2016
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
281
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
282 =head1 AUTHOR
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
283
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
284 Andreas Leimbach aleimba[at]gmx[dot]de
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
285
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
286 =head1 LICENSE
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
287
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
288 This program is free software: you can redistribute it and/or modify
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
289 it under the terms of the GNU General Public License as published by
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
290 the Free Software Foundation; either version 3 (GPLv3) of the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
291 License, or (at your option) any later version.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
292
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
293 This program is distributed in the hope that it will be useful, but
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
294 WITHOUT ANY WARRANTY; without even the implied warranty of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
295 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
296 General Public License for more details.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
297
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
298 You should have received a copy of the GNU General Public License
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
299 along with this program. If not, see L<http://www.gnu.org/licenses/>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
300
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
301 =cut
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
302
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
303
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
304 ########
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
305 # MAIN #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
306 ########
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
307
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
308 use strict;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
309 use warnings;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
310 use autodie;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
311 use Getopt::Long;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
312 use Pod::Usage;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
313
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
314 my $Cmdline = "$0 @ARGV"; # used call command
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
315
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
316 ### Get the options with Getopt::Long
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
317 my $Binary_Matrix_File; # binary input matrix
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
318 my $Groups_File; # tab-separated file with group affiliation for the input binary matrix column headers
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
319 my $Delimiter = "\t"; # set delimiter/separator of input matrix
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
320 my $Result_Dir; # path to results folder; default is set below to '"./results_i".$Inclusion_Cutoff."c".$Exclusion_Cutoff'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
321 my $Inclusion_Cutoff = 0.9; # inclusion percentage cutoff for column presence ('1') count per row (floating point number >0 && <=1)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
322 my $Exclusion_Cutoff = 0.1; # exclusion percentage cutoff (floating point number >=0 && <1)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
323 my $Opt_Group_Binary; # print resulting group binary matrix according to the group inclusion/exclusion results
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
324 my $Opt_Venn; # create venn diagram with R package 'gplots' function 'venn'; requires $Opt_Group_Binary
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
325 my $Opt_Strict_Core; # report also strict core, i.e. rows with presence in ALL columns of input binary matrix (identical to 'cutoff_core' with inclusion and exclusion cutoffs of 1 and 0 resp.)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
326 my $Opt_Singletons; # report also singletons for each column header (outside of the group classifications)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
327 my $Opt_Unspecific; # report also group-unspecific rows ('exclusion < row_group_column-presence_count < inclusion' for any group)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
328 my $Opt_Columns_Overall; # report overall row presence counts also for column headers without singletons
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
329 my $VERSION = 0.1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
330 my ($Opt_Version, $Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
331 GetOptions ('input=s' => \$Binary_Matrix_File,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
332 'groups_file=s' => \$Groups_File,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
333 'delimiter=s' => \$Delimiter,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
334 'result_dir=s' => \$Result_Dir,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
335 'cut_inclusion=f' => \$Inclusion_Cutoff,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
336 'cut_exclusion=f' => \$Exclusion_Cutoff,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
337 'binary_group_matrix' => \$Opt_Group_Binary,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
338 'plot_venn' => \$Opt_Venn,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
339 'core_strict' => \$Opt_Strict_Core,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
340 'singletons' => \$Opt_Singletons,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
341 'unspecific' => \$Opt_Unspecific,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
342 'all_column_presence_overall' => \$Opt_Columns_Overall,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
343 'version' => \$Opt_Version,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
344 'help|?' => \$Opt_Help)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
345 or pod2usage(-verbose => 1, -exitval => 2);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
346
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
347
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
348
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
349 ### Run perldoc on POD and enforce mandatory options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
350 pod2usage(-verbose => 2) if ($Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
351 die "$0 $VERSION\n" if ($Opt_Version);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
352
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
353 if (!$Binary_Matrix_File || !$Groups_File) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
354 my $warning = "\n### Fatal error: Mandatory options '-i' or '-g' or their arguments are missing!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
355 pod2usage(-verbose => 1, -message => $warning, -exitval => 2);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
356 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
357
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
358 if (($Inclusion_Cutoff <= 0 || $Inclusion_Cutoff > 1) || ($Exclusion_Cutoff < 0 || $Exclusion_Cutoff >= 1)) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
359 my $warning = "\n### Fatal error:\nInclusion (0 < '-cut_i' <= 1) or exclusion (0 <= '-cut_e' < 1) cutoff(s) not chosen correctly!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
360 pod2usage(-verbose => 1, -message => $warning, -exitval => 2);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
361 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
362
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
363 print STDERR "Script call command: $Cmdline\n"; # print call command after '-h|-v'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
364
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
365 if ($Opt_Venn && !$Opt_Group_Binary) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
366 warn "Option '-p' to draw the venn diagram set, but not it's required option '-b' for the output group binary matrix. Forcing option '-b'!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
367 $Opt_Group_Binary = 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
368 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
369
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
370
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
371
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
372 ### Parse column headers (from the input binary matrix) in groups file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
373 print STDERR "Parsing group file '$Groups_File' "; # run status of script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
374 open (my $Groups_File_Fh, "<", "$Groups_File");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
375
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
376 my @Group_Names; # store group name column order
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
377 my %Column_Header_Groups; # hash of hash with group name as key, internal hash with resp. input binary matrix column header anonymous array (key 'column_headers'), or calculated integer inclusion/exclusion cutoffs (key 'inclusion' etc.)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
378 while (<$Groups_File_Fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
379 chomp;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
380 next if (/^\s*$/); # skip empty lines
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
381
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
382 # check groups file header and get group names
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
383 if ($. == 1) { # header of groups file (first line)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
384 die "\n### Fatal error:\nGroups file '$Groups_File' does not have a correctly formatted header line with at least TWO and maximal FOUR tab-separated group name columns (without other whitespaces)!\n" if (!/^\S+\t\S+\t?\S*\t?\S*$/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
385 foreach my $group_name (split(/\t/)) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
386 check_file_system_conformity($group_name, 'group names'); # subroutine to check string only contains alphanumeric '0-9a-zA-Z', underscore '_', dash '-', and period '.' characters
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
387 die "\n### Fatal error:\nGroup name '$group_name' is not unique in the group file '$Groups_File'! Please choose unique names for the groups in the group file!\n" if (grep {/$group_name/} @Group_Names);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
388 push (@Group_Names, $group_name); # store groups array
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
389 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
390 next; # skip header line for subsequent code
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
391 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
392
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
393 # parse input binary matrix column headers in groups file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
394 die "\n### Fatal error:\nLine '$.' in the groups file '$Groups_File' does not have the correct format with maximal four tab-separated column header names from the input binary matrix '$Binary_Matrix_File' according to the groups (without other whitespaces):\n$_\n" if (!/^\S*\t?\S*\t?\S*\t?\S*$/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
395 my $column = -1; # group file column counter to associate input binary matrix column header to correct group (@Group_Names array zero-based, thus start with '-1')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
396 foreach my $column_header (split(/\t/)) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
397 $column++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
398 next if ($column_header =~ /^$/); # skip empty columns in group file, in case the groups have uneven column header numbers
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
399 warn "\n### Warning:\nInput binary matrix column header '$column_header' in the group file has a binary ('0' or '1') instead of an alphanumeric string, sure this is a column header? The matrix needs to have mandatory row and column headers, see help with '-h'!\n" if ($column_header =~ /^(0|1)$/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
400 check_file_system_conformity($column_header, 'column headers'); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
401 die "\n### Fatal error:\nInput binary matrix column header '$column_header' is not unique in the group file '$Groups_File', but needs to be in the group file and the input binary matrix '$Binary_Matrix_File'! The group file needs to contain all of the column headers from the input binary matrix and vice versa. Make sure the strings correspond exactly (also lower/uppercase) to the header of the matrix!\n\n" if (check_column_header_in_groups($column_header)); # subroutine to check if a column header is present in %Column_Header_Groups
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
402 push (@{ $Column_Header_Groups{$Group_Names[$column]}->{'column_headers'} }, $column_header); # push into anonymous array in hash of hash
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
403 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
404 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
405 close $Groups_File_Fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
406 print STDERR "with ", scalar map(@{ $Column_Header_Groups{$_}->{'column_headers'} }, keys %Column_Header_Groups), " input binary matrix column headers ...\n"; # run status of script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
407
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
408 foreach (@Group_Names) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
409 die "\n### Fatal error:\nGroup '$_' does not contain an element, please add a suitable input binary matrix column header or remove the group from the group file '$Groups_File'!\n" if (!$Column_Header_Groups{$_});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
410 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
411
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
412
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
413
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
414 ### Round inclusion and exclusion count cutoffs for each group
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
415 foreach (sort keys %Column_Header_Groups) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
416 $Column_Header_Groups{$_}->{'inclusion'} = int((scalar @{ $Column_Header_Groups{$_}->{'column_headers'} } * $Inclusion_Cutoff) + 0.5); # round positive number half up to integer ((s)printf rounds half to even), see e.g. https://stackoverflow.com/questions/178539/how-do-you-round-a-floating-point-number-in-perl
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
417 print STDERR "Group '$_' with ", scalar @{ $Column_Header_Groups{$_}->{'column_headers'} }, " input binary matrix column headers has a rounded inclusion cutoff of $Column_Header_Groups{$_}->{'inclusion'} column header(s) and ";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
418
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
419 $Column_Header_Groups{$_}->{'exclusion'} = int((scalar @{ $Column_Header_Groups{$_}->{'column_headers'} } * $Exclusion_Cutoff) + 0.5);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
420 print STDERR "an exclusion cutoff of $Column_Header_Groups{$_}->{'exclusion'} column header(s)!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
421 die "\n### Fatal error:\nThe rounded inclusion cutoff has to be greater than the exclusion cutoff for all groups, please choose appropriate cutoffs!\n" if ($Column_Header_Groups{$_}->{'inclusion'} <= $Column_Header_Groups{$_}->{'exclusion'});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
422 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
423
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
424
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
425
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
426 ### Create result folder
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
427 if (!$Result_Dir) { # can't give default before 'GetOptions' in case cutoffs are set by the user
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
428 $Result_Dir = "./results_i".$Inclusion_Cutoff."_e".$Exclusion_Cutoff;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
429 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
430 $Result_Dir =~ s/\/$//; # get rid of a potential '/' at the end of $Result_Dir path
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
431 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
432 if (-e $Result_Dir) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
433 empty_dir($Result_Dir); # subroutine to empty a directory with user interaction
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
434 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
435 mkdir $Result_Dir;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
436 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
437
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
438
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
439
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
440 ### Parse rows in input binary matrix
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
441 print STDERR "Parsing input binary matrix '$Binary_Matrix_File' ...\n"; # run status of script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
442 open (my $Binary_Matrix_Fh, "<", "$Binary_Matrix_File");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
443
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
444 my @Column_Headers; # store input binary matrix column header order
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
445 my %Row_Header_Groups; # hash of hash with row header as key (first column of each row) and internal hash with column header as key and '1' for presence ('0' absence is not stored)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
446 while (<$Binary_Matrix_Fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
447 chomp;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
448 next if (/^\s*$/); # skip empty lines
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
449 die "\n### Fatal error:\nSet delimiter/separator '$Delimiter' (option '-d') not found in line '$.' of the input binary matrix '$Binary_Matrix_File', sure the correct one is set?\n" if ($_ !~ /$Delimiter/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
450
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
451 # check input binary matrix file header and get column header names
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
452 if ($. == 1) { # header of input binary matrix file (first line)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
453 my (undef, @headers) = split(/$Delimiter/); # get rid of first line row header
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
454 foreach my $column_header (@headers) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
455 die "\n### Fatal error:\nColumn header '$column_header' is present in the input binary matrix '$Binary_Matrix_File' but not in the groups file '$Groups_File'. However, the group file needs to contain all of the column headers from the matrix and vice versa. Make sure the strings correspond exactly (also lower/uppercase) to the matrix header!\n" unless (check_column_header_in_groups($column_header)); # subroutine to check if a column header is present in %Column_Header_Groups
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
456 push (@Column_Headers, $column_header);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
457 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
458 die "\n### Fatal error:\nNot the same amount of input binary matrix column headers in the group file '$Groups_File' and matrix '$Binary_Matrix_File', but the group file needs to contain all of the column headers in the matrix and vice versa!\n" if (@Column_Headers != map(@{ $Column_Header_Groups{$_}->{'column_headers'} }, keys %Column_Header_Groups));
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
459 next; # skip header line of input binary matrix for subsequent code
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
460 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
461
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
462 # parse input binary matrix
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
463 my ($row_header, @presence_absence) = split(/$Delimiter/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
464 die "\n### Fatal error:\nInput binary matrix '$Binary_Matrix_File' row '$row_header' doesn't have the same amount of columns as overall column headers, however the matrix needs to be filled/consistent in every row!\n" unless (@Column_Headers == @presence_absence);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
465 die "\n### Fatal error:\nInput binary matrix row header '$row_header' is not unique in the input binary matrix '$Binary_Matrix_File', however all row headers need to be unique!\n" if ($Row_Header_Groups{$row_header});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
466 warn "\n### Warning:\nLine number '$.' of the binary input matrix '$Binary_Matrix_File' has a binary ('0' or '1') row header instead of an alphanumeric string, sure this is a row header? The matrix needs to have mandatory row and column headers, see help with '-h'!\n\n" if ($row_header =~ /^(0|1)$/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
467 my $column = -1; # column counter to associate presence/absence field to correct column header (@Column_Headers array zero-based, thus start with '-1')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
468 foreach (@presence_absence) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
469 $column++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
470 die "\n###Fatal error:\nThere's an empty field in the input binary matrix '$Binary_Matrix_File' in row '$row_header', column '$column+2'. This shouldn't happen, as all the fields of the BINARY matrix should either have a '0' indicating absence or a '1' indicating presence!\n" if (/^$/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
471 die "\n### Fatal error:\nThe input binary matrix '$Binary_Matrix_File' contains a field with '$_' in row/line '$.', column '$column+2', instead of the mandatory binary '0' indicating absence or '1' indicating presence!\n" if (!/^(0|1)$/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
472 next if (!$_); # skip absence columns in binary input matrix (indicated by '0'), have to skip after '$column++' otherwise column won't fit
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
473 $Row_Header_Groups{$row_header}{$Column_Headers[$column]} = $_; # store presence ('1') in hash of hash
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
474 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
475 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
476 close $Binary_Matrix_Fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
477
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
478
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
479
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
480 ### Calculate column stats from input binary matrix according to the group cutoffs, categorize the results and print to result files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
481 print STDERR "Calculating column stats from the input binary matrix according to the group cutoffs ...\n"; # run status of script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
482
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
483 # open result file for group binary matrix and print header
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
484 my $Group_Binary_Matrix_File; # filename needed also for venn diagram R script below
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
485 my $Group_Binary_Matrix_Fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
486 if ($Opt_Group_Binary) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
487 $Group_Binary_Matrix_File = "$Result_Dir/group_binary_matrix.tsv";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
488 open ($Group_Binary_Matrix_Fh, ">", "$Group_Binary_Matrix_File");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
489 print $Group_Binary_Matrix_Fh join("\t", @Group_Names), "\n"; # print header for group binary matrix
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
490 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
491
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
492 my %Count_Group_Pres_Abs; # hash of hash to store presence count and output filehandle for each category ('core', 'group-specific' etc.)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
493 foreach my $row_header (sort {lc($a) cmp lc($b)} keys %Row_Header_Groups) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
494 my %present_row_columns; # hash of array to store input binary matrix columns of a group with PRESENCE ('1') in the current row for cutoff tests and print outs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
495
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
496 # 'strict_core' category row, if ALL columns present in row (with option '-co')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
497 if ($Opt_Strict_Core && keys %{$Row_Header_Groups{$row_header}} == @Column_Headers) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
498 $Count_Group_Pres_Abs{'strict_core'}->{'presence_count'}++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
499 open_result_file("$Result_Dir/strict_core_rows.txt", 'strict_core') if (!$Count_Group_Pres_Abs{'strict_core'}->{'fh'}); # subroutine to open result file and store fh (if not opened yet)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
500 print { $Count_Group_Pres_Abs{'strict_core'}->{'fh'} } "$row_header\n"; # block around fh needed if not a simple string variable (or glob)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
501 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
502
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
503 # count and save number of columns with presence ('1') for the current $row_header with their corresponding group affiliation
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
504 foreach my $group (@Group_Names) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
505 foreach my $column_header (@{ $Column_Header_Groups{$group}->{'column_headers'} }) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
506 if ($Row_Header_Groups{$row_header}->{$column_header}) { # $column_header with presence in current $row_header
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
507 push(@{ $present_row_columns{$group} }, $column_header); # push into anonymous array in hash
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
508
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
509 # 'singleton' category row if only ONE column with presence (with option '-s')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
510 if ($Opt_Singletons && keys %{$Row_Header_Groups{$row_header}} == 1) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
511 $Count_Group_Pres_Abs{$column_header}->{'presence_count'}++; # here: category = $column_header
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
512 open_result_file("$Result_Dir/$column_header\_singletons.txt", $column_header) if (!$Count_Group_Pres_Abs{$column_header}->{'fh'}); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
513 print { $Count_Group_Pres_Abs{$column_header}->{'fh'} } "$row_header\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
514 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
515 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
516 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
517 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
518
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
519 # filter column presence counts of the groups in this row according to inclusion-/exclusion-cutoffs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
520 my @row_included_groups; # array for groups with number of present column in this row >= the inclusion cutoff
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
521 my @row_excluded_groups; # array for groups with column presence <= the exclusion cutoff
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
522 my @row_unspec_groups; # array for unspecific groups with 'exclusion < column_presence_count < inclusion'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
523 foreach my $group (@Group_Names) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
524 if ($present_row_columns{$group}) { # only run if group has a column presence in this row (otherwise undefined ARRAY reference)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
525 if (@{ $present_row_columns{$group} } >= $Column_Header_Groups{$group}->{'inclusion'}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
526 push(@row_included_groups, $group);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
527 } elsif (@{ $present_row_columns{$group} } <= $Column_Header_Groups{$group}->{'exclusion'}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
528 push(@row_excluded_groups, $group);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
529 } elsif (@{ $present_row_columns{$group} } > $Column_Header_Groups{$group}->{'exclusion'} && @{ $present_row_columns{$group} } < $Column_Header_Groups{$group}->{'inclusion'}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
530 push(@row_unspec_groups, $group);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
531 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
532 } else { # $present_row_columns{$group} == 0 for group (undef anonymous array), i.e. always <= exclusion
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
533 push(@row_excluded_groups, $group);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
534 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
535 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
536
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
537 # 'unspecific' category row if at least ONE group present in '@row_unspec_groups'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
538 # no further categorization, i.e. skip rest of logics
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
539 if (@row_unspec_groups) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
540 $Count_Group_Pres_Abs{'unspecific'}->{'presence_count'}++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
541 if ($Opt_Unspecific) { # print only to output file with option '-u'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
542 open_result_file("$Result_Dir/unspecific_rows.txt", 'unspecific') if (!$Count_Group_Pres_Abs{'unspecific'}->{'fh'}); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
543 print { $Count_Group_Pres_Abs{'unspecific'}->{'fh'} } "$row_header\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
544 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
545 next; # skip to next $row_header, because now unspec and not suitable for any further categories
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
546 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
547
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
548 # print group binary matrix (in contrast to rows in category 'unspecific' above, 'underrepresented' rows below not skipped because package 'venn' can handle a row with only zeros in the group binary matrix)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
549 if ($Opt_Group_Binary) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
550 my $i = 0;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
551 foreach my $group (@Group_Names) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
552 $i++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
553 print $Group_Binary_Matrix_Fh "1" if (grep {/$group/} @row_included_groups);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
554 print $Group_Binary_Matrix_Fh "0" if (grep {/$group/} @row_excluded_groups);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
555 print $Group_Binary_Matrix_Fh "\t" if ($i < @Group_Names);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
556 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
557 print $Group_Binary_Matrix_Fh "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
558 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
559
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
560 # 'cutoff_core' category row if ALL groups >= inclusion (includes 'strict_core' rows where ALL columns are present in the row)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
561 # 2 groups A/B [2 inclus :0 exclus], 3 groups A/B/C [3:0], 4 groups A/B/C/D [4:0]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
562 if (@row_included_groups == @Group_Names && @row_excluded_groups == 0) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
563 $Count_Group_Pres_Abs{'cutoff_core'}->{'presence_count'}++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
564 open_result_file("$Result_Dir/cutoff_core_rows.txt", 'cutoff_core') if (!$Count_Group_Pres_Abs{'cutoff_core'}->{'fh'}); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
565 print { $Count_Group_Pres_Abs{'cutoff_core'}->{'fh'} } "$row_header\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
566
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
567 # 'underrepresented' category row if ALL groups <= exclusion_cutoff (depending on the cutoffs and number of columns in the groups will include most singletons)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
568 # 2 groups [0:2], 3 [0:3], 4 [0:4]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
569 } elsif (@row_included_groups == 0 && @row_excluded_groups == @Group_Names) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
570 $Count_Group_Pres_Abs{'underrepresented'}->{'presence_count'}++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
571 open_result_file("$Result_Dir/underrepresented_rows.txt", 'underrepresented') if (!$Count_Group_Pres_Abs{'underrepresented'}->{'fh'}); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
572 print { $Count_Group_Pres_Abs{'underrepresented'}->{'fh'} } "$row_header\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
573
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
574 # 'group-specific' category row if only ONE group >= inclusion_cutoff
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
575 # 2 groups [1:1], 3 [1:2], 4 [1:3]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
576 } elsif (@row_included_groups == 1 && @row_excluded_groups == @Group_Names - 1) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
577 $Count_Group_Pres_Abs{"$row_included_groups[0]_specific"}->{'presence_count'}++; # only one element in array thus [0]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
578 open_result_file("$Result_Dir/$row_included_groups[0]_specific_rows.txt", "$row_included_groups[0]_specific") if (!$Count_Group_Pres_Abs{"$row_included_groups[0]_specific"}->{'fh'}); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
579 print { $Count_Group_Pres_Abs{"$row_included_groups[0]_specific"}->{'fh'} } "$row_header\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
580
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
581 # 'group-absent' category row if ONE group <= exclusion_cutoff (only for 3 and 4 groups)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
582 # 3 groups (A+B not C [2:1], A+C not B, and B+C not A) and 4 groups (A+B+C not D [3:1], A+B+D not C, A+C+D not B, B+C+D not A)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
583 } elsif (@Group_Names >= 3 && (@row_included_groups == @Group_Names - 1 && @row_excluded_groups == 1)) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
584 $Count_Group_Pres_Abs{"$row_excluded_groups[0]_absent"}->{'presence_count'}++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
585 open_result_file("$Result_Dir/$row_excluded_groups[0]_absent_rows.txt", "$row_excluded_groups[0]_absent") if (!$Count_Group_Pres_Abs{"$row_excluded_groups[0]_absent"}->{'fh'}); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
586 print { $Count_Group_Pres_Abs{"$row_excluded_groups[0]_absent"}->{'fh'} } "$row_header\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
587
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
588 # 'two group-specific' category row if TWO groups >= inclusion_cutoff (only for 4 groups)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
589 # 4 groups (B+D not A or C [2:2], B+C not A or D, A+D not B or C, A+C not B or D, A+B not C or D, C+D not A or B)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
590 } elsif (@Group_Names == 4 && (@row_included_groups == @row_excluded_groups)) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
591 $Count_Group_Pres_Abs{"$row_included_groups[0]-$row_included_groups[1]_specific"}->{'presence_count'}++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
592 open_result_file("$Result_Dir/$row_included_groups[0]-$row_included_groups[1]_specific_rows.txt", "$row_included_groups[0]-$row_included_groups[1]_specific") if (!$Count_Group_Pres_Abs{"$row_included_groups[0]-$row_included_groups[1]_specific"}->{'fh'}); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
593 print { $Count_Group_Pres_Abs{"$row_included_groups[0]-$row_included_groups[1]_specific"}->{'fh'} } "$row_header\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
594
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
595 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
596 die "\n### Fatal error:\nNo category logic test fits for row '$row_header', which shouldn't happen. Have you smuggled only one or five groups into the groups file '$Groups_File' (which also shouldn't be possible)? The script can handle only two, three, or four groups please edit the input!\n"; # shouldn't happen, since regex in group file parse checks for minimally two and maximally four tab-separated group names and all category logics should be covered
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
597 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
598 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
599
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
600
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
601
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
602 ### Close all output files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
603 close $Group_Binary_Matrix_Fh if ($Opt_Group_Binary);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
604 map { close $Count_Group_Pres_Abs{$_}->{'fh'} } keys %Count_Group_Pres_Abs;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
605
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
606
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
607
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
608 ### Plot the venn diagram with R package 'gplots' function 'venn'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
609 if ($Opt_Venn) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
610 my $tmp_r_script = "$Result_Dir/tmp_venn.R"; # filename of the R script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
611 my $venn_diagram_name = "$Result_Dir/venn_diagram.pdf"; # filename of the output PDF histogram
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
612 print STDERR "Drawing venn diagram (option '-p') '$venn_diagram_name' ...\n"; # run status of script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
613 open (my $r_fh, ">", "$tmp_r_script");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
614
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
615 select $r_fh; # select fh for standard print/f output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
616 print "#!/usr/bin/Rscript --vanilla --slave\n"; # header of R script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
617 print "suppressMessages(library(gplots))\n"; # load library 'gplots' for function 'venn' and suppress STDOUT message from loading
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
618 print "group_binary_matrix <- read.table(\"$Group_Binary_Matrix_File\", header=TRUE, sep=\"\\t\")\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
619 print "pdf(\"$venn_diagram_name\")\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
620 print "venn(group_binary_matrix)\n"; # create the venn diagram
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
621 print "out <- dev.off()\n"; # write histogram to PDF and suppress STDOUT output by diverting it
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
622 select STDOUT; # change standard print/f output back to STDOUT
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
623 close $r_fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
624
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
625 # execute tmp R script with 'Rscript'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
626 system("Rscript $tmp_r_script") == 0 or die "### Fatal error:\nPlotting the venn diagram didn't work. Either statistical programming language 'R' or the required R package 'gplots' is not installed, not in global \$PATH, or something wrong with the group binary matrix '$Group_Binary_Matrix_File' or the temporary R script '$tmp_r_script', which runs the drawing of the venn diagram. Function 'venn' of the 'gplots' package is required and needs to be installed!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
627 unlink $tmp_r_script;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
628 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
629
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
630
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
631
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
632 ### Count overall category rows, and plausibility check for above logics
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
633 my $Total_Category_Rows = 0;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
634 foreach my $category (keys %Count_Group_Pres_Abs) { # can't use a map in case of optionally strict_core, singletons categories (which overlap non-optional categories)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
635 next if ($category =~ /strict_core/ || grep {$category eq $_} @Column_Headers); # skip optional 'strict_core' (option '-co') or singleton ('-s') categories
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
636
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
637 $Total_Category_Rows += $Count_Group_Pres_Abs{$category}->{'presence_count'};
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
638 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
639 die "\n### Fatal error:\nResulting categories don't include all present rows, this shouldn't happen!\n" if (keys %Row_Header_Groups != $Total_Category_Rows); # is this really always the case?
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
640
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
641
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
642
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
643 ### Print final output stats matrix
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
644 print STDERR "Finished printing output files to result directory '$Result_Dir', printing final stats matrix to STDOUT ...\n"; # run status of script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
645 print "# category\tpresence_count\n"; # header for matrix
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
646 print "total_rows\t", scalar keys %Row_Header_Groups, "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
647 foreach my $category (sort {lc($a) cmp lc($b)} keys %Count_Group_Pres_Abs) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
648 print "$category";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
649 if (grep {$category eq $_} @Column_Headers) { # print overall presence counts for column headers with singletons (option '-s')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
650 print " overall_presence\t", scalar grep($Row_Header_Groups{$_}->{$category} , keys %Row_Header_Groups), "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
651 print "$category singletons"; # include category "key word" singleton for category stats below
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
652 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
653 print "\t$Count_Group_Pres_Abs{$category}->{'presence_count'}\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
654 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
655
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
656 # print out overall presence counts for all columns even without singletons (option '-a')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
657 if ($Opt_Columns_Overall) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
658 foreach my $column_header (sort {lc($a) cmp lc($b)} @Column_Headers) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
659 next if (grep {$column_header eq $_} keys %Count_Group_Pres_Abs); # skip column headers with singletons (option '-s'), already printed above
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
660 print "$column_header overall_presence\t", scalar grep($Row_Header_Groups{$_}->{$column_header} , keys %Row_Header_Groups), "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
661 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
662 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
663
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
664 exit;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
665
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
666
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
667 #############
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
668 #Subroutines#
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
669 #############
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
670
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
671 ### Subroutine to check if a column header is present in hash %Column_Header_Groups (hash filled from the $Groups_File)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
672 sub check_column_header_in_groups {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
673 my $column_header = shift;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
674 foreach my $group (keys %Column_Header_Groups) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
675 return 1 if (grep {$_ eq $column_header} @{ $Column_Header_Groups{$group}->{'column_headers'} });
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
676 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
677 return; # return 'undef'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
678 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
679
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
680
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
681
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
682 ### Subroutine to check if a string adheres to file system conformity, esp. for filenames
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
683 sub check_file_system_conformity {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
684 my ($string, $type) = @_;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
685 die "\n\n### Fatal error:\nOnly alphanumeric '0-9a-zA-Z', underscore '_', dash '-', and period '.' characters allowed for $type to avoid operating/file system problems. Thus, please adapt '$string' in the group file '$Groups_File' and if necessary also in the input binary matrix '$Binary_Matrix_File' accordingly!\n" if ($string !~ /^[\w_\-\.]+$/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
686 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
687 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
688
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
689
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
690
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
691 ### Subroutine to empty a directory with user interaction
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
692 sub empty_dir {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
693 my $dir = shift;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
694 print STDERR "\nDirectory '$dir' already exists! You can use either option '-r' to set a different output result directory name, or do you want to replace the directory and all its contents [y|n]? ";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
695 my $user_ask = <STDIN>;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
696 if ($user_ask =~ /y/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
697 unlink glob "$dir/*"; # remove all files in results directory
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
698 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
699 die "\nScript abborted!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
700 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
701 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
702 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
703
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
704
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
705
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
706 ### Open a result file and print the header
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
707 sub open_result_file {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
708 my ($filename, $category) = @_;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
709 open ($Count_Group_Pres_Abs{$category}->{'fh'}, ">", "$filename"); # store filehandle in hash of hash
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
710 print { $Count_Group_Pres_Abs{$category}->{'fh'} } "# row_header\n"; # block around fh needed if not a simple string variable (or glob)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
711 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
712 }