annotate COG/bac-genomics-scripts/po2group_stats/po2group_stats.pl @ 4:53df1177ff97 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:54:27 +0000
parents e42d30da7a74
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 #!/usr/bin/perl
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3 #######
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 # POD #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5 #######
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7 =pod
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 =head1 NAME
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 C<po2group_stats.pl> - categorize orthologs from Proteinortho5
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12 output according to genome groups
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14 =head1 SYNOPSIS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16 C<perl po2group_stats.pl -i matrix.proteinortho -d genome_fasta_dir/
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17 -g group_file.tsv -p E<gt> overall_stats.tsv>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19 =head1 DESCRIPTION
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21 Categorize the genomes in an ortholog/paralog output matrix (option
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22 B<-i>) from a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23 L<B<Proteinortho5>|http://www.bioinf.uni-leipzig.de/Software/proteinortho/>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24 calculation according to group affiliations. The group
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25 affiliations of the genomes are intended to get overall
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26 presence/absence statistics for groups of genomes and not simply
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27 single genomes (e.g. comparing 'marine', 'earth', 'commensal',
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28 'pathogenic' etc. genome groups). Percentage inclusion (option
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29 B<-cut_i>) and exclusion (option B<-cut_e>) cutoffs can be set to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 define how strict the presence/absence of genome groups within an
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31 orthologous group (OG) are defined. Of course groups can also hold
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 only single genomes to get single genome statistics. Group
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33 affiliations are defined in a mandatory B<tab-delimited> group input
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 file (option B<-g>) with B<minimal two> and B<maximal four> groups.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 Only alphanumeric (a-z, A-Z, 0-9), underscore (_), dash (-), and
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37 period (.) characters are allowed for the B<group names> in the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38 group file to avoid downstream problems with the operating/file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39 system. As a consequence, also no whitespaces are allowed in these!
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 Additionally, B<group names>, B<genome filenames> (should be
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41 enforced by the file system), and B<FASTA IDs> considering B<all>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42 genome files (mostly locus tags; should be enforced by
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43 Proteinortho5) need to be B<unique>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45 B<Proteinortho5> (PO) has to be run with option B<-singles> to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46 include also genes without orthologs, so-called singletons/ORFans,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47 for each genome in the PO matrix (see the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48 L<PO manual|http://www.bioinf.uni-leipzig.de/Software/proteinortho/manual.html>).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49 Additionally, option B<-selfblast> is recommended to enhance
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50 paralog detection by PO.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52 To explain the logic behind the categorization, the following
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53 annotation for example groups will be used. A '1' exemplifies a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54 group genome count in a respective OG E<gt>= the rounded inclusion
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55 cutoff, a '0' a group genome count E<lt>= the rounded exclusion
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56 cutoff. The presence and absence of OGs for the group affiliations
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57 are structured in different categories depending on the number of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58 groups. For B<two groups> (e.g. A and B) there are five categories:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59 'A specific' (A:B = 1:0), 'B specific' (0:1), 'cutoff core' (1:1),
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60 'underrepresented' (0:0), and 'unspecific'. Unspecific OGs have a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 genome count for at least B<one> group outside the cutoffs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62 (exclusion cutoff E<lt> genome count E<lt> inclusion cutoff) and
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63 thus cannot be categorized. These 'unspecific' OGs will only be
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64 printed to a final annotation result file with option B<-u>. Overall
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65 stats for all categories are printed to C<STDOUT> in a final
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66 tab-delimited output matrix.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68 B<Three groups> (A, B, and C) have the following nine categories: 'A
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69 specific' (A:B:C = 1:0:0), 'B specific' (0:1:0), 'C specific'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70 (0:0:1), 'A absent' (0:1:1), 'B absent' (1:0:1), 'C absent' (1:1:0),
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71 'cutoff core' (1:1:1), 'underrepresented' (0:0:0), and 'unspecific'.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73 B<Four groups> (A, B, C, and D) are classified in 17 categories: 'A
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74 specific' (A:B:C:D = 1:0:0:0), 'B specific' (0:1:0:0), 'C specific'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75 (0:0:1:0), 'D specific' (0:0:0:1), 'A-B specific' (1:1:0:0), 'A-C
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76 specific' (1:0:1:0), 'A-D specific' (1:0:0:1), 'B-C specific'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77 (0:1:1:0), 'B-D specific' (0:1:0:1), 'C-D specific' (0:0:1:1), 'A
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78 absent' (0:1:1:1), 'B absent' (1:0:1:1), 'C absent' (1:1:0:1), 'D
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79 absent' (1:1:1:0), 'cutoff core' (1:1:1:1), 'underrepresented'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80 (0:0:0:0), and 'unspecific'.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82 The resulting group presence/absence (according to the cutoffs) can
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83 also be printed to a binary matrix (option B<-b>) in the result
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84 directory (option B<-r>), excluding the 'unspecific' category. Since
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85 the categories are the logics underlying venn diagrams, you can also
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86 plot the results in a venn diagram using the binary matrix (option
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87 B<-p>). The 'underrepresented' category is exempt from the venn
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88 diagram, because it is outside of venn diagram logics.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90 For an illustration of the logics have a look at the example venn
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91 diagrams F<./pics/venn_diagram_logics.[svg|png]>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93 There are two optional categories (which are only considered for the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94 final print outs and in the final stats matrix, not for the binary
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95 matrix and the venn diagram): 'strict core' (option B<-co>) for OGs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96 where B<all> genomes have an ortholog. Of course all the 'strict
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97 core' OGs are also included in the 'cutoff core' category ('strict
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98 core' is identical to 'cutoff core' with B<-cut_i> 1 and B<-cut_e>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99 0). Option B<-s> activates the detection of 'singleton/ORFan' OGs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100 present in only B<one> genome. Depending on the cutoffs and number
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101 of genomes in the groups, category 'underrepresented' includes most
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102 of these singletons.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104 Additionally, annotation is retrieved from multi-FASTA files created
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105 with L<C<cds_extractor.pl>|/cds_extractor>. See
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106 L<C<cds_extractor.pl>|/cds_extractor> for a description of the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107 format. These files are used as input for the PO analysis and with
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108 option B<-d> for C<po2group_stats.pl>. The annotations are printed
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109 in category output files in the result directory.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111 Annotations are only pulled from one representative genome for each
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112 category present in the current OG. With option B<-co> you can set a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113 specific genome for the representative annotation for category
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114 'strict core'. For all other categories the representative genome is
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115 chosen according to the order of the genomes in the group files,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116 depending on the presence in each OG. Thus, the best annotated
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117 genome should be in the first group at the topmost position
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118 (especially for 'cutoff core'), as well as the best annotated ones
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119 at the top in all other groups.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121 In the result files, each orthologous group (OG) is listed in a row
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122 of the resulting category files, the first column holds the OG
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123 numbers from the PO input matrix (i.e. line number minus one). The
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124 following columns specify the ID for each CDS, gene, EC number(s),
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125 product, and organism are shown depending on their presence in the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126 CDS's annotation. The ID is in most cases the locus tag (see
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127 L<C<cds_extractor.pl>|/cds_extractor>). If several EC numbers exist
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128 for a single CDS they are separated by a ';'. If the representative
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129 genome within an OG includes paralogs (co-orthologs) these will be
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130 printed in the following row(s) B<without> a new OG number in the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131 first column.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133 The number of OGs in the category annotation result files are the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134 same as listed in the venn diagram and the final stats matrix.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135 However, since only annotation from one representative annotation is
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136 used the CDS number will be different to the final stats. The final
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137 stats include B<all> the CDS in this category in B<all> genomes
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138 present in the OG in groups E<gt>= the inclusion cutoff (i.e. for
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
139 'strict core' the CDS for all genomes in this OG are counted). Two
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
140 categories are different, for 'unspecific' all unspecific groups are
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
141 included, for 'underrepresented' all groups E<lt>= the exclusion
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
142 cutoffs. This is also the reason, the 'pangenome' CDS count is
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
143 greater than the 'included in categories' CDS count in the final
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
144 stats matrix, as genomes in excluded groups are exempt from the CDS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
145 counts for most categories. 'Included in categories' is the OG/CDS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
146 sum of all non-optional categories ('*specific', '*absent', 'cutoff
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
147 core', 'underrepresented', and 'unspecific'), since the optional
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
148 categories are included in non-optionals. An exception to the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
149 difference in CDS counts are the 'singletons' category where OG and
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
150 CDS counts are identical in the result files and in the overall
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
151 final output matrix (as there is only one genome), as well as in
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
152 group-'specific' categories for groups including only one genome.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
153
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
154 At last, if you want the respective representative sequences for a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
155 category you can first filter the locus tags from the result file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
156 with Unix command-line tools:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
157
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
158 C<grep -v "^#" result_file.tsv | cut -f 2 E<gt> locus_tags.txt>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
159
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
160 And then feed the locus tag list to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
161 L<C<cds_extractor.pl>|/cds_extractor> with option B<-l>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
162
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
163 As a final note, in the L<F<prot_finder>|/prot_finder> workflow is a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
164 script, C<binary_group_stats.pl>, based upon C<po2group_stats.pl>,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
165 which can calculate overall presence/absence statistics for column
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
166 groups in a delimited TEXT binary matrix (as with genomes here).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
167
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
168 =head1 OPTIONS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
169
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
170 =head2 Mandatory options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
171
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
172 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
173
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
174 =item B<-i>=I<str>, B<-input>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
175
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
176 Proteinortho (PO) result matrix (*.proteinortho or *.poff)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
177
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
178 =item B<-d>=I<str>, B<-dir_genome>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
179
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
180 Path to the directory including the genome multi-FASTA PO input
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
181 files (*.faa or *.ffn), created with
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
182 L<C<cds_extractor.pl>|/cds_extractor>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
183
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
184 =item B<-g>=I<str>, B<-groups_file>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
185
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
186 Tab-delimited file with group affiliation for the genomes with
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
187 B<minimal two> and B<maximal four> groups (easiest to create in a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
188 spreadsheet software and save in tab-separated format). B<All>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
189 genomes from the PO matrix need to be included. Group names can only
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
190 include alphanumeric (a-z, A-Z, 0-9), underscore (_), dash (-), and
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
191 period (.) characters (no whitespaces allowed either). Example
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
192 format with two genomes in group A, three in group B and D, and one
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
193 in group C:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
194
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
195 group_A\tgroup_B\tgroup_C\tgroup_D
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
196 genome1.faa\tgenome2.faa\tgenome3.faa\tgenome4.faa
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
197 genome5.faa\tgenome6.faa\t\tgenome7.faa
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
198 \tgenome8.faa\t\tgenome9.faa
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
199
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
200 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
201
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
202 =head2 Optional options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
203
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
204 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
205
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
206 =item B<-h>, B<-help>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
207
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
208 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
209
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
210 =item B<-r>=I<str>, B<-result_dir>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
211
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
212 Path to result folder [default = inclusion and exclusion percentage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
213 cutoff, './results_i#_e#']
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
214
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
215 =item B<-cut_i>=I<float>, B<-cut_inclusion>=I<float>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
216
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
217 Percentage inclusion cutoff for genomes in a group per OG, has to be
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
218 E<gt> 0 and E<lt>= 1. Cutoff will be rounded according to the genome
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
219 number in each group and has to be E<gt> the rounded exclusion
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
220 cutoff in this group. [default = 0.9]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
221
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
222 =item B<-cut_e>=I<float>, B<-cut_exclusion>=I<float>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
223
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
224 Percentage exclusion cutoff, has to be E<gt>= 0 and E<lt> 1. Rounded
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
225 cutoff has to be E<lt> rounded inclusion cutoff. [default = 0.1]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
226
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
227 =item B<-b>, B<-binary_matrix>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
228
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
229 Print a binary matrix with the presence/absence genome group results
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
230 according to the cutoffs (excluding 'unspecific' category OGs)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
231
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
232 =item B<-p>, B<-plot_venn>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
233
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
234 Plot venn diagram from the binary matrix (except 'unspecific' and
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
235 'underrepresented' categories) with function C<venn> from B<R>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
236 package B<gplots>, requires option B<-b>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
237
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
238 =item B<-co>=(I<str>), B<-core_strict>=(I<str>)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
239
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
240 Include 'strict core' category in output. Optionally, give a genome
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
241 name from the PO matrix to use for the representative output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
242 annotation. [default = topmost genome in first group]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
243
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
244 =item B<-s>, B<-singletons>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
245
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
246 Include singletons/ORFans for each genome in the output, activates
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
247 also overall genome OG/CDS stats in final stats matrix for genomes
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
248 with singletons
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
249
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
250 =item B<-u>, B<-unspecific>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
251
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
252 Include 'unspecific' category representative annotation file in
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
253 result directory
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
254
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
255 =item B<-a>, B<-all_genomes_overall>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
256
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
257 Report overall stats for all genomes (appended to the final stats
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
258 matrix), also those without singletons; will include all overall
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
259 genome stats without option B<-s>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
260
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
261 =item B<-v>, B<-version>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
262
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
263 Print version number to C<STDERR>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
264
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
265 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
266
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
267 =head1 OUTPUT
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
268
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
269 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
270
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
271 =item C<STDOUT>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
272
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
273 The tab-delimited final stats matrix is printed to C<STDOUT>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
274 Redirect or pipe into another tool as needed.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
275
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
276 =item F<./results_i#_e#>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
277
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
278 All output files are stored in a results folder
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
279
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
280 =item F<./results_i#_e#/[*_specific|*_absent|cutoff_core|underrepresented]_OGs.tsv>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
281
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
282 Tab-delimited files with OG annotation from a representative genome
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
283 for non-optional categories
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
284
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
285 =item (F<./results_i#_e#/[*_singletons|strict_core|unspecific]_OGs.tsv>)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
286
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
287 Optional category tab-delimited output files with representative
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
288 annotation
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
289
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
290 =item (F<./results_i#_e#/binary_matrix.tsv>)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
291
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
292 Tab-delimited binary matrix of group presence/absence results
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
293 according to cutoffs (excluding 'unspecific' category)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
294
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
295 =item (F<./results_i#_e#/venn_diagram.pdf>)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
296
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
297 Venn diagram for non-optional categories (except 'unspecific' and
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
298 'underrepresented' categories)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
299
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
300 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
301
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
302 =head1 EXAMPLES
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
303
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
304 =head2 L<C<cds_extractor.pl>|/cds_extractor>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
305
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
306 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
307
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
308 =item C<for i in *.[gbk|embl]; do perl cds_extractor.pl -i $i [-p|-n]; done>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
309
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
310 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
311
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
312 =head2 L<B<Proteinortho5>|http://www.bioinf.uni-leipzig.de/Software/proteinortho/>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
313
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
314 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
315
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
316 =item C<proteinortho5.pl -graph [-synteny] -cpus=# -selfblast -singles -identity=50 -cov=50 -blastParameters='-use_sw_tback [-seg no|-dust no]' *.[faa|ffn]>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
317
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
318 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
319
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
320 =head2 C<po2group_stats.pl>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
321
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
322 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
323
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
324 =item C<perl po2group_stats.pl -i matrix.[proteinortho|poff] -d genome_fasta_dir/ -g group_file.tsv -r result_dir -cut_i 0.7 -cut_e 0.2 -b -p -co genome4.[faa|ffn] -s -u -a E<gt> overall_stats.tsv>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
325
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
326 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
327
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
328 =head1 DEPENDENCIES
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
329
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
330 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
331
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
332 =item B<Statistical computing language L<R|http://www.r-project.org/>>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
333
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
334 C<Rscript> is needed to plot the venn diagram with option B<-p>,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
335 tested with version 3.2.2
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
336
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
337 =item B<L<gplots|https://cran.r-project.org/web/packages/gplots/index.html>>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
338
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
339 Package needed for B<R> to plot the venn diagram with function
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
340 C<venn>. Tested with B<gplots> version 2.17.0.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
341
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
342 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
343
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
344 =head1 VERSION
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
345
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
346 0.1.3 update: 06-06-2016
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
347 0.1 23-10-2015
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
348
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
349 =head1 AUTHOR
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
350
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
351 Andreas Leimbach aleimba[at]gmx[dot]de
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
352
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
353 =head1 LICENSE
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
354
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
355 This program is free software: you can redistribute it and/or modify
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
356 it under the terms of the GNU General Public License as published by
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
357 the Free Software Foundation; either version 3 (GPLv3) of the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
358 License, or (at your option) any later version.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
359
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
360 This program is distributed in the hope that it will be useful, but
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
361 WITHOUT ANY WARRANTY; without even the implied warranty of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
362 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
363 General Public License for more details.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
364
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
365 You should have received a copy of the GNU General Public License
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
366 along with this program. If not, see L<http://www.gnu.org/licenses/>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
367
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
368 =cut
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
369
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
370
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
371 ########
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
372 # MAIN #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
373 ########
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
374
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
375 use strict;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
376 use warnings;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
377 use autodie;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
378 use Getopt::Long;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
379 use Pod::Usage;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
380
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
381 my $Cmdline = "$0 @ARGV"; # used call command
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
382
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
383 ### Get the options with Getopt::Long
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
384 my $PO_Matrix_File; # PO result matrix (*.proteinortho or *.poff)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
385 my $Genome_Dir; # directory with genome multi-FASTA files (PO input)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
386 my $Groups_File; # tab-separated file with group classification for the genomes
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
387 my $Result_Dir; # path to results folder; default is set below to '"./results_i".$Inclusion_Cutoff."c".$Exclusion_Cutoff'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
388 my $Inclusion_Cutoff = 0.9; # inclusion percentage cutoff for genome count per OG (floating point number >0 && <=1)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
389 my $Exclusion_Cutoff = 0.1; # exclusion percentage cutoff (floating point number >=0 && <1)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
390 my $Opt_Binary; # print resulting binary matrix according to the group inclusion/exclusion results
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
391 my $Opt_Venn; # create venn diagram with R package 'gplots' function 'venn'; requires $Opt_Binary
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
392 my $Strict_Core; # report also strict core genome, i.e. OGs present in ALL genomes of PO matrix; either option flag (and use first genome in first group) or give genome file for result annotation print
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
393 my $Opt_Singletons; # report also singletons for each genome (outside of the group classifications)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
394 my $Opt_Unspecific; # report also group-unspecific OGs ('exclusion < OG_group_genome_count < inclusion' for any group)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
395 my $Opt_Genomes_Overall; # report overall stats also for genomes without singletons
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
396 my $VERSION = '0.1.3';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
397 my ($Opt_Version, $Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
398 GetOptions ('input=s' => \$PO_Matrix_File,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
399 'dir_genome=s' => \$Genome_Dir,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
400 'groups_file=s' => \$Groups_File,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
401 'result_dir=s' => \$Result_Dir,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
402 'cut_inclusion=f' => \$Inclusion_Cutoff,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
403 'cut_exclusion=f' => \$Exclusion_Cutoff,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
404 'binary_matrix' => \$Opt_Binary,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
405 'plot_venn' => \$Opt_Venn,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
406 'core_strict:s' => \$Strict_Core,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
407 'singletons' => \$Opt_Singletons,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
408 'unspecific' => \$Opt_Unspecific,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
409 'all_genomes_overall' => \$Opt_Genomes_Overall,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
410 'version' => \$Opt_Version,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
411 'help|?' => \$Opt_Help)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
412 or pod2usage(-verbose => 1, -exitval => 2);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
413
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
414
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
415
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
416 ### Run perldoc on POD and enforce mandatory options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
417 pod2usage(-verbose => 2) if ($Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
418 die "$0 $VERSION\n" if ($Opt_Version);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
419
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
420 if (!$PO_Matrix_File || !$Genome_Dir || !$Groups_File) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
421 my $warning = "\n### Fatal error: Mandatory options '-i', '-d', or '-g' or their arguments are missing!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
422 pod2usage(-verbose => 1, -message => $warning, -exitval => 2);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
423 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
424 $Genome_Dir =~ s/\/$//; # get rid of a potential '/' at the end of $Genome_Dir path
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
425 die "\n### Fatal error: Genome directory '$Genome_Dir' does not exist: $!\n" if (!-d $Genome_Dir);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
426
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
427 if (($Inclusion_Cutoff <= 0 || $Inclusion_Cutoff > 1) || ($Exclusion_Cutoff < 0 || $Exclusion_Cutoff >= 1)) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
428 my $warning = "\n### Fatal error:\nInclusion (0 < '-cut_i' <= 1) or exclusion (0 <= '-cut_e' < 1) cutoff(s) not chosen correctly!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
429 pod2usage(-verbose => 1, -message => $warning, -exitval => 2);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
430 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
431
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
432 print STDERR "Script call command: $Cmdline\n"; # print call command after '-h|-v'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
433
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
434 if ($Opt_Venn && !$Opt_Binary) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
435 warn "Option '-p' to draw the venn diagram set, but not it's required option '-b' for the binary matrix. Forcing option '-b'!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
436 $Opt_Binary = 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
437 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
438
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
439
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
440
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
441 ### Parse genomes in groups file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
442 print STDERR "Parsing group file '$Groups_File' "; # run status of script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
443 open (my $Groups_File_Fh, "<", "$Groups_File");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
444
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
445 my @Group_Names; # store group name column order
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
446 my %Genome_Groups; # hash of hash with group name as key, internal hash with resp. genomes anonymous array (key 'genomes'), or calculated integer inclusion/exclusion cutoffs (key 'inclusion' etc.)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
447 while (<$Groups_File_Fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
448 chomp;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
449 next if (/^\s*$/); # skip empty lines
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
450
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
451 # check groups file header and get group names
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
452 if ($. == 1) { # header of groups file (first line)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
453 die "\n### Fatal error:\nGroups file '$Groups_File' does not have a correctly formatted header line with at least TWO and maximal FOUR tab-separated group name columns (without other whitespaces)!\n" if (!/^\S+\t\S+\t?\S*\t?\S*$/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
454 foreach my $group_name (split(/\t/)) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
455 check_file_system_conformity($group_name, 'group names'); # subroutine to check string only contains alphanumeric '0-9a-zA-Z', underscore '_', dash '-', and period '.' characters
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
456 die "\n### Fatal error:\nGroup name '$group_name' is not unique in the group file '$Groups_File'! Please choose unique characters/names for the groups in the group file!\n" if (grep {/$group_name/} @Group_Names);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
457 push (@Group_Names, $group_name); # store groups array
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
458 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
459 next; # skip header line for subsequent code
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
460 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
461
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
462 # parse genomes in groups file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
463 die "\n### Fatal error:\nLine '$.' in the groups file '$Groups_File' does not have the correct format with maximal four tab-separated genome file names according to the groups in the header (without other whitespaces):\n$_\n" if (!/^\S*\t?\S*\t?\S*\t?\S*$/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
464 my $column = -1; # column counter to associate genome filename (column) to correct group in header (@Group_Names array zero-based, thus start with '-1')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
465 foreach my $genome (split(/\t/)) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
466 $column++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
467 next if ($genome =~ /^$/); # skip empty columns in group file, in case the groups have uneven genome numbers
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
468 die "\n### Fatal error:\nGenome '$genome' is not unique in the group file '$Groups_File'! However, the group file needs to contain all of the genomes from the header in the Proteinortho matrix '$PO_Matrix_File' and vice versa. Make sure the strings correspond exactly (also lower/uppercase) to the header of the matrix and the input files in directory '$Genome_Dir'!\n" if (check_genome_in_groups($genome)); # subroutine to check if a genome is present in %Genome_Groups
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
469 push (@{ $Genome_Groups{$Group_Names[$column]}->{'genomes'} }, $genome); # push into anonymous array in hash of hash
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
470 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
471 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
472 close $Groups_File_Fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
473 print STDERR "with ", scalar map(@{ $Genome_Groups{$_}->{'genomes'} }, keys %Genome_Groups), " genomes ...\n"; # run status of script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
474
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
475 foreach (@Group_Names) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
476 die "\n### Fatal error:\nGroup '$_' does not contain a genome, please add a suitable genome filename or remove the group from the group file '$Groups_File'!\n" if (!$Genome_Groups{$_});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
477 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
478
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
479
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
480
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
481 ### Check $Strict_Core and set genome if not given
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
482 if ($Strict_Core) { # if a genome filename is given with option '-co'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
483 die "\n### Fatal error:\nThe given genome '$Strict_Core' with option '-co' is not present in the genomes of the group file '$Groups_File'. The string has to be exactly the same, please check!\n" unless (check_genome_in_groups($Strict_Core)); # subroutine to check if a genome is present in %Genome_Groups
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
484 } elsif (defined $Strict_Core && !$Strict_Core) { # if option '-co' is set without argument
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
485 $Strict_Core = $Genome_Groups{$Group_Names[0]}->{'genomes'}[0];
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
486 print STDERR "Genome for option '-co' was set automatically to the first genome in the first group, '$Strict_Core'!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
487 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
488
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
489
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
490
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
491 ### Round inclusion and exclusion count cutoffs for each group
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
492 foreach (sort keys %Genome_Groups) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
493 $Genome_Groups{$_}->{'inclusion'} = int((scalar @{ $Genome_Groups{$_}->{'genomes'} } * $Inclusion_Cutoff) + 0.5); # round positive number half up to integer ((s)printf rounds half to even), see e.g. https://stackoverflow.com/questions/178539/how-do-you-round-a-floating-point-number-in-perl
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
494 print STDERR "Group '$_' with ", scalar @{ $Genome_Groups{$_}->{'genomes'} }, " genomes has a rounded inclusion cutoff of $Genome_Groups{$_}->{'inclusion'} genome(s) and ";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
495
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
496 $Genome_Groups{$_}->{'exclusion'} = int((scalar @{ $Genome_Groups{$_}->{'genomes'} } * $Exclusion_Cutoff) + 0.5);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
497 print STDERR "an exclusion cutoff of $Genome_Groups{$_}->{'exclusion'} genome(s)!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
498 die "\n### Fatal error:\nThe rounded inclusion cutoff has to be greater than the exclusion cutoff for all groups, please choose appropriate cutoffs!\n" if ($Genome_Groups{$_}->{'inclusion'} <= $Genome_Groups{$_}->{'exclusion'});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
499 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
500
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
501
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
502
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
503 ### Create result folder
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
504 if (!$Result_Dir) { # can't give default before 'GetOptions' in case cutoffs are set by the user
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
505 $Result_Dir = "./results_i".$Inclusion_Cutoff."_e".$Exclusion_Cutoff;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
506 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
507 $Result_Dir =~ s/\/$//; # get rid of a potential '/' at the end of $Result_Dir path
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
508 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
509 if (-e $Result_Dir) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
510 empty_dir($Result_Dir); # subroutine to empty a directory with user interaction
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
511 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
512 mkdir $Result_Dir;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
513 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
514
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
515
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
516
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
517 ### Parse OGs in input PO matrix
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
518 print STDERR "Parsing Proteinortho input matrix '$PO_Matrix_File' ...\n"; # run status of script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
519 open (my $PO_Matrix_Fh, "<", "$PO_Matrix_File");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
520
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
521 my @Genome_Files; # store genome filename column order
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
522 my %Ortho_Groups; # hash of hash with OG# as key, internal hash with genome filename as key and IDs (e.g. locus tags) in anonymous array
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
523 while (<$PO_Matrix_Fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
524 chomp;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
525
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
526 # check PO input file header and get genome file names
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
527 if ($. == 1) { # header of PO matrix file (first line)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
528 die "\n### Fatal error:\nProteinortho input matrix '$PO_Matrix_File' does not have the mandatory header line, which starts with the first three tab-separated mandatory columns:\n# Species\tGenes\tAlg.-Conn.\n" if (!/^# Species\tGenes\tAlg\.-Conn\./);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
529
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
530 # get input genomes from PO matrix and check if fit to %Genome_Groups from $Groups_File
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
531 foreach my $genome (split(/\t/)) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
532 next if ($genome =~ /# Species|Genes|Alg\.-Conn\./); # non-genome header columns
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
533 die "\n### Fatal error:\nGenome '$genome' present in the header of the Proteinortho input matrix '$PO_Matrix_File' but not in the groups file '$Groups_File'. However, the group file needs to contain all of the genomes from the matrix header and vice versa. Make sure the strings correspond exactly (also lower/uppercase) to the matrix header and the input files in directory '$Genome_Dir'!\n" unless (check_genome_in_groups($genome)); # subroutine to check if a genome is present in %Genome_Groups
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
534 push (@Genome_Files, $genome);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
535 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
536 die "\n### Fatal error:\nNot the same amount of genomes in the group file '$Groups_File' and the Proteinortho input matrix header '$PO_Matrix_File', but the group file needs to contain all of the genomes from the header in the Proteinortho matrix and vice versa!\n" if (@Genome_Files != map(@{ $Genome_Groups{$_}->{'genomes'} }, keys %Genome_Groups));
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
537 next; # skip header line of PO matrix for subsequent code
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
538 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
539
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
540 # parse PO ortholog matrix
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
541 my ($organisms, $genes, $alg_conn, @og) = split(/\t/); # $alg_conn not used
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
542 my $organism_count = 0; # count number of organisms in OG to compare to '$organisms' (should be ok from PO)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
543 my $gene_count = 0; # count number of genes in OG to compare to '$genes' (should be ok from PO)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
544 my $column = -1; # column counter to associate column to correct genome file in OG (@Genome_Files array zero-based, thus start with '-1')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
545 foreach (@og) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
546 $column++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
547 die "\n###Fatal error:\nThere's an empty field in the Proteinortho input matrix '$PO_Matrix_File' in row '$.', column '$column+4'. This shouldn't happen, as all the fields of the matrix should either have a FASTA ID from the input genome files or a '*' in case the respective genome doesn't have an ortholog in this orthologous group!\n" if (/^$/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
548 next if (/^\*$/); # skip columns without a FASTA ID (ortholog) in PO matrix (indicated by '*'), have to skip after '$column++' otherwise column won't fit
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
549 $organism_count++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
550 my @paralogs = split(',');
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
551 foreach (sort {$a cmp $b} @paralogs) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
552 $gene_count++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
553 push(@{ $Ortho_Groups{$.-1}->{$Genome_Files[$column]} }, $_); # push into anonymous array in hash of hash ($.-1 = OG number, because of header)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
554 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
555 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
556 die "\n### Fatal error:\nThe indicated number of species ($organisms) does not fit to the counted number of organisms ($organism_count) in the orthologous group of line '$.' of the Proteinortho input matrix '$PO_Matrix_File'!\n" if ($organism_count != $organisms); # to check PO matrix (should be ok from PO)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
557 die "\n### Fatal error:\nThe indicated number of genes ($genes) does not fit to the counted number of orthologs/paralogs ($gene_count) in the orthologous group of line '$.' of the Proteinortho input matrix '$PO_Matrix_File'!\n" if ($gene_count != $genes); # to check PO matrix (should be ok from PO)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
558 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
559 close $PO_Matrix_Fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
560
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
561
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
562
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
563 ### Parse annotations in genome multi-FASTAs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
564 print STDERR "Parsing annotation from multi-FASTA CDS genome files in directory '$Genome_Dir' ...\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
565 my %Annotation; # hash of hash to store the annotation of the genome files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
566 foreach my $genome (@Genome_Files) { # @Genome_Files filled from PO matrix
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
567 my $genome_file_path = "$Genome_Dir/$genome";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
568 check_file_exist($genome_file_path); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
569 open (my $genome_fh, "<", "$genome_file_path");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
570
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
571 # parse anno in current genome file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
572 while (my $anno_line = <$genome_fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
573 chomp $anno_line;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
574 die "\n### Fatal error:\n'$genome_file_path' is not a FASTA input file. First line of the file should be a FASTA ID/header line and start with a '>':\n$anno_line\n" if ($anno_line !~ /^>/ && $. == 1);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
575 next if ($anno_line !~ /^>/); # skip non-ID FASTA lines
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
576
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
577 my ($id, $gene, $product, $length, $organism, $ec); # $length not used
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
578 foreach (split(/\s/, $anno_line)) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
579 if (/^>(.+)$/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
580 $id = $1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
581 } elsif (/g=(.*)$/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
582 $gene = $1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
583 } elsif (/p=(.*)$/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
584 warn "\n### Warning:\nNo product annotation (p=) in file '$genome_file_path' on line:\n$anno_line\nProceeding ...\n" if (!$1);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
585 $product = $1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
586 $product =~ s/\_/ /g;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
587 } elsif (/l=(.*)$/) { # $length not used
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
588 my ($start, $stop) = split(/\.\./, $1);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
589 $length = abs($stop - $start) + 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
590 } elsif (/o=(.*)$/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
591 $organism = $1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
592 $organism =~ s/\_/ /g;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
593 } elsif (/ec=(.*)$/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
594 $ec = $1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
595 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
596 die "\n### Fatal error:\nAnnotation in file '$genome_file_path' is not recognized on the following line. Please use 'cds_extractor.pl' to create your multi-FASTA protein genome files.\n$anno_line\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
597 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
598 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
599 die "\n### Fatal error:\nThe following FASTA ID of file '$genome_file_path' is not unique but has to be considering ALL input genome files (actually Proteinortho should have complained already). Please modify all repetitive occurences.\n$id\n" if ($Annotation{$id});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
600 #die "\n### Fatal error:\nThe FASTA ID '$id' is present in the multi-FASTA genome file '$genome_file_path' but not for this genome in the Proteinortho matrix '$PO_Matrix_File'. However, all IDs from the genome files in directory '$Genome_Dir' need to be present in the matrix and vice versa. Please run Proteinortho5 with option '-singles' to include also genes without orthologs, so-called singletons/ORFans (recommended is also option '-selfblast' to enhance paralog detection).\n" unless (check_po_id($id, $genome)); # subroutine to check if ID is present in %Ortho_Groups (from PO matrix); the reversed check (an ID that's present in PO matrix but not in the corresponding genome file) is in sub 'print_og_results' # this code was 1.) way too slow and 2.) somehow broke '$Query_Specific_OGs' and '$Query_Singletons' during implementation in 'po2anno.pl', no idea why ... (thus handle with care here)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
601
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
602 # fill annotation hash of hash
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
603 $Annotation{$id} = {'genome' => $genome,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
604 'gene' => $gene,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
605 'product' => $product,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
606 'organism' => $organism,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
607 'ec' => $ec};
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
608 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
609 close $genome_fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
610 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
611
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
612
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
613
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
614 ### Check if CDS counts in multi-FASTA genome files and PO matrix are equal
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
615 print STDERR "Comparing annotation CDS counts to orthologous group CDS counts (test if everything is present) ...\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
616 foreach my $genome (@Genome_Files) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
617 my $ortho_cds_count = map ($Ortho_Groups{$_}->{$genome} ? @{ $Ortho_Groups{$_}->{$genome} } : (), keys %Ortho_Groups); # de-reference anonymous array in hash of hash
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
618 # map evaluates BLOCK or EXPR in list context and returns the LIST value composed of the results of each such evaluation (each element of LIST may produce zero, one, or more elements in the returned value <=> grep returns only one value). In scalar context, returns the total number of elements so generated.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
619 # 'grep($Ortho_Groups{$_}->{$category} , keys %Ortho_Groups)' should also work (see result matrix)?
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
620
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
621 # the map line above is a fancy way of writing
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
622 #foreach (keys %Ortho_Groups) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
623 #$ortho_cds_count += @{ $Ortho_Groups{$_}->{$genome} } if ($Ortho_Groups{$_}->{$genome});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
624 #}
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
625
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
626 my $anno_cds_count = grep ($Annotation{$_}->{'genome'} eq $genome, keys %Annotation);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
627 die "\n### Fatal error:\nThere are more CDSs in file '$genome' than for this genome in the Proteinortho matrix '$PO_Matrix_File', but counts have to be equal. Please run Proteinortho5 with option '-singles' to include also genes without orthologs, so-called singletons/ORFans (recommended is also option '-selfblast' to enhance paralog detection).\n" if ($ortho_cds_count < $anno_cds_count); # overlaps with discarded (commented) sub 'check_po_id', but much more efficient (albeit not ID-specific)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
628 die "\n### Fatal error:\nThere are less CDSs in file '$genome' than for this genome in the Proteinortho matrix '$PO_Matrix_File', but counts have to be equal. Please check if the files are correct.\n" if ($ortho_cds_count > $anno_cds_count); # overlaps with ID-specific check in sub 'print_og_results', but here checks all IDs, the sub checks only those used for result file annotations
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
629 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
630
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
631
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
632
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
633 ### Calculate stats from PO matrix according to the group cutoffs, categorize the results and print to result files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
634 print STDERR "Calculating stats from the Proteinortho matrix according to the group cutoffs ...\n"; # run status of script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
635
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
636 # open result file for binary matrix and print header
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
637 my $Binary_Matrix_File; # filename needed also for venn diagram R script below
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
638 my $Binary_Matrix_Fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
639 if ($Opt_Binary) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
640 $Binary_Matrix_File = "$Result_Dir/binary_matrix.tsv";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
641 open ($Binary_Matrix_Fh, ">", "$Binary_Matrix_File");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
642 print $Binary_Matrix_Fh join("\t", @Group_Names), "\n"; # print header for binary matrix; print array-elements tab-separated and after the last element a newline
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
643 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
644
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
645 my %Count_Group_OG; # hash of hash to store OG/CDS count and output filehandle for each category ('core', 'group-specific' etc.)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
646 foreach my $og (sort { $a <=> $b } keys %Ortho_Groups) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
647 my %present_og_genomes; # hash of array to store genomes of a group PRESENT in the current OG for cutoff tests and print outs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
648
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
649 # 'strict_core' category OG if present in ALL genomes (with option '-co')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
650 if ($Strict_Core && keys %{$Ortho_Groups{$og}} == @Genome_Files) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
651 # always count OGs and CDSs for each category for final stats and plausability checks
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
652 $Count_Group_OG{'strict_core'}->{'OG_count'}++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
653 count_category_CDS(\@Genome_Files, 'strict_core', $og); # subroutine to count all CDS (also paralogs) in the current OG for ALL genomes in the indicated category (strict core category includes all genomes, thus @Genome_Files)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
654 open_result_file("$Result_Dir/strict_core_OGs.tsv", 'strict_core') if (!$Count_Group_OG{'strict_core'}->{'fh'}); # subroutine to open result file and store fh (if not opened yet)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
655 print_og_results($og, $Strict_Core, $Count_Group_OG{'strict_core'}->{'fh'}); # subroutine to print representative annotation for specified genome to result file (here genome is '$Strict_Core')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
656 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
657
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
658 # count and save number of genomes in the current $og with their corresponding group affiliation
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
659 foreach my $group (@Group_Names) { # keep initial group order in groups file (for representative annotation)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
660 foreach my $genome (@{ $Genome_Groups{$group}->{'genomes'} }) { # keep initial genome order for each group in groups file (for representative annotation)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
661 if ($Ortho_Groups{$og}->{$genome}) { # $genome present in the current $og
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
662 push(@{ $present_og_genomes{$group} }, $genome); # push into anonymous array in hash; keeps initial genome order in groups
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
663
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
664 # 'singleton/ORFan' category OG if only present in ONE genome (with option '-s')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
665 if ($Opt_Singletons && keys %{$Ortho_Groups{$og}} == 1) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
666 $Count_Group_OG{$genome}->{'OG_count'}++; # here: category = $genome
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
667 $Count_Group_OG{$genome}->{'CDS_count'} += @{ $Ortho_Groups{$og}->{$genome} }; # don't need sub 'count_category_CDS' because only one genome
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
668 if (!$Count_Group_OG{$genome}->{'fh'}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
669 my $file_no_extension = $genome =~ s/\.\w+$//r; # remove file extensions from @Genome_Files to have nicer output files; non-destructive modifier causes the result of the substitution to be returned instead of modifying $_ (see http://perldoc.perl.org/perlrequick.html#Search-and-replace)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
670 open_result_file("$Result_Dir/$file_no_extension\_singletons.tsv", $genome); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
671 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
672 print_og_results($og, $genome, $Count_Group_OG{$genome}->{'fh'}); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
673 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
674 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
675 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
676 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
677
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
678 # filter genome counts of the groups in this OG according to inclusion-/exclusion-cutoffs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
679 my @OG_included_groups; # array for groups with number of genomes in this OG >= the inclusion cutoff
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
680 my @OG_excluded_groups; # array for groups with genomes <= the exclusion cutoff
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
681 my @OG_unspec_groups; # array for groups with 'exclusion < genome_count < inclusion'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
682 foreach my $group (@Group_Names) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
683 if ($present_og_genomes{$group}) { # only run if group has a genome in this OG (otherwise undefined ARRAY reference)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
684 if (@{ $present_og_genomes{$group} } >= $Genome_Groups{$group}->{'inclusion'}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
685 push(@OG_included_groups, $group);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
686 } elsif (@{ $present_og_genomes{$group} } <= $Genome_Groups{$group}->{'exclusion'}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
687 push(@OG_excluded_groups, $group);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
688 } elsif (@{ $present_og_genomes{$group} } > $Genome_Groups{$group}->{'exclusion'} && @{ $present_og_genomes{$group} } < $Genome_Groups{$group}->{'inclusion'}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
689 push(@OG_unspec_groups, $group);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
690 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
691 } else { # $present_og_genomes{$group} == 0 for group (undef anonymous array), i.e. always <= exclusion
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
692 push(@OG_excluded_groups, $group);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
693 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
694 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
695
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
696 # 'unspecific' category OG if at least ONE group present in '@OG_unspec_groups'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
697 # no further categorization, i.e. skip rest of logics
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
698 if (@OG_unspec_groups) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
699 $Count_Group_OG{'unspecific'}->{'OG_count'}++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
700 foreach my $group (@OG_unspec_groups) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
701 count_category_CDS(\@{ $present_og_genomes{$group} }, 'unspecific', $og); # subroutine to count all CDS in the current OG for ALL genomes in the included groups (here unspec groups); %present_og_genomes contains only genomes present in this OG (see above 'push')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
702 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
703 if ($Opt_Unspecific) { # print only to output file with option '-u'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
704 open_result_file("$Result_Dir/unspecific_OGs.tsv", 'unspecific') if (!$Count_Group_OG{'unspecific'}->{'fh'}); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
705 print_og_results($og, @{ $present_og_genomes{$OG_unspec_groups[0]} }[0], $Count_Group_OG{'unspecific'}->{'fh'}); # subroutine; use first genome in first group which is present in current OG
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
706 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
707 next; # skip to next $og, because now unspec and not suitable for any further categories
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
708 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
709
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
710 # print binary matrix (in contrast to OGs in category 'unspecific' above, 'underrepresented' OGs below not skipped because package 'venn' can handle a row with only zeros)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
711 if ($Opt_Binary) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
712 my $i = 0;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
713 foreach my $group (@Group_Names) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
714 $i++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
715 print $Binary_Matrix_Fh "1" if (grep {/$group/} @OG_included_groups);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
716 print $Binary_Matrix_Fh "0" if (grep {/$group/} @OG_excluded_groups);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
717 print $Binary_Matrix_Fh "\t" if ($i < @Group_Names);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
718 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
719 print $Binary_Matrix_Fh "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
720 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
721
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
722 # 'cutoff_core' category OG if ALL groups >= inclusion (includes 'strict_core' OGs where ALL genomes are present in the OG)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
723 # 2 groups A/B [2 inclus :0 exclus], 3 groups A/B/C [3:0], 4 groups A/B/C/D [4:0]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
724 if (@OG_included_groups == @Group_Names && @OG_excluded_groups == 0) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
725 $Count_Group_OG{'cutoff_core'}->{'OG_count'}++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
726 foreach my $group (@OG_included_groups) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
727 count_category_CDS(\@{ $present_og_genomes{$group} }, 'cutoff_core', $og); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
728 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
729 open_result_file("$Result_Dir/cutoff_core_OGs.tsv", 'cutoff_core') if (!$Count_Group_OG{'cutoff_core'}->{'fh'}); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
730 print_og_results($og, @{ $present_og_genomes{$OG_included_groups[0]} }[0], $Count_Group_OG{'cutoff_core'}->{'fh'}); # subroutine; again use first genome in first group which is present in current OG
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
731
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
732 # 'underrepresented' category OG if ALL groups <= exclusion_cutoff (depending on the cutoffs and number of genomes in the groups will include most singletons)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
733 # 2 groups [0:2], 3 [0:3], 4 [0:4]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
734 } elsif (@OG_included_groups == 0 && @OG_excluded_groups == @Group_Names) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
735 $Count_Group_OG{'underrepresented'}->{'OG_count'}++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
736 open_result_file("$Result_Dir/underrepresented_OGs.tsv", 'underrepresented') if (!$Count_Group_OG{'underrepresented'}->{'fh'}); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
737
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
738 my $genome_found; # indicates if a genome with annotation found
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
739 foreach my $group (@OG_excluded_groups) { # have to go through all groups in '@OG_excluded_groups' in case groups have no genomes ('$present_og_genomes{$group} == 0' above) in this OG; at least one group will have a genome (otherwise wouldn't be an OG in the PO matrix)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
740 if ($present_og_genomes{$group}) { # group with genome in this OG
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
741 count_category_CDS(\@{ $present_og_genomes{$group} }, 'underrepresented', $og); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
742 next if ($genome_found); # print annotation to output only for first genome found
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
743 $genome_found = print_og_results($og, @{ $present_og_genomes{$group} }[0], $Count_Group_OG{'underrepresented'}->{'fh'}); # subroutine returns 1
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
744 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
745 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
746
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
747 # 'group-specific' category OG if only ONE group >= inclusion_cutoff
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
748 # 2 groups [1:1], 3 [1:2], 4 [1:3]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
749 } elsif (@OG_included_groups == 1 && @OG_excluded_groups == @Group_Names - 1) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
750 $Count_Group_OG{"$OG_included_groups[0]_specific"}->{'OG_count'}++; # only one element in array thus [0]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
751 foreach my $group (@OG_included_groups) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
752 count_category_CDS(\@{ $present_og_genomes{$group} }, "$OG_included_groups[0]_specific", $og); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
753 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
754 open_result_file("$Result_Dir/$OG_included_groups[0]_specific_OGs.tsv", "$OG_included_groups[0]_specific") if (!$Count_Group_OG{"$OG_included_groups[0]_specific"}->{'fh'}); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
755 print_og_results($og, @{ $present_og_genomes{$OG_included_groups[0]} }[0], $Count_Group_OG{"$OG_included_groups[0]_specific"}->{'fh'}); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
756
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
757 # 'group-absent' category OG if ONE group <= exclusion_cutoff (only for 3 and 4 groups)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
758 # 3 groups (A+B not C [2:1], A+C not B, and B+C not A) and 4 groups (A+B+C not D [3:1], A+B+D not C, A+C+D not B, B+C+D not A)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
759 } elsif (@Group_Names >= 3 && (@OG_included_groups == @Group_Names - 1 && @OG_excluded_groups == 1)) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
760 $Count_Group_OG{"$OG_excluded_groups[0]_absent"}->{'OG_count'}++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
761 foreach my $group (@OG_included_groups) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
762 count_category_CDS(\@{ $present_og_genomes{$group} }, "$OG_excluded_groups[0]_absent", $og); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
763 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
764 open_result_file("$Result_Dir/$OG_excluded_groups[0]_absent_OGs.tsv", "$OG_excluded_groups[0]_absent") if (!$Count_Group_OG{"$OG_excluded_groups[0]_absent"}->{'fh'}); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
765 print_og_results($og, @{ $present_og_genomes{$OG_included_groups[0]} }[0], $Count_Group_OG{"$OG_excluded_groups[0]_absent"}->{'fh'}); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
766
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
767 # 'two group-specific' category OG if TWO groups >= inclusion_cutoff (only for 4 groups)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
768 # 4 groups (B+D not A or C [2:2], B+C not A or D, A+D not B or C, A+C not B or D, A+B not C or D, C+D not A or B)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
769 } elsif (@Group_Names == 4 && (@OG_included_groups == @OG_excluded_groups)) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
770 $Count_Group_OG{"$OG_included_groups[0]-$OG_included_groups[1]_specific"}->{'OG_count'}++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
771 foreach my $group (@OG_included_groups) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
772 count_category_CDS(\@{ $present_og_genomes{$group} }, "$OG_included_groups[0]-$OG_included_groups[1]_specific", $og); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
773 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
774 open_result_file("$Result_Dir/$OG_included_groups[0]-$OG_included_groups[1]_specific_OGs.tsv", "$OG_included_groups[0]-$OG_included_groups[1]_specific") if (!$Count_Group_OG{"$OG_included_groups[0]-$OG_included_groups[1]_specific"}->{'fh'}); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
775 print_og_results($og, @{ $present_og_genomes{$OG_included_groups[0]} }[0], $Count_Group_OG{"$OG_included_groups[0]-$OG_included_groups[1]_specific"}->{'fh'}); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
776
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
777 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
778 die "\n### Fatal error:\nNo category logic test fits for OG '$og', which shouldn't happen. Have you smuggled only one or five groups into the groups file '$Groups_File' (which also shouldn't be possible)? The script can handle only two, three, or four groups please edit the input!\n"; # shouldn't happen, since regex in group file parse checks for minimally two and maximally four tab-separated group names and all category logics should be covered
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
779 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
780 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
781
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
782
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
783
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
784 ### Close all output files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
785 close $Binary_Matrix_Fh if ($Opt_Binary);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
786 map { close $Count_Group_OG{$_}->{'fh'} } keys %Count_Group_OG;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
787
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
788
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
789
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
790 ### Plot the venn diagram with R package 'gplots' function 'venn'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
791 if ($Opt_Venn) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
792 my $tmp_r_script = "$Result_Dir/tmp_venn.R"; # filename of the R script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
793 my $venn_diagram_name = "$Result_Dir/venn_diagram.pdf"; # filename of the output PDF histogram
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
794 print STDERR "Drawing venn diagram (option '-p') '$venn_diagram_name' ...\n"; # run status of script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
795 open (my $r_fh, ">", "$tmp_r_script");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
796
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
797 select $r_fh; # select fh for standard print/f output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
798 print "#!/usr/bin/Rscript --vanilla --slave\n"; # header of R script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
799 print "suppressMessages(library(gplots))\n"; # load library 'gplots' for function 'venn' and suppress STDOUT message from loading
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
800 print "binary_matrix <- read.table(\"$Binary_Matrix_File\", header=TRUE, sep=\"\\t\")\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
801 print "pdf(\"$venn_diagram_name\")\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
802 print "venn(binary_matrix)\n"; # create the venn diagram
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
803 print "out <- dev.off()\n"; # write histogram to PDF and suppress STDOUT output by diverting it
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
804 select STDOUT; # change standard print/f output back to STDOUT
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
805 close $r_fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
806
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
807 # execute tmp R script with 'Rscript'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
808 system("Rscript $tmp_r_script") == 0 or die "### Fatal error:\nPlotting the venn diagram didn't work. Either statistical programming language 'R' or the required R package 'gplots' is not installed, not in global \$PATH, or something wrong with the binary matrix '$Binary_Matrix_File' or the temporary R script '$tmp_r_script', which runs the drawing of the venn diagram. Function 'venn' of the 'gplots' package is required and needs to be installed!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
809 unlink $tmp_r_script;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
810 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
811
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
812
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
813
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
814 ### Count overall category OGs and CDS, and plausibility check for above logics
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
815 my ($Total_Category_CDS, $Total_Category_OGs) = (0, 0);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
816 foreach my $category (keys %Count_Group_OG) { # can't use a map in case of strict_core, singletons (categories which overlap non-optional categories)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
817 next if ($category =~ /strict_core/ || grep {$category eq $_} @Genome_Files); # skip optional 'strict_core' (option '-co') or singleton ('-s') categories
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
818
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
819 $Total_Category_OGs += $Count_Group_OG{$category}->{'OG_count'};
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
820 $Total_Category_CDS += $Count_Group_OG{$category}->{'CDS_count'};
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
821 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
822 die "\n### Fatal error:\nResulting categories don't include all present OGs, this shouldn't happen!\n" if (keys %Ortho_Groups != $Total_Category_OGs); # is this really always the case?
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
823
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
824
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
825
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
826 ### Print final output stats matrix
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
827 print STDERR "Finished printing output files to result directory '$Result_Dir', printing final stats matrix to STDOUT ...\n"; # run status of script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
828 print "# category\tOG_count\tCDS_count (all included genomes)\n"; # header for matrix
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
829 print "pangenome\t", scalar keys %Ortho_Groups, "\t", scalar keys %Annotation, "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
830 print "included_in_categories\t$Total_Category_OGs\t$Total_Category_CDS\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
831 foreach my $category (sort {lc($a) cmp lc($b)} keys %Count_Group_OG) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
832 print "$category";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
833 if (grep {$category eq $_} @Genome_Files) { # print overall stats for genomes with singletons (option '-s')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
834 print " genome\t", scalar grep($Ortho_Groups{$_}->{$category} , keys %Ortho_Groups), "\t", scalar grep($Annotation{$_}->{'genome'} eq $category, keys %Annotation), "\n"; # why does it work without the fancy map during CDS count check?
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
835 print "$category singletons"; # include category "key word" singleton for category stats below
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
836 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
837 print "\t$Count_Group_OG{$category}->{'OG_count'}\t$Count_Group_OG{$category}->{'CDS_count'}\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
838 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
839
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
840 # print out overall stats for all genomes even without singletons (option '-a')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
841 if ($Opt_Genomes_Overall) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
842 foreach my $genome (sort {lc($a) cmp lc($b)} @Genome_Files) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
843 next if (grep {$genome eq $_} keys %Count_Group_OG); # skip genomes with singletons (option '-s'), already printed above
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
844 print "$genome genome\t", scalar grep($Ortho_Groups{$_}->{$genome} , keys %Ortho_Groups), "\t", scalar grep($Annotation{$_}->{'genome'} eq $genome, keys %Annotation), "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
845 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
846 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
847
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
848 exit;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
849
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
850
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
851 #############
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
852 #Subroutines#
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
853 #############
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
854
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
855 ### Subroutine to test for multi-FASTA genome file existence
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
856 sub check_file_exist {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
857 my $file = shift;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
858 die "\n### Fatal error:\nGenome file '$file' is listed in the Proteinortho matrix, but does not exist in directory '$Genome_Dir': $!\n" if (!-e $file);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
859 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
860 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
861
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
862
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
863
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
864 ### Subroutine to check if a string adheres to file system conformity, esp. for filenames
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
865 sub check_file_system_conformity {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
866 my ($string, $type) = @_;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
867 die "\n\n### Fatal error:\nOnly alphanumeric '0-9a-zA-Z', underscore '_', dash '-', and period '.' characters allowed for $type to avoid operating/file system problems. Thus, please adapt '$string' in the group file '$Groups_File' accordingly!\n" if ($string !~ /^[\w_\-\.]+$/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
868 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
869 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
870
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
871
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
872
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
873 ### Subroutine to check if a genome filename is present in hash %Genome_Groups (hash filled from the $Groups_File)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
874 sub check_genome_in_groups {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
875 my $genome = shift;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
876 foreach my $group (keys %Genome_Groups) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
877 return 1 if (grep {$_ eq $genome} @{ $Genome_Groups{$group}->{'genomes'} });
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
878 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
879 return; # return 'undef'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
880 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
881
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
882
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
883
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
884 ### Subroutine to check if an ID is present in hash %Ortho_Group (hash filled from the PO matrix)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
885 # this code was 1.) way too slow and 2.) somehow broke '$Query_Specific_OGs' and '$Query_Singletons' during implementation in 'po2anno.pl', no idea why ... (thus handle with care here)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
886 #sub check_po_id {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
887 #my ($id, $genome) = @_;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
888 #foreach my $og (keys %Ortho_Groups) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
889 #return 1 if (grep {$_ eq $id} @{ $Ortho_Groups{$og}->{$genome} });
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
890 #}
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
891 #return;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
892 #}
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
893
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
894
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
895
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
896 ### Subroutine to count and sum all CDS for ALL genomes in a certain category for a specific OG
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
897 sub count_category_CDS {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
898 my ($genomes_array_ref, $category, $og) = @_;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
899 foreach my $genome (@{ $genomes_array_ref }) { # de-reference array
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
900 $Count_Group_OG{$category}->{'CDS_count'} += @{ $Ortho_Groups{$og}->{$genome} };
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
901 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
902 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
903 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
904
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
905
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
906
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
907 ### Subroutine to empty a directory with user interaction
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
908 sub empty_dir {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
909 my $dir = shift;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
910 print STDERR "\nDirectory '$dir' already exists! You can use either option '-r' to set a different output result directory name, or do you want to replace the directory and all its contents [y|n]? ";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
911 my $user_ask = <STDIN>;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
912 if ($user_ask =~ /y/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
913 unlink glob "$dir/*"; # remove all files in results directory
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
914 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
915 die "\nScript abborted!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
916 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
917 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
918 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
919
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
920
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
921
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
922 ### Open a result file and print the header
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
923 sub open_result_file {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
924 my ($filename, $category) = @_;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
925 open ($Count_Group_OG{$category}->{'fh'}, ">", "$filename"); # store filehandle in hash of hash
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
926 print { $Count_Group_OG{$category}->{'fh'} } "# OG#\tID\tgene\tEC_number\tproduct\torganism\n"; # block around fh needed if not a simple string variable (or glob)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
927 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
928 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
929
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
930
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
931
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
932 ### Print annotation of the OGs to the result files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
933 sub print_og_results {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
934 my ($og, $genome, $fh) = @_;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
935
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
936 print $fh "$og"; # print OG number only for the first paralog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
937 foreach my $id (@{ $Ortho_Groups{$og}->{$genome} }) { # de-reference anonymous array for putative paralogs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
938 die "\n### Fatal error:\nID '$id' present in the Proteinortho matrix '$PO_Matrix_File' in OG '$og' but not present in the corresponding genome '$genome' multi-FASTA file. However, all IDs in the matrix and the genome files in directory '$Genome_Dir' need to be correspondent to each other. Make sure you chose the correct directory with the input genome files for the current Proteinortho matrix!\n" unless ($Annotation{$id}); # overlaps with CDS count check in multi-FASTA and PO matrix files, but here only IDs checked used for print out # the reverse check (an ID that's present in a genome file but not in the PO matrix) WAS during the parsing of the annotation from the multi-FASTA files and subroutine 'check_po_id($id)' (see sub why not)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
939
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
940 print $fh "\t$id\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
941 if ($Annotation{$id}->{'gene'}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
942 print $fh $Annotation{$id}->{'gene'}, "\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
943 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
944 print $fh "\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
945 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
946 if ($Annotation{$id}->{'ec'}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
947 print $fh $Annotation{$id}->{'ec'}, "\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
948 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
949 print $fh "\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
950 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
951 if ($Annotation{$id}->{'product'}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
952 print $fh $Annotation{$id}->{'product'}, "\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
953 } else { # very unlikely that a CDS has no product annotation
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
954 print $fh "\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
955 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
956 if ($Annotation{$id}->{'organism'}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
957 print $fh $Annotation{$id}->{'organism'}, "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
958 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
959 print $fh "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
960 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
961 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
962 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
963 }