annotate COG/bac-genomics-scripts/po2anno/po2anno.pl @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 #!/usr/bin/perl
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3 #######
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 # POD #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5 #######
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7 =pod
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 =head1 NAME
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 C<po2anno.pl> - create an annotation comparison matrix from Proteinortho5 output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13 =head1 SYNOPSIS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15 C<perl po2anno.pl -i matrix.proteinortho -d genome_fasta_dir/ -l -a E<gt> annotation_comparison.tsv>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17 =head1 DESCRIPTION
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19 Supplement an ortholog/paralog output matrix from a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20 L<B<Proteinortho5>|http://www.bioinf.uni-leipzig.de/Software/proteinortho/>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21 calculation with annotation information. The resulting tab-separated
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22 annotation comparison matrix (ACM) is mainly intended for the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23 transfer of high quality annotations from reference genomes to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24 homologs (orthologs and co-orthologs/paralogs) in a query genome
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25 (e.g. in conjunction with L<C<tbl2tab.pl>|/tbl2tab>). But of course
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26 it can also be used to have a quick glance at the annotation of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27 genes present only in a couple of input genomes in comparison to the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28 others.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 Annotation is retrieved from multi-FASTA files created with
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31 L<C<cds_extractor.pl>|/cds_extractor>. See
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 L<C<cds_extractor.pl>|/cds_extractor> for a description of the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33 format. These files are used as input for the PO analysis and option
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 B<-d> for C<po2anno.pl>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 B<Proteinortho5> (PO) has to be run with option B<-singles> to include
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37 also genes without orthologs, so-called singletons/ORFans, for each
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38 genome in the PO matrix (see the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39 L<PO manual|http://www.bioinf.uni-leipzig.de/Software/proteinortho/manual.html>).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 Additionally, option B<-selfblast> is recommended to enhance paralog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41 detection by PO.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43 Each orthologous group (OG) is listed in a row of the resulting ACM,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44 the first column holds the OG numbers from the PO input matrix (i.e.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45 line number minus one). The following columns specify the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46 orthologous CDS for each input genome. For each CDS the ID,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47 optionally the length in bp (option B<-l>), gene, EC number(s), and
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48 product are shown depending on their presence in the CDS's
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49 annotation. The ID is in most cases the locus tag (see
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50 L<C<cds_extractor.pl>|/cds_extractor>). If several EC numbers exist
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51 for a single CDS they're separated by ';'. If an OG includes
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52 paralogs, i.e. co-orthologs from a single genome, these will be
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53 printed in the following row(s) B<without> a new OG number in the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54 first column. The order of paralogous CDSs within an OG is
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55 arbitrarily.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57 The OGs are sorted numerically via the query ID (see option B<-q>).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58 If option B<-a> is set, the non-query OGs are appended to the output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59 after the query OGs, sorted numerically via OG number.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 =head1 OPTIONS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63 =head2 Mandatory options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67 =item B<-i>=I<str>, B<-input>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69 Proteinortho (PO) result matrix (*.proteinortho or *.poff), or piped C<STDIN> (-)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71 =item B<-d>=I<str>, B<-dir_genome>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73 Path to the directory including the genome multi-FASTA PO input
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74 files (*.faa or *.ffn), created with L<C<cds_extractor.pl>|/cds_extractor>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78 =head2 Optional options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82 =item B<-h>, B<-help>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86 =item B<-q>=I<str>, B<-query>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88 Query genome (has to be identical to the string in the PO matrix)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89 [default = first one in alphabetical order]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91 =item B<-l>, B<-length>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93 Include length of each CDS in bp
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95 =item B<-a>, B<-all>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97 Append non-query orthologous groups (OGs) to the output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99 =item B<-v>, B<-version>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101 Print version number to C<STDERR>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105 =head1 OUTPUT
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109 =item C<STDOUT>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111 The resulting tab-delimited ACM is printed to C<STDOUT>. Redirect or
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112 pipe into another tool as needed (e.g. C<cut>, C<grep>, C<head>, or
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113 C<tail>).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117 =head1 EXAMPLES
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119 =head2 L<C<cds_extractor.pl>|/cds_extractor>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123 =item C<for i in *.[gbk|embl]; do perl cds_extractor.pl -i $i [-p|-n]; done>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127 =head2 L<B<Proteinortho5>|http://www.bioinf.uni-leipzig.de/Software/proteinortho/>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131 =item C<proteinortho5.pl -graph [-synteny] -cpus=# -selfblast -singles -identity=50 -cov=50 -blastParameters='-use_sw_tback [-seg no|-dust no]' *.[faa|ffn]>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135 =head2 C<po2anno.pl>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
139 =item C<perl po2anno.pl -i matrix.[proteinortho|poff] -d genome_fasta_dir/ -q query.[faa|ffn] -l -a E<gt> annotation_comparison.tsv>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
140
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
141 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
142
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
143 =head1 VERSION
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
144
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
145 0.2.2 update: 23-10-2015
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
146 0.1 18-12-2014
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
147
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
148 =head1 AUTHOR
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
149
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
150 Andreas Leimbach aleimba[at]gmx[dot]de
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
151
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
152 =head1 LICENSE
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
153
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
154 This program is free software: you can redistribute it and/or modify
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
155 it under the terms of the GNU General Public License as published by
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
156 the Free Software Foundation; either version 3 (GPLv3) of the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
157 License, or (at your option) any later version.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
158
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
159 This program is distributed in the hope that it will be useful, but
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
160 WITHOUT ANY WARRANTY; without even the implied warranty of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
161 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
162 General Public License for more details.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
163
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
164 You should have received a copy of the GNU General Public License
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
165 along with this program. If not, see L<http://www.gnu.org/licenses/>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
166
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
167 =cut
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
168
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
169
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
170 ########
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
171 # MAIN #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
172 ########
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
173
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
174 use strict;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
175 use warnings;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
176 use autodie;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
177 use Getopt::Long;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
178 use Pod::Usage;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
179
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
180 my $Cmdline = "$0 @ARGV"; # used call command
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
181
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
182 ### Get the options with Getopt::Long
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
183 my $PO_Matrix_File; # PO result matrix (*.proteinortho or *.poff)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
184 my $Genome_Dir; # directory with genome multi-FASTAs (PO input)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
185 my $Query; # query genome (first column in output)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
186 my $Opt_Length; # include CDS nucleotide lengths in output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
187 my $Opt_All_OGs; # print also all non-query OGs to output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
188 my $VERSION = '0.2.2';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
189 my ($Opt_Version, $Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
190 GetOptions ('input=s' => \$PO_Matrix_File,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
191 'dir_genome=s' => \$Genome_Dir,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
192 'query=s' => \$Query,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
193 'length' => \$Opt_Length,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
194 'all' => \$Opt_All_OGs,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
195 'version' => \$Opt_Version,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
196 'help|?' => \$Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
197
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
198
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
199
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
200 ### Run perldoc on POD
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
201 pod2usage(-verbose => 2) if ($Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
202 die "$0 $VERSION\n" if ($Opt_Version);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
203
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
204 if (!$PO_Matrix_File || !$Genome_Dir) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
205 my $warning = "\n### Fatal error: Mandatory options '-i' or '-d' or their arguments are missing!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
206 pod2usage(-verbose => 1, -message => $warning, -exitval => 2);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
207 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
208 die "\n### Fatal error: Proteinortho matrix file '$PO_Matrix_File' does not exist: $!\n" if (!-e $PO_Matrix_File);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
209 $Genome_Dir =~ s/\/$//; # get rid of a potential '/' at the end of $Genome_Dir path
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
210 die "\n### Fatal error: Genome directory '$Genome_Dir' does not exist: $!\n" if (!-d $Genome_Dir);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
211
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
212 print STDERR "Script call command: $Cmdline\n"; # print call command after '-h|-v'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
213
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
214
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
215
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
216 ### Pipe input from STDIN or open input file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
217 my $PO_Matrix_Fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
218 if ($PO_Matrix_File eq '-') { # file input via STDIN
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
219 $PO_Matrix_Fh = *STDIN; # capture typeglob of STDIN
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
220 } else { # input via input file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
221 open ($PO_Matrix_Fh, "<", "$PO_Matrix_File");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
222 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
223
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
224
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
225
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
226 ### Parse OGs in input PO matrix
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
227 print STDERR "Parsing Proteinortho input matrix '$PO_Matrix_File' ...\n"; # run status of script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
228 my @Genome_Files; # store genome filename column order
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
229 my %Ortho_Groups; # hash of hash with OG# as key, internal hash with genome filename as key and IDs (e.g. locus tags) in anonymous array
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
230 while (<$PO_Matrix_Fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
231 chomp;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
232
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
233 # check PO input file header and get genome file names
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
234 if ($. == 1) { # header of PO matrix file (first line)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
235 die "\n### Fatal error:\nProteinortho input matrix '$PO_Matrix_File' does not have the mandatory header line, which starts with the first three tab-separated mandatory columns:\n# Species\tGenes\tAlg.-Conn.\n" if (!/^# Species\tGenes\tAlg\.-Conn\./);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
236
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
237 # get input genomes from PO matrix and check $Query existence
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
238 my $query_present = 0 if ($Query);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
239 foreach (split(/\t/, $_)) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
240 next if (/# Species|Genes|Alg\.-Conn\./); # non-genome header columns
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
241 $query_present = 1 if ($Query && $Query eq $_);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
242 push (@Genome_Files, $_); # store genome in array
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
243 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
244 die "\n### Fatal error:\nGiven query '$Query' not found in the header of the input Proteinortho matrix '$PO_Matrix_File'! Make sure the string corresponds exactly (also lower/uppercase) to the header of the matrix and the input file in '$Genome_Dir'!\n" if ($Query && !$query_present);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
245 next; # skip header line of PO matrix for subsequent code
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
246 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
247
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
248 # parse PO ortholog matrix
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
249 my ($organisms, $genes, $alg_conn, @og) = split(/\t/); # $alg_conn not used
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
250 my $organism_count = 0; # count number of organisms in OG to compare to '$organisms' (should be ok from PO)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
251 my $gene_count = 0; # count number of genes in OG to compare to '$genes' (should be ok from PO)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
252 my $column = -1; # column counter to associate column to correct genome file in OG (@Genome_Files array zero-based, thus start with '-1')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
253 foreach (@og) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
254 $column++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
255 next if (/^\*$/); # skip empty columns in PO matrix (indicated by '*')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
256 $organism_count++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
257 my @paralogs = split(',');
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
258 foreach (sort {$a cmp $b} @paralogs) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
259 $gene_count++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
260 push(@{ $Ortho_Groups{$.-1}->{$Genome_Files[$column]} }, $_); # push into anonymous array in hash of hash ($.-1 = OG number, because of header)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
261 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
262 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
263 die "\n### Fatal error:\nThe indicated number of species ($organisms) does not fit to the counted number of organisms ($organism_count) in the orthologous group of line '$.' of the Proteinortho input matrix '$PO_Matrix_File'!\n" if ($organism_count != $organisms); # to check PO matrix
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
264 die "\n### Fatal error:\nThe indicated number of genes ($genes) does not fit to the counted number of orthologs/paralogs ($gene_count) in the orthologous group of line '$.' of the Proteinortho input matrix '$PO_Matrix_File'!\n" if ($gene_count != $genes); # to check PO matrix
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
265 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
266 close $PO_Matrix_Fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
267
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
268
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
269
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
270 ### Sort @Genome_Files and set $Query as first element (not possible above because has to follow $column!)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
271 @Genome_Files = sort {lc($a) cmp lc($b)} @Genome_Files; # sort regardless of upper/lowercase
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
272 if ($Query) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
273 @Genome_Files = grep ($_ ne $Query, @Genome_Files); # remove $Query from @Genome_Files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
274 unshift (@Genome_Files, $Query); # set as first element
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
275 } elsif (!$Query) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
276 $Query = $Genome_Files[0]; # $Query first element in sorted array, when not given with option '-q'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
277 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
278
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
279
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
280
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
281 ### Parse annotations in genome multi-FASTAs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
282 print STDERR "Parsing annotation from multi-FASTA CDS genome files in directory '$Genome_Dir' ...\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
283 my %Annotation; # hash of hash to store the annotation of the genome files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
284 my %Anno_Features; # hash of hash to store which annotation features are present overall in each individual genome multi-FASTA (optional are 'gene/g=', 'product/p=' and 'EC_number/ec=' tags)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
285 foreach my $genome (@Genome_Files) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
286 my $genome_file_path = "$Genome_Dir/$genome";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
287 check_file_exist($genome_file_path); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
288 open (my $genome_fh, "<", "$genome_file_path");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
289
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
290 # parse anno in current genome file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
291 while (my $anno_line = <$genome_fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
292 chomp $anno_line;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
293 die "\n### Fatal error:\n'$genome_file_path' is not a FASTA input file. First line of the file should be a FASTA ID/header line and start with a '>':\n$anno_line\n" if ($anno_line !~ /^>/ && $. == 1);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
294 next if ($anno_line !~ /^>/); # skip non-ID FASTA lines
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
295
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
296 my ($id, $gene, $product, $length, $organism, $ec); # $organism not used
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
297 foreach (split(/\s/, $anno_line)) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
298 if (/^>(.+)$/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
299 $id = $1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
300 } elsif (/g=(.*)$/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
301 $gene = $1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
302 } elsif (/p=(.*)$/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
303 warn "\n### Warning:\nNo product annotation (p=) in file '$genome_file_path' on line:\n$anno_line\nProceeding ...\n" if (!$1);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
304 $product = $1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
305 $product =~ s/\_/ /g;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
306 } elsif (/l=(.*)$/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
307 my ($start, $stop) = split(/\.\./, $1);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
308 $length = abs($stop - $start) + 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
309 } elsif (/o=(.*)$/) { # $organism not used
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
310 $organism = $1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
311 $organism =~ s/\_/ /g;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
312 } elsif (/ec=(.*)$/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
313 $ec = $1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
314 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
315 die "\n### Fatal error:\nAnnotation in file '$genome_file_path' is not recognized on the following line. Please use 'cds_extractor.pl' to create your multi-FASTA protein genome files.\n$anno_line\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
316 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
317 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
318 die "\n### Fatal error:\nThe following FASTA ID of file '$genome_file_path' is not unique but has to be considering ALL input genome files (actually Proteinortho should have complained already). Please modify all repetitive occurences.\n$id\n" if ($Annotation{$id});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
319
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
320 # fill annotation hash of hash
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
321 $Annotation{$id} = {'genome' => $genome,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
322 'gene' => $gene,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
323 'product' => $product,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
324 'length' => $length,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
325 'ec' => $ec};
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
326
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
327 # store which 'optional' annotation features are present overall in a genome
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
328 # (used to have lots of greps on %Annotation for this purpose below, but lets save performance)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
329 $Anno_Features{$genome}{'gene'} = 1 if ($gene && !$Anno_Features{$genome}->{'gene'});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
330 $Anno_Features{$genome}{'product'} = 1 if ($product && !$Anno_Features{$genome}->{'product'});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
331 $Anno_Features{$genome}{'ec'} = 1 if ($ec && !$Anno_Features{$genome}->{'ec'});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
332 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
333 close $genome_fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
334 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
335
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
336
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
337
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
338 ### Check if CDS counts in multi-FASTA genome files and PO matrix are equal and print header of output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
339 print STDERR "Comparing annotation CDS counts to orthologous group CDS counts and printing header for output matrix ...\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
340 my $Query_CDS_Count; # store number of query CDS for stat report at end
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
341 print "# OG"; # first column of header
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
342 foreach my $genome (@Genome_Files) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
343 my $ortho_cds_count = map ($Ortho_Groups{$_}->{$genome} ? @{ $Ortho_Groups{$_}->{$genome} } : (), keys %Ortho_Groups); # de-reference anonymous array in hash of hash
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
344 # map evaluates BLOCK or EXPR in list context and returns the LIST value composed of the results of each such evaluation (each element of LIST may produce zero, one, or more elements in the returned value <=> grep returns only one value). In scalar context, returns the total number of elements so generated.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
345
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
346 # the map line above is a fancy way of writing
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
347 #foreach (keys %Ortho_Groups) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
348 #$ortho_cds_count += @{ $Ortho_Groups{$_}->{$genome} } if ($Ortho_Groups{$_}->{$genome});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
349 #}
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
350
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
351 my $anno_cds_count = grep ($Annotation{$_}->{'genome'} eq $genome, keys %Annotation);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
352 $Query_CDS_Count = $anno_cds_count if ($Query eq $genome);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
353 die "\n### Fatal error:\nThere are more CDSs in file '$genome' than for this genome in the Proteinortho matrix '$PO_Matrix_File', but counts have to be equal. Please run Proteinortho5 with option '-singles' to include also genes without orthologs, so-called singletons/ORFans (recommended is also option '-selfblast' to enhance paralog detection).\n" if ($ortho_cds_count < $anno_cds_count);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
354 die "\n### Fatal error:\nThere are less CDSs in file '$genome' than for this genome in the Proteinortho matrix '$PO_Matrix_File', but counts have to be equal. Please check if the files are correct.\n" if ($ortho_cds_count > $anno_cds_count); # overlaps with ID-specific check in sub 'print_matrix'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
355
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
356 # print header fields for output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
357 print "\t$genome";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
358 print "\tlength [bp]" if ($Opt_Length);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
359 print "\tgene" if ($Anno_Features{$genome}->{'gene'});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
360 print "\tec" if ($Anno_Features{$genome}->{'ec'});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
361 print "\tproduct" if ($Anno_Features{$genome}->{'product'}); # very unlikely that there's a whole genome without product annotation ...
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
362 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
363 print "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
364
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
365
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
366
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
367 ### Print annotation comparison matrix
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
368 print STDERR "Printing output annotation comparison matrix for query OGs ...\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
369 my %Query_ID_Seen; # query IDs already processed for output (because of paralogous CDSs in the same OGs)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
370 my %Query_OGs; # OGs containing query CDSs, to skip for non-query OGs (option '-a') below
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
371 my $Query_Specific_OGs = 0; # count number of OGs including only query CDSs (not including CDSs of the other genomes)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
372 my $Query_Singletons = 0; # count query singletons/ORFans (different to '$Query_Singleton_OGs' because of paralogous CDSs)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
373 foreach my $id (sort { $a cmp $b } grep ($Annotation{$_}->{'genome'} eq $Query, keys %Annotation)) { # get IDs only for the query genome, sorted
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
374 next if ($Query_ID_Seen{$id});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
375
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
376 # get OG for current query $id (I'm sure this can also be resolved by a fancy map/grep construction, but I didn't get it to work)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
377 my $id_og; # OG containing the current query ID
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
378 foreach my $og (keys %Ortho_Groups) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
379 foreach (@{ $Ortho_Groups{$og}->{$Query} }) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
380 if ($_ eq $id) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
381 $id_og = $og;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
382 last;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
383 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
384 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
385 last if ($id_og);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
386 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
387 $Query_OGs{$id_og} = 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
388
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
389 print_matrix($id_og, \%Query_ID_Seen); # subroutine to print the annotation comparison matrix for the current OG to STDOUT
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
390 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
391
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
392
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
393
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
394 ### Optionally, print also non-query OGs/singletons
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
395 if ($Opt_All_OGs) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
396 print STDERR "Printing output annotation comparison matrix for non-query OGs (option '-all') ...\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
397 foreach my $og (sort {$a <=> $b} keys %Ortho_Groups) { # sort from smallest to largest OG
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
398 next if ($Query_OGs{$og});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
399 print_matrix($og); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
400 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
401 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
402
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
403
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
404
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
405 ### Print some final stats
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
406 print STDERR "\nFinished creating Proteinortho annotation comparison matrix with query genome '$Query':\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
407 print STDERR "Total genomes: ", scalar @Genome_Files, "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
408 print STDERR "Total CDSs: ", scalar keys %Annotation, "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
409 print STDERR "Total query CDSs: $Query_CDS_Count\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
410 print STDERR "Total OGs: ", scalar keys %Ortho_Groups, "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
411 print STDERR "OGs including query CDSs: ", scalar keys %Query_OGs, "\n"; # 'scalar grep($Ortho_Groups{$_}->{$Query} , keys %Ortho_Groups)' should actually be the same, but somehow doesn't work ... (what about the fancy map in the CDS count check above?)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
412 print STDERR "Query-specific OGs (not including CDSs of the other genomes): $Query_Specific_OGs\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
413 print STDERR "Total query singletons/ORFans: $Query_Singletons\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
414
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
415
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
416 exit;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
417
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
418
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
419 #############
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
420 #Subroutines#
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
421 #############
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
422
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
423 ### Subroutine to test for file existence
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
424 sub check_file_exist {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
425 my $file = shift;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
426 die "\n### Fatal error:\nGenome file '$file' is listed in the Proteinortho matrix, but does not exist in directory '$Genome_Dir': $!\n" if (!-e $file);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
427 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
428 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
429
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
430
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
431
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
432 ### Subroutine to print resulting annotation comparison matrix to STDOUT
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
433 sub print_matrix {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
434 my ($og, $query_id_seen) = @_; # $query_id_seen hash-ref to %Query_ID_Seen ($query_id_seen not given/defined if called above for non-query OGs, option '-a')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
435
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
436 # get max count of paralogs/co-orthologs over all genomes in current OG for row printing below
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
437 my $max_paralog_count = 0;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
438 foreach my $genome (keys %{$Ortho_Groups{$og}}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
439 my $paralog_count = 0;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
440 foreach (@{ $Ortho_Groups{$og}->{$genome} }) { # only needed for genomes in the CURRENT $og <=> ALL genomes needed below
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
441 $query_id_seen->{$_} = 1 if ($query_id_seen && $genome eq $Query); # store query IDs that have been processed
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
442 $paralog_count++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
443 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
444 if ($genome eq $Query && keys %{$Ortho_Groups{$og}} == 1) { # query-specific OGs (not including CDSs of the other genomes)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
445 $Query_Specific_OGs++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
446 $Query_Singletons += $paralog_count;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
447 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
448 $max_paralog_count = $paralog_count if ($paralog_count > $max_paralog_count);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
449 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
450
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
451 # print output matrix for current OG
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
452 print "$og";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
453 for (my $i = 0; $i < $max_paralog_count; $i++) { # '<' because array zero-based
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
454 print "\t"; # tab after OG-nr
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
455
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
456 foreach my $genome (@Genome_Files) { # for ALL genomes
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
457
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
458 # CDS(s) of genome present in current $og and within $max_paralog_count
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
459 if ($Ortho_Groups{$og}->{$genome}->[$i]) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
460 die "\n### Fatal error:\nID '$Ortho_Groups{$og}->{$genome}->[$i]' present in the Proteinortho matrix '$PO_Matrix_File' in OG '$og' but not present in the corresponding genome '$genome' multi-FASTA file. However, all IDs in the matrix and the genome files in directory '$Genome_Dir' need to be correspondent to each other. Make sure you chose the correct directory with the input genome files for the current Proteinortho matrix!\n" unless ($Annotation{$Ortho_Groups{$og}->{$genome}->[$i]}); # overlaps with check during the CDS count comparison of FASTA files and PO matrix (but here not all IDs are controlled without option '-a')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
461
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
462 print "$Ortho_Groups{$og}->{$genome}->[$i]\t"; # print CDS ID
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
463 print $Annotation{$Ortho_Groups{$og}->{$genome}->[$i]}->{'length'}, "\t" if ($Opt_Length);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
464 if ($Annotation{$Ortho_Groups{$og}->{$genome}->[$i]}->{'gene'}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
465 print $Annotation{$Ortho_Groups{$og}->{$genome}->[$i]}->{'gene'}, "\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
466 } elsif ($Anno_Features{$genome}->{'gene'}) { # tab only if this genome has a 'g=/gene' annotation at all (stored in hash of hash %Anno_Features)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
467 print "\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
468 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
469 if ($Annotation{$Ortho_Groups{$og}->{$genome}->[$i]}->{'ec'}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
470 print $Annotation{$Ortho_Groups{$og}->{$genome}->[$i]}->{'ec'}, "\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
471 } elsif ($Anno_Features{$genome}->{'ec'}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
472 print "\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
473 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
474 if ($Annotation{$Ortho_Groups{$og}->{$genome}->[$i]}->{'product'}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
475 print $Annotation{$Ortho_Groups{$og}->{$genome}->[$i]}->{'product'}, "\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
476 } elsif ($Anno_Features{$genome}->{'product'}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
477 warn "\n### Warning:\nNo product annotation (p=) in file '$genome' for ID:\n$Ortho_Groups{$og}->{$genome}->[$i]\nProceeding ...\n"; # overlaps with 'warn' for missing product annotation in parsing the annotations of the genomes
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
478 print "\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
479 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
480
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
481 } else { # genome without a CDS in current $og or less paralogous CDSs than $max_paralog_count
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
482 my $times = 5;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
483 $times -= 1 if (!$Opt_Length);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
484 $times -= 1 if (!$Anno_Features{$genome}->{'gene'}); # genome doesn't have a 'g=/gene' annotation at all
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
485 $times -= 1 if (!$Anno_Features{$genome}->{'ec'});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
486 $times -= 1 if (!$Anno_Features{$genome}->{'product'}); # very unlikely that there's a whole genome without product annotation ...
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
487 print "\t" x $times;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
488 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
489 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
490 print "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
491 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
492
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
493 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
494 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
495