annotate COG/bac-genomics-scripts/cdd2cog/cdd2cog.pl @ 15:dbde253606c5 draft

Uploaded
author dereeper
date Wed, 11 Dec 2024 08:25:06 +0000
parents e42d30da7a74
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 #!/usr/bin/perl
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3 #######
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 # POD #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5 #######
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7 =pod
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 =head1 NAME
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 C<cdd2cog.pl> - assign COG categories to protein sequences
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13 =head1 SYNOPSIS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15 C<perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17 =head1 DESCRIPTION
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19 The script assigns COG (L<cluster of orthologous
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20 groups|http://www.ncbi.nlm.nih.gov/COG/>) categories to proteins.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21 For this purpose, the query proteins need to be blasted with
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22 RPS-BLAST+ (L<Reverse Position-Specific BLAST|http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download>)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23 against NCBI's Conserved Domain Database
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24 (L<CDD|http://www.ncbi.nlm.nih.gov/cdd>). Use
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25 L<C<cds_extractor.pl>|/cds_extractor> beforehand to extract multi-fasta
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26 protein files from GENBANK or EMBL files.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28 Both tab-delimited RPS-BLAST+ outformats, B<-outfmt 6> and B<-outfmt
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29 7>, can be processed by C<cdd2cog.pl>. By default, RPS-BLAST+ hits
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 for each query protein are filtered for the best hit (lowest
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31 e-value). Use option B<-a|all_hits> to assign COGs to all BLAST hits
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 and e.g. do a downstream filtering in a spreadsheet application.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33 Results are written to tab-delimited files in the F<./results>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 folder, overall assignment statistics are printed to C<STDOUT>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 Several files are needed from NCBI's FTP server to run the RPS-BLAST+
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37 and C<cdd2cog.pl>:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41 =item 1.) L<CDD|ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43 More information about the files in the CDD FTP archive can be found
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44 in the respective F<README> file.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46 =item 1.1.) F<cddid.tbl.gz>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48 The file needs to be unpacked:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50 C<gunzip cddid.tbl.gz>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52 Contains summary information about the CD models in a tab-delimited
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53 format. The columns are: PSSM-Id, CD accession (e.g. COG#), CD short
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54 name, CD description, and PSSM (position-specific scoring matrices)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55 length.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57 =item 1.2.) F<./little_endian/Cog_LE.tar.gz>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59 Unpack and untar via:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 C<tar xvfz Cog_LE.tar.gz>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63 Preformatted RPS-BLAST+ database of the CDD COG distribution for
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64 Intel CPUs and Unix/Windows architectures.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66 =item 2.) L<COG|ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68 Read F<readme> for more information about the respective files in
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69 the COG FTP archive.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71 =item 2.1.) F<fun.txt>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73 One-letter functional classification used in the COG database.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75 =item 2.2.) F<whog>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77 Name, description, and corresponding functional classification of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78 each COG.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82 =head1 OPTIONS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84 =head2 Mandatory options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88 =item B<-r>=I<str>, B<-rps_report>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90 Path to RPS-BLAST+ report/output, outfmt 6 or 7
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92 =item B<-c>=I<str>, B<-cddid>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94 Path to CDD's F<cddid.tbl> file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96 =item B<-f>=I<str>, B<-fun>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98 Path to COG's F<fun.txt> file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100 =item B<-w>=I<str>, B<-whog>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102 Path to COG's F<whog> file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106 =head2 Optional options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110 =item B<-h>, B<-help>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114 =item B<-a>, B<-all_hits>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116 Don't filter RPS-BLAST+ output for the best hit, rather assign COGs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117 to all hits
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119 =item B<-v>, B<-version>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121 Print version number to C<STDERR>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125 =head1 OUTPUT
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129 =item C<STDOUT>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131 Overall assignment statistics
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133 =item F<./results>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135 All tab-delimited output files are stored in this result folder
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137 =item F<rps-blast_cog.txt>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
139 COG assignments concatenated to the RPS-BLAST+ results for filtering
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
140
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
141 =item F<protein-id_cog.txt>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
142
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
143 Slimmed down F<rps-blast_cog.txt> only including query id (first
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
144 BLAST report column), COGs, and functional categories
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
145
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
146 =item F<cog_stats.txt>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
147
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
148 Assignment counts for each used COG
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
149
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
150 =item F<func_stats.txt>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
151
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
152 Assignment counts for single-letter functional categories
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
153
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
154 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
155
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
156 =head1 EXAMPLES
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
157
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
158 =head2 RPS-BLAST+
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
159
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
160 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
161
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
162 =item C<rpsblast -query protein.fasta -db Cog -out rps-blast.out
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
163 -evalue 1e-2 -outfmt 6>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
164
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
165 =item C<rpsblast -query protein.fasta -db Cog -out rps-blast.out
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
166 -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gapopen
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
167 qstart qend sstart send evalue bitscore qcovs'>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
168
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
169 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
170
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
171 =head2 C<cdd2cog.pl>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
172
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
173 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
174
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
175 =item C<perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
176 -w whog -a>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
177
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
178 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
179
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
180 =head1 VERSION
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
181
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
182 0.2 update: 2017-02-16
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
183 0.1 2013-08-01
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
184
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
185 =head1 AUTHOR
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
186
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
187 Andreas Leimbach aleimba[at]gmx[dot]de
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
188
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
189 =head1 ACKNOWLEDGEMENTS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
190
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
191 I got the idea for using NCBI's CDD PSSMs for COG assignment from JGI's L<IMG/ER annotation
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
192 system|http://img.jgi.doe.gov/>, which employes the same technique.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
193
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
194
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
195 =head1 LICENSE
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
196
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
197 This program is free software: you can redistribute it and/or modify
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
198 it under the terms of the GNU General Public License as published by
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
199 the Free Software Foundation; either version 3 (GPLv3) of the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
200 License, or (at your option) any later version.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
201
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
202 This program is distributed in the hope that it will be useful, but
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
203 WITHOUT ANY WARRANTY; without even the implied warranty of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
204 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
205 General Public License for more details.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
206
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
207 You should have received a copy of the GNU General Public License
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
208 along with this program. If not, see L<http://www.gnu.org/licenses/>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
209
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
210 =cut
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
211
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
212
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
213 ########
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
214 # MAIN #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
215 ########
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
216
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
217 use strict;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
218 use warnings;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
219 use autodie;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
220 use Getopt::Long;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
221 use Pod::Usage;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
222
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
223
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
224 ### Get the options with Getopt::Long
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
225 my $Rps_Report; # path to the rps-blast report/output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
226 my $CDDid_File; # path to the CDD 'cddid.tbl' file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
227 my $Fun_File; # path to the COG 'fun' file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
228 my $Whog_File; # path to the COG 'whog' file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
229 my $Opt_All_Hits; # give all blast hits for a query, not just the best (lowest evalue)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
230 my $VERSION = 0.1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
231 my ($Opt_Version, $Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
232 GetOptions ('rps_report=s' => \$Rps_Report,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
233 'cddid=s' => \$CDDid_File,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
234 'fun=s' => \$Fun_File,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
235 'whog=s' => \$Whog_File,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
236 'all_hits' => \$Opt_All_Hits,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
237 'version' => \$Opt_Version,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
238 'help|?' => \$Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
239
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
240
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
241
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
242 ### Run perldoc on POD
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
243 pod2usage(-verbose => 2) if ($Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
244 die "$0 $VERSION\n" if ($Opt_Version);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
245 if (!$Rps_Report || !$CDDid_File || !$Fun_File || !$Whog_File) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
246 my $warning = "\n### Fatal error: Option(s) or arguments for '-r', '-c', '-f', or '-w' are missing!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
247 pod2usage(-verbose => 1, -message => $warning, -exitval => 2);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
248 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
249
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
250
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
251
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
252 ### Parse the 'cddid.tbl', 'fun.txt' and 'whog' file contents and store info in hash structures
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
253 my (%CDDid, %Fun, %Whog); # global hashes
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
254 parse_cdd_cog(); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
255
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
256
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
257
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
258 ### Create results directory for output files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
259 my $Out_Dir = './results/';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
260 if (-e $Out_Dir) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
261 print "###Directory '$Out_Dir' already exists! Replace the directory and all its contents [y|n]? ";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
262 my $user_ask = <STDIN>;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
263 if ($user_ask =~ /y/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
264 unlink glob "$Out_Dir*"; # remove all files in results directory
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
265 rmdir $Out_Dir; # remove the empty directory
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
266 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
267 die "Script abborted!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
268 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
269 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
270 mkdir $Out_Dir or die "Can't create directory \"$Out_Dir\": $!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
271
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
272
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
273
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
274 ### Parse the rps-blast report/output file and assign COGs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
275 my %Cog_Stats; # store the total number of query protein hits for each COG, written to '$Cogstats_Out' below
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
276
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
277 my $Blast_Out = 'rps-blast_cog.txt'; # output file for COG assignments appended to RPS-BLAST results
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
278 open (my $Blast_Out_Fh, ">", "$Out_Dir"."$Blast_Out");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
279 print $Blast_Out_Fh "query id\tsubject id\t% identity\talignment length\tmismatches\tgap opens\tq. start\tq. end\ts. start\ts. end\tevalue\tbit score\tCOG#\tfunctional categories\t\t\t\t\tCOG protein description\n"; # header for $Blast_Out
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
280
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
281 my $Locus_Cog = "protein-id_cog.txt"; # slimmed down $Blast_Out only including locus_tags, COGs, and functional categories
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
282 open (my $Locus_Cog_Fh, ">", "$Out_Dir"."$Locus_Cog");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
283
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
284 print "Parsing RPS-BLAST report ...\n"; # status message
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
285 my $Skip = ''; # only keep best blast hit per query (lowest e-value), except option 'all_hits' is given
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
286 open (my $Rps_Report_Fh, "<", "$Rps_Report");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
287 while (<$Rps_Report_Fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
288 if (/^#/) { # skip comment lines in blast report for BLAST+ "outfmt 7"
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
289 next;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
290 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
291 chomp;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
292
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
293 my @line = split(/\t/, $_); # split tab-separated RPS-BLAST report
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
294
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
295 if ($Skip eq $line[0] && !$Opt_All_Hits) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
296 # only keep best blast hit per query, only if option 'all_hits' is NOT set
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
297 # $line[0] is query id, and should be locus_tag or specific ID from multi-fasta protein query file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
298 next;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
299 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
300 $Skip = $line[0];
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
301
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
302 my $pssm_id = $1 if $line[1] =~ /CDD\|(\d+)/; # get PSSM-Id from the subject hit
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
303 my $cog = $CDDid{$pssm_id}; # get the COG# according to the PSSM-Id as listed in 'cddid.tbl'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
304 print "COG: $cog \n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
305 $Cog_Stats{$cog}++; # increment hit-number for specific COG
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
306
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
307 ### Collect functional categories stats
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
308 my @functions = split('', $Whog{$cog}->{'function'}); # split the single-letter functional categories to count them and join them as tab-separated below
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
309 foreach (@functions) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
310 $Fun{$_}->{'count'}++; # increment hit-number for specific functional category
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
311 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
312
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
313 ### Print to result files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
314 my $functions = join("\t", @functions); # join functional categories tab-separated
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
315 print $Locus_Cog_Fh "$line[0]\t$cog\t$functions\n"; # locus_tag\tCOG\tfunctional categories
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
316 $functions .= "\t" x (5 - @functions); # add additional tabs for COGs with fewer than five functions (which is the maximum number)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
317 print $Blast_Out_Fh "$_\t$cog\t$functions\t$Whog{$cog}->{'desc'}\n"; # $_ = RPS-BLAST line
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
318 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
319
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
320 close $Rps_Report_Fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
321 close $Blast_Out_Fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
322 close $Locus_Cog_Fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
323
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
324
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
325
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
326 ### Total COG and functional categories stats
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
327 print "Writing assignment statistic files in '$Out_Dir' folder ...\n"; # status message
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
328
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
329 my $Cogstats_Out = 'cog_stats.txt'; # output file for assignment numbers for each COG
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
330 open (my $Cog_Stats_Fh, ">", "$Out_Dir"."$Cogstats_Out");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
331 my $prot_stats = 0; # store total number of query proteins, which have a COG assignment
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
332 foreach my $cog (sort keys %Cog_Stats) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
333 print $Cog_Stats_Fh "$cog\t$Whog{$cog}->{'desc'}\t$Cog_Stats{$cog}\n"; # COG protein descriptions stored in %Whog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
334 $prot_stats += $Cog_Stats{$cog}; # sum up total COG assignments
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
335 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
336 close $Cog_Stats_Fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
337
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
338 my $Funcstats_Out = 'func_stats.txt'; # output file for assignment numbers for each functional category
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
339 open (my $Func_Stats_Fh, ">", "$Out_Dir"."$Funcstats_Out");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
340 my $func_cats = 0; # store total number of assigned functional categories
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
341 foreach my $func (sort keys %Fun) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
342 print $Func_Stats_Fh "$func\t$Fun{$func}->{'desc'}\t$Fun{$func}->{'count'}\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
343 $func_cats += $Fun{$func}->{'count'}; # sum up total functional category assignments
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
344 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
345 close $Func_Stats_Fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
346
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
347
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
348
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
349 ### State which files were created and print overall statistics
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
350 print "\n############################################################################\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
351 print "The following tab-delimited files were created in the '$Out_Dir' directory:\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
352 print "- $Blast_Out: COG assignments concatenated to the RPS-BLAST results for filtering\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
353 print "- $Locus_Cog: Slimmed down '$Blast_Out' only including query id (first BLAST report column), COGs, and functional categories\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
354 print "- $Cogstats_Out: COG assignment counts\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
355 print "- $Funcstats_Out: Functional category assignment counts\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
356 print "##############################################################################\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
357 print "Overall assignment statistics:\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
358 print "~ Total query proteins categorized into COGs: $prot_stats\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
359 print "~ Total COGs used for the query proteins [of ", scalar keys %CDDid, " overall]: ", scalar keys %Cog_Stats, "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
360 print "~ Total number of assigned functional categories: $func_cats\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
361 print "~ Total functional categories used for the query proteins [of ", scalar keys %Fun, " overall]: ", scalar grep ($Fun{$_}->{'count'} > 0, keys %Fun), "\n\n"; # grep for functional categories with a count > 0 to get the ones with assigned query proteins
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
362
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
363 exit;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
364
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
365
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
366
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
367 ###############
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
368 # Subroutines #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
369 ###############
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
370
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
371 ### Subroutine to parse the 'cddid.tbl', 'fun' and 'whog' file contents and store in hash structures
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
372 sub parse_cdd_cog {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
373
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
374 ### 'cddid.tbl'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
375 open (my $cddid_fh, "<", "$CDDid_File");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
376 print "\nParsing CDDs '$CDDid_File' file ...\n"; # status message
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
377 while (<$cddid_fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
378 chomp;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
379 my @line = split(/\t/, $_); # split line at the tabs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
380 if ($line[1] =~ /^COG\d{4}$/) { # search for COG CD accessions in cddid
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
381 $CDDid{$line[0]} = $line[1]; # hash to store info; $line[0] = PSSM-Id
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
382 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
383 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
384 close $cddid_fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
385
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
386 ### 'fun.txt'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
387 open (my $fun_fh, "<", "$Fun_File");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
388 print "Parsing COGs '$Fun_File' file ...\n"; # status message
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
389 while (<$fun_fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
390 chomp;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
391 $_ =~ s/^\s*|\s+$//g; # get rid of all leading and trailing whitespaces
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
392 if (/^\[(\w)\]\s*(.+)$/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
393 $Fun{$1} = {'desc' => $2, 'count' => 0}; # anonymous hash in hash
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
394 # $1 = single-letter functional category, $2 = description of functional category
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
395 # count used to find functional categories not present in the query proteins for final overall assignment statistics
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
396 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
397 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
398 close $fun_fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
399
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
400 ### 'whog'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
401 open (my $whog_fh, "<", "$Whog_File");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
402 print "Parsing COGs '$Whog_File' file ...\n"; # status message
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
403 while (<$whog_fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
404 chomp;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
405 $_ =~ s/^\s*|\s+$//g; # get rid of all leading and trailing whitespaces
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
406 if (/^\[(\w+)\]\s*(COG\d{4})\s+(.+)$/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
407 $Whog{$2} = {'function' => $1, 'desc' => $3}; # anonymous hash in hash
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
408 # $1 = single-letter functional categories, maximal five per COG (only COG5032 with five)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
409 # $2 = COG#, $3 = COG protein description
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
410 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
411 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
412 close $whog_fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
413
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
414 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
415 }