3
|
1 cdd2cog
|
|
2 =======
|
|
3
|
|
4 `cdd2cog.pl` is a script to assign COG categories to query protein sequences.
|
|
5
|
|
6 * [Synopsis](#synopsis)
|
|
7 * [Description](#description)
|
|
8 * [Usage](#usage)
|
|
9 * [RPS-BLAST+](#rps-blast)
|
|
10 * [cdd2cog](#cdd2cog)
|
|
11 * [Options](#options)
|
|
12 * [Mandatory options](#mandatory-options)
|
|
13 * [Optional options](#optional-options)
|
|
14 * [Output](#output)
|
|
15 * [Run environment](#run-environment)
|
|
16 * [Author - contact](#author---contact)
|
|
17 * [Acknowledgements](#acknowledgements)
|
|
18 * [Citation, installation, and license](#citation-installation-and-license)
|
|
19 * [Changelog](#changelog)
|
|
20
|
|
21 ## Synopsis
|
|
22
|
|
23 perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog
|
|
24
|
|
25 ## Description
|
|
26 For troubleshooting and a working example please see issue [#1](https://github.com/aleimba/bac-genomics-scripts/issues/1).
|
|
27
|
|
28 The script assigns COG ([cluster of orthologous
|
|
29 groups](http://www.ncbi.nlm.nih.gov/COG/)) categories to proteins.
|
|
30 For this purpose, the query proteins need to be blasted with
|
|
31 RPS-BLAST+ ([Reverse Position-Specific BLAST](http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download))
|
|
32 against NCBI's Conserved Domain Database
|
|
33 ([CDD](http://www.ncbi.nlm.nih.gov/cdd)). Use
|
|
34 [`cds_extractor.pl`](/cds_extractor) beforehand to extract multi-fasta protein
|
|
35 files from GENBANK or EMBL files.
|
|
36
|
|
37 Both tab-delimited RPS-BLAST+ outformats, **-outfmt 6** and **-outfmt
|
|
38 7**, can be processed by `cdd2cog.pl`. By default, RPS-BLAST+ hits
|
|
39 for each query protein are filtered for the best hit (lowest
|
|
40 e-value). Use option **-a|all\_hits** to assign COGs to all BLAST hits
|
|
41 and e.g. do a downstream filtering in a spreadsheet application.
|
|
42 Results are written to tab-delimited files in the './results'
|
|
43 folder, overall assignment statistics are printed to *STDOUT*.
|
|
44
|
|
45 Several files are needed from NCBI's FTP server to run the RPS-BLAST+ and `cdd2cog.pl`:
|
|
46
|
|
47 1. **CDD** (ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/)
|
|
48
|
|
49 More information about the files in the CDD FTP archive can be found in the respective 'README' file.
|
|
50
|
|
51 1. 'cddid.tbl.gz'
|
|
52
|
|
53 The file needs to be unpacked:
|
|
54
|
|
55 `gunzip cddid.tbl.gz`
|
|
56
|
|
57 Contains summary information about the CD models in a tab-delimited format. The columns are: PSSM-Id, CD accession (e.g. COG#), CD short name, CD description, and PSSM (position-specific scoring matrices) length.
|
|
58
|
|
59 2. './little_endian/Cog_LE.tar.gz'
|
|
60
|
|
61 Unpack and untar via:
|
|
62
|
|
63 `tar xvfz Cog_LE.tar.gz`
|
|
64
|
|
65 Preformatted RPS-BLAST+ database of the CDD COG distribution for Intel CPUs and Unix/Windows architectures.
|
|
66
|
|
67 2. **COG** (ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/)
|
|
68
|
|
69 Read 'readme' for more information about the respective files in the COG FTP archive.
|
|
70
|
|
71 1. 'fun.txt'
|
|
72
|
|
73 One-letter functional classification used in the COG database.
|
|
74
|
|
75 2. 'whog'
|
|
76
|
|
77 Name, description, and corresponding functional classification of each COG.
|
|
78
|
|
79 ## Usage
|
|
80
|
|
81 ### RPS-BLAST+
|
|
82
|
|
83 rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6
|
|
84 rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs'
|
|
85
|
|
86 ### cdd2cog
|
|
87
|
|
88 perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog -a
|
|
89
|
|
90 ## Options
|
|
91
|
|
92 ### Mandatory options
|
|
93
|
|
94 - -r, -rps\_report
|
|
95
|
|
96 Path to RPS-BLAST+ report/output, outfmt 6 or 7
|
|
97
|
|
98 - -c, -cddid
|
|
99
|
|
100 Path to CDD's 'cddid.tbl' file
|
|
101
|
|
102 - -f, -fun
|
|
103
|
|
104 Path to COG's 'fun.txt' file
|
|
105
|
|
106 - -w, -whog
|
|
107
|
|
108 Path to COG's 'whog' file
|
|
109
|
|
110 ### Optional options
|
|
111
|
|
112 - -h, -help
|
|
113
|
|
114 Help (perldoc POD)
|
|
115
|
|
116 - -a, -all\_hits
|
|
117
|
|
118 Don't filter RPS-BLAST+ output for the best hit, rather assign COGs to all hits
|
|
119
|
|
120 - -v, -version
|
|
121
|
|
122 Print version number to *STDERR*
|
|
123
|
|
124 ## Output
|
|
125
|
|
126 - *STDOUT*
|
|
127
|
|
128 Overall assignment statistics
|
|
129
|
|
130 - ./results
|
|
131
|
|
132 All tab-delimited output files are stored in this result folder
|
|
133
|
|
134 - rps-blast_cog.txt
|
|
135
|
|
136 COG assignments concatenated to the RPS-BLAST+ results for filtering
|
|
137
|
|
138 - protein-id_cog.txt
|
|
139
|
|
140 Slimmed down 'rps-blast_cog.txt' only including query id (first BLAST report column), COGs, and functional categories
|
|
141
|
|
142 - cog_stats.txt
|
|
143
|
|
144 Assignment counts for each used COG
|
|
145
|
|
146 - func_stats.txt
|
|
147
|
|
148 Assignment counts for single-letter functional categories
|
|
149
|
|
150 ## Run environment
|
|
151
|
|
152 The Perl script runs under UNIX flavors.
|
|
153
|
|
154 ## Author - contact
|
|
155
|
|
156 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
|
|
157
|
|
158 ## Acknowledgements
|
|
159
|
|
160 I got the idea for using NCBI's CDD PSSMs for COG assignment from JGI's [IMG/ER annotation system](http://img.jgi.doe.gov/), which employes the same technique.
|
|
161
|
|
162 ## Citation, installation, and license
|
|
163
|
|
164 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
|
|
165
|
|
166 ## Changelog
|
|
167
|
|
168 * v0.2 (2017-02-16)
|
|
169 * Adapted to new NCBI FASTA header format for CDD RPS-BLAST+ output
|
|
170 * v0.1 (2013-08-01)
|