comparison COG/bac-genomics-scripts/cdd2cog/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
comparison
equal deleted inserted replaced
2:97e4e3e818b6 3:e42d30da7a74
1 cdd2cog
2 =======
3
4 `cdd2cog.pl` is a script to assign COG categories to query protein sequences.
5
6 * [Synopsis](#synopsis)
7 * [Description](#description)
8 * [Usage](#usage)
9 * [RPS-BLAST+](#rps-blast)
10 * [cdd2cog](#cdd2cog)
11 * [Options](#options)
12 * [Mandatory options](#mandatory-options)
13 * [Optional options](#optional-options)
14 * [Output](#output)
15 * [Run environment](#run-environment)
16 * [Author - contact](#author---contact)
17 * [Acknowledgements](#acknowledgements)
18 * [Citation, installation, and license](#citation-installation-and-license)
19 * [Changelog](#changelog)
20
21 ## Synopsis
22
23 perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog
24
25 ## Description
26 For troubleshooting and a working example please see issue [#1](https://github.com/aleimba/bac-genomics-scripts/issues/1).
27
28 The script assigns COG ([cluster of orthologous
29 groups](http://www.ncbi.nlm.nih.gov/COG/)) categories to proteins.
30 For this purpose, the query proteins need to be blasted with
31 RPS-BLAST+ ([Reverse Position-Specific BLAST](http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download))
32 against NCBI's Conserved Domain Database
33 ([CDD](http://www.ncbi.nlm.nih.gov/cdd)). Use
34 [`cds_extractor.pl`](/cds_extractor) beforehand to extract multi-fasta protein
35 files from GENBANK or EMBL files.
36
37 Both tab-delimited RPS-BLAST+ outformats, **-outfmt 6** and **-outfmt
38 7**, can be processed by `cdd2cog.pl`. By default, RPS-BLAST+ hits
39 for each query protein are filtered for the best hit (lowest
40 e-value). Use option **-a|all\_hits** to assign COGs to all BLAST hits
41 and e.g. do a downstream filtering in a spreadsheet application.
42 Results are written to tab-delimited files in the './results'
43 folder, overall assignment statistics are printed to *STDOUT*.
44
45 Several files are needed from NCBI's FTP server to run the RPS-BLAST+ and `cdd2cog.pl`:
46
47 1. **CDD** (ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/)
48
49 More information about the files in the CDD FTP archive can be found in the respective 'README' file.
50
51 1. 'cddid.tbl.gz'
52
53 The file needs to be unpacked:
54
55 `gunzip cddid.tbl.gz`
56
57 Contains summary information about the CD models in a tab-delimited format. The columns are: PSSM-Id, CD accession (e.g. COG#), CD short name, CD description, and PSSM (position-specific scoring matrices) length.
58
59 2. './little_endian/Cog_LE.tar.gz'
60
61 Unpack and untar via:
62
63 `tar xvfz Cog_LE.tar.gz`
64
65 Preformatted RPS-BLAST+ database of the CDD COG distribution for Intel CPUs and Unix/Windows architectures.
66
67 2. **COG** (ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/)
68
69 Read 'readme' for more information about the respective files in the COG FTP archive.
70
71 1. 'fun.txt'
72
73 One-letter functional classification used in the COG database.
74
75 2. 'whog'
76
77 Name, description, and corresponding functional classification of each COG.
78
79 ## Usage
80
81 ### RPS-BLAST+
82
83 rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6
84 rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs'
85
86 ### cdd2cog
87
88 perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog -a
89
90 ## Options
91
92 ### Mandatory options
93
94 - -r, -rps\_report
95
96 Path to RPS-BLAST+ report/output, outfmt 6 or 7
97
98 - -c, -cddid
99
100 Path to CDD's 'cddid.tbl' file
101
102 - -f, -fun
103
104 Path to COG's 'fun.txt' file
105
106 - -w, -whog
107
108 Path to COG's 'whog' file
109
110 ### Optional options
111
112 - -h, -help
113
114 Help (perldoc POD)
115
116 - -a, -all\_hits
117
118 Don't filter RPS-BLAST+ output for the best hit, rather assign COGs to all hits
119
120 - -v, -version
121
122 Print version number to *STDERR*
123
124 ## Output
125
126 - *STDOUT*
127
128 Overall assignment statistics
129
130 - ./results
131
132 All tab-delimited output files are stored in this result folder
133
134 - rps-blast_cog.txt
135
136 COG assignments concatenated to the RPS-BLAST+ results for filtering
137
138 - protein-id_cog.txt
139
140 Slimmed down 'rps-blast_cog.txt' only including query id (first BLAST report column), COGs, and functional categories
141
142 - cog_stats.txt
143
144 Assignment counts for each used COG
145
146 - func_stats.txt
147
148 Assignment counts for single-letter functional categories
149
150 ## Run environment
151
152 The Perl script runs under UNIX flavors.
153
154 ## Author - contact
155
156 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
157
158 ## Acknowledgements
159
160 I got the idea for using NCBI's CDD PSSMs for COG assignment from JGI's [IMG/ER annotation system](http://img.jgi.doe.gov/), which employes the same technique.
161
162 ## Citation, installation, and license
163
164 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
165
166 ## Changelog
167
168 * v0.2 (2017-02-16)
169 * Adapted to new NCBI FASTA header format for CDD RPS-BLAST+ output
170 * v0.1 (2013-08-01)