annotate COG/bac-genomics-scripts/cdd2cog/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 cdd2cog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2 =======
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 `cdd2cog.pl` is a script to assign COG categories to query protein sequences.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6 * [Synopsis](#synopsis)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7 * [Description](#description)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8 * [Usage](#usage)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 * [RPS-BLAST+](#rps-blast)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10 * [cdd2cog](#cdd2cog)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 * [Options](#options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12 * [Mandatory options](#mandatory-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13 * [Optional options](#optional-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14 * [Output](#output)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15 * [Run environment](#run-environment)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16 * [Author - contact](#author---contact)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17 * [Acknowledgements](#acknowledgements)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18 * [Citation, installation, and license](#citation-installation-and-license)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19 * [Changelog](#changelog)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21 ## Synopsis
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23 perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25 ## Description
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26 For troubleshooting and a working example please see issue [#1](https://github.com/aleimba/bac-genomics-scripts/issues/1).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28 The script assigns COG ([cluster of orthologous
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29 groups](http://www.ncbi.nlm.nih.gov/COG/)) categories to proteins.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 For this purpose, the query proteins need to be blasted with
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31 RPS-BLAST+ ([Reverse Position-Specific BLAST](http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download))
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 against NCBI's Conserved Domain Database
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33 ([CDD](http://www.ncbi.nlm.nih.gov/cdd)). Use
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 [`cds_extractor.pl`](/cds_extractor) beforehand to extract multi-fasta protein
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35 files from GENBANK or EMBL files.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37 Both tab-delimited RPS-BLAST+ outformats, **-outfmt 6** and **-outfmt
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38 7**, can be processed by `cdd2cog.pl`. By default, RPS-BLAST+ hits
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39 for each query protein are filtered for the best hit (lowest
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 e-value). Use option **-a|all\_hits** to assign COGs to all BLAST hits
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41 and e.g. do a downstream filtering in a spreadsheet application.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42 Results are written to tab-delimited files in the './results'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43 folder, overall assignment statistics are printed to *STDOUT*.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45 Several files are needed from NCBI's FTP server to run the RPS-BLAST+ and `cdd2cog.pl`:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47 1. **CDD** (ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49 More information about the files in the CDD FTP archive can be found in the respective 'README' file.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51 1. 'cddid.tbl.gz'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53 The file needs to be unpacked:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55 `gunzip cddid.tbl.gz`
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57 Contains summary information about the CD models in a tab-delimited format. The columns are: PSSM-Id, CD accession (e.g. COG#), CD short name, CD description, and PSSM (position-specific scoring matrices) length.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59 2. './little_endian/Cog_LE.tar.gz'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 Unpack and untar via:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63 `tar xvfz Cog_LE.tar.gz`
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65 Preformatted RPS-BLAST+ database of the CDD COG distribution for Intel CPUs and Unix/Windows architectures.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67 2. **COG** (ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69 Read 'readme' for more information about the respective files in the COG FTP archive.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71 1. 'fun.txt'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73 One-letter functional classification used in the COG database.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75 2. 'whog'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77 Name, description, and corresponding functional classification of each COG.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79 ## Usage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81 ### RPS-BLAST+
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83 rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84 rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86 ### cdd2cog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88 perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog -a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90 ## Options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92 ### Mandatory options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94 - -r, -rps\_report
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96 Path to RPS-BLAST+ report/output, outfmt 6 or 7
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98 - -c, -cddid
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100 Path to CDD's 'cddid.tbl' file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102 - -f, -fun
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104 Path to COG's 'fun.txt' file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106 - -w, -whog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108 Path to COG's 'whog' file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110 ### Optional options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112 - -h, -help
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116 - -a, -all\_hits
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118 Don't filter RPS-BLAST+ output for the best hit, rather assign COGs to all hits
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120 - -v, -version
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122 Print version number to *STDERR*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124 ## Output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126 - *STDOUT*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128 Overall assignment statistics
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130 - ./results
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132 All tab-delimited output files are stored in this result folder
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134 - rps-blast_cog.txt
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136 COG assignments concatenated to the RPS-BLAST+ results for filtering
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138 - protein-id_cog.txt
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
139
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
140 Slimmed down 'rps-blast_cog.txt' only including query id (first BLAST report column), COGs, and functional categories
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
141
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
142 - cog_stats.txt
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
143
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
144 Assignment counts for each used COG
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
145
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
146 - func_stats.txt
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
147
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
148 Assignment counts for single-letter functional categories
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
149
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
150 ## Run environment
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
151
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
152 The Perl script runs under UNIX flavors.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
153
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
154 ## Author - contact
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
155
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
156 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
157
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
158 ## Acknowledgements
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
159
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
160 I got the idea for using NCBI's CDD PSSMs for COG assignment from JGI's [IMG/ER annotation system](http://img.jgi.doe.gov/), which employes the same technique.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
161
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
162 ## Citation, installation, and license
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
163
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
164 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
165
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
166 ## Changelog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
167
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
168 * v0.2 (2017-02-16)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
169 * Adapted to new NCBI FASTA header format for CDD RPS-BLAST+ output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
170 * v0.1 (2013-08-01)