Mercurial > repos > dereeper > pangenome_explorer
comparison COG/bac-genomics-scripts/cdd2cog/README.md @ 3:e42d30da7a74 draft
Uploaded
author | dereeper |
---|---|
date | Thu, 30 May 2024 11:52:25 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
2:97e4e3e818b6 | 3:e42d30da7a74 |
---|---|
1 cdd2cog | |
2 ======= | |
3 | |
4 `cdd2cog.pl` is a script to assign COG categories to query protein sequences. | |
5 | |
6 * [Synopsis](#synopsis) | |
7 * [Description](#description) | |
8 * [Usage](#usage) | |
9 * [RPS-BLAST+](#rps-blast) | |
10 * [cdd2cog](#cdd2cog) | |
11 * [Options](#options) | |
12 * [Mandatory options](#mandatory-options) | |
13 * [Optional options](#optional-options) | |
14 * [Output](#output) | |
15 * [Run environment](#run-environment) | |
16 * [Author - contact](#author---contact) | |
17 * [Acknowledgements](#acknowledgements) | |
18 * [Citation, installation, and license](#citation-installation-and-license) | |
19 * [Changelog](#changelog) | |
20 | |
21 ## Synopsis | |
22 | |
23 perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog | |
24 | |
25 ## Description | |
26 For troubleshooting and a working example please see issue [#1](https://github.com/aleimba/bac-genomics-scripts/issues/1). | |
27 | |
28 The script assigns COG ([cluster of orthologous | |
29 groups](http://www.ncbi.nlm.nih.gov/COG/)) categories to proteins. | |
30 For this purpose, the query proteins need to be blasted with | |
31 RPS-BLAST+ ([Reverse Position-Specific BLAST](http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download)) | |
32 against NCBI's Conserved Domain Database | |
33 ([CDD](http://www.ncbi.nlm.nih.gov/cdd)). Use | |
34 [`cds_extractor.pl`](/cds_extractor) beforehand to extract multi-fasta protein | |
35 files from GENBANK or EMBL files. | |
36 | |
37 Both tab-delimited RPS-BLAST+ outformats, **-outfmt 6** and **-outfmt | |
38 7**, can be processed by `cdd2cog.pl`. By default, RPS-BLAST+ hits | |
39 for each query protein are filtered for the best hit (lowest | |
40 e-value). Use option **-a|all\_hits** to assign COGs to all BLAST hits | |
41 and e.g. do a downstream filtering in a spreadsheet application. | |
42 Results are written to tab-delimited files in the './results' | |
43 folder, overall assignment statistics are printed to *STDOUT*. | |
44 | |
45 Several files are needed from NCBI's FTP server to run the RPS-BLAST+ and `cdd2cog.pl`: | |
46 | |
47 1. **CDD** (ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/) | |
48 | |
49 More information about the files in the CDD FTP archive can be found in the respective 'README' file. | |
50 | |
51 1. 'cddid.tbl.gz' | |
52 | |
53 The file needs to be unpacked: | |
54 | |
55 `gunzip cddid.tbl.gz` | |
56 | |
57 Contains summary information about the CD models in a tab-delimited format. The columns are: PSSM-Id, CD accession (e.g. COG#), CD short name, CD description, and PSSM (position-specific scoring matrices) length. | |
58 | |
59 2. './little_endian/Cog_LE.tar.gz' | |
60 | |
61 Unpack and untar via: | |
62 | |
63 `tar xvfz Cog_LE.tar.gz` | |
64 | |
65 Preformatted RPS-BLAST+ database of the CDD COG distribution for Intel CPUs and Unix/Windows architectures. | |
66 | |
67 2. **COG** (ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/) | |
68 | |
69 Read 'readme' for more information about the respective files in the COG FTP archive. | |
70 | |
71 1. 'fun.txt' | |
72 | |
73 One-letter functional classification used in the COG database. | |
74 | |
75 2. 'whog' | |
76 | |
77 Name, description, and corresponding functional classification of each COG. | |
78 | |
79 ## Usage | |
80 | |
81 ### RPS-BLAST+ | |
82 | |
83 rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6 | |
84 rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs' | |
85 | |
86 ### cdd2cog | |
87 | |
88 perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog -a | |
89 | |
90 ## Options | |
91 | |
92 ### Mandatory options | |
93 | |
94 - -r, -rps\_report | |
95 | |
96 Path to RPS-BLAST+ report/output, outfmt 6 or 7 | |
97 | |
98 - -c, -cddid | |
99 | |
100 Path to CDD's 'cddid.tbl' file | |
101 | |
102 - -f, -fun | |
103 | |
104 Path to COG's 'fun.txt' file | |
105 | |
106 - -w, -whog | |
107 | |
108 Path to COG's 'whog' file | |
109 | |
110 ### Optional options | |
111 | |
112 - -h, -help | |
113 | |
114 Help (perldoc POD) | |
115 | |
116 - -a, -all\_hits | |
117 | |
118 Don't filter RPS-BLAST+ output for the best hit, rather assign COGs to all hits | |
119 | |
120 - -v, -version | |
121 | |
122 Print version number to *STDERR* | |
123 | |
124 ## Output | |
125 | |
126 - *STDOUT* | |
127 | |
128 Overall assignment statistics | |
129 | |
130 - ./results | |
131 | |
132 All tab-delimited output files are stored in this result folder | |
133 | |
134 - rps-blast_cog.txt | |
135 | |
136 COG assignments concatenated to the RPS-BLAST+ results for filtering | |
137 | |
138 - protein-id_cog.txt | |
139 | |
140 Slimmed down 'rps-blast_cog.txt' only including query id (first BLAST report column), COGs, and functional categories | |
141 | |
142 - cog_stats.txt | |
143 | |
144 Assignment counts for each used COG | |
145 | |
146 - func_stats.txt | |
147 | |
148 Assignment counts for single-letter functional categories | |
149 | |
150 ## Run environment | |
151 | |
152 The Perl script runs under UNIX flavors. | |
153 | |
154 ## Author - contact | |
155 | |
156 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) | |
157 | |
158 ## Acknowledgements | |
159 | |
160 I got the idea for using NCBI's CDD PSSMs for COG assignment from JGI's [IMG/ER annotation system](http://img.jgi.doe.gov/), which employes the same technique. | |
161 | |
162 ## Citation, installation, and license | |
163 | |
164 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). | |
165 | |
166 ## Changelog | |
167 | |
168 * v0.2 (2017-02-16) | |
169 * Adapted to new NCBI FASTA header format for CDD RPS-BLAST+ output | |
170 * v0.1 (2013-08-01) |