Mercurial > repos > dereeper > pangenome_explorer
comparison COG/bac-genomics-scripts/cdd2cog/README.md @ 3:e42d30da7a74 draft
Uploaded
| author | dereeper |
|---|---|
| date | Thu, 30 May 2024 11:52:25 +0000 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| 2:97e4e3e818b6 | 3:e42d30da7a74 |
|---|---|
| 1 cdd2cog | |
| 2 ======= | |
| 3 | |
| 4 `cdd2cog.pl` is a script to assign COG categories to query protein sequences. | |
| 5 | |
| 6 * [Synopsis](#synopsis) | |
| 7 * [Description](#description) | |
| 8 * [Usage](#usage) | |
| 9 * [RPS-BLAST+](#rps-blast) | |
| 10 * [cdd2cog](#cdd2cog) | |
| 11 * [Options](#options) | |
| 12 * [Mandatory options](#mandatory-options) | |
| 13 * [Optional options](#optional-options) | |
| 14 * [Output](#output) | |
| 15 * [Run environment](#run-environment) | |
| 16 * [Author - contact](#author---contact) | |
| 17 * [Acknowledgements](#acknowledgements) | |
| 18 * [Citation, installation, and license](#citation-installation-and-license) | |
| 19 * [Changelog](#changelog) | |
| 20 | |
| 21 ## Synopsis | |
| 22 | |
| 23 perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog | |
| 24 | |
| 25 ## Description | |
| 26 For troubleshooting and a working example please see issue [#1](https://github.com/aleimba/bac-genomics-scripts/issues/1). | |
| 27 | |
| 28 The script assigns COG ([cluster of orthologous | |
| 29 groups](http://www.ncbi.nlm.nih.gov/COG/)) categories to proteins. | |
| 30 For this purpose, the query proteins need to be blasted with | |
| 31 RPS-BLAST+ ([Reverse Position-Specific BLAST](http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download)) | |
| 32 against NCBI's Conserved Domain Database | |
| 33 ([CDD](http://www.ncbi.nlm.nih.gov/cdd)). Use | |
| 34 [`cds_extractor.pl`](/cds_extractor) beforehand to extract multi-fasta protein | |
| 35 files from GENBANK or EMBL files. | |
| 36 | |
| 37 Both tab-delimited RPS-BLAST+ outformats, **-outfmt 6** and **-outfmt | |
| 38 7**, can be processed by `cdd2cog.pl`. By default, RPS-BLAST+ hits | |
| 39 for each query protein are filtered for the best hit (lowest | |
| 40 e-value). Use option **-a|all\_hits** to assign COGs to all BLAST hits | |
| 41 and e.g. do a downstream filtering in a spreadsheet application. | |
| 42 Results are written to tab-delimited files in the './results' | |
| 43 folder, overall assignment statistics are printed to *STDOUT*. | |
| 44 | |
| 45 Several files are needed from NCBI's FTP server to run the RPS-BLAST+ and `cdd2cog.pl`: | |
| 46 | |
| 47 1. **CDD** (ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/) | |
| 48 | |
| 49 More information about the files in the CDD FTP archive can be found in the respective 'README' file. | |
| 50 | |
| 51 1. 'cddid.tbl.gz' | |
| 52 | |
| 53 The file needs to be unpacked: | |
| 54 | |
| 55 `gunzip cddid.tbl.gz` | |
| 56 | |
| 57 Contains summary information about the CD models in a tab-delimited format. The columns are: PSSM-Id, CD accession (e.g. COG#), CD short name, CD description, and PSSM (position-specific scoring matrices) length. | |
| 58 | |
| 59 2. './little_endian/Cog_LE.tar.gz' | |
| 60 | |
| 61 Unpack and untar via: | |
| 62 | |
| 63 `tar xvfz Cog_LE.tar.gz` | |
| 64 | |
| 65 Preformatted RPS-BLAST+ database of the CDD COG distribution for Intel CPUs and Unix/Windows architectures. | |
| 66 | |
| 67 2. **COG** (ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/) | |
| 68 | |
| 69 Read 'readme' for more information about the respective files in the COG FTP archive. | |
| 70 | |
| 71 1. 'fun.txt' | |
| 72 | |
| 73 One-letter functional classification used in the COG database. | |
| 74 | |
| 75 2. 'whog' | |
| 76 | |
| 77 Name, description, and corresponding functional classification of each COG. | |
| 78 | |
| 79 ## Usage | |
| 80 | |
| 81 ### RPS-BLAST+ | |
| 82 | |
| 83 rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6 | |
| 84 rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs' | |
| 85 | |
| 86 ### cdd2cog | |
| 87 | |
| 88 perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog -a | |
| 89 | |
| 90 ## Options | |
| 91 | |
| 92 ### Mandatory options | |
| 93 | |
| 94 - -r, -rps\_report | |
| 95 | |
| 96 Path to RPS-BLAST+ report/output, outfmt 6 or 7 | |
| 97 | |
| 98 - -c, -cddid | |
| 99 | |
| 100 Path to CDD's 'cddid.tbl' file | |
| 101 | |
| 102 - -f, -fun | |
| 103 | |
| 104 Path to COG's 'fun.txt' file | |
| 105 | |
| 106 - -w, -whog | |
| 107 | |
| 108 Path to COG's 'whog' file | |
| 109 | |
| 110 ### Optional options | |
| 111 | |
| 112 - -h, -help | |
| 113 | |
| 114 Help (perldoc POD) | |
| 115 | |
| 116 - -a, -all\_hits | |
| 117 | |
| 118 Don't filter RPS-BLAST+ output for the best hit, rather assign COGs to all hits | |
| 119 | |
| 120 - -v, -version | |
| 121 | |
| 122 Print version number to *STDERR* | |
| 123 | |
| 124 ## Output | |
| 125 | |
| 126 - *STDOUT* | |
| 127 | |
| 128 Overall assignment statistics | |
| 129 | |
| 130 - ./results | |
| 131 | |
| 132 All tab-delimited output files are stored in this result folder | |
| 133 | |
| 134 - rps-blast_cog.txt | |
| 135 | |
| 136 COG assignments concatenated to the RPS-BLAST+ results for filtering | |
| 137 | |
| 138 - protein-id_cog.txt | |
| 139 | |
| 140 Slimmed down 'rps-blast_cog.txt' only including query id (first BLAST report column), COGs, and functional categories | |
| 141 | |
| 142 - cog_stats.txt | |
| 143 | |
| 144 Assignment counts for each used COG | |
| 145 | |
| 146 - func_stats.txt | |
| 147 | |
| 148 Assignment counts for single-letter functional categories | |
| 149 | |
| 150 ## Run environment | |
| 151 | |
| 152 The Perl script runs under UNIX flavors. | |
| 153 | |
| 154 ## Author - contact | |
| 155 | |
| 156 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) | |
| 157 | |
| 158 ## Acknowledgements | |
| 159 | |
| 160 I got the idea for using NCBI's CDD PSSMs for COG assignment from JGI's [IMG/ER annotation system](http://img.jgi.doe.gov/), which employes the same technique. | |
| 161 | |
| 162 ## Citation, installation, and license | |
| 163 | |
| 164 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). | |
| 165 | |
| 166 ## Changelog | |
| 167 | |
| 168 * v0.2 (2017-02-16) | |
| 169 * Adapted to new NCBI FASTA header format for CDD RPS-BLAST+ output | |
| 170 * v0.1 (2013-08-01) |
