3
+ − 1 cdd2cog
+ − 2 =======
+ − 3
+ − 4 `cdd2cog.pl` is a script to assign COG categories to query protein sequences.
+ − 5
+ − 6 * [Synopsis](#synopsis)
+ − 7 * [Description](#description)
+ − 8 * [Usage](#usage)
+ − 9 * [RPS-BLAST+](#rps-blast)
+ − 10 * [cdd2cog](#cdd2cog)
+ − 11 * [Options](#options)
+ − 12 * [Mandatory options](#mandatory-options)
+ − 13 * [Optional options](#optional-options)
+ − 14 * [Output](#output)
+ − 15 * [Run environment](#run-environment)
+ − 16 * [Author - contact](#author---contact)
+ − 17 * [Acknowledgements](#acknowledgements)
+ − 18 * [Citation, installation, and license](#citation-installation-and-license)
+ − 19 * [Changelog](#changelog)
+ − 20
+ − 21 ## Synopsis
+ − 22
+ − 23 perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog
+ − 24
+ − 25 ## Description
+ − 26 For troubleshooting and a working example please see issue [#1](https://github.com/aleimba/bac-genomics-scripts/issues/1).
+ − 27
+ − 28 The script assigns COG ([cluster of orthologous
+ − 29 groups](http://www.ncbi.nlm.nih.gov/COG/)) categories to proteins.
+ − 30 For this purpose, the query proteins need to be blasted with
+ − 31 RPS-BLAST+ ([Reverse Position-Specific BLAST](http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download))
+ − 32 against NCBI's Conserved Domain Database
+ − 33 ([CDD](http://www.ncbi.nlm.nih.gov/cdd)). Use
+ − 34 [`cds_extractor.pl`](/cds_extractor) beforehand to extract multi-fasta protein
+ − 35 files from GENBANK or EMBL files.
+ − 36
+ − 37 Both tab-delimited RPS-BLAST+ outformats, **-outfmt 6** and **-outfmt
+ − 38 7**, can be processed by `cdd2cog.pl`. By default, RPS-BLAST+ hits
+ − 39 for each query protein are filtered for the best hit (lowest
+ − 40 e-value). Use option **-a|all\_hits** to assign COGs to all BLAST hits
+ − 41 and e.g. do a downstream filtering in a spreadsheet application.
+ − 42 Results are written to tab-delimited files in the './results'
+ − 43 folder, overall assignment statistics are printed to *STDOUT*.
+ − 44
+ − 45 Several files are needed from NCBI's FTP server to run the RPS-BLAST+ and `cdd2cog.pl`:
+ − 46
+ − 47 1. **CDD** (ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/)
+ − 48
+ − 49 More information about the files in the CDD FTP archive can be found in the respective 'README' file.
+ − 50
+ − 51 1. 'cddid.tbl.gz'
+ − 52
+ − 53 The file needs to be unpacked:
+ − 54
+ − 55 `gunzip cddid.tbl.gz`
+ − 56
+ − 57 Contains summary information about the CD models in a tab-delimited format. The columns are: PSSM-Id, CD accession (e.g. COG#), CD short name, CD description, and PSSM (position-specific scoring matrices) length.
+ − 58
+ − 59 2. './little_endian/Cog_LE.tar.gz'
+ − 60
+ − 61 Unpack and untar via:
+ − 62
+ − 63 `tar xvfz Cog_LE.tar.gz`
+ − 64
+ − 65 Preformatted RPS-BLAST+ database of the CDD COG distribution for Intel CPUs and Unix/Windows architectures.
+ − 66
+ − 67 2. **COG** (ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/)
+ − 68
+ − 69 Read 'readme' for more information about the respective files in the COG FTP archive.
+ − 70
+ − 71 1. 'fun.txt'
+ − 72
+ − 73 One-letter functional classification used in the COG database.
+ − 74
+ − 75 2. 'whog'
+ − 76
+ − 77 Name, description, and corresponding functional classification of each COG.
+ − 78
+ − 79 ## Usage
+ − 80
+ − 81 ### RPS-BLAST+
+ − 82
+ − 83 rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6
+ − 84 rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs'
+ − 85
+ − 86 ### cdd2cog
+ − 87
+ − 88 perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog -a
+ − 89
+ − 90 ## Options
+ − 91
+ − 92 ### Mandatory options
+ − 93
+ − 94 - -r, -rps\_report
+ − 95
+ − 96 Path to RPS-BLAST+ report/output, outfmt 6 or 7
+ − 97
+ − 98 - -c, -cddid
+ − 99
+ − 100 Path to CDD's 'cddid.tbl' file
+ − 101
+ − 102 - -f, -fun
+ − 103
+ − 104 Path to COG's 'fun.txt' file
+ − 105
+ − 106 - -w, -whog
+ − 107
+ − 108 Path to COG's 'whog' file
+ − 109
+ − 110 ### Optional options
+ − 111
+ − 112 - -h, -help
+ − 113
+ − 114 Help (perldoc POD)
+ − 115
+ − 116 - -a, -all\_hits
+ − 117
+ − 118 Don't filter RPS-BLAST+ output for the best hit, rather assign COGs to all hits
+ − 119
+ − 120 - -v, -version
+ − 121
+ − 122 Print version number to *STDERR*
+ − 123
+ − 124 ## Output
+ − 125
+ − 126 - *STDOUT*
+ − 127
+ − 128 Overall assignment statistics
+ − 129
+ − 130 - ./results
+ − 131
+ − 132 All tab-delimited output files are stored in this result folder
+ − 133
+ − 134 - rps-blast_cog.txt
+ − 135
+ − 136 COG assignments concatenated to the RPS-BLAST+ results for filtering
+ − 137
+ − 138 - protein-id_cog.txt
+ − 139
+ − 140 Slimmed down 'rps-blast_cog.txt' only including query id (first BLAST report column), COGs, and functional categories
+ − 141
+ − 142 - cog_stats.txt
+ − 143
+ − 144 Assignment counts for each used COG
+ − 145
+ − 146 - func_stats.txt
+ − 147
+ − 148 Assignment counts for single-letter functional categories
+ − 149
+ − 150 ## Run environment
+ − 151
+ − 152 The Perl script runs under UNIX flavors.
+ − 153
+ − 154 ## Author - contact
+ − 155
+ − 156 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
+ − 157
+ − 158 ## Acknowledgements
+ − 159
+ − 160 I got the idea for using NCBI's CDD PSSMs for COG assignment from JGI's [IMG/ER annotation system](http://img.jgi.doe.gov/), which employes the same technique.
+ − 161
+ − 162 ## Citation, installation, and license
+ − 163
+ − 164 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
+ − 165
+ − 166 ## Changelog
+ − 167
+ − 168 * v0.2 (2017-02-16)
+ − 169 * Adapted to new NCBI FASTA header format for CDD RPS-BLAST+ output
+ − 170 * v0.1 (2013-08-01)