diff COG/bac-genomics-scripts/cdd2cog/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/COG/bac-genomics-scripts/cdd2cog/README.md	Thu May 30 11:52:25 2024 +0000
@@ -0,0 +1,170 @@
+cdd2cog
+=======
+
+`cdd2cog.pl` is a script to assign COG categories to query protein sequences.
+
+* [Synopsis](#synopsis)
+* [Description](#description)
+* [Usage](#usage)
+  * [RPS-BLAST+](#rps-blast)
+  * [cdd2cog](#cdd2cog)
+* [Options](#options)
+  * [Mandatory options](#mandatory-options)
+  * [Optional options](#optional-options)
+* [Output](#output)
+* [Run environment](#run-environment)
+* [Author - contact](#author---contact)
+* [Acknowledgements](#acknowledgements)
+* [Citation, installation, and license](#citation-installation-and-license)
+* [Changelog](#changelog)
+
+## Synopsis
+
+    perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog
+
+## Description
+For troubleshooting and a working example please see issue [#1](https://github.com/aleimba/bac-genomics-scripts/issues/1).
+
+The script assigns COG ([cluster of orthologous
+groups](http://www.ncbi.nlm.nih.gov/COG/)) categories to proteins.
+For this purpose, the query proteins need to be blasted with
+RPS-BLAST+ ([Reverse Position-Specific BLAST](http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download))
+against NCBI's Conserved Domain Database
+([CDD](http://www.ncbi.nlm.nih.gov/cdd)). Use
+[`cds_extractor.pl`](/cds_extractor) beforehand to extract multi-fasta protein
+files from GENBANK or EMBL files.
+
+Both tab-delimited RPS-BLAST+ outformats, **-outfmt 6** and **-outfmt
+7**, can be processed by `cdd2cog.pl`. By default, RPS-BLAST+ hits
+for each query protein are filtered for the best hit (lowest
+e-value). Use option **-a|all\_hits** to assign COGs to all BLAST hits
+and e.g. do a downstream filtering in a spreadsheet application.
+Results are written to tab-delimited files in the './results'
+folder, overall assignment statistics are printed to *STDOUT*.
+
+Several files are needed from NCBI's FTP server to run the RPS-BLAST+ and `cdd2cog.pl`:
+
+1. **CDD** (ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/)
+
+    More information about the files in the CDD FTP archive can be found in the respective 'README' file.
+
+  1. 'cddid.tbl.gz'
+
+    The file needs to be unpacked:
+
+    `gunzip cddid.tbl.gz`
+
+    Contains summary information about the CD models in a tab-delimited format. The columns are: PSSM-Id, CD accession (e.g. COG#), CD short name, CD description, and PSSM (position-specific scoring matrices) length.
+
+  2. './little_endian/Cog_LE.tar.gz'
+
+    Unpack and untar via:
+
+    `tar xvfz Cog_LE.tar.gz`
+
+    Preformatted RPS-BLAST+ database of the CDD COG distribution for Intel CPUs and Unix/Windows architectures.
+
+2. **COG** (ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/)
+
+    Read 'readme' for more information about the respective files in the COG FTP archive.
+
+  1. 'fun.txt'
+
+    One-letter functional classification used in the COG database.
+
+  2. 'whog'
+
+    Name, description, and corresponding functional classification of each COG.
+
+## Usage
+
+### RPS-BLAST+
+
+    rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6
+    rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs'
+
+### cdd2cog
+
+    perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog -a
+
+## Options
+
+### Mandatory options
+
+- -r, -rps\_report
+
+    Path to RPS-BLAST+ report/output, outfmt 6 or 7
+
+- -c, -cddid
+
+    Path to CDD's 'cddid.tbl' file
+
+- -f, -fun
+
+    Path to COG's 'fun.txt' file
+
+- -w, -whog
+
+    Path to COG's 'whog' file
+
+### Optional options
+
+- -h, -help
+
+    Help (perldoc POD)
+
+- -a, -all\_hits
+
+    Don't filter RPS-BLAST+ output for the best hit, rather assign COGs to all hits
+
+- -v, -version
+
+    Print version number to *STDERR*
+
+## Output
+
+- *STDOUT*
+
+    Overall assignment statistics
+
+- ./results
+
+    All tab-delimited output files are stored in this result folder
+
+- rps-blast_cog.txt
+
+    COG assignments concatenated to the RPS-BLAST+ results for filtering
+
+- protein-id_cog.txt
+
+    Slimmed down 'rps-blast_cog.txt' only including query id (first BLAST report column), COGs, and functional categories
+
+- cog_stats.txt
+
+    Assignment counts for each used COG
+
+- func_stats.txt
+
+    Assignment counts for single-letter functional categories
+
+## Run environment
+
+The Perl script runs under UNIX flavors.
+
+## Author - contact
+
+Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
+
+## Acknowledgements
+
+I got the idea for using NCBI's CDD PSSMs for COG assignment from JGI's [IMG/ER annotation system](http://img.jgi.doe.gov/), which employes the same technique.
+
+## Citation, installation, and license
+
+For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
+
+## Changelog
+
+* v0.2 (2017-02-16)
+    * Adapted to new NCBI FASTA header format for CDD RPS-BLAST+ output
+* v0.1 (2013-08-01)