Mercurial > repos > dereeper > pangenome_explorer
view COG/bac-genomics-scripts/cdd2cog/README.md @ 3:e42d30da7a74 draft
Uploaded
author | dereeper |
---|---|
date | Thu, 30 May 2024 11:52:25 +0000 |
parents | |
children |
line wrap: on
line source
cdd2cog ======= `cdd2cog.pl` is a script to assign COG categories to query protein sequences. * [Synopsis](#synopsis) * [Description](#description) * [Usage](#usage) * [RPS-BLAST+](#rps-blast) * [cdd2cog](#cdd2cog) * [Options](#options) * [Mandatory options](#mandatory-options) * [Optional options](#optional-options) * [Output](#output) * [Run environment](#run-environment) * [Author - contact](#author---contact) * [Acknowledgements](#acknowledgements) * [Citation, installation, and license](#citation-installation-and-license) * [Changelog](#changelog) ## Synopsis perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog ## Description For troubleshooting and a working example please see issue [#1](https://github.com/aleimba/bac-genomics-scripts/issues/1). The script assigns COG ([cluster of orthologous groups](http://www.ncbi.nlm.nih.gov/COG/)) categories to proteins. For this purpose, the query proteins need to be blasted with RPS-BLAST+ ([Reverse Position-Specific BLAST](http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download)) against NCBI's Conserved Domain Database ([CDD](http://www.ncbi.nlm.nih.gov/cdd)). Use [`cds_extractor.pl`](/cds_extractor) beforehand to extract multi-fasta protein files from GENBANK or EMBL files. Both tab-delimited RPS-BLAST+ outformats, **-outfmt 6** and **-outfmt 7**, can be processed by `cdd2cog.pl`. By default, RPS-BLAST+ hits for each query protein are filtered for the best hit (lowest e-value). Use option **-a|all\_hits** to assign COGs to all BLAST hits and e.g. do a downstream filtering in a spreadsheet application. Results are written to tab-delimited files in the './results' folder, overall assignment statistics are printed to *STDOUT*. Several files are needed from NCBI's FTP server to run the RPS-BLAST+ and `cdd2cog.pl`: 1. **CDD** (ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/) More information about the files in the CDD FTP archive can be found in the respective 'README' file. 1. 'cddid.tbl.gz' The file needs to be unpacked: `gunzip cddid.tbl.gz` Contains summary information about the CD models in a tab-delimited format. The columns are: PSSM-Id, CD accession (e.g. COG#), CD short name, CD description, and PSSM (position-specific scoring matrices) length. 2. './little_endian/Cog_LE.tar.gz' Unpack and untar via: `tar xvfz Cog_LE.tar.gz` Preformatted RPS-BLAST+ database of the CDD COG distribution for Intel CPUs and Unix/Windows architectures. 2. **COG** (ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/) Read 'readme' for more information about the respective files in the COG FTP archive. 1. 'fun.txt' One-letter functional classification used in the COG database. 2. 'whog' Name, description, and corresponding functional classification of each COG. ## Usage ### RPS-BLAST+ rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6 rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs' ### cdd2cog perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog -a ## Options ### Mandatory options - -r, -rps\_report Path to RPS-BLAST+ report/output, outfmt 6 or 7 - -c, -cddid Path to CDD's 'cddid.tbl' file - -f, -fun Path to COG's 'fun.txt' file - -w, -whog Path to COG's 'whog' file ### Optional options - -h, -help Help (perldoc POD) - -a, -all\_hits Don't filter RPS-BLAST+ output for the best hit, rather assign COGs to all hits - -v, -version Print version number to *STDERR* ## Output - *STDOUT* Overall assignment statistics - ./results All tab-delimited output files are stored in this result folder - rps-blast_cog.txt COG assignments concatenated to the RPS-BLAST+ results for filtering - protein-id_cog.txt Slimmed down 'rps-blast_cog.txt' only including query id (first BLAST report column), COGs, and functional categories - cog_stats.txt Assignment counts for each used COG - func_stats.txt Assignment counts for single-letter functional categories ## Run environment The Perl script runs under UNIX flavors. ## Author - contact Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) ## Acknowledgements I got the idea for using NCBI's CDD PSSMs for COG assignment from JGI's [IMG/ER annotation system](http://img.jgi.doe.gov/), which employes the same technique. ## Citation, installation, and license For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). ## Changelog * v0.2 (2017-02-16) * Adapted to new NCBI FASTA header format for CDD RPS-BLAST+ output * v0.1 (2013-08-01)