Mercurial > repos > dereeper > pangenome_explorer
diff COG/bac-genomics-scripts/po2anno/README.md @ 3:e42d30da7a74 draft
Uploaded
author | dereeper |
---|---|
date | Thu, 30 May 2024 11:52:25 +0000 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/COG/bac-genomics-scripts/po2anno/README.md Thu May 30 11:52:25 2024 +0000 @@ -0,0 +1,147 @@ +po2anno +======= + +`po2anno.pl` is a script to create an annotation comparison matrix from [Proteinortho5](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) output. + +* [Synopsis](#synopsis) +* [Description](#description) +* [Usage](#usage) + * [cds_extractor](#cds_extractor) + * [Proteinortho5](#proteinortho5) + * [po2anno](#po2anno) +* [Options](#options) + * [Mandatory options](#mandatory-options) + * [Optional options](#optional-options) +* [Output](#output) +* [Run environment](#run-environment) +* [Author - contact](#author---contact) +* [Citation, installation, and license](#citation-installation-and-license) +* [Changelog](#changelog) + +## Synopsis + + perl po2anno.pl -i matrix.proteinortho -d genome_fasta_dir/ -l -a > annotation_comparison.tsv + +## Description + +Supplement an ortholog/paralog output matrix from a +[**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) +calculation with annotation information. The resulting tab-separated +annotation comparison matrix (ACM) is mainly intended for the +transfer of high quality annotations from reference genomes to +homologs (orthologs and co-orthologs/paralogs) in a query genome +(e.g. in conjunction with [`tbl2tab.pl`](/tbl2tab)). But of course +it can also be used to have a quick glance at the annotation of +genes present only in a couple of input genomes in comparison to the +others. + +Annotation is retrieved from multi-FASTA files created with +[`cds_extractor.pl`](/cds_extractor). See +[`cds_extractor.pl`](/cds_extractor) for a description of the +format. These files are used as input for the PO analysis and option +**-d** for `po2anno.pl`. + +**Proteinortho5** (PO) has to be run with option **-singles** to include +also genes without orthologs, so-called singletons/ORFans, for each +genome in the PO matrix (see the +[PO manual](http://www.bioinf.uni-leipzig.de/Software/proteinortho/manual.html)). +Additionally, option **-selfblast** is recommended to enhance paralog +detection by PO. + +Each orthologous group (OG) is listed in a row of the resulting ACM, +the first column holds the OG numbers from the PO input matrix (i.e. +line number minus one). The following columns specify the +orthologous CDS for each input genome. For each CDS the ID, +optionally the length in bp (option **-l**), gene, EC number(s), and +product are shown depending on their presence in the CDS's +annotation. The ID is in most cases the locus tag (see +[`cds_extractor.pl`](/cds_extractor)). If several EC numbers exist +for a single CDS they're separated by ';'. If an OG includes +paralogs, i.e. co-orthologs from a single genome, these will be +printed in the following row(s) **without** a new OG number in the +first column. The order of paralogous CDSs within an OG is +arbitrarily. + +The OGs are sorted numerically via the query ID (see option **-q**). +If option **-a** is set, the non-query OGs are appended to the output +after the query OGs, sorted numerically via OG number. + +## Usage + +### [`cds_extractor`](/cds_extractor) + + for i in *.[gbk|embl]; do perl cds_extractor.pl -i $i [-p|-n]; done + +### [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) + + proteinortho5.pl -graph [-synteny] -cpus=# -selfblast -singles -identity=50 -cov=50 -blastParameters='-use_sw_tback [-seg no|-dust no]' *.[faa|ffn] + +### po2anno + + perl po2anno.pl -i matrix.[proteinortho|poff] -d genome_fasta_dir/ -q query.[faa|ffn] -l -a > annotation_comparison.tsv + +## Options + +### Mandatory options + +- **-i**=_str_, **-input**=_str_ + + Proteinortho (PO) result matrix (\*.proteinortho or \*.poff), or piped *STDIN* (-) + +- **-d**=_str_, **-dir\_genome**=_str_ + + Path to the directory including the genome multi-FASTA PO input files (\*.faa or \*.ffn), created with [`cds_extractor.pl`](/cds_extractor) + +### Optional options + +- **-h**, **-help** + + Help (perldoc POD) + +- **-q**=_str_, **-query**=_str_ + + Query genome (has to be identical to the string in the PO matrix) [default = first one in alphabetical order] + +- **-l**, **-length** + + Include length of each CDS in bp + +- **-a**, **-all** + + Append non-query orthologous groups (OGs) to the output + +- **-v**, **-version** + + Print version number to *STDERR* + +## Output + +- *STDOUT* + + The resulting tab-delimited ACM is printed to *STDOUT*. Redirect or pipe into another tool as needed (e.g. `cut`, `grep`, `head`, or `tail`). + +## Run environment + +The Perl script runs under Windows and UNIX flavors. + +## Author - contact + +Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) + +## Citation, installation, and license + +For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). + +## Changelog + +* v0.2.2 (23.10.2015) + * minor syntax changes to `po2anno.pl` and README + * changed option **-g|-genome_dir** to **-d|-dir_genome** for consistency with [`po2group_stats.pl`](/po2group_stats) +* v0.2.1 (07.09.2015) + * get rid of underscores in product annotation strings (from [`cds_extractor.pl`](/cds_extractor)) + * debugged hard-coded relative path for `$genome_file_path` +* v0.2 (15.01.2015) + * give number of query-specific OGs and total query singletons/ORFans in final stat output + * changed final stat output to an easier readable format + * fixed bug: %Query_ID_Seen included also non-query IDs, which luckily had no consequences +* v0.1 (18.12.2014)