Mercurial > repos > dereeper > pangenome_explorer
diff COG/bac-genomics-scripts/po2group_stats/README.md @ 3:e42d30da7a74 draft
Uploaded
author | dereeper |
---|---|
date | Thu, 30 May 2024 11:52:25 +0000 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/COG/bac-genomics-scripts/po2group_stats/README.md Thu May 30 11:52:25 2024 +0000 @@ -0,0 +1,316 @@ +po2group_stats +============== + +`po2group_stats.pl` is a script to categorize orthologs from [Proteinortho5](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) output according to genome groups. In the [prot_finder](/prot_finder) workflow is a script, `binary_group_stats.pl`, which does the same thing for column groups in a delimited TEXT binary matrix. + +* [Synopsis](#synopsis) +* [Description](#description) +* [Usage](#usage) + * [cds_extractor](#cds_extractor) + * [Proteinortho5](#proteinortho5) + * [po2group_stats](#po2group_stats) +* [Options](#options) + * [Mandatory options](#mandatory-options) + * [Optional options](#optional-options) +* [Output](#output) +* [Dependencies](#dependencies) +* [Run environment](#run-environment) +* [Author - contact](#author---contact) +* [Citation, installation, and license](#citation-installation-and-license) +* [Changelog](#changelog) + +## Synopsis + + perl po2group_stats.pl -i matrix.proteinortho -d genome_fasta_dir/ -g group_file.tsv -p > overall_stats.tsv + +## Description + +Categorize the genomes in an ortholog/paralog output matrix (option **-i**) from a +[**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) +calculation according to group affiliations. The group +affiliations of the genomes are intended to get overall +presence/absence statistics for groups of genomes and not simply +single genomes (e.g. comparing 'marine', 'earth', 'commensal', +'pathogenic' etc. genome groups). Percentage inclusion (option +**-cut\_i**) and exclusion (option **-cut\_e**) cutoffs can be set to +define how strict the presence/absence of genome groups within an +orthologous group (OG) are defined. Of course groups can also hold +only single genomes to get single genome statistics. Group +affiliations are defined in a mandatory **tab-delimited** group input +file (option **-g**) with **minimal two** and **maximal four** groups. + +Only alphanumeric (a-z, A-Z, 0-9), underscore (\_), dash (-), and +period (.) characters are allowed for the **group names** in the +group file to avoid downstream problems with the operating/file +system. As a consequence, also no whitespaces are allowed in these! +Additionally, **group names**, **genome filenames** (should be +enforced by the file system), and **FASTA IDs** considering **all** +genome files (mostly locus tags; should be enforced by Proteinortho5) +need to be **unique**. + +**Proteinortho5** (PO) has to be run with option **-singles** to +include also genes without orthologs, so-called singletons/ORFans, +for each genome in the PO matrix (see the +[PO manual](http://www.bioinf.uni-leipzig.de/Software/proteinortho/manual.html)). +Additionally, option **-selfblast** is recommended to enhance +paralog detection by PO. + +To explain the logic behind the categorization, the following +annotation for example groups will be used. A '1' exemplifies a +group genome count in a respective OG >= the rounded inclusion +cutoff, a '0' a group genome count <= the rounded exclusion cutoff. +The presence and absence of OGs for the group affiliations are +structured in different categories depending on the number of +groups. For **two groups** (e.g. A and B) there are five categories: +'A specific' (A:B = 1:0), 'B specific' (0:1), 'cutoff core' (1:1), +'underrepresented' (0:0), and 'unspecific'. Unspecific OGs have a +genome count for at least **one** group outside the cutoffs +(exclusion cutoff < genome count < inclusion cutoff) and +thus cannot be categorized. These 'unspecific' OGs will only be +printed to a final annotation result file with option **-u**. Overall +stats for all categories are printed to *STDOUT* in a final +tab-delimited output matrix. + +**Three groups** (A, B, and C) have the following nine categories: 'A +specific' (A:B:C = 1:0:0), 'B specific' (0:1:0), 'C specific' +(0:0:1), 'A absent' (0:1:1), 'B absent' (1:0:1), 'C absent' (1:1:0), +'cutoff core' (1:1:1), 'underrepresented' (0:0:0), and 'unspecific'. + +**Four groups** (A, B, C, and D) are classified in 17 categories: 'A +specific' (A:B:C:D = 1:0:0:0), 'B specific' (0:1:0:0), 'C specific' +(0:0:1:0), 'D specific' (0:0:0:1), 'A-B specific' (1:1:0:0), 'A-C +specific' (1:0:1:0), 'A-D specific' (1:0:0:1), 'B-C specific' +(0:1:1:0), 'B-D specific' (0:1:0:1), 'C-D specific' (0:0:1:1), 'A +absent' (0:1:1:1), 'B absent' (1:0:1:1), 'C absent' (1:1:0:1), 'D +absent' (1:1:1:0), 'cutoff core' (1:1:1:1), 'underrepresented' +(0:0:0:0), and 'unspecific'. + +The resulting group presence/absence (according to the cutoffs) can +also be printed to a binary matrix (option **-b**) in the result +directory (option **-r**), excluding the 'unspecific' category. Since +the categories are the logics underlying venn diagrams, you can also +plot the results in a venn diagram using the binary matrix (option +**-p**). The 'underrepresented' category is exempt from the venn +diagram, because it is outside of venn diagram logics. + +Here are venn diagrams illustrating the logic categories (see folder ['pics'](./pics)): + +<p align="center"> + <img alt="venn diagram logics" src="https://github.com/aleimba/bac-genomics-scripts/raw/master/po2group_stats/pics/venn_diagram_logics.png"> +</p> + +There are two optional categories (which are only considered for the +final print outs and in the final stats matrix, not for the binary +matrix and the venn diagram): 'strict core' (option **-co**) for +OGs where **all** genomes have an ortholog, independent of the +cutoffs. Of course all the 'strict core' OGs are also included in +the 'cutoff\_core' category ('strict core' is identical to 'cutoff +core' with **-cut\_i** 1 and **-cut\_e** 0). Option **-s** activates the +detection of 'singleton/ORFan' OGs present in only **one** genome. +Depending on the cutoffs and number of genomes in the groups, +category 'underrepresented' includes most of these singletons. + +Additionally, annotation is retrieved from multi-FASTA files created +with [`cds_extractor.pl`](/cds_extractor). See +[`cds_extractor.pl`](/cds_extractor) for a description of the +format. These files are used as input for the PO analysis and with +option **-d** for `po2group_stats.pl`. The annotations are printed +in category output files in the result directory. + +Annotations are only pulled from one representative genome for each +category present in the current OG. With option **-co** you can set a +specific genome for the representative annotation for category +'strict core'. For all other categories the representative genome is +chosen according to the order of the genomes in the group files, +depending on the presence in each OG. Thus, the best annotated +genome should be in the first group at the topmost position +(especially for 'cutoff core'), as well as the best annotated ones +at the top in all other groups. + +In the result files, each orthologous group (OG) is listed in a row +of the resulting category files, the first column holds the OG +numbers from the PO input matrix (i.e. line number minus one). The +following columns specify the ID for each CDS, gene, EC number(s), +product, and organism are shown depending on their presence in the +CDS's annotation. The ID is in most cases the locus tag (see +[`cds_extractor.pl`](/cds_extractor)). If several EC numbers exist +for a single CDS they are separated by a ';'. If the representative +genome within an OG includes paralogs (co-orthologs) these will be +printed in the following row(s) **without** a new OG number in the +first column. + +The number of OGs in the category annotation result files are the +same as listed in the venn diagram and the final stats matrix. +However, since only annotation from one representative annotation is +used the CDS number will be different to the final stats. The final +stats include **all** the CDS in this category in **all** genomes +present in the OG in groups >= the inclusion cutoff (i.e. for +'strict core' the CDS for all genomes in this OG are counted). Two +categories are different, for 'unspecific' all unspecific groups are +included, for 'underrepresented' all groups <= the exclusion +cutoffs. This is also the reason, the 'pangenome' CDS count is +greater than the 'included in categories' CDS count in the final +stats matrix, as genomes in excluded groups are exempt from the CDS +counts for most categories. 'Included in categories' is the OG/CDS +sum of all non-optional categories ('\*specific', '\*absent', 'cutoff +core', 'underrepresented', and 'unspecific'), since the optional +categories are included in non-optionals. An exception to the +difference in CDS counts are the 'singletons' category where OG and +CDS counts are identical in the result files and in the overall +final output matrix (as there is only one genome), as well as in +group-'specific' categories for groups including only one genome. + +At last, if you want the respective representative sequences for a +category you can first filter the locus tags from the result file +with Unix command-line tools: + + grep -v "^#" result_file.tsv | cut -f 2 > locus_tags.txt + +And then feed the locus tag list to +[`cds_extractor.pl`](/cds_extractor) with option **-l**. + +As a final note, in the [prot_finder](/prot_finder) workflow is a +script, `binary_group_stats.pl`, based upon `po2group_stats.pl`, +which can calculate overall presence/absence statistics for column +groups in a delimited TEXT binary matrix (as with genomes here). + +## Usage + +### [`cds_extractor`](/cds_extractor) + + for i in *.[gbk|embl]; do perl cds_extractor.pl -i $i [-p|-n]; done + +### [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) + + proteinortho5.pl -graph [-synteny] -cpus=# -selfblast -singles -identity=50 -cov=50 -blastParameters='-use_sw_tback [-seg no|-dust no]' *.[faa|ffn] + +### po2group_stats + + perl po2group_stats.pl -i matrix.[proteinortho|poff] -d genome_fasta_dir/ -g group_file.tsv -r result_dir -cut_i 0.7 -cut_e 0.2 -b -p -co genome4.[faa|ffn] -s -u -a > overall_stats.tsv + +## Options + +### Mandatory options + +- **-i**=_str_, **-input**=_str_ + + Proteinortho (PO) result matrix (\*.proteinortho or \*.poff) + +- **-d**=_str_, **-dir\_genome**=_str_ + + Path to the directory including the genome multi-FASTA PO input files (\*.faa or \*.ffn), created with [`cds_extractor.pl`](/cds_extractor) + +- **-g**=_str_, **-groups\_file**=_str_ + + Tab-delimited file with group affiliation for the genomes with **minimal two** and **maximal four** groups (easiest to create in a spreadsheet software and save in tab-separated format). **All** genomes from the PO matrix need to be included. Group names can only include alphanumeric (a-z, A-Z, 0-9), underscore (\_), dash (-), and period (.) characters (no whitespaces allowed either). Example format with two genomes in group A, three genomes in group B and D, and one genome in group C: + + group\_A group\_B group\_C group\_D<br> + genome1.faa genome2.faa genome3.faa genome4.faa<br> + genome5.faa genome6.faa  genome7.faa<br> +  genome8.faa  genome9.faa + +### Optional options + +- **-h**, **-help** + + Help (perldoc POD) + +- **-r**=_str_, **-result\_dir**=_str_ + + Path to result folder \[default = inclusion and exclusion percentage cutoff, './results\_i#\_e#'\] + +- **-cut\_i**=_float_, **-cut\_inclusion**=_float_ + + Percentage inclusion cutoff for genomes in a group per OG, has to be > 0 and <= 1. Cutoff will be rounded according to the genome number in each group and has to be > the rounded exclusion cutoff in this group. \[default = 0.9\] + +- **-cut\_e**=_float_, **-cut\_exclusion**=_float_ + + Percentage exclusion cutoff, has to be >= 0 and < 1. Rounded cutoff has to be < rounded inclusion cutoff. \[default = 0.1\] + +- **-b**, **-binary\_matrix** + + Print a binary matrix with the presence/absence genome group results according to the cutoffs (excluding 'unspecific' category OGs) + +- **-p**, **-plot\_venn** + + Plot venn diagram from the binary matrix (except 'unspecific' and 'underrepresented' categories) with function `venn` from **R** package **gplots**, requires option **-b** + +- **-co**=(_str_), **-core_strict**=(_str_) + + Include 'strict core' category in output. Optionally, give a genome name from the PO matrix to use for the representative output annotation. \[default = topmost genome in first group\] + +- **-s**, **-singletons** + + Include singletons/ORFans for each genome in the output, activates also overall genome OG/CDS stats in final stats matrix for genomes with singletons + +- **-u**, **-unspecific** + + Include 'unspecific' category representative annotation file in result directory + +- **-a**, **-all\_genomes\_overall** + + Report overall stats for all genomes (appended to the final stats matrix), also those without singletons; will include all overall genome stats without option **-s** + +- **-v**, **-version** + + Print version number to *STDERR* + +## Output + +- *STDOUT* + + The tab-delimited final stats matrix is printed to *STDOUT*. Redirect or pipe into another tool as needed. + +- ./results_i#_e# + + All output files are stored in a results folder + +- ./results_i#_e#/[\*_specific|\*_absent|cutoff_core|underrepresented]_OGs.tsv + + Tab-delimited files with OG annotation from a representative genome for non-optional categories + +- (./results_i#_e#/[\*_singletons|strict_core|unspecific]_OGs.tsv) + + Optional category tab-delimited output files with representative annotation + +- (./results_i#_e#/binary_matrix.tsv) + + Tab-delimited binary matrix of group presence/absence results according to cutoffs (excluding 'unspecific' category) + +- (./results_i#_e#/venn_diagram.pdf) + + Venn diagram for non-optional categories (except 'unspecific' and 'underrepresented' categories) + +## Dependencies + +- **Statistical computing language [R](http://www.r-project.org/)** + + `Rscript` is needed to plot the venn diagram with option **-p**, tested with version 3.2.2 + +- **gplots (https://cran.r-project.org/web/packages/gplots/index.html)** + + Package needed for **R** to plot the venn diagram with function `venn`. Tested with **gplots** version 2.17.0. + +## Run environment + +The Perl script runs under UNIX and Windows flavors. + +## Author - contact + +Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) + +## Citation, installation, and license + +For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). + +## Changelog + +* v0.1.3 (06.06.2016) + * included check for file system conformity for group names + * some minor syntax changes and additions to error messages, basically adapting to [`binary_group_stats.pl`](/prot_finder) +* v0.1.2 (19.11.2015) + * added `pod2usage`-die for Getopts::Long call + * minor POD/README change +* v0.1.1 (30.10.2015) + * fixed bug for representative annotation in output files, the representative genome was not chosen according to genome order in the groups file +* v0.1 (23.10.2015)