Mercurial > repos > dereeper > pangenome_explorer
comparison COG/bac-genomics-scripts/po2group_stats/README.md @ 3:e42d30da7a74 draft
Uploaded
author | dereeper |
---|---|
date | Thu, 30 May 2024 11:52:25 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
2:97e4e3e818b6 | 3:e42d30da7a74 |
---|---|
1 po2group_stats | |
2 ============== | |
3 | |
4 `po2group_stats.pl` is a script to categorize orthologs from [Proteinortho5](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) output according to genome groups. In the [prot_finder](/prot_finder) workflow is a script, `binary_group_stats.pl`, which does the same thing for column groups in a delimited TEXT binary matrix. | |
5 | |
6 * [Synopsis](#synopsis) | |
7 * [Description](#description) | |
8 * [Usage](#usage) | |
9 * [cds_extractor](#cds_extractor) | |
10 * [Proteinortho5](#proteinortho5) | |
11 * [po2group_stats](#po2group_stats) | |
12 * [Options](#options) | |
13 * [Mandatory options](#mandatory-options) | |
14 * [Optional options](#optional-options) | |
15 * [Output](#output) | |
16 * [Dependencies](#dependencies) | |
17 * [Run environment](#run-environment) | |
18 * [Author - contact](#author---contact) | |
19 * [Citation, installation, and license](#citation-installation-and-license) | |
20 * [Changelog](#changelog) | |
21 | |
22 ## Synopsis | |
23 | |
24 perl po2group_stats.pl -i matrix.proteinortho -d genome_fasta_dir/ -g group_file.tsv -p > overall_stats.tsv | |
25 | |
26 ## Description | |
27 | |
28 Categorize the genomes in an ortholog/paralog output matrix (option **-i**) from a | |
29 [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) | |
30 calculation according to group affiliations. The group | |
31 affiliations of the genomes are intended to get overall | |
32 presence/absence statistics for groups of genomes and not simply | |
33 single genomes (e.g. comparing 'marine', 'earth', 'commensal', | |
34 'pathogenic' etc. genome groups). Percentage inclusion (option | |
35 **-cut\_i**) and exclusion (option **-cut\_e**) cutoffs can be set to | |
36 define how strict the presence/absence of genome groups within an | |
37 orthologous group (OG) are defined. Of course groups can also hold | |
38 only single genomes to get single genome statistics. Group | |
39 affiliations are defined in a mandatory **tab-delimited** group input | |
40 file (option **-g**) with **minimal two** and **maximal four** groups. | |
41 | |
42 Only alphanumeric (a-z, A-Z, 0-9), underscore (\_), dash (-), and | |
43 period (.) characters are allowed for the **group names** in the | |
44 group file to avoid downstream problems with the operating/file | |
45 system. As a consequence, also no whitespaces are allowed in these! | |
46 Additionally, **group names**, **genome filenames** (should be | |
47 enforced by the file system), and **FASTA IDs** considering **all** | |
48 genome files (mostly locus tags; should be enforced by Proteinortho5) | |
49 need to be **unique**. | |
50 | |
51 **Proteinortho5** (PO) has to be run with option **-singles** to | |
52 include also genes without orthologs, so-called singletons/ORFans, | |
53 for each genome in the PO matrix (see the | |
54 [PO manual](http://www.bioinf.uni-leipzig.de/Software/proteinortho/manual.html)). | |
55 Additionally, option **-selfblast** is recommended to enhance | |
56 paralog detection by PO. | |
57 | |
58 To explain the logic behind the categorization, the following | |
59 annotation for example groups will be used. A '1' exemplifies a | |
60 group genome count in a respective OG >= the rounded inclusion | |
61 cutoff, a '0' a group genome count <= the rounded exclusion cutoff. | |
62 The presence and absence of OGs for the group affiliations are | |
63 structured in different categories depending on the number of | |
64 groups. For **two groups** (e.g. A and B) there are five categories: | |
65 'A specific' (A:B = 1:0), 'B specific' (0:1), 'cutoff core' (1:1), | |
66 'underrepresented' (0:0), and 'unspecific'. Unspecific OGs have a | |
67 genome count for at least **one** group outside the cutoffs | |
68 (exclusion cutoff < genome count < inclusion cutoff) and | |
69 thus cannot be categorized. These 'unspecific' OGs will only be | |
70 printed to a final annotation result file with option **-u**. Overall | |
71 stats for all categories are printed to *STDOUT* in a final | |
72 tab-delimited output matrix. | |
73 | |
74 **Three groups** (A, B, and C) have the following nine categories: 'A | |
75 specific' (A:B:C = 1:0:0), 'B specific' (0:1:0), 'C specific' | |
76 (0:0:1), 'A absent' (0:1:1), 'B absent' (1:0:1), 'C absent' (1:1:0), | |
77 'cutoff core' (1:1:1), 'underrepresented' (0:0:0), and 'unspecific'. | |
78 | |
79 **Four groups** (A, B, C, and D) are classified in 17 categories: 'A | |
80 specific' (A:B:C:D = 1:0:0:0), 'B specific' (0:1:0:0), 'C specific' | |
81 (0:0:1:0), 'D specific' (0:0:0:1), 'A-B specific' (1:1:0:0), 'A-C | |
82 specific' (1:0:1:0), 'A-D specific' (1:0:0:1), 'B-C specific' | |
83 (0:1:1:0), 'B-D specific' (0:1:0:1), 'C-D specific' (0:0:1:1), 'A | |
84 absent' (0:1:1:1), 'B absent' (1:0:1:1), 'C absent' (1:1:0:1), 'D | |
85 absent' (1:1:1:0), 'cutoff core' (1:1:1:1), 'underrepresented' | |
86 (0:0:0:0), and 'unspecific'. | |
87 | |
88 The resulting group presence/absence (according to the cutoffs) can | |
89 also be printed to a binary matrix (option **-b**) in the result | |
90 directory (option **-r**), excluding the 'unspecific' category. Since | |
91 the categories are the logics underlying venn diagrams, you can also | |
92 plot the results in a venn diagram using the binary matrix (option | |
93 **-p**). The 'underrepresented' category is exempt from the venn | |
94 diagram, because it is outside of venn diagram logics. | |
95 | |
96 Here are venn diagrams illustrating the logic categories (see folder ['pics'](./pics)): | |
97 | |
98 <p align="center"> | |
99 <img alt="venn diagram logics" src="https://github.com/aleimba/bac-genomics-scripts/raw/master/po2group_stats/pics/venn_diagram_logics.png"> | |
100 </p> | |
101 | |
102 There are two optional categories (which are only considered for the | |
103 final print outs and in the final stats matrix, not for the binary | |
104 matrix and the venn diagram): 'strict core' (option **-co**) for | |
105 OGs where **all** genomes have an ortholog, independent of the | |
106 cutoffs. Of course all the 'strict core' OGs are also included in | |
107 the 'cutoff\_core' category ('strict core' is identical to 'cutoff | |
108 core' with **-cut\_i** 1 and **-cut\_e** 0). Option **-s** activates the | |
109 detection of 'singleton/ORFan' OGs present in only **one** genome. | |
110 Depending on the cutoffs and number of genomes in the groups, | |
111 category 'underrepresented' includes most of these singletons. | |
112 | |
113 Additionally, annotation is retrieved from multi-FASTA files created | |
114 with [`cds_extractor.pl`](/cds_extractor). See | |
115 [`cds_extractor.pl`](/cds_extractor) for a description of the | |
116 format. These files are used as input for the PO analysis and with | |
117 option **-d** for `po2group_stats.pl`. The annotations are printed | |
118 in category output files in the result directory. | |
119 | |
120 Annotations are only pulled from one representative genome for each | |
121 category present in the current OG. With option **-co** you can set a | |
122 specific genome for the representative annotation for category | |
123 'strict core'. For all other categories the representative genome is | |
124 chosen according to the order of the genomes in the group files, | |
125 depending on the presence in each OG. Thus, the best annotated | |
126 genome should be in the first group at the topmost position | |
127 (especially for 'cutoff core'), as well as the best annotated ones | |
128 at the top in all other groups. | |
129 | |
130 In the result files, each orthologous group (OG) is listed in a row | |
131 of the resulting category files, the first column holds the OG | |
132 numbers from the PO input matrix (i.e. line number minus one). The | |
133 following columns specify the ID for each CDS, gene, EC number(s), | |
134 product, and organism are shown depending on their presence in the | |
135 CDS's annotation. The ID is in most cases the locus tag (see | |
136 [`cds_extractor.pl`](/cds_extractor)). If several EC numbers exist | |
137 for a single CDS they are separated by a ';'. If the representative | |
138 genome within an OG includes paralogs (co-orthologs) these will be | |
139 printed in the following row(s) **without** a new OG number in the | |
140 first column. | |
141 | |
142 The number of OGs in the category annotation result files are the | |
143 same as listed in the venn diagram and the final stats matrix. | |
144 However, since only annotation from one representative annotation is | |
145 used the CDS number will be different to the final stats. The final | |
146 stats include **all** the CDS in this category in **all** genomes | |
147 present in the OG in groups >= the inclusion cutoff (i.e. for | |
148 'strict core' the CDS for all genomes in this OG are counted). Two | |
149 categories are different, for 'unspecific' all unspecific groups are | |
150 included, for 'underrepresented' all groups <= the exclusion | |
151 cutoffs. This is also the reason, the 'pangenome' CDS count is | |
152 greater than the 'included in categories' CDS count in the final | |
153 stats matrix, as genomes in excluded groups are exempt from the CDS | |
154 counts for most categories. 'Included in categories' is the OG/CDS | |
155 sum of all non-optional categories ('\*specific', '\*absent', 'cutoff | |
156 core', 'underrepresented', and 'unspecific'), since the optional | |
157 categories are included in non-optionals. An exception to the | |
158 difference in CDS counts are the 'singletons' category where OG and | |
159 CDS counts are identical in the result files and in the overall | |
160 final output matrix (as there is only one genome), as well as in | |
161 group-'specific' categories for groups including only one genome. | |
162 | |
163 At last, if you want the respective representative sequences for a | |
164 category you can first filter the locus tags from the result file | |
165 with Unix command-line tools: | |
166 | |
167 grep -v "^#" result_file.tsv | cut -f 2 > locus_tags.txt | |
168 | |
169 And then feed the locus tag list to | |
170 [`cds_extractor.pl`](/cds_extractor) with option **-l**. | |
171 | |
172 As a final note, in the [prot_finder](/prot_finder) workflow is a | |
173 script, `binary_group_stats.pl`, based upon `po2group_stats.pl`, | |
174 which can calculate overall presence/absence statistics for column | |
175 groups in a delimited TEXT binary matrix (as with genomes here). | |
176 | |
177 ## Usage | |
178 | |
179 ### [`cds_extractor`](/cds_extractor) | |
180 | |
181 for i in *.[gbk|embl]; do perl cds_extractor.pl -i $i [-p|-n]; done | |
182 | |
183 ### [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) | |
184 | |
185 proteinortho5.pl -graph [-synteny] -cpus=# -selfblast -singles -identity=50 -cov=50 -blastParameters='-use_sw_tback [-seg no|-dust no]' *.[faa|ffn] | |
186 | |
187 ### po2group_stats | |
188 | |
189 perl po2group_stats.pl -i matrix.[proteinortho|poff] -d genome_fasta_dir/ -g group_file.tsv -r result_dir -cut_i 0.7 -cut_e 0.2 -b -p -co genome4.[faa|ffn] -s -u -a > overall_stats.tsv | |
190 | |
191 ## Options | |
192 | |
193 ### Mandatory options | |
194 | |
195 - **-i**=_str_, **-input**=_str_ | |
196 | |
197 Proteinortho (PO) result matrix (\*.proteinortho or \*.poff) | |
198 | |
199 - **-d**=_str_, **-dir\_genome**=_str_ | |
200 | |
201 Path to the directory including the genome multi-FASTA PO input files (\*.faa or \*.ffn), created with [`cds_extractor.pl`](/cds_extractor) | |
202 | |
203 - **-g**=_str_, **-groups\_file**=_str_ | |
204 | |
205 Tab-delimited file with group affiliation for the genomes with **minimal two** and **maximal four** groups (easiest to create in a spreadsheet software and save in tab-separated format). **All** genomes from the PO matrix need to be included. Group names can only include alphanumeric (a-z, A-Z, 0-9), underscore (\_), dash (-), and period (.) characters (no whitespaces allowed either). Example format with two genomes in group A, three genomes in group B and D, and one genome in group C: | |
206 | |
207 group\_A group\_B group\_C group\_D<br> | |
208 genome1.faa genome2.faa genome3.faa genome4.faa<br> | |
209 genome5.faa genome6.faa  genome7.faa<br> | |
210  genome8.faa  genome9.faa | |
211 | |
212 ### Optional options | |
213 | |
214 - **-h**, **-help** | |
215 | |
216 Help (perldoc POD) | |
217 | |
218 - **-r**=_str_, **-result\_dir**=_str_ | |
219 | |
220 Path to result folder \[default = inclusion and exclusion percentage cutoff, './results\_i#\_e#'\] | |
221 | |
222 - **-cut\_i**=_float_, **-cut\_inclusion**=_float_ | |
223 | |
224 Percentage inclusion cutoff for genomes in a group per OG, has to be > 0 and <= 1. Cutoff will be rounded according to the genome number in each group and has to be > the rounded exclusion cutoff in this group. \[default = 0.9\] | |
225 | |
226 - **-cut\_e**=_float_, **-cut\_exclusion**=_float_ | |
227 | |
228 Percentage exclusion cutoff, has to be >= 0 and < 1. Rounded cutoff has to be < rounded inclusion cutoff. \[default = 0.1\] | |
229 | |
230 - **-b**, **-binary\_matrix** | |
231 | |
232 Print a binary matrix with the presence/absence genome group results according to the cutoffs (excluding 'unspecific' category OGs) | |
233 | |
234 - **-p**, **-plot\_venn** | |
235 | |
236 Plot venn diagram from the binary matrix (except 'unspecific' and 'underrepresented' categories) with function `venn` from **R** package **gplots**, requires option **-b** | |
237 | |
238 - **-co**=(_str_), **-core_strict**=(_str_) | |
239 | |
240 Include 'strict core' category in output. Optionally, give a genome name from the PO matrix to use for the representative output annotation. \[default = topmost genome in first group\] | |
241 | |
242 - **-s**, **-singletons** | |
243 | |
244 Include singletons/ORFans for each genome in the output, activates also overall genome OG/CDS stats in final stats matrix for genomes with singletons | |
245 | |
246 - **-u**, **-unspecific** | |
247 | |
248 Include 'unspecific' category representative annotation file in result directory | |
249 | |
250 - **-a**, **-all\_genomes\_overall** | |
251 | |
252 Report overall stats for all genomes (appended to the final stats matrix), also those without singletons; will include all overall genome stats without option **-s** | |
253 | |
254 - **-v**, **-version** | |
255 | |
256 Print version number to *STDERR* | |
257 | |
258 ## Output | |
259 | |
260 - *STDOUT* | |
261 | |
262 The tab-delimited final stats matrix is printed to *STDOUT*. Redirect or pipe into another tool as needed. | |
263 | |
264 - ./results_i#_e# | |
265 | |
266 All output files are stored in a results folder | |
267 | |
268 - ./results_i#_e#/[\*_specific|\*_absent|cutoff_core|underrepresented]_OGs.tsv | |
269 | |
270 Tab-delimited files with OG annotation from a representative genome for non-optional categories | |
271 | |
272 - (./results_i#_e#/[\*_singletons|strict_core|unspecific]_OGs.tsv) | |
273 | |
274 Optional category tab-delimited output files with representative annotation | |
275 | |
276 - (./results_i#_e#/binary_matrix.tsv) | |
277 | |
278 Tab-delimited binary matrix of group presence/absence results according to cutoffs (excluding 'unspecific' category) | |
279 | |
280 - (./results_i#_e#/venn_diagram.pdf) | |
281 | |
282 Venn diagram for non-optional categories (except 'unspecific' and 'underrepresented' categories) | |
283 | |
284 ## Dependencies | |
285 | |
286 - **Statistical computing language [R](http://www.r-project.org/)** | |
287 | |
288 `Rscript` is needed to plot the venn diagram with option **-p**, tested with version 3.2.2 | |
289 | |
290 - **gplots (https://cran.r-project.org/web/packages/gplots/index.html)** | |
291 | |
292 Package needed for **R** to plot the venn diagram with function `venn`. Tested with **gplots** version 2.17.0. | |
293 | |
294 ## Run environment | |
295 | |
296 The Perl script runs under UNIX and Windows flavors. | |
297 | |
298 ## Author - contact | |
299 | |
300 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) | |
301 | |
302 ## Citation, installation, and license | |
303 | |
304 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). | |
305 | |
306 ## Changelog | |
307 | |
308 * v0.1.3 (06.06.2016) | |
309 * included check for file system conformity for group names | |
310 * some minor syntax changes and additions to error messages, basically adapting to [`binary_group_stats.pl`](/prot_finder) | |
311 * v0.1.2 (19.11.2015) | |
312 * added `pod2usage`-die for Getopts::Long call | |
313 * minor POD/README change | |
314 * v0.1.1 (30.10.2015) | |
315 * fixed bug for representative annotation in output files, the representative genome was not chosen according to genome order in the groups file | |
316 * v0.1 (23.10.2015) |