comparison COG/bac-genomics-scripts/po2group_stats/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
comparison
equal deleted inserted replaced
2:97e4e3e818b6 3:e42d30da7a74
1 po2group_stats
2 ==============
3
4 `po2group_stats.pl` is a script to categorize orthologs from [Proteinortho5](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) output according to genome groups. In the [prot_finder](/prot_finder) workflow is a script, `binary_group_stats.pl`, which does the same thing for column groups in a delimited TEXT binary matrix.
5
6 * [Synopsis](#synopsis)
7 * [Description](#description)
8 * [Usage](#usage)
9 * [cds_extractor](#cds_extractor)
10 * [Proteinortho5](#proteinortho5)
11 * [po2group_stats](#po2group_stats)
12 * [Options](#options)
13 * [Mandatory options](#mandatory-options)
14 * [Optional options](#optional-options)
15 * [Output](#output)
16 * [Dependencies](#dependencies)
17 * [Run environment](#run-environment)
18 * [Author - contact](#author---contact)
19 * [Citation, installation, and license](#citation-installation-and-license)
20 * [Changelog](#changelog)
21
22 ## Synopsis
23
24 perl po2group_stats.pl -i matrix.proteinortho -d genome_fasta_dir/ -g group_file.tsv -p > overall_stats.tsv
25
26 ## Description
27
28 Categorize the genomes in an ortholog/paralog output matrix (option **-i**) from a
29 [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/)
30 calculation according to group affiliations. The group
31 affiliations of the genomes are intended to get overall
32 presence/absence statistics for groups of genomes and not simply
33 single genomes (e.g. comparing 'marine', 'earth', 'commensal',
34 'pathogenic' etc. genome groups). Percentage inclusion (option
35 **-cut\_i**) and exclusion (option **-cut\_e**) cutoffs can be set to
36 define how strict the presence/absence of genome groups within an
37 orthologous group (OG) are defined. Of course groups can also hold
38 only single genomes to get single genome statistics. Group
39 affiliations are defined in a mandatory **tab-delimited** group input
40 file (option **-g**) with **minimal two** and **maximal four** groups.
41
42 Only alphanumeric (a-z, A-Z, 0-9), underscore (\_), dash (-), and
43 period (.) characters are allowed for the **group names** in the
44 group file to avoid downstream problems with the operating/file
45 system. As a consequence, also no whitespaces are allowed in these!
46 Additionally, **group names**, **genome filenames** (should be
47 enforced by the file system), and **FASTA IDs** considering **all**
48 genome files (mostly locus tags; should be enforced by Proteinortho5)
49 need to be **unique**.
50
51 **Proteinortho5** (PO) has to be run with option **-singles** to
52 include also genes without orthologs, so-called singletons/ORFans,
53 for each genome in the PO matrix (see the
54 [PO manual](http://www.bioinf.uni-leipzig.de/Software/proteinortho/manual.html)).
55 Additionally, option **-selfblast** is recommended to enhance
56 paralog detection by PO.
57
58 To explain the logic behind the categorization, the following
59 annotation for example groups will be used. A '1' exemplifies a
60 group genome count in a respective OG >= the rounded inclusion
61 cutoff, a '0' a group genome count <= the rounded exclusion cutoff.
62 The presence and absence of OGs for the group affiliations are
63 structured in different categories depending on the number of
64 groups. For **two groups** (e.g. A and B) there are five categories:
65 'A specific' (A:B = 1:0), 'B specific' (0:1), 'cutoff core' (1:1),
66 'underrepresented' (0:0), and 'unspecific'. Unspecific OGs have a
67 genome count for at least **one** group outside the cutoffs
68 (exclusion cutoff < genome count < inclusion cutoff) and
69 thus cannot be categorized. These 'unspecific' OGs will only be
70 printed to a final annotation result file with option **-u**. Overall
71 stats for all categories are printed to *STDOUT* in a final
72 tab-delimited output matrix.
73
74 **Three groups** (A, B, and C) have the following nine categories: 'A
75 specific' (A:B:C = 1:0:0), 'B specific' (0:1:0), 'C specific'
76 (0:0:1), 'A absent' (0:1:1), 'B absent' (1:0:1), 'C absent' (1:1:0),
77 'cutoff core' (1:1:1), 'underrepresented' (0:0:0), and 'unspecific'.
78
79 **Four groups** (A, B, C, and D) are classified in 17 categories: 'A
80 specific' (A:B:C:D = 1:0:0:0), 'B specific' (0:1:0:0), 'C specific'
81 (0:0:1:0), 'D specific' (0:0:0:1), 'A-B specific' (1:1:0:0), 'A-C
82 specific' (1:0:1:0), 'A-D specific' (1:0:0:1), 'B-C specific'
83 (0:1:1:0), 'B-D specific' (0:1:0:1), 'C-D specific' (0:0:1:1), 'A
84 absent' (0:1:1:1), 'B absent' (1:0:1:1), 'C absent' (1:1:0:1), 'D
85 absent' (1:1:1:0), 'cutoff core' (1:1:1:1), 'underrepresented'
86 (0:0:0:0), and 'unspecific'.
87
88 The resulting group presence/absence (according to the cutoffs) can
89 also be printed to a binary matrix (option **-b**) in the result
90 directory (option **-r**), excluding the 'unspecific' category. Since
91 the categories are the logics underlying venn diagrams, you can also
92 plot the results in a venn diagram using the binary matrix (option
93 **-p**). The 'underrepresented' category is exempt from the venn
94 diagram, because it is outside of venn diagram logics.
95
96 Here are venn diagrams illustrating the logic categories (see folder ['pics'](./pics)):
97
98 <p align="center">
99 <img alt="venn diagram logics" src="https://github.com/aleimba/bac-genomics-scripts/raw/master/po2group_stats/pics/venn_diagram_logics.png">
100 </p>
101
102 There are two optional categories (which are only considered for the
103 final print outs and in the final stats matrix, not for the binary
104 matrix and the venn diagram): 'strict core' (option **-co**) for
105 OGs where **all** genomes have an ortholog, independent of the
106 cutoffs. Of course all the 'strict core' OGs are also included in
107 the 'cutoff\_core' category ('strict core' is identical to 'cutoff
108 core' with **-cut\_i** 1 and **-cut\_e** 0). Option **-s** activates the
109 detection of 'singleton/ORFan' OGs present in only **one** genome.
110 Depending on the cutoffs and number of genomes in the groups,
111 category 'underrepresented' includes most of these singletons.
112
113 Additionally, annotation is retrieved from multi-FASTA files created
114 with [`cds_extractor.pl`](/cds_extractor). See
115 [`cds_extractor.pl`](/cds_extractor) for a description of the
116 format. These files are used as input for the PO analysis and with
117 option **-d** for `po2group_stats.pl`. The annotations are printed
118 in category output files in the result directory.
119
120 Annotations are only pulled from one representative genome for each
121 category present in the current OG. With option **-co** you can set a
122 specific genome for the representative annotation for category
123 'strict core'. For all other categories the representative genome is
124 chosen according to the order of the genomes in the group files,
125 depending on the presence in each OG. Thus, the best annotated
126 genome should be in the first group at the topmost position
127 (especially for 'cutoff core'), as well as the best annotated ones
128 at the top in all other groups.
129
130 In the result files, each orthologous group (OG) is listed in a row
131 of the resulting category files, the first column holds the OG
132 numbers from the PO input matrix (i.e. line number minus one). The
133 following columns specify the ID for each CDS, gene, EC number(s),
134 product, and organism are shown depending on their presence in the
135 CDS's annotation. The ID is in most cases the locus tag (see
136 [`cds_extractor.pl`](/cds_extractor)). If several EC numbers exist
137 for a single CDS they are separated by a ';'. If the representative
138 genome within an OG includes paralogs (co-orthologs) these will be
139 printed in the following row(s) **without** a new OG number in the
140 first column.
141
142 The number of OGs in the category annotation result files are the
143 same as listed in the venn diagram and the final stats matrix.
144 However, since only annotation from one representative annotation is
145 used the CDS number will be different to the final stats. The final
146 stats include **all** the CDS in this category in **all** genomes
147 present in the OG in groups >= the inclusion cutoff (i.e. for
148 'strict core' the CDS for all genomes in this OG are counted). Two
149 categories are different, for 'unspecific' all unspecific groups are
150 included, for 'underrepresented' all groups <= the exclusion
151 cutoffs. This is also the reason, the 'pangenome' CDS count is
152 greater than the 'included in categories' CDS count in the final
153 stats matrix, as genomes in excluded groups are exempt from the CDS
154 counts for most categories. 'Included in categories' is the OG/CDS
155 sum of all non-optional categories ('\*specific', '\*absent', 'cutoff
156 core', 'underrepresented', and 'unspecific'), since the optional
157 categories are included in non-optionals. An exception to the
158 difference in CDS counts are the 'singletons' category where OG and
159 CDS counts are identical in the result files and in the overall
160 final output matrix (as there is only one genome), as well as in
161 group-'specific' categories for groups including only one genome.
162
163 At last, if you want the respective representative sequences for a
164 category you can first filter the locus tags from the result file
165 with Unix command-line tools:
166
167 grep -v "^#" result_file.tsv | cut -f 2 > locus_tags.txt
168
169 And then feed the locus tag list to
170 [`cds_extractor.pl`](/cds_extractor) with option **-l**.
171
172 As a final note, in the [prot_finder](/prot_finder) workflow is a
173 script, `binary_group_stats.pl`, based upon `po2group_stats.pl`,
174 which can calculate overall presence/absence statistics for column
175 groups in a delimited TEXT binary matrix (as with genomes here).
176
177 ## Usage
178
179 ### [`cds_extractor`](/cds_extractor)
180
181 for i in *.[gbk|embl]; do perl cds_extractor.pl -i $i [-p|-n]; done
182
183 ### [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/)
184
185 proteinortho5.pl -graph [-synteny] -cpus=# -selfblast -singles -identity=50 -cov=50 -blastParameters='-use_sw_tback [-seg no|-dust no]' *.[faa|ffn]
186
187 ### po2group_stats
188
189 perl po2group_stats.pl -i matrix.[proteinortho|poff] -d genome_fasta_dir/ -g group_file.tsv -r result_dir -cut_i 0.7 -cut_e 0.2 -b -p -co genome4.[faa|ffn] -s -u -a > overall_stats.tsv
190
191 ## Options
192
193 ### Mandatory options
194
195 - **-i**=_str_, **-input**=_str_
196
197 Proteinortho (PO) result matrix (\*.proteinortho or \*.poff)
198
199 - **-d**=_str_, **-dir\_genome**=_str_
200
201 Path to the directory including the genome multi-FASTA PO input files (\*.faa or \*.ffn), created with [`cds_extractor.pl`](/cds_extractor)
202
203 - **-g**=_str_, **-groups\_file**=_str_
204
205 Tab-delimited file with group affiliation for the genomes with **minimal two** and **maximal four** groups (easiest to create in a spreadsheet software and save in tab-separated format). **All** genomes from the PO matrix need to be included. Group names can only include alphanumeric (a-z, A-Z, 0-9), underscore (\_), dash (-), and period (.) characters (no whitespaces allowed either). Example format with two genomes in group A, three genomes in group B and D, and one genome in group C:
206
207 group\_A&emsp;group\_B&emsp;group\_C&emsp;group\_D<br>
208 genome1.faa&emsp;genome2.faa&emsp;genome3.faa&emsp;genome4.faa<br>
209 genome5.faa&emsp;genome6.faa&emsp;&emsp;genome7.faa<br>
210 &emsp;genome8.faa&emsp;&emsp;genome9.faa
211
212 ### Optional options
213
214 - **-h**, **-help**
215
216 Help (perldoc POD)
217
218 - **-r**=_str_, **-result\_dir**=_str_
219
220 Path to result folder \[default = inclusion and exclusion percentage cutoff, './results\_i#\_e#'\]
221
222 - **-cut\_i**=_float_, **-cut\_inclusion**=_float_
223
224 Percentage inclusion cutoff for genomes in a group per OG, has to be > 0 and <= 1. Cutoff will be rounded according to the genome number in each group and has to be > the rounded exclusion cutoff in this group. \[default = 0.9\]
225
226 - **-cut\_e**=_float_, **-cut\_exclusion**=_float_
227
228 Percentage exclusion cutoff, has to be >= 0 and < 1. Rounded cutoff has to be < rounded inclusion cutoff. \[default = 0.1\]
229
230 - **-b**, **-binary\_matrix**
231
232 Print a binary matrix with the presence/absence genome group results according to the cutoffs (excluding 'unspecific' category OGs)
233
234 - **-p**, **-plot\_venn**
235
236 Plot venn diagram from the binary matrix (except 'unspecific' and 'underrepresented' categories) with function `venn` from **R** package **gplots**, requires option **-b**
237
238 - **-co**=(_str_), **-core_strict**=(_str_)
239
240 Include 'strict core' category in output. Optionally, give a genome name from the PO matrix to use for the representative output annotation. \[default = topmost genome in first group\]
241
242 - **-s**, **-singletons**
243
244 Include singletons/ORFans for each genome in the output, activates also overall genome OG/CDS stats in final stats matrix for genomes with singletons
245
246 - **-u**, **-unspecific**
247
248 Include 'unspecific' category representative annotation file in result directory
249
250 - **-a**, **-all\_genomes\_overall**
251
252 Report overall stats for all genomes (appended to the final stats matrix), also those without singletons; will include all overall genome stats without option **-s**
253
254 - **-v**, **-version**
255
256 Print version number to *STDERR*
257
258 ## Output
259
260 - *STDOUT*
261
262 The tab-delimited final stats matrix is printed to *STDOUT*. Redirect or pipe into another tool as needed.
263
264 - ./results_i#_e#
265
266 All output files are stored in a results folder
267
268 - ./results_i#_e#/[\*_specific|\*_absent|cutoff_core|underrepresented]_OGs.tsv
269
270 Tab-delimited files with OG annotation from a representative genome for non-optional categories
271
272 - (./results_i#_e#/[\*_singletons|strict_core|unspecific]_OGs.tsv)
273
274 Optional category tab-delimited output files with representative annotation
275
276 - (./results_i#_e#/binary_matrix.tsv)
277
278 Tab-delimited binary matrix of group presence/absence results according to cutoffs (excluding 'unspecific' category)
279
280 - (./results_i#_e#/venn_diagram.pdf)
281
282 Venn diagram for non-optional categories (except 'unspecific' and 'underrepresented' categories)
283
284 ## Dependencies
285
286 - **Statistical computing language [R](http://www.r-project.org/)**
287
288 `Rscript` is needed to plot the venn diagram with option **-p**, tested with version 3.2.2
289
290 - **gplots (https://cran.r-project.org/web/packages/gplots/index.html)**
291
292 Package needed for **R** to plot the venn diagram with function `venn`. Tested with **gplots** version 2.17.0.
293
294 ## Run environment
295
296 The Perl script runs under UNIX and Windows flavors.
297
298 ## Author - contact
299
300 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
301
302 ## Citation, installation, and license
303
304 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
305
306 ## Changelog
307
308 * v0.1.3 (06.06.2016)
309 * included check for file system conformity for group names
310 * some minor syntax changes and additions to error messages, basically adapting to [`binary_group_stats.pl`](/prot_finder)
311 * v0.1.2 (19.11.2015)
312 * added `pod2usage`-die for Getopts::Long call
313 * minor POD/README change
314 * v0.1.1 (30.10.2015)
315 * fixed bug for representative annotation in output files, the representative genome was not chosen according to genome order in the groups file
316 * v0.1 (23.10.2015)