comparison COG/bac-genomics-scripts/po2anno/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
comparison
equal deleted inserted replaced
2:97e4e3e818b6 3:e42d30da7a74
1 po2anno
2 =======
3
4 `po2anno.pl` is a script to create an annotation comparison matrix from [Proteinortho5](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) output.
5
6 * [Synopsis](#synopsis)
7 * [Description](#description)
8 * [Usage](#usage)
9 * [cds_extractor](#cds_extractor)
10 * [Proteinortho5](#proteinortho5)
11 * [po2anno](#po2anno)
12 * [Options](#options)
13 * [Mandatory options](#mandatory-options)
14 * [Optional options](#optional-options)
15 * [Output](#output)
16 * [Run environment](#run-environment)
17 * [Author - contact](#author---contact)
18 * [Citation, installation, and license](#citation-installation-and-license)
19 * [Changelog](#changelog)
20
21 ## Synopsis
22
23 perl po2anno.pl -i matrix.proteinortho -d genome_fasta_dir/ -l -a > annotation_comparison.tsv
24
25 ## Description
26
27 Supplement an ortholog/paralog output matrix from a
28 [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/)
29 calculation with annotation information. The resulting tab-separated
30 annotation comparison matrix (ACM) is mainly intended for the
31 transfer of high quality annotations from reference genomes to
32 homologs (orthologs and co-orthologs/paralogs) in a query genome
33 (e.g. in conjunction with [`tbl2tab.pl`](/tbl2tab)). But of course
34 it can also be used to have a quick glance at the annotation of
35 genes present only in a couple of input genomes in comparison to the
36 others.
37
38 Annotation is retrieved from multi-FASTA files created with
39 [`cds_extractor.pl`](/cds_extractor). See
40 [`cds_extractor.pl`](/cds_extractor) for a description of the
41 format. These files are used as input for the PO analysis and option
42 **-d** for `po2anno.pl`.
43
44 **Proteinortho5** (PO) has to be run with option **-singles** to include
45 also genes without orthologs, so-called singletons/ORFans, for each
46 genome in the PO matrix (see the
47 [PO manual](http://www.bioinf.uni-leipzig.de/Software/proteinortho/manual.html)).
48 Additionally, option **-selfblast** is recommended to enhance paralog
49 detection by PO.
50
51 Each orthologous group (OG) is listed in a row of the resulting ACM,
52 the first column holds the OG numbers from the PO input matrix (i.e.
53 line number minus one). The following columns specify the
54 orthologous CDS for each input genome. For each CDS the ID,
55 optionally the length in bp (option **-l**), gene, EC number(s), and
56 product are shown depending on their presence in the CDS's
57 annotation. The ID is in most cases the locus tag (see
58 [`cds_extractor.pl`](/cds_extractor)). If several EC numbers exist
59 for a single CDS they're separated by ';'. If an OG includes
60 paralogs, i.e. co-orthologs from a single genome, these will be
61 printed in the following row(s) **without** a new OG number in the
62 first column. The order of paralogous CDSs within an OG is
63 arbitrarily.
64
65 The OGs are sorted numerically via the query ID (see option **-q**).
66 If option **-a** is set, the non-query OGs are appended to the output
67 after the query OGs, sorted numerically via OG number.
68
69 ## Usage
70
71 ### [`cds_extractor`](/cds_extractor)
72
73 for i in *.[gbk|embl]; do perl cds_extractor.pl -i $i [-p|-n]; done
74
75 ### [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/)
76
77 proteinortho5.pl -graph [-synteny] -cpus=# -selfblast -singles -identity=50 -cov=50 -blastParameters='-use_sw_tback [-seg no|-dust no]' *.[faa|ffn]
78
79 ### po2anno
80
81 perl po2anno.pl -i matrix.[proteinortho|poff] -d genome_fasta_dir/ -q query.[faa|ffn] -l -a > annotation_comparison.tsv
82
83 ## Options
84
85 ### Mandatory options
86
87 - **-i**=_str_, **-input**=_str_
88
89 Proteinortho (PO) result matrix (\*.proteinortho or \*.poff), or piped *STDIN* (-)
90
91 - **-d**=_str_, **-dir\_genome**=_str_
92
93 Path to the directory including the genome multi-FASTA PO input files (\*.faa or \*.ffn), created with [`cds_extractor.pl`](/cds_extractor)
94
95 ### Optional options
96
97 - **-h**, **-help**
98
99 Help (perldoc POD)
100
101 - **-q**=_str_, **-query**=_str_
102
103 Query genome (has to be identical to the string in the PO matrix) [default = first one in alphabetical order]
104
105 - **-l**, **-length**
106
107 Include length of each CDS in bp
108
109 - **-a**, **-all**
110
111 Append non-query orthologous groups (OGs) to the output
112
113 - **-v**, **-version**
114
115 Print version number to *STDERR*
116
117 ## Output
118
119 - *STDOUT*
120
121 The resulting tab-delimited ACM is printed to *STDOUT*. Redirect or pipe into another tool as needed (e.g. `cut`, `grep`, `head`, or `tail`).
122
123 ## Run environment
124
125 The Perl script runs under Windows and UNIX flavors.
126
127 ## Author - contact
128
129 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
130
131 ## Citation, installation, and license
132
133 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
134
135 ## Changelog
136
137 * v0.2.2 (23.10.2015)
138 * minor syntax changes to `po2anno.pl` and README
139 * changed option **-g|-genome_dir** to **-d|-dir_genome** for consistency with [`po2group_stats.pl`](/po2group_stats)
140 * v0.2.1 (07.09.2015)
141 * get rid of underscores in product annotation strings (from [`cds_extractor.pl`](/cds_extractor))
142 * debugged hard-coded relative path for `$genome_file_path`
143 * v0.2 (15.01.2015)
144 * give number of query-specific OGs and total query singletons/ORFans in final stat output
145 * changed final stat output to an easier readable format
146 * fixed bug: %Query_ID_Seen included also non-query IDs, which luckily had no consequences
147 * v0.1 (18.12.2014)