3
|
1 po2anno
|
|
2 =======
|
|
3
|
|
4 `po2anno.pl` is a script to create an annotation comparison matrix from [Proteinortho5](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) output.
|
|
5
|
|
6 * [Synopsis](#synopsis)
|
|
7 * [Description](#description)
|
|
8 * [Usage](#usage)
|
|
9 * [cds_extractor](#cds_extractor)
|
|
10 * [Proteinortho5](#proteinortho5)
|
|
11 * [po2anno](#po2anno)
|
|
12 * [Options](#options)
|
|
13 * [Mandatory options](#mandatory-options)
|
|
14 * [Optional options](#optional-options)
|
|
15 * [Output](#output)
|
|
16 * [Run environment](#run-environment)
|
|
17 * [Author - contact](#author---contact)
|
|
18 * [Citation, installation, and license](#citation-installation-and-license)
|
|
19 * [Changelog](#changelog)
|
|
20
|
|
21 ## Synopsis
|
|
22
|
|
23 perl po2anno.pl -i matrix.proteinortho -d genome_fasta_dir/ -l -a > annotation_comparison.tsv
|
|
24
|
|
25 ## Description
|
|
26
|
|
27 Supplement an ortholog/paralog output matrix from a
|
|
28 [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/)
|
|
29 calculation with annotation information. The resulting tab-separated
|
|
30 annotation comparison matrix (ACM) is mainly intended for the
|
|
31 transfer of high quality annotations from reference genomes to
|
|
32 homologs (orthologs and co-orthologs/paralogs) in a query genome
|
|
33 (e.g. in conjunction with [`tbl2tab.pl`](/tbl2tab)). But of course
|
|
34 it can also be used to have a quick glance at the annotation of
|
|
35 genes present only in a couple of input genomes in comparison to the
|
|
36 others.
|
|
37
|
|
38 Annotation is retrieved from multi-FASTA files created with
|
|
39 [`cds_extractor.pl`](/cds_extractor). See
|
|
40 [`cds_extractor.pl`](/cds_extractor) for a description of the
|
|
41 format. These files are used as input for the PO analysis and option
|
|
42 **-d** for `po2anno.pl`.
|
|
43
|
|
44 **Proteinortho5** (PO) has to be run with option **-singles** to include
|
|
45 also genes without orthologs, so-called singletons/ORFans, for each
|
|
46 genome in the PO matrix (see the
|
|
47 [PO manual](http://www.bioinf.uni-leipzig.de/Software/proteinortho/manual.html)).
|
|
48 Additionally, option **-selfblast** is recommended to enhance paralog
|
|
49 detection by PO.
|
|
50
|
|
51 Each orthologous group (OG) is listed in a row of the resulting ACM,
|
|
52 the first column holds the OG numbers from the PO input matrix (i.e.
|
|
53 line number minus one). The following columns specify the
|
|
54 orthologous CDS for each input genome. For each CDS the ID,
|
|
55 optionally the length in bp (option **-l**), gene, EC number(s), and
|
|
56 product are shown depending on their presence in the CDS's
|
|
57 annotation. The ID is in most cases the locus tag (see
|
|
58 [`cds_extractor.pl`](/cds_extractor)). If several EC numbers exist
|
|
59 for a single CDS they're separated by ';'. If an OG includes
|
|
60 paralogs, i.e. co-orthologs from a single genome, these will be
|
|
61 printed in the following row(s) **without** a new OG number in the
|
|
62 first column. The order of paralogous CDSs within an OG is
|
|
63 arbitrarily.
|
|
64
|
|
65 The OGs are sorted numerically via the query ID (see option **-q**).
|
|
66 If option **-a** is set, the non-query OGs are appended to the output
|
|
67 after the query OGs, sorted numerically via OG number.
|
|
68
|
|
69 ## Usage
|
|
70
|
|
71 ### [`cds_extractor`](/cds_extractor)
|
|
72
|
|
73 for i in *.[gbk|embl]; do perl cds_extractor.pl -i $i [-p|-n]; done
|
|
74
|
|
75 ### [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/)
|
|
76
|
|
77 proteinortho5.pl -graph [-synteny] -cpus=# -selfblast -singles -identity=50 -cov=50 -blastParameters='-use_sw_tback [-seg no|-dust no]' *.[faa|ffn]
|
|
78
|
|
79 ### po2anno
|
|
80
|
|
81 perl po2anno.pl -i matrix.[proteinortho|poff] -d genome_fasta_dir/ -q query.[faa|ffn] -l -a > annotation_comparison.tsv
|
|
82
|
|
83 ## Options
|
|
84
|
|
85 ### Mandatory options
|
|
86
|
|
87 - **-i**=_str_, **-input**=_str_
|
|
88
|
|
89 Proteinortho (PO) result matrix (\*.proteinortho or \*.poff), or piped *STDIN* (-)
|
|
90
|
|
91 - **-d**=_str_, **-dir\_genome**=_str_
|
|
92
|
|
93 Path to the directory including the genome multi-FASTA PO input files (\*.faa or \*.ffn), created with [`cds_extractor.pl`](/cds_extractor)
|
|
94
|
|
95 ### Optional options
|
|
96
|
|
97 - **-h**, **-help**
|
|
98
|
|
99 Help (perldoc POD)
|
|
100
|
|
101 - **-q**=_str_, **-query**=_str_
|
|
102
|
|
103 Query genome (has to be identical to the string in the PO matrix) [default = first one in alphabetical order]
|
|
104
|
|
105 - **-l**, **-length**
|
|
106
|
|
107 Include length of each CDS in bp
|
|
108
|
|
109 - **-a**, **-all**
|
|
110
|
|
111 Append non-query orthologous groups (OGs) to the output
|
|
112
|
|
113 - **-v**, **-version**
|
|
114
|
|
115 Print version number to *STDERR*
|
|
116
|
|
117 ## Output
|
|
118
|
|
119 - *STDOUT*
|
|
120
|
|
121 The resulting tab-delimited ACM is printed to *STDOUT*. Redirect or pipe into another tool as needed (e.g. `cut`, `grep`, `head`, or `tail`).
|
|
122
|
|
123 ## Run environment
|
|
124
|
|
125 The Perl script runs under Windows and UNIX flavors.
|
|
126
|
|
127 ## Author - contact
|
|
128
|
|
129 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
|
|
130
|
|
131 ## Citation, installation, and license
|
|
132
|
|
133 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
|
|
134
|
|
135 ## Changelog
|
|
136
|
|
137 * v0.2.2 (23.10.2015)
|
|
138 * minor syntax changes to `po2anno.pl` and README
|
|
139 * changed option **-g|-genome_dir** to **-d|-dir_genome** for consistency with [`po2group_stats.pl`](/po2group_stats)
|
|
140 * v0.2.1 (07.09.2015)
|
|
141 * get rid of underscores in product annotation strings (from [`cds_extractor.pl`](/cds_extractor))
|
|
142 * debugged hard-coded relative path for `$genome_file_path`
|
|
143 * v0.2 (15.01.2015)
|
|
144 * give number of query-specific OGs and total query singletons/ORFans in final stat output
|
|
145 * changed final stat output to an easier readable format
|
|
146 * fixed bug: %Query_ID_Seen included also non-query IDs, which luckily had no consequences
|
|
147 * v0.1 (18.12.2014)
|