Mercurial > repos > dereeper > pangenome_explorer
comparison COG/bac-genomics-scripts/po2anno/README.md @ 3:e42d30da7a74 draft
Uploaded
author | dereeper |
---|---|
date | Thu, 30 May 2024 11:52:25 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
2:97e4e3e818b6 | 3:e42d30da7a74 |
---|---|
1 po2anno | |
2 ======= | |
3 | |
4 `po2anno.pl` is a script to create an annotation comparison matrix from [Proteinortho5](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) output. | |
5 | |
6 * [Synopsis](#synopsis) | |
7 * [Description](#description) | |
8 * [Usage](#usage) | |
9 * [cds_extractor](#cds_extractor) | |
10 * [Proteinortho5](#proteinortho5) | |
11 * [po2anno](#po2anno) | |
12 * [Options](#options) | |
13 * [Mandatory options](#mandatory-options) | |
14 * [Optional options](#optional-options) | |
15 * [Output](#output) | |
16 * [Run environment](#run-environment) | |
17 * [Author - contact](#author---contact) | |
18 * [Citation, installation, and license](#citation-installation-and-license) | |
19 * [Changelog](#changelog) | |
20 | |
21 ## Synopsis | |
22 | |
23 perl po2anno.pl -i matrix.proteinortho -d genome_fasta_dir/ -l -a > annotation_comparison.tsv | |
24 | |
25 ## Description | |
26 | |
27 Supplement an ortholog/paralog output matrix from a | |
28 [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) | |
29 calculation with annotation information. The resulting tab-separated | |
30 annotation comparison matrix (ACM) is mainly intended for the | |
31 transfer of high quality annotations from reference genomes to | |
32 homologs (orthologs and co-orthologs/paralogs) in a query genome | |
33 (e.g. in conjunction with [`tbl2tab.pl`](/tbl2tab)). But of course | |
34 it can also be used to have a quick glance at the annotation of | |
35 genes present only in a couple of input genomes in comparison to the | |
36 others. | |
37 | |
38 Annotation is retrieved from multi-FASTA files created with | |
39 [`cds_extractor.pl`](/cds_extractor). See | |
40 [`cds_extractor.pl`](/cds_extractor) for a description of the | |
41 format. These files are used as input for the PO analysis and option | |
42 **-d** for `po2anno.pl`. | |
43 | |
44 **Proteinortho5** (PO) has to be run with option **-singles** to include | |
45 also genes without orthologs, so-called singletons/ORFans, for each | |
46 genome in the PO matrix (see the | |
47 [PO manual](http://www.bioinf.uni-leipzig.de/Software/proteinortho/manual.html)). | |
48 Additionally, option **-selfblast** is recommended to enhance paralog | |
49 detection by PO. | |
50 | |
51 Each orthologous group (OG) is listed in a row of the resulting ACM, | |
52 the first column holds the OG numbers from the PO input matrix (i.e. | |
53 line number minus one). The following columns specify the | |
54 orthologous CDS for each input genome. For each CDS the ID, | |
55 optionally the length in bp (option **-l**), gene, EC number(s), and | |
56 product are shown depending on their presence in the CDS's | |
57 annotation. The ID is in most cases the locus tag (see | |
58 [`cds_extractor.pl`](/cds_extractor)). If several EC numbers exist | |
59 for a single CDS they're separated by ';'. If an OG includes | |
60 paralogs, i.e. co-orthologs from a single genome, these will be | |
61 printed in the following row(s) **without** a new OG number in the | |
62 first column. The order of paralogous CDSs within an OG is | |
63 arbitrarily. | |
64 | |
65 The OGs are sorted numerically via the query ID (see option **-q**). | |
66 If option **-a** is set, the non-query OGs are appended to the output | |
67 after the query OGs, sorted numerically via OG number. | |
68 | |
69 ## Usage | |
70 | |
71 ### [`cds_extractor`](/cds_extractor) | |
72 | |
73 for i in *.[gbk|embl]; do perl cds_extractor.pl -i $i [-p|-n]; done | |
74 | |
75 ### [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) | |
76 | |
77 proteinortho5.pl -graph [-synteny] -cpus=# -selfblast -singles -identity=50 -cov=50 -blastParameters='-use_sw_tback [-seg no|-dust no]' *.[faa|ffn] | |
78 | |
79 ### po2anno | |
80 | |
81 perl po2anno.pl -i matrix.[proteinortho|poff] -d genome_fasta_dir/ -q query.[faa|ffn] -l -a > annotation_comparison.tsv | |
82 | |
83 ## Options | |
84 | |
85 ### Mandatory options | |
86 | |
87 - **-i**=_str_, **-input**=_str_ | |
88 | |
89 Proteinortho (PO) result matrix (\*.proteinortho or \*.poff), or piped *STDIN* (-) | |
90 | |
91 - **-d**=_str_, **-dir\_genome**=_str_ | |
92 | |
93 Path to the directory including the genome multi-FASTA PO input files (\*.faa or \*.ffn), created with [`cds_extractor.pl`](/cds_extractor) | |
94 | |
95 ### Optional options | |
96 | |
97 - **-h**, **-help** | |
98 | |
99 Help (perldoc POD) | |
100 | |
101 - **-q**=_str_, **-query**=_str_ | |
102 | |
103 Query genome (has to be identical to the string in the PO matrix) [default = first one in alphabetical order] | |
104 | |
105 - **-l**, **-length** | |
106 | |
107 Include length of each CDS in bp | |
108 | |
109 - **-a**, **-all** | |
110 | |
111 Append non-query orthologous groups (OGs) to the output | |
112 | |
113 - **-v**, **-version** | |
114 | |
115 Print version number to *STDERR* | |
116 | |
117 ## Output | |
118 | |
119 - *STDOUT* | |
120 | |
121 The resulting tab-delimited ACM is printed to *STDOUT*. Redirect or pipe into another tool as needed (e.g. `cut`, `grep`, `head`, or `tail`). | |
122 | |
123 ## Run environment | |
124 | |
125 The Perl script runs under Windows and UNIX flavors. | |
126 | |
127 ## Author - contact | |
128 | |
129 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) | |
130 | |
131 ## Citation, installation, and license | |
132 | |
133 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). | |
134 | |
135 ## Changelog | |
136 | |
137 * v0.2.2 (23.10.2015) | |
138 * minor syntax changes to `po2anno.pl` and README | |
139 * changed option **-g|-genome_dir** to **-d|-dir_genome** for consistency with [`po2group_stats.pl`](/po2group_stats) | |
140 * v0.2.1 (07.09.2015) | |
141 * get rid of underscores in product annotation strings (from [`cds_extractor.pl`](/cds_extractor)) | |
142 * debugged hard-coded relative path for `$genome_file_path` | |
143 * v0.2 (15.01.2015) | |
144 * give number of query-specific OGs and total query singletons/ORFans in final stat output | |
145 * changed final stat output to an easier readable format | |
146 * fixed bug: %Query_ID_Seen included also non-query IDs, which luckily had no consequences | |
147 * v0.1 (18.12.2014) |