annotate COG/bac-genomics-scripts/po2anno/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 po2anno
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2 =======
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 `po2anno.pl` is a script to create an annotation comparison matrix from [Proteinortho5](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) output.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6 * [Synopsis](#synopsis)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7 * [Description](#description)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8 * [Usage](#usage)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 * [cds_extractor](#cds_extractor)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10 * [Proteinortho5](#proteinortho5)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 * [po2anno](#po2anno)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12 * [Options](#options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13 * [Mandatory options](#mandatory-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14 * [Optional options](#optional-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15 * [Output](#output)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16 * [Run environment](#run-environment)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17 * [Author - contact](#author---contact)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18 * [Citation, installation, and license](#citation-installation-and-license)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19 * [Changelog](#changelog)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21 ## Synopsis
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23 perl po2anno.pl -i matrix.proteinortho -d genome_fasta_dir/ -l -a > annotation_comparison.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25 ## Description
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27 Supplement an ortholog/paralog output matrix from a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28 [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29 calculation with annotation information. The resulting tab-separated
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 annotation comparison matrix (ACM) is mainly intended for the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31 transfer of high quality annotations from reference genomes to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 homologs (orthologs and co-orthologs/paralogs) in a query genome
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33 (e.g. in conjunction with [`tbl2tab.pl`](/tbl2tab)). But of course
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 it can also be used to have a quick glance at the annotation of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35 genes present only in a couple of input genomes in comparison to the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 others.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38 Annotation is retrieved from multi-FASTA files created with
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39 [`cds_extractor.pl`](/cds_extractor). See
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 [`cds_extractor.pl`](/cds_extractor) for a description of the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41 format. These files are used as input for the PO analysis and option
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42 **-d** for `po2anno.pl`.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44 **Proteinortho5** (PO) has to be run with option **-singles** to include
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45 also genes without orthologs, so-called singletons/ORFans, for each
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46 genome in the PO matrix (see the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47 [PO manual](http://www.bioinf.uni-leipzig.de/Software/proteinortho/manual.html)).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48 Additionally, option **-selfblast** is recommended to enhance paralog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49 detection by PO.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51 Each orthologous group (OG) is listed in a row of the resulting ACM,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52 the first column holds the OG numbers from the PO input matrix (i.e.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53 line number minus one). The following columns specify the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54 orthologous CDS for each input genome. For each CDS the ID,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55 optionally the length in bp (option **-l**), gene, EC number(s), and
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56 product are shown depending on their presence in the CDS's
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57 annotation. The ID is in most cases the locus tag (see
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58 [`cds_extractor.pl`](/cds_extractor)). If several EC numbers exist
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59 for a single CDS they're separated by ';'. If an OG includes
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60 paralogs, i.e. co-orthologs from a single genome, these will be
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 printed in the following row(s) **without** a new OG number in the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62 first column. The order of paralogous CDSs within an OG is
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63 arbitrarily.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65 The OGs are sorted numerically via the query ID (see option **-q**).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66 If option **-a** is set, the non-query OGs are appended to the output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67 after the query OGs, sorted numerically via OG number.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69 ## Usage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71 ### [`cds_extractor`](/cds_extractor)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73 for i in *.[gbk|embl]; do perl cds_extractor.pl -i $i [-p|-n]; done
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75 ### [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77 proteinortho5.pl -graph [-synteny] -cpus=# -selfblast -singles -identity=50 -cov=50 -blastParameters='-use_sw_tback [-seg no|-dust no]' *.[faa|ffn]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79 ### po2anno
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81 perl po2anno.pl -i matrix.[proteinortho|poff] -d genome_fasta_dir/ -q query.[faa|ffn] -l -a > annotation_comparison.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83 ## Options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85 ### Mandatory options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87 - **-i**=_str_, **-input**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89 Proteinortho (PO) result matrix (\*.proteinortho or \*.poff), or piped *STDIN* (-)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91 - **-d**=_str_, **-dir\_genome**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93 Path to the directory including the genome multi-FASTA PO input files (\*.faa or \*.ffn), created with [`cds_extractor.pl`](/cds_extractor)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95 ### Optional options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97 - **-h**, **-help**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101 - **-q**=_str_, **-query**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103 Query genome (has to be identical to the string in the PO matrix) [default = first one in alphabetical order]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105 - **-l**, **-length**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107 Include length of each CDS in bp
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109 - **-a**, **-all**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111 Append non-query orthologous groups (OGs) to the output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113 - **-v**, **-version**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115 Print version number to *STDERR*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117 ## Output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119 - *STDOUT*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121 The resulting tab-delimited ACM is printed to *STDOUT*. Redirect or pipe into another tool as needed (e.g. `cut`, `grep`, `head`, or `tail`).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123 ## Run environment
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125 The Perl script runs under Windows and UNIX flavors.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127 ## Author - contact
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131 ## Citation, installation, and license
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135 ## Changelog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137 * v0.2.2 (23.10.2015)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138 * minor syntax changes to `po2anno.pl` and README
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
139 * changed option **-g|-genome_dir** to **-d|-dir_genome** for consistency with [`po2group_stats.pl`](/po2group_stats)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
140 * v0.2.1 (07.09.2015)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
141 * get rid of underscores in product annotation strings (from [`cds_extractor.pl`](/cds_extractor))
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
142 * debugged hard-coded relative path for `$genome_file_path`
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
143 * v0.2 (15.01.2015)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
144 * give number of query-specific OGs and total query singletons/ORFans in final stat output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
145 * changed final stat output to an easier readable format
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
146 * fixed bug: %Query_ID_Seen included also non-query IDs, which luckily had no consequences
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
147 * v0.1 (18.12.2014)