Proteinortho with POFF - An orthology detection tool
What it does
Proteinortho is a tool to detect orthologous proteins/genes within different species (at least 2).
It compares similarities of given gene/protein sequences and clusters them to find significant groups.The algorithm was designed to handle large-scale data and can be applied to hundreds of species at once.Details can be found in (doi:10.1186/1471-2105-12-124 and doi:10.3389/fbinf.2023.1322477).To enhance the prediction accuracy, the relative order of genes (synteny) can be used as an additional feature for the discrimination of orthologs. The corresponding extension, namely PoFF (details see doi:10.1371/journal.pone.0105015), is already built in Proteinortho.
Proteinortho in a nutshell
(i) Build adaptive reciprocal best hit graph (RBH)
Using the blast algorithm (diamond,blast,blat,...) all input sequences are compared against each other.If two proteins find each other with respect to multiple criteria like minimal evalue, and similarity compared to the best hit, ... then an edge is drawn between the two proteins.The result of this step is outputted to RBH
(ii) Cluster the RBH
Using two clustering algorithms, edges are removed that weakly connect two connected components to reduce false positive hits.The resulting connected components are outputted in orthology-groups / -pairs
Proteinortho output files
RBH
The result of the (i) step, the reciprocal best hit graph.First two comment line announces 2 species (# ecoli.faa human.faa) as well as the median values (evalue_ab,bitscore_ab,evalue_ba,bitscore_ba).Following these header lines, each line corresponds to a reciprocal best hit of 2 proteins/genes (columns 1 and 2) of the announced species. The output format is shown below.seqidA,*seqidB* = the 2 ids/names of the proteins involvedevalue_ab = evalue with seqidA as query and seqidB as part of the databasebitscore_ab = bitscore with seqidA as query ...evalue_ba = evalue with seqidB as query ...
seqidA | seqidB | evalue_ab | bitscore_ab | evalue_ba | bitscore_ba |
# ecoli.faa | human.faa | ||||
# 1.91e-112 | 357.5 | 1.825e-113 | 360 | ||
L_10 | C_10;test | 4.32e-151 | 447 | 4.30e-151 | 446 |
L_11 | C_11 | 1.17e-68 | 209 | 3.00e-69 | 210 |
L_14 | C_14 | 3.64e-139 | 422 | 1.19e-142 | 431 |
L_15 | C_15 | 3.51e-100 | 303 | 2.12e-102 | 308 |
L_16 | C_16 | 3.75e-49 | 157 | 7.06e-50 | 159 |
L_17 | C_17 | 2.96e-195 | 578 | 5.50e-196 | 579 |
orthology-groups
The result of the (ii) step, the clustered reciprocal best hit graph or the orthology groups.Every line corresponds to an orthology group.The first 3 columns characterize the general properties of that group: number of proteins, species, and algebraic connectivity. The higher the algebraic connectivity the more edges are there and the better the group is connected to itself in general.Then a column for each species follows containing the proteins of these species.If a species contributes with more than one protein to a group of orthologs, then they are ordered by descending connectivity.The '*' represents that this species does not contribute to the group.
Species | Genes | alg.-conn. | ecoli.faa | human.faa | snail.faa | wale.faa | ebola.faa |
5 | 5 | 0.715 | C_10 | C_10;test | E_10 | L_10 | M_10 |
4 | 6 | 0.115 | C_12 | E_315 | L_313 | M_313 | |
4 | 5 | 0.167 | C_63 | E_19 | L_19 | M_19 | |
4 | 4 | 0.816 | C_64 | E_18 | L_18 | M_18 |
orthology-pairs
The same as orthology-groups but every edge is printed one-by-one instead of the whole group. The output is formatted the same as the RBH graph:
seqidA | seqidB | evalue_ab | bitscore_ab | evalue_ba | bitscore_ba |
Proteinortho-Tools for downstream analysis
More information can be found on github https://gitlab.com/paulklemm_PHD/proteinortho
Citations: