Galaxy |

Proteinortho (version 6.3.1+galaxy0)

Select the input fasta files (>2):

The input fasta files. At least 2 are needed!

Similarity comparision algorithm:

In the first step of proteinortho an all-versus-all reciprocal best hit graph is build from the input files (using this algorithm).

Minimal reciprocal similarity in %:

This and --evalue are main parameters for the generation of the reciprocal best hit graph. 1 = only the best reciprocal hits are reported, 0 = all possible reciprocal blast matches (within the E-value cutoff) are reported.

Minimal algebraic connectivity:

This is the main parameter for the clustering step. Choose larger values than more splits are done, resulting in more and smaller clusters. A value of 0 corresponds to no clustering.

Additional Options

Additional Options 0

Activate synteny feature (POFF):

To enhance the prediction accuracy, the relative order of genes (synteny) can be used as an additional feature for the discrimination of orthologs. For more details see doi:10.1371/journal.pone.0105015.

Proteinortho with POFF - An orthology detection tool

What it does

Proteinortho is a tool to detect orthologous proteins/genes within different species (at least 2).

It compares similarities of given gene/protein sequences and clusters them to find significant groups.

The algorithm was designed to handle large-scale data and can be applied to hundreds of species at once.

Details can be found in (doi:10.1186/1471-2105-12-124 and doi:10.3389/fbinf.2023.1322477).

To enhance the prediction accuracy, the relative order of genes (synteny) can be used as an additional feature for the discrimination of orthologs. The corresponding extension, namely PoFF (details see doi:10.1371/journal.pone.0105015), is already built in Proteinortho.

Proteinortho in a nutshell

(i) Build adaptive reciprocal best hit graph (RBH)

Using the blast algorithm (diamond,blast,blat,...) all input sequences are compared against each other.

If two proteins find each other with respect to multiple criteria like minimal evalue, and similarity compared to the best hit, ... then an edge is drawn between the two proteins.

The result of this step is outputted to RBH
(ii) Cluster the RBH

Using two clustering algorithms, edges are removed that weakly connect two connected components to reduce false positive hits.

The resulting connected components are outputted in orthology-groups / -pairs

Proteinortho output files

RBH

The result of the (i) step, the reciprocal best hit graph.

First two comment line announces 2 species (# ecoli.faa human.faa) as well as the median values (evalue_ab,bitscore_ab,evalue_ba,bitscore_ba).

Following these header lines, each line corresponds to a reciprocal best hit of 2 proteins/genes (columns 1 and 2) of the announced species. The output format is shown below.

seqidA,*seqidB* = the 2 ids/names of the proteins involved

evalue_ab = evalue with seqidA as query and seqidB as part of the database

bitscore_ab = bitscore with seqidA as query ...

evalue_ba = evalue with seqidB as query ...

seqidA	seqidB	evalue_ab	bitscore_ab	evalue_ba	bitscore_ba
# ecoli.faa	human.faa
# 1.91e-112	357.5	1.825e-113	360
L_10	C_10;test	4.32e-151	447	4.30e-151	446
L_11	C_11	1.17e-68	209	3.00e-69	210
L_14	C_14	3.64e-139	422	1.19e-142	431
L_15	C_15	3.51e-100	303	2.12e-102	308
L_16	C_16	3.75e-49	157	7.06e-50	159
L_17	C_17	2.96e-195	578	5.50e-196	579

orthology-groups

The result of the (ii) step, the clustered reciprocal best hit graph or the orthology groups.

Every line corresponds to an orthology group.

The first 3 columns characterize the general properties of that group: number of proteins, species, and algebraic connectivity. The higher the algebraic connectivity the more edges are there and the better the group is connected to itself in general.

Then a column for each species follows containing the proteins of these species.

If a species contributes with more than one protein to a group of orthologs, then they are ordered by descending connectivity.

The '*' represents that this species does not contribute to the group.

Species	Genes	alg.-conn.	ecoli.faa	human.faa	snail.faa	wale.faa	ebola.faa
5	5	0.715	C_10	C_10;test	E_10	L_10	M_10
4	6	0.115		C_12	E_315	L_313	M_313
4	5	0.167		C_63	E_19	L_19	M_19
4	4	0.816		C_64	E_18	L_18	M_18

orthology-pairs

The same as orthology-groups but every edge is printed one-by-one instead of the whole group. The output is formatted the same as the RBH graph:

seqidA

seqidB

evalue_ab

bitscore_ab

evalue_ba

bitscore_ba

Proteinortho-Tools for downstream analysis

proteinortho grab proteins : find gene(s)/protein(s) in a given fasta file and retrieve their sequence(s). You can also use a orthology-groups file or a subset (e.g. filter by Species>10).
proteinortho summary : Summaries the orthology-pairs/RBH files to determine how the species are connected to each other.

More information can be found on github https://gitlab.com/paulklemm_PHD/proteinortho

Citations: