view proteinortho_clustering.xml @ 0:0be284f6abf4 draft default tip

planemo upload for repository https://gitlab.com/paulklemm_PHD/proteinortho commit 621d623b1acbdf230f0c6a2480bb49968ca1e2d5
author iuc
date Tue, 17 Jun 2025 07:50:24 +0000
parents
children
line wrap: on
line source

<tool id="proteinortho_clustering" name="Proteinortho clustering" version="@TOOL_VERSION@+galaxy@WRAPPER_VERSION@" profile="@PROFILE@">
    <description>Spectral partitioning algorithm</description>
    <macros>
        <import>proteinortho_macros.xml</import>
    </macros>
    <expand macro="biotools"/>
    <expand macro="requirements"/>
    <expand macro="version_command"/>
    <command detect_errors="exit_code"><![CDATA[
        export TERM=dumb &&
        proteinortho_clustering 
            -conn '$conn'
            -minspecies '$minspecies'
            $core
            $abc
            -cpus "\${GALAXY_SLOTS:-4}"
            #if str($seed) != '':
                -seed '$seed'
            #end if
            -lapack '$lapack'
            '$blastgraph'
        2>&1 
        >output.tsv
    ]]></command>
    <inputs>
        <param name="blastgraph" type="data" format="tabular" label="Proteinortho blastgraph file (RBH) or a undirected weighted graph file in ABC format (needs -abc)"/>
        <param argument="-conn" type="float" value="0.1" label="Minimal connectivity (-conn)" help="Split if algebraic connectivity is below this value."/>
        <param argument="minspecies" type="float" value="1" label="Min species ratio (-minspecies)" help="Stop clustering if the ratio of species per nodes in a cluster is below this value. Trumps the conn threshold."/>
        <param argument="-core" type="boolean" truevalue="-core" falsevalue="" label="Enable core mode" help="Only split a cluster if all species are preserved in the resulting clusters. Disables the conn and minspecies thresholds."/>
        <param argument="-abc" type="boolean" truevalue="-abc" falsevalue="" label="Input is ABC format" help="Use this if your graph file is in abc format (node1(tab)node2(tab)pos.weight). The graph is expected to be unweighted (if the edge a-b is present, b-a is ignored)."/>
        <param argument="-seed" type="integer" optional="true" label="Random seed (-seed)" help="Set srand seed for reproducibility."/>
        <param argument="-lapack" type="select" optional="true" label="Use LAPACK ssyevr for algebraic connectivity calculation">
            <option value="0">Use power iteration instead of LAPACK</option>
            <option value="1" selected="true">Yes, if applicable</option>
            <option value="2">Always</option>
        </param>
    </inputs>
    <outputs>
        <data name="clustering" format="tabular" label="${tool.name} on ${on_string}: Clustering Result (conn=${conn})" from_work_dir="output.tsv"/>
    </outputs>
    <tests>
        <test expect_num_outputs="1">
            <param name="blastgraph" value="result.blast-graph"/>
            <output name="clustering">
                <assert_contents>
                    <has_text text="# Species"/>
                    <has_text text="C_11"/>
                    <has_text text="C_14"/>
                    <has_text text="C_17"/>
                    <has_text text="4"/>
                </assert_contents>
            </output>
        </test>
    </tests>
    <help><![CDATA[proteinortho clustering

The spectral clustering approach follows a bisecting paradigm. Groups are successively divided until a predefined algebraic connectivity threshold (**conn**) is met. The choice of this threshold directly affects the size and quality of reported (co-)orthologous groups. A high connectivity threshold (conn) will only return sets of mutually similar proteins but can lead to excessive fragmentation of the orthology graph in numerous small connected components. Orthologous groups might fall apart into several subsets. A low threshold (conn), on the other hand, might return non-informative large connected components with multiple putative co-orthologs for each species that actually represent unions of several orthologous groups. 

The default threshold applied by Proteinortho was defined empirically and represents a reasonable trade-off between both extremes.

The **core** parameter assumes that members of orthologous groups should be found in all species. Iterative spectral clustering is applied irrespective of the graph’s connectivity until the graph would split into two subgraphs of which neither covers all species that were covered by the original connected component. 

With **abc** you can enable the processing of a abc formated input graph (node1<tab>node2<tab>similarity_weight)

.. csv-table::
    
    node1,node2,10
    node2,node3,2
    node1,node3,3

You can use double dashes to indicate species like this

.. csv-table::
    
    node1--species1,node2--species1,10
    node1--species1,node1--species2,2
    node2--species1,node2--species2,3

Only edges between different species are removed. If no species are defined, each node is treated as a individual species.

**Output**

The result of the clustering is a tab-separated file.
The first 3 columns characterize the general properties of that group: number of proteins, species, and algebraic connectivity. The higher the algebraic connectivity the more edges are there and the better the group is connected to itself. 
Then a column for each species/node follows containing the proteins of these species. 
If a species contributes with more than one protein to a group of orthologs, then they are ordered by descending connectivity.
The '*' represents that this species does not contribute to the group.

.. csv-table::
    
    Species,Genes,alg.-conn.,ecoli.faa,human.faa,snail.faa,wale.faa,mouse.faa
    5,5,0.715,C_10,C_10;test,E_10,L_10,M_10
    4,6,0.115,*,C_12,E_315,L_313,M_313
    4,5,0.167,*,C_63,E_19,L_19,M_19
    4,4,0.816,*,C_64,E_18,L_18,M_18

More information can be found on github https://gitlab.com/paulklemm_PHD/proteinortho
]]>
    </help>
    <citations>
        <citation type="doi">10.3389/fbinf.2023.1322477</citation>
        <citation type="doi">10.1186/1471-2105-12-124</citation>
        <citation type="doi">10.1371/journal.pone.0105015</citation>
    </citations>
</tool>