view cluster.xml @ 0:32849c52aa54 draft

planemo upload for repository https://github.com/galaxyecology/tools-ecology/tree/master/tools/Ecoregionalization_workflow commit 2a2ae892fa2dbc1eff9c6a59c3ad8f3c27c1c78d
author ecology
date Wed, 18 Oct 2023 09:59:19 +0000
parents
children edb8d19735a6
line wrap: on
line source

<tool id="ecoregion_clara_cluster" name="ClaraClust" version="0.1.0+galaxy0" profile="22.05">
    <description>Make cluster from BRT prediction with Clara algorithm</description>
    <requirements>
       <requirement type="package" version="4.2.3">r-base</requirement>
       <requirement type="package" version="2.1.4">r-cluster</requirement>
       <requirement type="package" version="1.1.1">r-dplyr</requirement>
       <requirement type="package" version="2.0.0">r-tidyverse</requirement>
    </requirements>
    <command detect_errors="exit_code"><![CDATA[
    Rscript
        '$__tool_directory__/cluster_ceamarc.R'
        '$predictionmatrix'
        '$envfile'
        '$predictionfile'
        '$k'
        '$metric'
        '$sample'
        '$output1'
        '$output2'
        '$output3'
    ]]></command>
    <inputs>
      <param name="predictionmatrix" type="data" format="tabular" label="Prediction matrix (data to cluster from Cluster Estimate tool) "/>
      <param name="envfile" type="data" format="txt,csv,tabular" label="Environmental file"/>
      <param name="predictionfile" type="data" format="tabular" label="Prediction file (data.bio table from Cluster Estimate tool)"/>
      <param name="k" type="integer" label="Number of Cluster wanted" min= "1" value="2"/>
      <param name="metric" type="select" label="What metric to use to calculate dissimilarities between observations ?">
             <option value = "manhattan">manhattan</option>
             <option value = "euclidean">euclidean</option>
             <option value = "jaccard">jaccard</option>
      </param>
      <param name="sample" type="integer" label= "The number of samples to be drawn from the dataset" min="5" value="10"/>
    </inputs>
    <outputs>
      <data name="output1" from_work_dir="sih.png" format="png" label="SIH plot"/>
      <data name="output2" from_work_dir="points_clus.txt" format="txt" label="Cluster points"/>
      <data name="output3" from_work_dir="clus.txt" format="txt" label="Cluster info"/>
    </outputs>
    <tests>
        <test>
            <param name="predictionmatrix" value="Data_to_cluster.tsv"/>
            <param name="envfile" value="ceamarc_env.csv"/>
            <param name="predictionfile" value="Data.bio_table.tsv"/>
            <param name='k' value="2"/>
            <param name='metric' value="manhattan"/>
            <param name='sample' value="10"/>
            <output name='output1'>
                <assert_contents>
            	    <has_size value="8128" delta="500"/>
            	</assert_contents>
            </output>
            <output name='output2'>
                <assert_contents>
                    <has_size value="255" delta="1000" />
                </assert_contents>
            </output>
            <output name='output3' >
                <assert_contents>
                    <has_size value="2008" delta="1000000" />
                </assert_contents>
            </output>
        </test>
    </tests>
    <help><![CDATA[
==================    
**What it does ?**
==================

This tool is made to partition the pixels of the environmental layers according to the associated values of the BRT prediction index. Due to the large size of the datasets, the Clara function of the Cluster package was used to apply the Partitioning Around Medoids (PAM) algorithm on a representative sample of the data. This speeds up the clustering process and makes the calculation more efficient. 


===================         
**How to use it ?**
===================

The tool takes as input the prediction matrix obtained in the previous step of the workflow, the prediction file and the file containing the environmental parameters. See example below. You must fill in the metric used to calculate the dissimilarities between the observations:

- Manhattan (distance between two real-valued vectors) : default metric in this workflow

- euclidean (shortest distance between two points)

- jaccard 

The sample size that will be used to perform clustering must be re-signed. A fairly high value representative of the data is recommended. It is important to note that using too small a sample may result in loss of information compared to using the entire data set. 
And finally enter the number of clusters that the clara function will use to partition the data.

The tool will produce three outputs: two data files and a graph. The first file (rather large) contains all the information related to the clusters created, that is, in column: the latitude, the longitude, the corresponding cluster number and for each taxon the prediction value. The second file is the one that we will use in the following workflow: the file containing the latitude and longitude of the environmental pixel and the associated cluster number. This file will later be called a cluster file. 
Finally, the tool will produce a plot silhouette. A silhouette graph is a representation used to visualize the silhouette index of each observation in a clustered data set. It makes it possible to assess the quality of clusters and determine their coherence and separation. In a silhouette graph, each observation is represented by a horizontal bar whose length is proportional to its silhouette index. The longer the bar, the better the consistency of the observation with its cluster and the separation from other clusters. The silhouette index ranges from -1 to 1. Values close to 1 indicate that objects are well grouped and separated from other clusters, while values close to -1 indicate that objects are poorly grouped and may be closer to other clusters. A value close to 0 indicates a situation where objects are located at the border between two clusters
neighbors.

**Example of the prediction matrix :**

+--------------------+--------------------------+--------------------+-----+
|    Acarnidae	     | Antarctotetilla_grandis  | Antarctotetilla_sp | ... |
+--------------------+--------------------------+--------------------+-----+
| 0.186146570494706  |    0.239854021968922     |  0.525876102108555 | ... |
+--------------------+--------------------------+--------------------+-----+ 
| 0.18639581256804   |    0.240065974372315     |  0.525876102108555 | ... |
+--------------------+--------------------------+--------------------+-----+
|        ...         |            ...           |         ...        | ... |
+--------------------+--------------------------+--------------------+-----+


**Example of the prediction file :**

+-----------+----------+-----------------------+-------------+
|    lat    |   long   |   Prediction.index    |     spe     |
+-----------+----------+-----------------------+-------------+
|  -65.57   |  139.22  |   0.122438487221909   |  Acarnidae  |
+-----------+----------+-----------------------+-------------+
|  -65.57   |  139.32  |   0.119154535627801   |  Acarnidae  |
+-----------+----------+-----------------------+-------------+
|   ...     |   ...    |         ...           |     ...     |
+-----------+----------+-----------------------+-------------+


**Example of the environmental parameters :**


+------+------+---------+------+--------------+-----+
| long | lat  |  Carbo  | Grav |  Maxbearing  | ... |
+------+------+---------+------+--------------+-----+
|139.22|-65.57|   0.88  |28.59 |     3.67     | ... |
+------+------+---------+------+--------------+-----+
|139.22|-65.57|   0.88  |28.61 |     3.64     | ... |
+------+------+---------+------+--------------+-----+
| ...  | ...  |   ...   | ...  |     ...      | ... |
+------+------+---------+------+--------------+-----+


    ]]></help>
</tool>