view imputation.xml @ 2:caba07f41453 draft default tip

"planemo upload for repository https://github.com/secimTools/SECIMTools/tree/main/galaxy commit 498abad641099412df56f04ff6e144e4193bbc34-dirty"
author malex
date Thu, 10 Jun 2021 15:41:17 +0000
parents 2e7d47c0b027
children
line wrap: on
line source

<tool id="secimtools_imputation" name="Imputation (Mean, Median, K-Nearest Neighbours, Stochastic)" version="@WRAPPER_VERSION@">
    <description>of missing values using selected algorithm.</description>
    <macros>
        <import>macros.xml</import>
    </macros>
    <expand macro="requirements" />
    <stdio>
        <exit_code range="1:" level="warning" description="UserWarning"/>
        <exit_code range="1:" level="warning" description="VisibleDeprecationWarning"/>
    </stdio>
    <command detect_errors="exit_code"><![CDATA[
imputation.py
--input $input
--design $design
--ID $uniqID
--group $group

--output $imputed

--knn $k
--strategy $imputation
--row_cutoff $rowCutoff
--col_cutoff $colCutoff
--distribution $distribution

#if $noZero
    --no_zero
#end if

#if $noNeg
    --no_negative
#end if

#if $exclude
    --exclude $exclude
#end if
    ]]></command>
    <inputs>
        <param name="input" type="data" format="tabular" label="Wide Dataset" help="Input your tab-separated wide format dataset. If file is not tab separated see TIP below."/>
        <param name="design" type="data" format="tabular" label="Design File" help="Input your design file (tab-separated). Note you need a 'sampleID' column. If not tab separated see TIP below."/>
        <param name="uniqID" type="text" size="30" value="" label="Unique Feature ID" help="Name of the column in your wide dataset that has unique identifiers.."/>
        <param name="group" type="text" size="30" label="Group/Treatment" help="Name of the column in your design file that contains group classifications."/>
        <param name="imputation" size="30" type="select" value="" label="Imputation Strategy" help="Choose an imputation strategy.">
            <option value="knn" selected="true">K-Nearest Neighbors</option>
            <option value="bayesian" selected="true">Stochastic</option>
            <option value="mean" selected="true">Mean</option>
            <option value="median" selected="true">Median</option>
        </param>
        <param name="noZero" type="boolean" label="Count Zeroes as missing" help="Zeroes can be counted as missing or left as data."/>
        <param name="noNeg" type="boolean" label="Count Negative as missing" help="Negatives can be counted as missing or left as data."/>
        <param name="exclude" type="text" size="30" value="" label="Additional values to treat as missing [Optional]" help="Separate additional values to treat as missing data with commas."/>
        <param name="rowCutoff" type="text" size="30" value=".5" label="Row Percent Cutoff Value" help="Proportion of missing values allowed per group per row. If the proportion of missing values for each feature is greater than the specifed value, then the sample mean is imputed instead of values from the K-Nearest Neighbors algorithm. Default: 0.5 (50%)."/>
        <param name="k" type="text" size="30" value="5" label="K value" help="Only for K-Nearest Neighbors Imputation, ignore for other imputation methods. K value is the number of neighbors to search. Default: 5. If less then 5 neighbours are available, all are used."/>
        <param name="colCutoff" type="text" size="30" value=".8" label="Column Percent Cutoff Value" help="Only for K-Nearest Neighbors Imputation, ignore for other imputation methods. If the proportion of missing values is greater than the specified value, the imputation stops and the tool returns an error. Default: 0.8 (80%)."/>
        <param name="distribution" size="30" type="select" value="" label="Bayesian Distribution" help="Only for Stochastic Imputation, ignore for other imputation methods.  Choose between normal and Poisson distributions.">
            <option value="Poisson" selected="true">Poisson</option>
            <option value="Normal" selected="true">Normal</option>
        </param>
    </inputs>
    <outputs>
        <data format="tabular" name="imputed" label="${tool.name} on ${on_string}"/>
    </outputs>
    <tests>
        <test>
            <param name="input"      value="ST000006_data.tsv"/>
            <param name="design"     value="ST000006_design.tsv"/>
            <param name="uniqID"     value="Retention_Index" />
            <param name="group"      value="White_wine_type_and_source" />
            <param name="imputation" value="knn" />
            <output name="imputed"   file="ST000006_imputation_output_KNN.tsv" />
        </test>
    </tests>
    <help><![CDATA[

@TIP_AND_WARNING@

**Tool Description**

The tool performs an imputation procedure for missing data based on three conceptually different methods:

(1) Naive imputation (mean, median)
(2) K-nearest neighbor imputation (KNN) 
(3) Stochastic imputation (based on normal and Poisson distributions)

Imputations are performed separately for each sample group since treatment groups are expected to be different.
If only a single sample (column) is available for a given group, nothing is imputed and the sample is kept intact.
An option to select which values should be treated as missing is included.
The default value for missing data is an empty cell in the dataset with the option to treat zeroes, negative values and user-defined value(s) as missing and subsequently impute missing values.

(1) Naive imputation:

Computes the mean (or median) of the features within the samples for a given group and uses that value to impute values for that feature among the missing samples.
Feature values for all missing samples in the group get the same value equal to the mean (median) of the available samples, provided the allowed missing threshold is met.

(2) K-Nearest Neighbors (KNN) imputation:

Based on the procedure where nearest neighbor samples (K value default = 5) for the given sample within each group are considered.
The neighboring samples are used to generate the missing value for the current samples.
If less than the specified K value number of neighbors are available for the current sample in the current group, then the maximum available number of neighbors is used.
If the proportion of missing values for each row (feature) is greater than the specified Row Percent Cutoff value (default 0.5), then the column (sample) mean is imputed instead of values from the KNN algorithm.
The proportion of missing values for each column (sample) can be specified (Column Percent Cutoff default =  0.8) and determines whether a sample should be imputed or not.
If the proportion of missing values for each sample is greater than the specified value, then the missing values are not imputed and the imputation process is interrupted.
The algorithm is deterministic and always imputes the same missing values for the same settings.
More details on the algorithm are available via the reference and link below:

Olga Troyanskaya,  Michael Cantor,  Gavin Sherlock,  Pat Brown,  Trevor Hastie,  Robert Tibshirani,  David Botstein and Russ B. Altman, Missing value estimation methods for DNA microarrays BIOINFORMATICS Vol. 17 no. 6. 2001 Pages 520-525.

https://bioconductor.org/packages/release/bioc/html/impute.html

(3) Stochastic imputation:

Based on the assumption that each feature within a given group follows some underlying distribution.
As a result, all missing values are generated from the underlying distribution.
The parameter(s) of the underlying distribution is (are) estimated from the observed features.
Two distribution options are available:

Normal (recommended for logged and negative data) and Poisson (recommended for nonnegative counts).
The normal distribution parameters are estimated by the mean and standard deviation of the observed samples for a given feature.
If all observed values for a feature are the same, then the standard deviation is assumed to be 1/3 the absolute value of the mean.
The Poisson distribution parameter is estimated by the mean of the observed values for a given feature and is expected to be positive for the imputation procedure to work correctly.

--------------------------------------------------------------------------------

**Note**

- This tool currently treats all variables as continuous numeric
  variables. Running the tool on categorical variables might result in
  incorrect results.
- Rows containing non-numeric (or missing) data in any
  of the chosen columns will be skipped from the analysis.

--------------------------------------------------------------------------------

**Input**

    - Two input datasets are required.

@WIDE@

**NOTE:** The sample IDs must match the sample IDs in the Design File (below).
Extra columns will automatically be ignored.

@METADATA@

@UNIQID@

@GROUP@

**Imputation Strategy.**

    - Select an imputation strategy.

**Count Zeroes as missing.**

    - Zeroes can be treated as missing or left as data.

**Count Negative as missing.**

    - Negatives can be treated as missing or left as data.

**Additional values to treat missing [Optional].**

    - Additional values to treat as missing data, separate with commas.

**Row Percent Cutoff Value.**

    - Proportion of missing values allowed per group per row. If the proportion of missing samples in the row is greater than the cutoff value specified, nothing will be imputed for that row. Default: 0.5 (50%).

**K value.**

    - If you are not using the KNN Imputation, then ignore. K value is the number of neighbors to search. Default: 5. If less then 5 neighbours are available, all are used.

**Column Percent Cutoff Value.**

    - If you are not using the KNN Imputation, then ignore. The maximum proportion of missing data allowed in any data column (sample). Default: 0.8 (80%). The imputation will fail if the proportion in the data exceeds this cutoff!

**Bayesian Distribution.**

    - Choose between Normal and Poisson distributions for stochastic imputation.

--------------------------------------------------------------------------------

**Output**

TSV file containing the same column names as the original Wide Dataset, where the values in each cell correspond to either the original values or to values obtained during the imputation procedure.


    ]]></help>
    <expand macro="citations"/>
</tool>