Mercurial > repos > bgruening > sklearn_sample_generator

<tool id="sklearn_sample_generator" name="Generate" version="@VERSION@">
    <description>random samples with controlled size and complexity</description>
    <macros>
        <import>main_macros.xml</import>
    </macros>
    <expand macro="python_requirements"/>
    <expand macro="macro_stdio"/>
    <version_command>echo "@VERSION@"</version_command>
    <command>
        <![CDATA[
        python "$sample_generator_script" '$inputs'
        ]]>
    </command>
    <configfiles>
        <inputs name="inputs" />
        <configfile name="sample_generator_script">
            <![CDATA[
import sys
import json
import pandas
import numpy as np
from sklearn import datasets

input_json_path = sys.argv[1]
with open(input_json_path, "r") as param_handler:
    params = json.load(param_handler)

my_function = getattr(datasets, params["sample_generators"]["selected_generator"])
options = params["sample_generators"]["options"]
my_points, my_targets = my_function(**options)

points_df = pandas.DataFrame(my_points)
targets_df = pandas.DataFrame(my_targets)
res = pandas.concat([points_df, targets_df], axis=1)
res.to_csv(path_or_buf = "$outfile", sep="\t", index=False)

            ]]>
        </configfile>
    </configfiles>
    <inputs>
        <conditional name="sample_generators">
            <param name="selected_generator" type="select" label="Sample type:">
                <option value="make_blobs" selected="true">Isotropic Gaussian blobs for clustering</option>
                <option value="make_classification">Random n-class classification problem</option>
                <option value="make_gaussian_quantiles">Isotropic Gaussian and label samples by quantile</option>
                <option value="make_hastie_10_2">Data for binary classification (used in Hastie et al)</option>
                <option value="make_circles">Large circle containing a smaller circle in 2d (circles)</option>
                <option value="make_moons">Two interleaving half circles (moons)</option>
                <option value="make_regression">Random regression problem</option>
                <option value="make_sparse_uncorrelated">Random regression problem with sparse uncorrelated design</option>
                <option value="make_friedman1">“Friedman #1” regression problem</option>
                <option value="make_friedman2">“Friedman #2” regression problem</option>
                <option value="make_friedman3">“Friedman #3” regression problem</option>
                <!--option value="make_low_rank_matrix">Mostly low rank matrix with bell-shaped singular values</option-->
                <!--option value="make_multilabel_classification">Random multilabel classification problem</option-->
                <option value="make_s_curve">S curve dataset</option>
                <option value="make_swiss_roll">Swiss roll dataset</option>
                <!--option value="make_sparse_coded_signal">Sparse combination of dictionary elements</option-->
                <!--option value="make_sparse_spd_matrix">Sparse symmetric definite positive matrix</option-->
                <!--option value="make_spd_matrix">Random symmetric, positive-definite matrix</option-->
                <!--option value="make_biclusters">Array with constant block diagonal structure for biclustering</option>
                <option value="make_checkerboard">Array with block checkerboard structure for biclustering</option-->
            </param>
            <when value="make_blobs">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="n_samples"/>
                    <expand macro="n_features"/>
                    <param argument="centers" type="integer" optional="true" value="3" label="Number of centers to generate" help=" "/>
                    <!--todo: expand centers type : int or array of shape [n_centers, n_features]-->
                    <param argument="cluster_std" type="float" optional="true" value="1.0" label="Standard deviation of the clusters" help=" "/>
                    <!--todo: expand cluster_std type : float or sequence of floats-->
                    <!--param argument=center_box-->
                    <expand macro="shuffle" label="Shuffle the samples"/>
                    <expand macro="random_state"/>
                </section>
            </when>
            <when value="make_classification">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="n_samples"/>
                    <expand macro="n_features" default_value="20"/>
                    <param argument="n_informative" type="integer" optional="true" value="2" label="Number of informative features" help="Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension n_informative. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined within each cluster in order to add covariance. The clusters are then placed on the vertices of the hypercube. "/>
                    <param argument="n_redundant" type="integer" optional="true" value="2" label="Number of redundant features" help="These features are generated as random linear combinations of the informative features. "/>
                    <param argument="n_repeated" type="integer" optional="true" value="0" label="Number of duplicated features" help="These are drawn randomly from the informative and the redundant features. "/>
                    <param argument="n_classes" type="integer" optional="true" value="2" label="Number of classes" help="The number of classes (or labels) of the classification problem. "/>
                    <param argument="n_clusters_per_class" type="integer" optional="true" value="2" label="Number of clusters per class" help=" "/>
                    <!--param argument = weights-->
                    <param argument="flip_y" type="float" optional="true" value="0.01" label="Fraction of samples with randomly exchanged class labels" help=" "/>
                    <!--param argument = class_sep-->
                    <!--param argument = hypercube-->
                    <!--param argument = shift-->
                    <!--param argument = scale-->
                    <expand macro="shuffle" label="Shuffle the samples"/>
                    <expand macro="random_state"/>
                </section>
            </when>
            <when value="make_gaussian_quantiles">
                <section name="options" title="Advanced Options" expanded="False">
                    <!--param argument = mean-->
                    <expand macro="n_samples"/>
                    <expand macro="n_features"/>
                    <param argument="cov" type="float" optional="true" value="1" label="Unit matrix coefficient" help="The covariance matrix will be this value times the unit matrix. This dataset only produces symmetric normal distributions. "/>
                    <param argument="n_classes" type="integer" optional="true" value="2" label="Number of classes" help="The number of classes (or labels) of the classification problem. "/>
                    <expand macro="shuffle" label="Shuffle the samples"/>
                    <expand macro="random_state"/>
                </section>
            </when>
            <when value="make_hastie_10_2">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="n_samples" default_value="12000"/>
                    <expand macro="random_state"/>
                </section>
            </when>
            <when value="make_circles">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="n_samples"/>
                    <expand macro="shuffle" label="Shuffle the samples"/>
                    <expand macro="noise" default_value=""/>
                    <param argument="factor" type="float" optional="true" value="0.8" label="Scale factor between inner and outer circle" help=" Floating point number less than 1. "/>
                    <expand macro="random_state"/>
                </section>
            </when>
            <when value="make_moons">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="n_samples"/>
                    <expand macro="shuffle" label="Shuffle the samples"/>
                    <expand macro="noise" default_value=""/>
                    <expand macro="random_state"/>
                </section>
            </when>
            <when value="make_regression">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="n_samples"/>
                    <expand macro="n_features" default_value="100"/>
                    <param argument="n_informative" type="integer" optional="true" value="10" label="Number of informative features" help="the number of features used to build the linear model used to generate the output "/>
                    <param argument="n_targets" type="integer" optional="true" value="1" label="Number of regression targets" help="The dimension of the y output vector associated with a sample. By default, the output is a scalar."/>
                    <param argument="bias" type="float" optional="true" value="0.0" label="Bias of the true function" help="The bias term in the underlying linear model. "/>
                    <!--param argument = effective_rank-->
                    <!--param argument = tail_strength-->
                    <!--param argument = coef-->
                    <expand macro="noise"/>
                    <expand macro="random_state"/>
                </section>
            </when>
            <when value="make_sparse_uncorrelated">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="n_samples"/>
                    <expand macro="n_features" default_value="10"/>
                    <expand macro="random_state"/>
                </section>
            </when>
            <when value="make_friedman1">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="n_samples"/>
                    <expand macro="n_features" default_value="10"/>
                    <expand macro="noise"/>
                    <expand macro="random_state"/>
                </section>
            </when>
            <when value="make_friedman2">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="n_samples"/>
                    <expand macro="noise"/>
                    <expand macro="random_state"/>
                </section>
            </when>
            <when value="make_friedman3">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="n_samples"/>
                    <expand macro="noise"/>
                    <expand macro="random_state"/>
                </section>
            </when>
            <!--when value="make_low_rank_matrix">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="n_samples"/>
                    <expand macro="n_features" default_value="100"/>
                    <expand macro="random_state"/>
                </section>
            </when-->
            <!--when value="make_multilabel_classification">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="n_samples"/>
                    <expand macro="n_features" default_value="20"/>
                    <expand macro="random_state"/>
                </section>
            </when-->
            <when value="make_s_curve">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="n_samples"/>
                    <expand macro="noise"/>
                    <expand macro="random_state"/>
                </section>
            </when>
            <when value="make_swiss_roll">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="n_samples"/>
                    <expand macro="noise"/>
                    <expand macro="random_state"/>
                </section>
            </when>
            <!--when value="make_sparse_coded_signal">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="n_samples" default_value=""/>
                    <expand macro="n_features" default_value=""/>
                    <expand macro="random_state"/>
                </section>
            </when-->
            <!--when value="make_spd_matrix">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="random_state"/>
                </section>
            </when-->
            <!--when value="make_sparse_spd_matrix">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="random_state"/>
                </section>
            </when-->
            <!--when value="make_biclusters">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="shuffle" label="Shuffle the samples"/>
                    <expand macro="noise"/>
                    <expand macro="random_state"/>
                </section>
            </when>
            <when value="make_checkerboard">
                <section name="options" title="Advanced Options" expanded="False">
                    <expand macro="shuffle" label="Shuffle the samples"/>
                    <expand macro="noise"/>
                    <expand macro="random_state"/>
                </section>
            </when-->
        </conditional>
    </inputs>
    <outputs>
        <data format="tabular" name="outfile"/>
    </outputs>
    <tests>
        <test>
            <param name="selected_generator" value="make_blobs"/>
            <param name="random_state" value="100"/>
            <output name="outfile" file="blobs.txt"/>
        </test>
        <test>
            <param name="selected_generator" value="make_classification"/>
            <param name="random_state" value="100"/>
            <output name="outfile" file="class.txt" compare="sim_size" />
        </test>
        <test>
            <param name="selected_generator" value="make_circles"/>
            <param name="random_state" value="100"/>
            <output name="outfile" file="circles.txt"/>
        </test>
        <test>
            <param name="selected_generator" value="make_friedman1"/>
            <param name="random_state" value="100"/>
            <output name="outfile" file="friedman1.txt"/>
        </test>
        <test>
            <param name="selected_generator" value="make_friedman2"/>
            <param name="random_state" value="100"/>
            <output name="outfile" file="friedman2.txt"/>
        </test>
        <test>
            <param name="selected_generator" value="make_friedman3"/>
            <param name="random_state" value="100"/>
            <output name="outfile" file="friedman3.txt"/>
        </test>
        <test>
            <param name="selected_generator" value="make_gaussian_quantiles"/>
            <param name="random_state" value="100"/>
            <output name="outfile" file="gaus.txt"/>
        </test>
        <test>
            <param name="selected_generator" value="make_hastie_10_2"/>
            <param name="random_state" value="100"/>
            <output name="outfile" file="hastie.txt" compare="contains"/>
        </test>
        <test>
            <param name="selected_generator" value="make_moons"/>
            <param name="random_state" value="100"/>
            <output name="outfile" file="moons.txt"/>
        </test>
        <test>
            <param name="selected_generator" value="make_regression"/>
            <param name="random_state" value="100"/>
            <output name="outfile" file="regression.txt" compare="sim_size" />
        </test>
        <test>
            <param name="selected_generator" value="make_s_curve"/>
            <param name="random_state" value="100"/>
            <output name="outfile" file="scurve.txt"/>
        </test>
        <test>
            <param name="selected_generator" value="make_sparse_uncorrelated"/>
            <param name="random_state" value="100"/>
            <output name="outfile" file="sparse_u.txt"/>
        </test>
        <test>
            <param name="selected_generator" value="make_swiss_roll"/>
            <param name="random_state" value="100"/>
            <output name="outfile" file="swiss_r.txt"/>
        </test>
    </tests>
    <help>
        <![CDATA[
**What it does**

This tool generates artificial data samples with specified size and controlled complexity.
It provides sample generators for the following machine learning problems:


**1 - Single_label classification and clustering**

 These generators produce a file containing the data samples. It is a tabular representation with samples in rows having features in columns.
 (In machine learning, each numerical property of a sample is called a feature.)
 The corresponding discrete targets are generated in a separate column. This column is added as the last coulmn of the data.


 **Example**
 Sample data with 4 features and a single target (n_samples=8 , n_features=4) :


 features columns
 ::

  4.01163365529    -6.10797684314    8.29829894763     -9.10139563721
  10.0788438916    1.59539821454     10.0684278289     4.16975127881
  -5.17607775503   -0.878286135332   6.92941850665     -5.27083063186
  4.00975406235    -7.11847496542    9.3802423585      -9.36732159584
  4.61204065139    -5.71217537352    9.12509610964     -9.2260804162
  8.26530668997    2.96705005011     8.88881190248     2.75339082289
  2.96366327113    -3.76295851562    11.7113372463     -9.79136150321
  8.13319631944    -0.223645298585   10.5820605308     4.47715318678

 target column
 ::

  1
  0
  2
  1
  1
  0
  1
  0

 The following generators are included in this section:


   * **Isotropic Gaussian blobs for clustering** creates multiclass datasets by allocating each class one or more normally-distributed clusters of points (isotropic = equally distributed in all directions). It provides control regarding the centers and standard deviations of each cluster, and is used to demonstrate clustering.

   * **Random n-class classification problem** does the same specialising in introducing noise by way of: correlated, redundant and uninformative features; multiple Gaussian clusters per class; and linear transformations of the feature space.

   * **Isotropic Gaussian and label samples by quantile** divides a single Gaussian cluster into near-equal-size classes separated by concentric hyperspheres.

   * **Data for binary classification (Hastie)** generates a binary problem similar to the above with 10 features.

   * **Circles** and **moons** generate 2-dimensional binary classification datasets that are challenging to certain algorithms (e.g. centroid-based clustering or linear classification), including optional Gaussian noise. They are useful for visualisation.

**2 - Generators for regression**

 These generators produce output with same same format as in section 1, thoguh aimed for regression problems. The following generators are included in this section:

  * **Random regression problem** produces regression targets as an optionally-sparse random linear combination of random features, with noise. Its informative features may be uncorrelated, or low rank (few features account for most of the variance). It can produce multiple targets for each point.

  * **Random regression problem with sparse uncorrelated design** produces a target as a linear combination of four features with fixed coefficients.

  * **Nonlinear generators** encode explicitly non-linear relations: **“Friedman #1”** is related by polynomial and sine transforms; **“Friedman #2”** includes feature multiplication and reciprocation; and **“Friedman #3”** is similar with an arctan transformation on the target.

**3 - Generators for manifold learning**

  Generators belonging to this group produce datasets suitable for non-linear dimensionality reduction problems. The idea behind this type of problem is that the dimensionality of many data sets is only artificially high. **S curve dataset** and **Swiss roll dataset** produce the same points-targets output format, sample points are 3-dimensional and the target column indicates the univariate position of the sample according to the main dimension of the points in the manifold.

        ]]>
    </help>
    <expand macro="sklearn_citation"/>
</tool>
author	bgruening
date	Tue, 07 Aug 2018 05:45:58 -0400
parents	c02c2bf137ab
children	4ba68dd788b3