comparison sample_generator.xml @ 0:dec5e182fd0c draft

planemo upload for repository https://github.com/bgruening/galaxytools/tools/sklearn commit 0e582cf1f3134c777cce3aa57d71b80ed95e6ba9
author bgruening
date Fri, 16 Feb 2018 09:16:51 -0500
parents
children c02c2bf137ab
comparison
equal deleted inserted replaced
-1:000000000000 0:dec5e182fd0c
1 <tool id="sklearn_sample_generator" name="Generate" version="@VERSION@">
2 <description>random samples with controlled size and complexity</description>
3 <macros>
4 <import>main_macros.xml</import>
5 </macros>
6 <expand macro="python_requirements"/>
7 <expand macro="macro_stdio"/>
8 <version_command>echo "@VERSION@"</version_command>
9 <command>
10 <![CDATA[
11 python "$sample_generator_script" '$inputs'
12 ]]>
13 </command>
14 <configfiles>
15 <inputs name="inputs" />
16 <configfile name="sample_generator_script">
17 <![CDATA[
18 import sys
19 import json
20 import pandas
21 import numpy as np
22 from sklearn import datasets
23
24 input_json_path = sys.argv[1]
25 params = json.load(open(input_json_path, "r"))
26
27 my_function = getattr(datasets, params["sample_generators"]["selected_generator"])
28 options = params["sample_generators"]["options"]
29 my_points, my_targets = my_function(**options)
30
31 points_df = pandas.DataFrame(my_points)
32 targets_df = pandas.DataFrame(my_targets)
33 res = pandas.concat([points_df, targets_df], axis=1)
34 res.to_csv(path_or_buf = "$outfile", sep="\t", index=False)
35
36 ]]>
37 </configfile>
38 </configfiles>
39 <inputs>
40 <conditional name="sample_generators">
41 <param name="selected_generator" type="select" label="Sample type:">
42 <option value="make_blobs" selected="true">Isotropic Gaussian blobs for clustering</option>
43 <option value="make_classification">Random n-class classification problem</option>
44 <option value="make_gaussian_quantiles">Isotropic Gaussian and label samples by quantile</option>
45 <option value="make_hastie_10_2">Data for binary classification (used in Hastie et al)</option>
46 <option value="make_circles">Large circle containing a smaller circle in 2d (circles)</option>
47 <option value="make_moons">Two interleaving half circles (moons)</option>
48 <option value="make_regression">Random regression problem</option>
49 <option value="make_sparse_uncorrelated">Random regression problem with sparse uncorrelated design</option>
50 <option value="make_friedman1">“Friedman #1” regression problem</option>
51 <option value="make_friedman2">“Friedman #2” regression problem</option>
52 <option value="make_friedman3">“Friedman #3” regression problem</option>
53 <!--option value="make_low_rank_matrix">Mostly low rank matrix with bell-shaped singular values</option-->
54 <!--option value="make_multilabel_classification">Random multilabel classification problem</option-->
55 <option value="make_s_curve">S curve dataset</option>
56 <option value="make_swiss_roll">Swiss roll dataset</option>
57 <!--option value="make_sparse_coded_signal">Sparse combination of dictionary elements</option-->
58 <!--option value="make_sparse_spd_matrix">Sparse symmetric definite positive matrix</option-->
59 <!--option value="make_spd_matrix">Random symmetric, positive-definite matrix</option-->
60 <!--option value="make_biclusters">Array with constant block diagonal structure for biclustering</option>
61 <option value="make_checkerboard">Array with block checkerboard structure for biclustering</option-->
62 </param>
63 <when value="make_blobs">
64 <section name="options" title="Advanced Options" expanded="False">
65 <expand macro="n_samples"/>
66 <expand macro="n_features"/>
67 <param argument="centers" type="integer" optional="true" value="3" label="Number of centers to generate" help=" "/>
68 <!--todo: expand centers type : int or array of shape [n_centers, n_features]-->
69 <param argument="cluster_std" type="float" optional="true" value="1.0" label="Standard deviation of the clusters" help=" "/>
70 <!--todo: expand cluster_std type : float or sequence of floats-->
71 <!--param argument=center_box-->
72 <expand macro="shuffle" label="Shuffle the samples"/>
73 <expand macro="random_state"/>
74 </section>
75 </when>
76 <when value="make_classification">
77 <section name="options" title="Advanced Options" expanded="False">
78 <expand macro="n_samples"/>
79 <expand macro="n_features" default_value="20"/>
80 <param argument="n_informative" type="integer" optional="true" value="2" label="Number of informative features" help="Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension n_informative. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined within each cluster in order to add covariance. The clusters are then placed on the vertices of the hypercube. "/>
81 <param argument="n_redundant" type="integer" optional="true" value="2" label="Number of redundant features" help="These features are generated as random linear combinations of the informative features. "/>
82 <param argument="n_repeated" type="integer" optional="true" value="0" label="Number of duplicated features" help="These are drawn randomly from the informative and the redundant features. "/>
83 <param argument="n_classes" type="integer" optional="true" value="2" label="Number of classes" help="The number of classes (or labels) of the classification problem. "/>
84 <param argument="n_clusters_per_class" type="integer" optional="true" value="2" label="Number of clusters per class" help=" "/>
85 <!--param argument = weights-->
86 <param argument="flip_y" type="float" optional="true" value="0.01" label="Fraction of samples with randomly exchanged class labels" help=" "/>
87 <!--param argument = class_sep-->
88 <!--param argument = hypercube-->
89 <!--param argument = shift-->
90 <!--param argument = scale-->
91 <expand macro="shuffle" label="Shuffle the samples"/>
92 <expand macro="random_state"/>
93 </section>
94 </when>
95 <when value="make_gaussian_quantiles">
96 <section name="options" title="Advanced Options" expanded="False">
97 <!--param argument = mean-->
98 <expand macro="n_samples"/>
99 <expand macro="n_features"/>
100 <param argument="cov" type="float" optional="true" value="1" label="Unit matrix coefficient" help="The covariance matrix will be this value times the unit matrix. This dataset only produces symmetric normal distributions. "/>
101 <param argument="n_classes" type="integer" optional="true" value="2" label="Number of classes" help="The number of classes (or labels) of the classification problem. "/>
102 <expand macro="shuffle" label="Shuffle the samples"/>
103 <expand macro="random_state"/>
104 </section>
105 </when>
106 <when value="make_hastie_10_2">
107 <section name="options" title="Advanced Options" expanded="False">
108 <expand macro="n_samples" default_value="12000"/>
109 <expand macro="random_state"/>
110 </section>
111 </when>
112 <when value="make_circles">
113 <section name="options" title="Advanced Options" expanded="False">
114 <expand macro="n_samples"/>
115 <expand macro="shuffle" label="Shuffle the samples"/>
116 <expand macro="noise" default_value=""/>
117 <param argument="factor" type="float" optional="true" value="0.8" label="Scale factor between inner and outer circle" help=" Floating point number less than 1. "/>
118 <expand macro="random_state"/>
119 </section>
120 </when>
121 <when value="make_moons">
122 <section name="options" title="Advanced Options" expanded="False">
123 <expand macro="n_samples"/>
124 <expand macro="shuffle" label="Shuffle the samples"/>
125 <expand macro="noise" default_value=""/>
126 <expand macro="random_state"/>
127 </section>
128 </when>
129 <when value="make_regression">
130 <section name="options" title="Advanced Options" expanded="False">
131 <expand macro="n_samples"/>
132 <expand macro="n_features" default_value="100"/>
133 <param argument="n_informative" type="integer" optional="true" value="10" label="Number of informative features" help="the number of features used to build the linear model used to generate the output "/>
134 <param argument="n_targets" type="integer" optional="true" value="1" label="Number of regression targets" help="The dimension of the y output vector associated with a sample. By default, the output is a scalar."/>
135 <param argument="bias" type="float" optional="true" value="0.0" label="Bias of the true function" help="The bias term in the underlying linear model. "/>
136 <!--param argument = effective_rank-->
137 <!--param argument = tail_strength-->
138 <!--param argument = coef-->
139 <expand macro="noise"/>
140 <expand macro="random_state"/>
141 </section>
142 </when>
143 <when value="make_sparse_uncorrelated">
144 <section name="options" title="Advanced Options" expanded="False">
145 <expand macro="n_samples"/>
146 <expand macro="n_features" default_value="10"/>
147 <expand macro="random_state"/>
148 </section>
149 </when>
150 <when value="make_friedman1">
151 <section name="options" title="Advanced Options" expanded="False">
152 <expand macro="n_samples"/>
153 <expand macro="n_features" default_value="10"/>
154 <expand macro="noise"/>
155 <expand macro="random_state"/>
156 </section>
157 </when>
158 <when value="make_friedman2">
159 <section name="options" title="Advanced Options" expanded="False">
160 <expand macro="n_samples"/>
161 <expand macro="noise"/>
162 <expand macro="random_state"/>
163 </section>
164 </when>
165 <when value="make_friedman3">
166 <section name="options" title="Advanced Options" expanded="False">
167 <expand macro="n_samples"/>
168 <expand macro="noise"/>
169 <expand macro="random_state"/>
170 </section>
171 </when>
172 <!--when value="make_low_rank_matrix">
173 <section name="options" title="Advanced Options" expanded="False">
174 <expand macro="n_samples"/>
175 <expand macro="n_features" default_value="100"/>
176 <expand macro="random_state"/>
177 </section>
178 </when-->
179 <!--when value="make_multilabel_classification">
180 <section name="options" title="Advanced Options" expanded="False">
181 <expand macro="n_samples"/>
182 <expand macro="n_features" default_value="20"/>
183 <expand macro="random_state"/>
184 </section>
185 </when-->
186 <when value="make_s_curve">
187 <section name="options" title="Advanced Options" expanded="False">
188 <expand macro="n_samples"/>
189 <expand macro="noise"/>
190 <expand macro="random_state"/>
191 </section>
192 </when>
193 <when value="make_swiss_roll">
194 <section name="options" title="Advanced Options" expanded="False">
195 <expand macro="n_samples"/>
196 <expand macro="noise"/>
197 <expand macro="random_state"/>
198 </section>
199 </when>
200 <!--when value="make_sparse_coded_signal">
201 <section name="options" title="Advanced Options" expanded="False">
202 <expand macro="n_samples" default_value=""/>
203 <expand macro="n_features" default_value=""/>
204 <expand macro="random_state"/>
205 </section>
206 </when-->
207 <!--when value="make_spd_matrix">
208 <section name="options" title="Advanced Options" expanded="False">
209 <expand macro="random_state"/>
210 </section>
211 </when-->
212 <!--when value="make_sparse_spd_matrix">
213 <section name="options" title="Advanced Options" expanded="False">
214 <expand macro="random_state"/>
215 </section>
216 </when-->
217 <!--when value="make_biclusters">
218 <section name="options" title="Advanced Options" expanded="False">
219 <expand macro="shuffle" label="Shuffle the samples"/>
220 <expand macro="noise"/>
221 <expand macro="random_state"/>
222 </section>
223 </when>
224 <when value="make_checkerboard">
225 <section name="options" title="Advanced Options" expanded="False">
226 <expand macro="shuffle" label="Shuffle the samples"/>
227 <expand macro="noise"/>
228 <expand macro="random_state"/>
229 </section>
230 </when-->
231 </conditional>
232 </inputs>
233 <outputs>
234 <data format="tabular" name="outfile"/>
235 </outputs>
236 <tests>
237 <test>
238 <param name="selected_generator" value="make_blobs"/>
239 <param name="random_state" value="100"/>
240 <output name="outfile" file="blobs.txt"/>
241 </test>
242 <test>
243 <param name="selected_generator" value="make_classification"/>
244 <param name="random_state" value="100"/>
245 <output name="outfile" file="class.txt" compare="sim_size" />
246 </test>
247 <test>
248 <param name="selected_generator" value="make_circles"/>
249 <param name="random_state" value="100"/>
250 <output name="outfile" file="circles.txt"/>
251 </test>
252 <test>
253 <param name="selected_generator" value="make_friedman1"/>
254 <param name="random_state" value="100"/>
255 <output name="outfile" file="friedman1.txt"/>
256 </test>
257 <test>
258 <param name="selected_generator" value="make_friedman2"/>
259 <param name="random_state" value="100"/>
260 <output name="outfile" file="friedman2.txt"/>
261 </test>
262 <test>
263 <param name="selected_generator" value="make_friedman3"/>
264 <param name="random_state" value="100"/>
265 <output name="outfile" file="friedman3.txt"/>
266 </test>
267 <test>
268 <param name="selected_generator" value="make_gaussian_quantiles"/>
269 <param name="random_state" value="100"/>
270 <output name="outfile" file="gaus.txt"/>
271 </test>
272 <test>
273 <param name="selected_generator" value="make_hastie_10_2"/>
274 <param name="random_state" value="100"/>
275 <output name="outfile" file="hastie.txt" compare="contains"/>
276 </test>
277 <test>
278 <param name="selected_generator" value="make_moons"/>
279 <param name="random_state" value="100"/>
280 <output name="outfile" file="moons.txt"/>
281 </test>
282 <test>
283 <param name="selected_generator" value="make_regression"/>
284 <param name="random_state" value="100"/>
285 <output name="outfile" file="regression.txt" compare="sim_size" />
286 </test>
287 <test>
288 <param name="selected_generator" value="make_s_curve"/>
289 <param name="random_state" value="100"/>
290 <output name="outfile" file="scurve.txt"/>
291 </test>
292 <test>
293 <param name="selected_generator" value="make_sparse_uncorrelated"/>
294 <param name="random_state" value="100"/>
295 <output name="outfile" file="sparse_u.txt"/>
296 </test>
297 <test>
298 <param name="selected_generator" value="make_swiss_roll"/>
299 <param name="random_state" value="100"/>
300 <output name="outfile" file="swiss_r.txt"/>
301 </test>
302 </tests>
303 <help>
304 <![CDATA[
305 **What it does**
306
307 This tool generates artificial data samples with specified size and controlled complexity.
308 It provides sample generators for the following machine learning problems:
309
310
311
312 **1 - Single_label classification and clustering**
313
314 These generators produce a file containing the data samples. It is a tabular representation with samples in rows having features in columns.
315 (In machine learning, each numerical property of a sample is called a feature.)
316 The corresponding discrete targets are generated in a separate column. This column is added as the last coulmn of the data.
317
318
319 **Example**
320 Sample data with 4 features and a single target (n_samples=8 , n_features=4) :
321
322
323 features columns
324 ::
325
326 4.01163365529 -6.10797684314 8.29829894763 -9.10139563721
327 10.0788438916 1.59539821454 10.0684278289 4.16975127881
328 -5.17607775503 -0.878286135332 6.92941850665 -5.27083063186
329 4.00975406235 -7.11847496542 9.3802423585 -9.36732159584
330 4.61204065139 -5.71217537352 9.12509610964 -9.2260804162
331 8.26530668997 2.96705005011 8.88881190248 2.75339082289
332 2.96366327113 -3.76295851562 11.7113372463 -9.79136150321
333 8.13319631944 -0.223645298585 10.5820605308 4.47715318678
334
335 target column
336 ::
337
338 1
339 0
340 2
341 1
342 1
343 0
344 1
345 0
346
347 The following generators are included in this section:
348
349
350 * **Isotropic Gaussian blobs for clustering** creates multiclass datasets by allocating each class one or more normally-distributed clusters of points (isotropic = equally distributed in all directions). It provides control regarding the centers and standard deviations of each cluster, and is used to demonstrate clustering.
351
352 * **Random n-class classification problem** does the same specialising in introducing noise by way of: correlated, redundant and uninformative features; multiple Gaussian clusters per class; and linear transformations of the feature space.
353
354 * **Isotropic Gaussian and label samples by quantile** divides a single Gaussian cluster into near-equal-size classes separated by concentric hyperspheres.
355
356 * **Data for binary classification (Hastie)** generates a binary problem similar to the above with 10 features.
357
358 * **Circles** and **moons** generate 2-dimensional binary classification datasets that are challenging to certain algorithms (e.g. centroid-based clustering or linear classification), including optional Gaussian noise. They are useful for visualisation.
359
360 **2 - Generators for regression**
361
362 These generators produce output with same same format as in section 1, thoguh aimed for regression problems. The following generators are included in this section:
363
364 * **Random regression problem** produces regression targets as an optionally-sparse random linear combination of random features, with noise. Its informative features may be uncorrelated, or low rank (few features account for most of the variance). It can produce multiple targets for each point.
365
366 * **Random regression problem with sparse uncorrelated design** produces a target as a linear combination of four features with fixed coefficients.
367
368 * **Nonlinear generators** encode explicitly non-linear relations: **“Friedman #1”** is related by polynomial and sine transforms; **“Friedman #2”** includes feature multiplication and reciprocation; and **“Friedman #3”** is similar with an arctan transformation on the target.
369
370 **3 - Generators for manifold learning**
371
372 Generators belonging to this group produce datasets suitable for non-linear dimensionality reduction problems. The idea behind this type of problem is that the dimensionality of many data sets is only artificially high. **S curve dataset** and **Swiss roll dataset** produce the same points-targets output format, sample points are 3-dimensional and the target column indicates the univariate position of the sample according to the main dimension of the points in the manifold.
373
374 ]]>
375 </help>
376 <expand macro="sklearn_citation"/>
377 </tool>