Mercurial > repos > bgruening > sklearn_estimator_attributes
annotate train_test_split.py @ 5:f4a7c3aa1e10 draft
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
| author | bgruening | 
|---|---|
| date | Fri, 01 Nov 2019 17:21:38 -0400 | 
| parents | |
| children | 27fabe5feedc | 
| rev | line source | 
|---|---|
| 5 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 1 import argparse | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 2 import json | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 3 import pandas as pd | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 4 import warnings | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 5 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 6 from galaxy_ml.model_validations import train_test_split | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 7 from galaxy_ml.utils import get_cv, read_columns | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 8 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 9 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 10 def _get_single_cv_split(params, array, infile_labels=None, | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 11 infile_groups=None): | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 12 """ output (train, test) subset from a cv splitter | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 13 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 14 Parameters | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 15 ---------- | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 16 params : dict | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 17 Galaxy tool inputs | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 18 array : pandas DataFrame object | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 19 The target dataset to split | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 20 infile_labels : str | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 21 File path to dataset containing target values | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 22 infile_groups : str | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 23 File path to dataset containing group values | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 24 """ | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 25 y = None | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 26 groups = None | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 27 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 28 nth_split = params['mode_selection']['nth_split'] | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 29 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 30 # read groups | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 31 if infile_groups: | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 32 header = 'infer' if (params['mode_selection']['cv_selector'] | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 33 ['groups_selector']['header_g']) else None | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 34 column_option = (params['mode_selection']['cv_selector'] | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 35 ['groups_selector']['column_selector_options_g'] | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 36 ['selected_column_selector_option_g']) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 37 if column_option in ['by_index_number', 'all_but_by_index_number', | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 38 'by_header_name', 'all_but_by_header_name']: | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 39 c = (params['mode_selection']['cv_selector']['groups_selector'] | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 40 ['column_selector_options_g']['col_g']) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 41 else: | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 42 c = None | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 43 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 44 groups = read_columns(infile_groups, c=c, c_option=column_option, | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 45 sep='\t', header=header, parse_dates=True) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 46 groups = groups.ravel() | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 47 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 48 params['mode_selection']['cv_selector']['groups_selector'] = groups | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 49 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 50 # read labels | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 51 if infile_labels: | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 52 target_input = (params['mode_selection'] | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 53 ['cv_selector'].pop('target_input')) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 54 header = 'infer' if target_input['header1'] else None | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 55 col_index = target_input['col'][0] - 1 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 56 df = pd.read_csv(infile_labels, sep='\t', header=header, | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 57 parse_dates=True) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 58 y = df.iloc[:, col_index].values | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 59 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 60 # construct the cv splitter object | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 61 splitter, groups = get_cv(params['mode_selection']['cv_selector']) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 62 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 63 total_n_splits = splitter.get_n_splits(array.values, y=y, groups=groups) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 64 if nth_split > total_n_splits: | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 65 raise ValueError("Total number of splits is {}, but got `nth_split` " | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 66 "= {}".format(total_n_splits, nth_split)) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 67 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 68 i = 1 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 69 for train_index, test_index in splitter.split(array.values, y=y, groups=groups): | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 70 # suppose nth_split >= 1 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 71 if i == nth_split: | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 72 break | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 73 else: | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 74 i += 1 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 75 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 76 train = array.iloc[train_index, :] | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 77 test = array.iloc[test_index, :] | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 78 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 79 return train, test | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 80 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 81 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 82 def main(inputs, infile_array, outfile_train, outfile_test, | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 83 infile_labels=None, infile_groups=None): | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 84 """ | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 85 Parameter | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 86 --------- | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 87 inputs : str | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 88 File path to galaxy tool parameter | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 89 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 90 infile_array : str | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 91 File paths of input arrays separated by comma | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 92 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 93 infile_labels : str | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 94 File path to dataset containing labels | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 95 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 96 infile_groups : str | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 97 File path to dataset containing groups | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 98 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 99 outfile_train : str | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 100 File path to dataset containing train split | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 101 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 102 outfile_test : str | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 103 File path to dataset containing test split | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 104 """ | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 105 warnings.simplefilter('ignore') | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 106 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 107 with open(inputs, 'r') as param_handler: | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 108 params = json.load(param_handler) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 109 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 110 input_header = params['header0'] | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 111 header = 'infer' if input_header else None | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 112 array = pd.read_csv(infile_array, sep='\t', header=header, | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 113 parse_dates=True) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 114 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 115 # train test split | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 116 if params['mode_selection']['selected_mode'] == 'train_test_split': | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 117 options = params['mode_selection']['options'] | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 118 shuffle_selection = options.pop('shuffle_selection') | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 119 options['shuffle'] = shuffle_selection['shuffle'] | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 120 if infile_labels: | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 121 header = 'infer' if shuffle_selection['header1'] else None | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 122 col_index = shuffle_selection['col'][0] - 1 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 123 df = pd.read_csv(infile_labels, sep='\t', header=header, | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 124 parse_dates=True) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 125 labels = df.iloc[:, col_index].values | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 126 options['labels'] = labels | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 127 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 128 train, test = train_test_split(array, **options) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 129 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 130 # cv splitter | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 131 else: | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 132 train, test = _get_single_cv_split(params, array, | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 133 infile_labels=infile_labels, | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 134 infile_groups=infile_groups) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 135 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 136 print("Input shape: %s" % repr(array.shape)) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 137 print("Train shape: %s" % repr(train.shape)) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 138 print("Test shape: %s" % repr(test.shape)) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 139 train.to_csv(outfile_train, sep='\t', header=input_header, index=False) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 140 test.to_csv(outfile_test, sep='\t', header=input_header, index=False) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 141 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 142 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 143 if __name__ == '__main__': | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 144 aparser = argparse.ArgumentParser() | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 145 aparser.add_argument("-i", "--inputs", dest="inputs", required=True) | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 146 aparser.add_argument("-X", "--infile_array", dest="infile_array") | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 147 aparser.add_argument("-y", "--infile_labels", dest="infile_labels") | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 148 aparser.add_argument("-g", "--infile_groups", dest="infile_groups") | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 149 aparser.add_argument("-o", "--outfile_train", dest="outfile_train") | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 150 aparser.add_argument("-t", "--outfile_test", dest="outfile_test") | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 151 args = aparser.parse_args() | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 152 | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 153 main(args.inputs, args.infile_array, args.outfile_train, | 
| 
f4a7c3aa1e10
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit eb703290e2589561ea215c84aa9f71bcfe1712c6"
 bgruening parents: diff
changeset | 154 args.outfile_test, args.infile_labels, args.infile_groups) | 
