comparison README.rst @ 0:29899feb4d44 draft

planemo upload for repository https://github.com/bgruening/galaxytools/tools/sklearn commit 0e582cf1f3134c777cce3aa57d71b80ed95e6ba9
author bgruening
date Fri, 16 Feb 2018 09:18:41 -0500
parents
children f0f1e5ba6fca
comparison
equal deleted inserted replaced
-1:000000000000 0:29899feb4d44
1 ***************
2 Galaxy wrapper for scikit-learn library
3 ***************
4
5 Contents
6 ========
7 - `What is scikit-learn?`_
8 - `Scikit-learn main package groups`_
9 - `Tools offered by this wrapper`_
10
11 - `Machine learning workflows`_
12 - `Supervised learning workflows`_
13 - `Unsupervised learning workflows`_
14
15
16 ____________________________
17
18
19 .. _What is scikit-learn?
20
21 What is scikit-learn?
22 ===========================
23
24 Scikit-learn is an open-source machine learning library for the Python programming language. It offers various algorithms for performing supervised and unsupervised learning as well as data preprocessing and transformation, model selection and evaluation, and dataset utilities. It is built upon SciPy (Scientific Python) library.
25
26 Scikit-learn source code can be accessed at https://github.com/scikit-learn/scikit-learn.
27 Detailed installation instructions can be found at http://scikit-learn.org/stable/install.html
28
29
30 .. _Scikit-learn main package groups:
31
32 ======
33 Scikit-learn main package groups
34 ======
35
36 Scikit-learn provides the users with several main groups of related operations.
37 These are:
38
39 - Classification
40 - Identifying to which category an object belongs.
41 - Regression
42 - Predicting a continuous-valued attribute associated with an object.
43 - Clustering
44 - Automatic grouping of similar objects into sets.
45 - Preprocessing
46 - Feature extraction and normalization.
47 - Model selection and evaluation
48 - Comparing, validating and choosing parameters and models.
49 - Dimensionality reduction
50 - Reducing the number of random variables to consider.
51
52 Each group consists of a number of well-known algorithms from the category. For example, one can find hierarchical, spectral, kmeans, and other clustering methods in sklearn.cluster package.
53
54
55 .. _Tools offered by this wrapper:
56
57 ===================
58 Available tools in the current wrapper
59 ===================
60
61 The current release of the wrapper offers a subset of the packages from scikit-learn library. You can find:
62
63 - A subset of classification metric functions
64 - Linear and quadratic discriminant classifiers
65 - Random forest and Ada boost classifiers and regressors
66 - All the clustering methods
67 - All support vector machine classifiers
68 - A subset of data preprocessing estimator classes
69 - Pairwise metric measurement functions
70
71 In addition, several tools for performing matrix operations, generating problem-specific datasets, and encoding text and extracting features have been prepared to help the user with more advanced operations.
72
73 .. _Machine learning workflows:
74
75 Machine learning workflows
76 ===============
77
78 Machine learning is about processes. No matter what machine learning algorithm we use, we can apply typical workflows and dataflows to produce more robust models and better predictions.
79 Here we discuss supervised and unsupervised learning workflows.
80
81 .. _Supervised learning workflows:
82
83 ===================
84 Supervised machine learning workflows
85 ===================
86
87 **What is supervised learning?**
88
89 In this machine learning task, given sample data which are labeled, the aim is to build a model which can predict the labels for new observations.
90 In practice, there are five steps which we can go through to start from raw input data and end up getting reasonable predictions for new samples:
91
92 1. Preprocess the data::
93
94 * Change the collected data into the proper format and datatype.
95 * Adjust the data quality by filling the missing values, performing
96 required scaling and normalizations, etc.
97 * Extract features which are the most meaningfull for the learning task.
98 * Split the ready dataset into training and test samples.
99
100 2. Choose an algorithm::
101
102 * These factors help one to choose a learning algorithm:
103 - Nature of the data (e.g. linear vs. nonlinear data)
104 - Structure of the predicted output (e.g. binary vs. multilabel classification)
105 - Memory and time usage of the training
106 - Predictive accuracy on new data
107 - Interpretability of the predictions
108
109 3. Choose a validation method
110
111 Every machine learning model should be evaluated before being put into practicical use.
112 There are numerous performance metrics to evaluate machine learning models.
113 For supervised learning, usually classification or regression metrics are used.
114
115 A validation method helps to evaluate the performance metrics of a trained model in order
116 to optimize its performance or ultimately switch to a more efficient model.
117 Cross-validation is a known validation method.
118
119 4. Fit a model
120
121 Given the learning algorithm, validation method, and performance metric(s)
122 repeat the following steps::
123
124 * Train the model.
125 * Evaluate based on metrics.
126 * Optimize unitl satisfied.
127
128 5. Use fitted model for prediction::
129
130 This is a final evaluation in which, the optimized model is used to make predictions
131 on unseen (here test) samples. After this, the model is put into production.
132
133 .. _Unsupervised learning workflows:
134
135 =======================
136 Unsupervised machine learning workflows
137 =======================
138
139 **What is unsupervised learning?**
140
141 Unlike supervised learning and more liklely in real life, here the initial data is not labeled.
142 The task is to extract the structure from the data and group the samples based on their similarities.
143 Clustering and dimensionality reduction are two famous examples of unsupervised learning tasks.
144
145 In this case, the workflow is as follows::
146
147 * Preprocess the data (without splitting to train and test).
148 * Train a model.
149 * Evaluate and tune parameters.
150 * Analyse the model and test on real data.