Scikit Learn User Guide

scikit-learn%20user%20guide

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 2170

DownloadScikit-learn User Guide
Open PDF In BrowserView PDF
scikit-learn user guide
Release 0.19.1

scikit-learn developers

Nov 21, 2017

CONTENTS

1

2

3

4

Welcome to scikit-learn
1.1 Installing scikit-learn . . . .
1.2 Frequently Asked Questions
1.3 Support . . . . . . . . . . .
1.4 Related Projects . . . . . .
1.5 About us . . . . . . . . . .
1.6 Who is using scikit-learn? .
1.7 Release history . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

1
1
2
7
8
11
16
24

scikit-learn Tutorials
2.1 An introduction to machine learning with scikit-learn . . . . .
2.2 A tutorial on statistical-learning for scientific data processing
2.3 Working With Text Data . . . . . . . . . . . . . . . . . . . .
2.4 Choosing the right estimator . . . . . . . . . . . . . . . . . .
2.5 External Resources, Videos and Talks . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

107
107
113
141
148
148

User Guide
3.1 Supervised learning . . . . . . . . . . . . . .
3.2 Unsupervised learning . . . . . . . . . . . . .
3.3 Model selection and evaluation . . . . . . . .
3.4 Dataset transformations . . . . . . . . . . . .
3.5 Dataset loading utilities . . . . . . . . . . . .
3.6 Strategies to scale computationally: bigger data
3.7 Computational Performance . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

151
151
282
372
502
540
566
570

Examples
4.1 General examples . . . . . . . . . . .
4.2 Examples based on real world datasets
4.3 Biclustering . . . . . . . . . . . . . . .
4.4 Calibration . . . . . . . . . . . . . . .
4.5 Classification . . . . . . . . . . . . . .
4.6 Clustering . . . . . . . . . . . . . . .
4.7 Covariance estimation . . . . . . . . .
4.8 Cross decomposition . . . . . . . . . .
4.9 Dataset examples . . . . . . . . . . . .
4.10 Decomposition . . . . . . . . . . . . .
4.11 Ensemble methods . . . . . . . . . . .
4.12 Tutorial exercises . . . . . . . . . . . .
4.13 Feature Selection . . . . . . . . . . . .
4.14 Gaussian Process for Machine Learning

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

581
581
625
683
691
705
720
783
800
804
812
842
891
898
908

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

i

4.15
4.16
4.17
4.18
4.19
4.20
4.21
4.22
4.23
4.24
4.25
4.26
5

6

ii

Generalized Linear Models . .
Manifold learning . . . . . . .
Gaussian Mixture Models . . .
Model Selection . . . . . . . .
Multioutput methods . . . . . .
Nearest Neighbors . . . . . . .
Neural Networks . . . . . . . .
Preprocessing . . . . . . . . . .
Semi Supervised Classification
Support Vector Machines . . .
Working with text documents .
Decision Trees . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

932
1002
1024
1041
1079
1081
1096
1109
1121
1133
1162
1173

API Reference
5.1 sklearn.base: Base classes and utility functions . . . . . . . . . .
5.2 sklearn.calibration: Probability Calibration . . . . . . . . . .
5.3 sklearn.cluster: Clustering . . . . . . . . . . . . . . . . . . . .
5.4 sklearn.cluster.bicluster: Biclustering . . . . . . . . . . .
5.5 sklearn.covariance: Covariance Estimators . . . . . . . . . . .
5.6 sklearn.cross_decomposition: Cross decomposition . . . .
5.7 sklearn.datasets: Datasets . . . . . . . . . . . . . . . . . . . .
5.8 sklearn.decomposition: Matrix Decomposition . . . . . . . .
5.9 sklearn.discriminant_analysis: Discriminant Analysis . .
5.10 sklearn.dummy: Dummy estimators . . . . . . . . . . . . . . . . .
5.11 sklearn.ensemble: Ensemble Methods . . . . . . . . . . . . . .
5.12 sklearn.exceptions: Exceptions and warnings . . . . . . . . .
5.13 sklearn.feature_extraction: Feature Extraction . . . . . . .
5.14 sklearn.feature_selection: Feature Selection . . . . . . . .
5.15 sklearn.gaussian_process: Gaussian Processes . . . . . . . .
5.16 sklearn.isotonic: Isotonic regression . . . . . . . . . . . . . .
5.17 sklearn.kernel_approximation Kernel Approximation . . .
5.18 sklearn.kernel_ridge Kernel Ridge Regression . . . . . . . .
5.19 sklearn.linear_model: Generalized Linear Models . . . . . . .
5.20 sklearn.manifold: Manifold Learning . . . . . . . . . . . . . .
5.21 sklearn.metrics: Metrics . . . . . . . . . . . . . . . . . . . . .
5.22 sklearn.mixture: Gaussian Mixture Models . . . . . . . . . . .
5.23 sklearn.model_selection: Model Selection . . . . . . . . . .
5.24 sklearn.multiclass: Multiclass and multilabel classification . .
5.25 sklearn.multioutput: Multioutput regression and classification
5.26 sklearn.naive_bayes: Naive Bayes . . . . . . . . . . . . . . .
5.27 sklearn.neighbors: Nearest Neighbors . . . . . . . . . . . . . .
5.28 sklearn.neural_network: Neural network models . . . . . . .
5.29 sklearn.pipeline: Pipeline . . . . . . . . . . . . . . . . . . . .
5.30 sklearn.preprocessing: Preprocessing and Normalization . . .
5.31 sklearn.random_projection: Random projection . . . . . . .
5.32 sklearn.semi_supervised Semi-Supervised Learning . . . . .
5.33 sklearn.svm: Support Vector Machines . . . . . . . . . . . . . . .
5.34 sklearn.tree: Decision Trees . . . . . . . . . . . . . . . . . . . .
5.35 sklearn.utils: Utilities . . . . . . . . . . . . . . . . . . . . . . .
5.36 Recently deprecated . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

1183
1183
1189
1193
1231
1237
1267
1281
1330
1383
1391
1396
1426
1431
1457
1489
1527
1531
1540
1543
1642
1661
1728
1739
1795
1803
1810
1821
1870
1884
1892
1936
1942
1948
1981
2006
2019

Developer’s Guide
2085
6.1 Contributing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2085
6.2 Developers’ Tips and Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2102

6.3
6.4
6.5
6.6

Utilities for Developers . . . . . . . .
How to optimize for speed . . . . . . .
Advanced installation instructions . . .
Maintainer / core-developer information

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

2104
2107
2113
2119

Bibliography

2121

Index

2129

iii

iv

CHAPTER

ONE

WELCOME TO SCIKIT-LEARN

1.1 Installing scikit-learn
Note: If you wish to contribute to the project, it’s recommended you install the latest development version.

1.1.1 Installing the latest release
Scikit-learn requires:
• Python (>= 2.7 or >= 3.3),
• NumPy (>= 1.8.2),
• SciPy (>= 0.13.3).
If you already have a working installation of numpy and scipy, the easiest way to install scikit-learn is using pip
pip install -U scikit-learn

or conda:
conda install scikit-learn

If you have not installed NumPy or SciPy yet, you can also install these using conda or pip. When using pip, please
ensure that binary wheels are used, and NumPy and SciPy are not recompiled from source, which can happen when
using particular configurations of operating system and hardware (such as Linux on a Raspberry Pi). Building numpy
and scipy from source can be complex (especially on Windows) and requires careful configuration to ensure that they
link against an optimized implementation of linear algebra routines. Instead, use a third-party distribution as described
below.
If you must install scikit-learn and its dependencies with pip, you can install it as scikit-learn[alldeps]. The
most common use case for this is in a requirements.txt file used as part of an automated build process for a
PaaS application or a Docker image. This option is not intended for manual installation from the command line.

1.1.2 Third-party Distributions
If you don’t already have a python installation with numpy and scipy, we recommend to install either via your package
manager or via a python bundle. These come with numpy, scipy, scikit-learn, matplotlib and many other helpful
1

scikit-learn user guide, Release 0.19.1

scientific and data processing libraries.
Available options are:
Canopy and Anaconda for all supported platforms
Canopy and Anaconda both ship a recent version of scikit-learn, in addition to a large set of scientific python library
for Windows, Mac OSX and Linux.
Anaconda offers scikit-learn as part of its free distribution.
Warning: To upgrade or uninstall scikit-learn installed with Anaconda or conda you should not use the pip
command. Instead:
To upgrade scikit-learn:
conda update scikit-learn

To uninstall scikit-learn:
conda remove scikit-learn

Upgrading with pip install -U scikit-learn or uninstalling pip uninstall scikit-learn is
likely fail to properly remove files installed by the conda command.
pip upgrade and uninstall operations only work on packages installed via pip install.

WinPython for Windows
The WinPython project distributes scikit-learn as an additional plugin.
For installation instructions for particular operating systems or for compiling the bleeding edge version, see the Advanced installation instructions.

1.2 Frequently Asked Questions
Here we try to give some answers to questions that regularly pop up on the mailing list.

1.2.1 What is the project name (a lot of people get it wrong)?
scikit-learn, but not scikit or SciKit nor sci-kit learn. Also not scikits.learn or scikits-learn, which were previously
used.

1.2.2 How do you pronounce the project name?
sy-kit learn. sci stands for science!

1.2.3 Why scikit?
There are multiple scikits, which are scientific toolboxes built around SciPy. You can find a list at https://scikits.
appspot.com/scikits. Apart from scikit-learn, another popular one is scikit-image.

2

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

1.2.4 How can I contribute to scikit-learn?
See Contributing. Before wanting to add a new algorithm, which is usually a major and lengthy undertaking, it is
recommended to start with known issues. Please do not contact the contributors of scikit-learn directly regarding
contributing to scikit-learn.

1.2.5 What’s the best way to get help on scikit-learn usage?
For general machine learning questions, please use Cross Validated with the [machine-learning] tag.
For scikit-learn usage questions, please use Stack Overflow with the [scikit-learn] and [python] tags. You
can alternatively use the mailing list.
Please make sure to include a minimal reproduction code snippet (ideally shorter than 10 lines) that highlights your
problem on a toy dataset (for instance from sklearn.datasets or randomly generated with functions of numpy.
random with a fixed random seed). Please remove any line of code that is not necessary to reproduce your problem.
The problem should be reproducible by simply copy-pasting your code snippet in a Python shell with scikit-learn
installed. Do not forget to include the import statements.
More guidance to write good reproduction code snippets can be found at:
http://stackoverflow.com/help/mcve
If your problem raises an exception that you do not understand (even after googling it), please make sure to include
the full traceback that you obtain when running the reproduction script.
For bug reports or feature requests, please make use of the issue tracker on GitHub.
There is also a scikit-learn Gitter channel where some users and developers might be found.
Please do not email any authors directly to ask for assistance, report bugs, or for any other issue related to
scikit-learn.

1.2.6 How can I create a bunch object?
Don’t make a bunch object! They are not part of the scikit-learn API. Bunch objects are just a way to package some
numpy arrays. As a scikit-learn user you only ever need numpy arrays to feed your model with data.
For instance to train a classifier, all you need is a 2D array X for the input variables and a 1D array y for the target
variables. The array X holds the features as columns and samples as rows . The array y contains integer values to
encode the class membership of each sample in X.

1.2.7 How can I load my own datasets into a format usable by scikit-learn?
Generally, scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that
are convertible to numeric arrays such as pandas DataFrame are also acceptable.
For more information on loading your data files into these usable data structures, please refer to loading external
datasets.

1.2.8 What are the inclusion criteria for new algorithms ?
We only consider well-established algorithms for inclusion. A rule of thumb is at least 3 years since publication, 200+
citations and wide use and usefulness. A technique that provides a clear-cut improvement (e.g. an enhanced data
structure or a more efficient approximation technique) on a widely-used method will also be considered for inclusion.
1.2. Frequently Asked Questions

3

scikit-learn user guide, Release 0.19.1

From the algorithms or techniques that meet the above criteria, only those which fit well within the current API of
scikit-learn, that is a fit, predict/transform interface and ordinarily having input/output that is a numpy array
or sparse matrix, are accepted.
The contributor should support the importance of the proposed addition with research papers and/or implementations
in other similar packages, demonstrate its usefulness via common use-cases/applications and corroborate performance
improvements, if any, with benchmarks and/or plots. It is expected that the proposed algorithm should outperform the
methods that are already implemented in scikit-learn at least in some areas.
Also note that your implementation need not be in scikit-learn to be used together with scikit-learn tools. You can
implement your favorite algorithm in a scikit-learn compatible way, upload it to github and let us know. We will list it
under Related Projects.

1.2.9 Why are you so selective on what algorithms you include in scikit-learn?
Code is maintenance cost, and we need to balance the amount of code we have with the size of the team (and add to
this the fact that complexity scales non linearly with the number of features). The package relies on core developers
using their free time to fix bugs, maintain code and review contributions. Any algorithm that is added needs future
attention by the developers, at which point the original author might long have lost interest. Also see this thread on the
mailing list.

1.2.10 Why did you remove HMMs from scikit-learn?
See Will you add graphical models or sequence prediction to scikit-learn?.

1.2.11 Will you add graphical models or sequence prediction to scikit-learn?
Not in the foreseeable future. scikit-learn tries to provide a unified API for the basic tasks in machine learning, with
pipelines and meta-algorithms like grid search to tie everything together. The required concepts, APIs, algorithms
and expertise required for structured learning are different from what scikit-learn has to offer. If we started doing
arbitrary structured learning, we’d need to redesign the whole package and the project would likely collapse under its
own weight.
There are two project with API similar to scikit-learn that do structured prediction:
• pystruct handles general structured learning (focuses on SSVMs on arbitrary graph structures with approximate
inference; defines the notion of sample as an instance of the graph structure)
• seqlearn handles sequences only (focuses on exact inference; has HMMs, but mostly for the sake of completeness; treats a feature vector as a sample and uses an offset encoding for the dependencies between feature
vectors)

1.2.12 Will you add GPU support?
No, or at least not in the near future. The main reason is that GPU support will introduce many software dependencies
and introduce platform specific issues. scikit-learn is designed to be easy to install on a wide variety of platforms.
Outside of neural networks, GPUs don’t play a large role in machine learning today, and much larger gains in speed
can often be achieved by a careful choice of algorithms.

4

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

1.2.13 Do you support PyPy?
In case you didn’t know, PyPy is the new, fast, just-in-time compiling Python implementation. We don’t support it.
When the NumPy support in PyPy is complete or near-complete, and SciPy is ported over as well, we can start thinking
of a port. We use too much of NumPy to work with a partial implementation.

1.2.14 How do I deal with string data (or trees, graphs. . . )?
scikit-learn estimators assume you’ll feed them real-valued feature vectors. This assumption is hard-coded in pretty
much all of the library. However, you can feed non-numerical inputs to estimators in several ways.
If you have text documents, you can use a term frequency features; see Text feature extraction for the built-in text
vectorizers. For more general feature extraction from any kind of data, see Loading features from dicts and Feature
hashing.
Another common case is when you have non-numerical data and a custom distance (or similarity) metric on these data.
Examples include strings with edit distance (aka. Levenshtein distance; e.g., DNA or RNA sequences). These can be
encoded as numbers, but doing so is painful and error-prone. Working with distance metrics on arbitrary data can be
done in two ways.
Firstly, many estimators take precomputed distance/similarity matrices, so if the dataset is not too large, you can
compute distances for all pairs of inputs. If the dataset is large, you can use feature vectors with only one “feature”,
which is an index into a separate data structure, and supply a custom metric function that looks up the actual data in
this data structure. E.g., to use DBSCAN with Levenshtein distances:
>>> from leven import levenshtein
>>> import numpy as np
>>> from sklearn.cluster import dbscan
>>> data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"]
>>> def lev_metric(x, y):
...
i, j = int(x[0]), int(y[0])
# extract indices
...
return levenshtein(data[i], data[j])
...
>>> X = np.arange(len(data)).reshape(-1, 1)
>>> X
array([[0],
[1],
[2]])
>>> dbscan(X, metric=lev_metric, eps=5, min_samples=2)
([0, 1], array([ 0, 0, -1]))

(This uses the third-party edit distance package leven.)
Similar tricks can be used, with some care, for tree kernels, graph kernels, etc.

1.2.15 Why do I sometime get a crash/freeze with n_jobs > 1 under OSX or Linux?
Several scikit-learn tools such as GridSearchCV and cross_val_score rely internally on Python’s multiprocessing module to parallelize execution onto several Python processes by passing n_jobs > 1 as argument.
The problem is that Python multiprocessing does a fork system call without following it with an exec system
call for performance reasons. Many libraries like (some versions of) Accelerate / vecLib under OSX, (some versions
of) MKL, the OpenMP runtime of GCC, nvidia’s Cuda (and probably many others), manage their own internal thread
pool. Upon a call to fork, the thread pool state in the child process is corrupted: the thread pool believes it has many
threads while only the main thread state has been forked. It is possible to change the libraries to make them detect

1.2. Frequently Asked Questions

5

scikit-learn user guide, Release 0.19.1

when a fork happens and reinitialize the thread pool in that case: we did that for OpenBLAS (merged upstream in
master since 0.2.10) and we contributed a patch to GCC’s OpenMP runtime (not yet reviewed).
But in the end the real culprit is Python’s multiprocessing that does fork without exec to reduce the overhead
of starting and using new Python processes for parallel computing. Unfortunately this is a violation of the POSIX
standard and therefore some software editors like Apple refuse to consider the lack of fork-safety in Accelerate /
vecLib as a bug.
In Python 3.4+ it is now possible to configure multiprocessing to use the ‘forkserver’ or ‘spawn’ start methods
(instead of the default ‘fork’) to manage the process pools. To work around this issue when using scikit-learn, you
can set the JOBLIB_START_METHOD environment variable to ‘forkserver’. However the user should be aware that
using the ‘forkserver’ method prevents joblib.Parallel to call function interactively defined in a shell session.
If you have custom code that uses multiprocessing directly instead of using it via joblib you can enable the
‘forkserver’ mode globally for your program: Insert the following instructions in your main script:
import multiprocessing
# other imports, custom code, load data, define model...
if __name__ == '__main__':
multiprocessing.set_start_method('forkserver')
# call scikit-learn utils with n_jobs > 1 here

You can find more default on the new start methods in the multiprocessing documentation.

1.2.16 Why is there no support for deep or reinforcement learning / Will there be
support for deep or reinforcement learning in scikit-learn?
Deep learning and reinforcement learning both require a rich vocabulary to define an architecture, with deep learning
additionally requiring GPUs for efficient computing. However, neither of these fit within the design constraints of
scikit-learn; as a result, deep learning and reinforcement learning are currently out of scope for what scikit-learn seeks
to achieve.
You can find more information about addition of gpu support at Will you add GPU support?.

1.2.17 Why is my pull request not getting any attention?
The scikit-learn review process takes a significant amount of time, and contributors should not be discouraged by a
lack of activity or review on their pull request. We care a lot about getting things right the first time, as maintenance
and later change comes at a high cost. We rarely release any “experimental” code, so all of our contributions will be
subject to high use immediately and should be of the highest quality possible initially.
Beyond that, scikit-learn is limited in its reviewing bandwidth; many of the reviewers and core developers are working
on scikit-learn on their own time. If a review of your pull request comes slowly, it is likely because the reviewers are
busy. We ask for your understanding and request that you not close your pull request or discontinue your work solely
because of this reason.

1.2.18 How do I set a random_state for an entire execution?
For testing and replicability, it is often important to have the entire execution controlled by a single seed for the pseudorandom number generator used in algorithms that have a randomized component. Scikit-learn does not use its own
global random state; whenever a RandomState instance or an integer random seed is not provided as an argument, it

6

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

relies on the numpy global random state, which can be set using numpy.random.seed. For example, to set an
execution’s numpy global random state to 42, one could execute the following in his or her script:
import numpy as np
np.random.seed(42)

However, a global random state is prone to modification by other code during execution. Thus, the only way to ensure
replicability is to pass RandomState instances everywhere and ensure that both estimators and cross-validation
splitters have their random_state parameter set.

1.3 Support
There are several ways to get in touch with the developers.

1.3.1 Mailing List
• The main mailing list is scikit-learn.
• There is also a commit list scikit-learn-commits, where updates to the main repository and test failures get
notified.

1.3.2 User questions
• Some scikit-learn developers support users on StackOverflow using the [scikit-learn] tag.
• For general theoretical or methodological Machine Learning questions stack exchange is probably a more suitable venue.
In both cases please use a descriptive question in the title field (e.g. no “Please help with scikit-learn!” as this is not a
question) and put details on what you tried to achieve, what were the expected results and what you observed instead
in the details field.
Code and data snippets are welcome. Minimalistic (up to ~20 lines long) reproduction script very helpful.
Please describe the nature of your data and the how you preprocessed it: what is the number of samples, what is the
number and type of features (i.d. categorical or numerical) and for supervised learning tasks, what target are your
trying to predict: binary, multiclass (1 out of n_classes) or multilabel (k out of n_classes) classification or
continuous variable regression.

1.3.3 Bug tracker
If you think you’ve encountered a bug, please report it to the issue tracker:
https://github.com/scikit-learn/scikit-learn/issues
Don’t forget to include:
• steps (or better script) to reproduce,
• expected outcome,
• observed outcome or python (or gdb) tracebacks

1.3. Support

7

scikit-learn user guide, Release 0.19.1

To help developers fix your bug faster, please link to a https://gist.github.com holding a standalone minimalistic python
script that reproduces your bug and optionally a minimalistic subsample of your dataset (for instance exported as CSV
files using numpy.savetxt).
Note: gists are git cloneable repositories and thus you can use git to push datafiles to them.

1.3.4 IRC
Some developers like to hang out on channel #scikit-learn on irc.freenode.net.
If you do not have an IRC client or are behind a firewall this web client works fine: http://webchat.freenode.net

1.3.5 Documentation resources
This documentation is relative to 0.19.1. Documentation for other versions can be found here.
Printable pdf documentation for old versions can be found here.

1.4 Related Projects
Projects implementing the scikit-learn estimator API are encouraged to use the scikit-learn-contrib template which
facilitates best practices for testing and documenting estimators. The scikit-learn-contrib GitHub organisation also
accepts high-quality contributions of repositories conforming to this template.
Below is a list of sister-projects, extensions and domain specific packages.

1.4.1 Interoperability and framework enhancements
These tools adapt scikit-learn for use with other technologies or otherwise enhance the functionality of scikit-learn’s
estimators.
Data formats
• sklearn_pandas bridge for scikit-learn pipelines and pandas data frame with dedicated transformers.
Auto-ML
• auto_ml Automated machine learning for production and analytics, built on scikit-learn and related projects.
Trains a pipeline wth all the standard machine learning steps. Tuned for prediction speed and ease of transfer to
production environments.
• auto-sklearn An automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator
• TPOT An automated machine learning toolkit that optimizes a series of scikit-learn operators to design a machine learning pipeline, including data and feature preprocessors as well as the estimators. Works as a drop-in
replacement for a scikit-learn estimator.
Experimentation frameworks
• REP Environment for conducting data-driven research in a consistent and reproducible way
• ML Frontend provides dataset management and SVM fitting/prediction through web-based and programmatic
interfaces.
• Scikit-Learn Laboratory A command-line wrapper around scikit-learn that makes it easy to run machine learning
experiments with multiple learners and large feature sets.

8

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• Xcessiv is a notebook-like application for quick, scalable, and automated hyperparameter tuning and stacked
ensembling. Provides a framework for keeping track of model-hyperparameter combinations.
Model inspection and visualisation
• eli5 A library for debugging/inspecting machine learning models and explaining their predictions.
• mlxtend Includes model visualization utilities.
• scikit-plot A visualization library for quick and easy generation of common plots in data analysis and machine
learning.
• yellowbrick A suite of custom matplotlib visualizers for scikit-learn estimators to support visual feature analysis,
model selection, evaluation, and diagnostics.
Model export for production
• sklearn-pmml Serialization of (some) scikit-learn estimators into PMML.
• sklearn2pmml Serialization of a wide variety of scikit-learn estimators and transformers into PMML with the
help of JPMML-SkLearn library.
• sklearn-porter Transpile trained scikit-learn models to C, Java, Javascript and others.
• sklearn-compiledtrees Generate a C++ implementation of the predict function for decision trees (and ensembles)
trained by sklearn. Useful for latency-sensitive production environments.

1.4.2 Other estimators and tasks
Not everything belongs or is mature enough for the central scikit-learn project. The following are projects providing
interfaces similar to scikit-learn for additional learning algorithms, infrastructures and tasks.
Structured learning
• Seqlearn Sequence classification using HMMs or structured perceptron.
• HMMLearn Implementation of hidden markov models that was previously part of scikit-learn.
• PyStruct General conditional random fields and structured prediction.
• pomegranate Probabilistic modelling for Python, with an emphasis on hidden Markov models.
• sklearn-crfsuite Linear-chain conditional random fields (CRFsuite wrapper with sklearn-like API).
Deep neural networks etc.
• pylearn2 A deep learning and neural network library build on theano with scikit-learn like interface.
• sklearn_theano scikit-learn compatible estimators, transformers, and datasets which use Theano internally
• nolearn A number of wrappers and abstractions around existing neural network libraries
• keras Deep Learning library capable of running on top of either TensorFlow or Theano.
• lasagne A lightweight library to build and train neural networks in Theano.
Broad scope
• mlxtend Includes a number of additional estimators as well as model visualization utilities.
• sparkit-learn Scikit-learn API and functionality for PySpark’s distributed modelling.
Other regression and classification
• xgboost Optimised gradient boosted decision tree library.
• lightning Fast state-of-the-art linear model solvers (SDCA, AdaGrad, SVRG, SAG, etc. . . ).

1.4. Related Projects

9

scikit-learn user guide, Release 0.19.1

• py-earth Multivariate adaptive regression splines
• Kernel Regression Implementation of Nadaraya-Watson kernel regression with automatic bandwidth selection
• gplearn Genetic Programming for symbolic regression tasks.
• multiisotonic Isotonic regression on multidimensional features.
Decomposition and clustering
• lda: Fast implementation of latent Dirichlet allocation in Cython which uses Gibbs sampling
to sample from the true posterior distribution.
(scikit-learn’s sklearn.decomposition.
LatentDirichletAllocation implementation uses variational inference to sample from a tractable
approximation of a topic model’s posterior distribution.)
• Sparse Filtering Unsupervised feature learning based on sparse-filtering
• kmodes k-modes clustering algorithm for categorical data, and several of its variations.
• hdbscan HDBSCAN and Robust Single Linkage clustering algorithms for robust variable density clustering.
• spherecluster Spherical K-means and mixture of von Mises Fisher clustering routines for data on the unit hypersphere.
Pre-processing
• categorical-encoding A library of sklearn compatible categorical variable encoders.
• imbalanced-learn Various methods to under- and over-sample datasets.

1.4.3 Statistical learning with Python
Other packages useful for data analysis and machine learning.
• Pandas Tools for working with heterogeneous and columnar data, relational queries, time series and basic statistics.
• theano A CPU/GPU array processing framework geared towards deep learning research.
• statsmodels Estimating and analysing statistical models. More focused on statistical tests and less on prediction
than scikit-learn.
• PyMC Bayesian statistical models and fitting algorithms.
• Sacred Tool to help you configure, organize, log and reproduce experiments
• Seaborn Visualization library based on matplotlib. It provides a high-level interface for drawing attractive
statistical graphics.
• Deep Learning A curated list of deep learning software libraries.
Domain specific packages
• scikit-image Image processing and computer vision in python.
• Natural language toolkit (nltk) Natural language processing and some machine learning.
• gensim A library for topic modelling, document indexing and similarity retrieval
• NiLearn Machine learning for neuro-imaging.
• AstroML Machine learning for astronomy.
• MSMBuilder Machine learning for protein conformational dynamics time series.

10

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

1.4.4 Snippets and tidbits
The wiki has more!

1.5 About us
This is a community effort, and as such many people have contributed to it over the years.

1.5.1 History
This project was started in 2007 as a Google Summer of Code project by David Cournapeau. Later that year, Matthieu
Brucher started work on this project as part of his thesis.
In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent Michel of INRIA took leadership of the
project and made the first public release, February the 1st 2010. Since then, several releases have appeared following
a ~3 month cycle, and a thriving international community has been leading the development.

1.5.2 People
The following people have been core contributors to scikit-learn’s development and maintenance:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•

Mathieu Blondel
Matthieu Brucher
Lars Buitinck
David Cournapeau
Noel Dawe
Vincent Dubourg
Edouard Duchesnay
Tom Dupré la Tour
Alexander Fabisch
Virgile Fritsch
Satra Ghosh
Angel Soler Gollonet
Chris Filo Gorgolewski
Alexandre Gramfort
Olivier Grisel
Jaques Grobler
Yaroslav Halchenko
Brian Holt
Arnaud Joly
Thouis (Ray) Jones
Kyle Kastner
Manoj Kumar
Robert Layton
Wei Li
Paolo Losi
Gilles Louppe
Jan Hendrik Metzen
Vincent Michel
Jarrod Millman
Andreas Müller (release manager)

1.5. About us

11

scikit-learn user guide, Release 0.19.1

•
•
•
•
•
•
•
•
•
•

Vlad Niculae
Joel Nothman
Alexandre Passos
Fabian Pedregosa
Peter Prettenhofer
Bertrand Thirion
Jake VanderPlas
Nelle Varoquaux
Gael Varoquaux
Ron Weiss

Please do not email the authors directly to ask for assistance or report issues. Instead, please see What’s the best way
to ask questions about scikit-learn in the FAQ.
See also:
How you can contribute to the project

1.5.3 Citing scikit-learn
If you use scikit-learn in a scientific publication, we would appreciate citations to the following paper:
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
Bibtex entry:
@article{scikit-learn,
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
volume={12},
pages={2825--2830},
year={2011}
}

If you want to cite scikit-learn for its API or design, you may also want to consider the following paper:
API design for machine learning software: experiences from the scikit-learn project, Buitinck et al., 2013.
Bibtex entry:
@inproceedings{sklearn_api,
author
= {Lars Buitinck and Gilles Louppe and Mathieu Blondel and
Fabian Pedregosa and Andreas Mueller and Olivier Grisel and
Vlad Niculae and Peter Prettenhofer and Alexandre Gramfort
and Jaques Grobler and Robert Layton and Jake VanderPlas and
Arnaud Joly and Brian Holt and Ga{\"{e}}l Varoquaux},
title
= {{API} design for machine learning software: experiences from
˓→the scikit-learn
project},
booktitle = {ECML PKDD Workshop: Languages for Data Mining and Machine
˓→Learning},
year
= {2013},
pages = {108--122},
}

12

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

1.5.4 Artwork
High quality PNG and SVG logos are available in the doc/logos/ source directory.

1.5.5 Funding
INRIA actively supports this project. It has provided funding for Fabian Pedregosa (2010-2012), Jaques Grobler
(2012-2013) and Olivier Grisel (2013-2017) to work on this project full-time. It also hosts coding sprints and other

events.
Paris-Saclay Center for Data Science funded one year for a
developer to work on the project full-time (2014-2015) and 50% of the time of Guillaume Lemaitre (2016-2017).

NYU Moore-Sloan Data Science Environment funded Andreas
Mueller (2014-2016) to work on this project. The Moore-Sloan Data Science Environment also funds several stu-

Télécom Paristech funded
dents to work on the project part-time.
Manoj Kumar (2014), Tom Dupré la Tour (2015), Raghav RV (2015-2017), Thierry Guillemot (2016-2017) and Albert

Thomas (2017) to work on scikit-learn.

1.5. About us

Columbia University funds Andreas Müller since

13

scikit-learn user guide, Release 0.19.1

2016.

Andreas Müller also received a grant to improve scikit-learn from the Alfred P. Sloan

Foundation in 2017.

The University of Sydney funds Joel Nothman

since July 2017.
The following students were sponsored by Google
to work on scikit-learn through the Google Summer of Code program.
• 2007 - David Cournapeau
• 2011 - Vlad Niculae
• 2012 - Vlad Niculae, Immanuel Bayer.
• 2013 - Kemal Eren, Nicolas Trésegnie
• 2014 - Hamzeh Alsalhi, Issam Laradji, Maheshakya Wijewardena, Manoj Kumar.
• 2015 - Raghav RV, Wei Xue
• 2016 - Nelson Liu, YenChen Lin
It also provided funding for sprints and events around scikit-learn. If you would like to participate in the next Google
Summer of code program, please see this page.
The NeuroDebian project providing Debian packaging and contributions is supported by Dr. James V. Haxby (Dartmouth College).
The PSF helped find and manage funding for our 2011 Granada sprint. More information can be found here
tinyclues funded the 2011 international Granada sprint.
Donating to the project
If you are interested in donating to the project or to one of our code-sprints, you can use the Paypal button below or the
NumFOCUS Donations Page (if you use the latter, please indicate that you are donating for the scikit-learn project).
All donations will be handled by NumFOCUS, a non-profit-organization which is managed by a board of Scipy
community members. NumFOCUS’s mission is to foster scientific computing software, in particular in Python. As

14

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

a fiscal home of scikit-learn, it ensures that money is available when needed to keep the project funded and available
while in compliance with tax regulations.
The received donations for the scikit-learn project mostly will go towards covering travel-expenses for code sprints, as
well as towards the organization budget of the project1 .
Notes
The 2013 Paris international sprint

Fig. 1.1: IAP VII/19 - DYSCO
For more information on this sprint, see here

1.5.6 Infrastructure support
• We would like to thank Rackspace for providing us with a free Rackspace Cloud account to automatically build
the documentation and the example gallery from for the development version of scikit-learn using this tool.
• We would also like to thank Shining Panda for free CPU time on their Continuous Integration server.
1 Regarding the organization budget in particular, we might use some of the donated funds to pay for other project expenses such as DNS,
hosting or continuous integration services.

1.5. About us

15

scikit-learn user guide, Release 0.19.1

1.6 Who is using scikit-learn?
1.6.1 Spotify

Scikit-learn provides a toolbox with solid implementations of a bunch of state-of-the-art models and makes it easy to
plug them into existing applications. We’ve been using it quite a lot for music recommendations at Spotify and I think
it’s the most well-designed ML package I’ve seen so far.
Erik Bernhardsson, Engineering Manager Music Discovery & Machine Learning, Spotify

1.6.2 Inria

At INRIA, we use scikit-learn to support leading-edge basic research in many teams: Parietal for neuroimaging, Lear
for computer vision, Visages for medical image analysis, Privatics for security. The project is a fantastic tool to
address difficult applications of machine learning in an academic environment as it is performant and versatile, but all
easy-to-use and well documented, which makes it well suited to grad students.
Gaël Varoquaux, research at Parietal

1.6.3 betaworks

Betaworks is a NYC-based startup studio that builds new products, grows companies, and invests in others. Over
the past 8 years we’ve launched a handful of social data analytics-driven services, such as Bitly, Chartbeat, digg and
Scale Model. Consistently the betaworks data science team uses Scikit-learn for a variety of tasks. From exploratory
analysis, to product development, it is an essential part of our toolkit. Recent uses are included in digg’s new video
recommender system, and Poncho’s dynamic heuristic subspace clustering.
Gilad Lotan, Chief Data Scientist

16

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

1.6.4 Evernote

Building a classifier is typically an iterative process of exploring the data, selecting the features (the attributes of the
data believed to be predictive in some way), training the models, and finally evaluating them. For many of these tasks,
we relied on the excellent scikit-learn package for Python.
Read more
Mark Ayzenshtat, VP, Augmented Intelligence

1.6.5 Télécom ParisTech

At Telecom ParisTech, scikit-learn is used for hands-on sessions and home assignments in introductory and advanced
machine learning courses. The classes are for undergrads and masters students. The great benefit of scikit-learn is its
fast learning curve that allows students to quickly start working on interesting and motivating problems.
Alexandre Gramfort, Assistant Professor

1.6.6 Booking.com

At Booking.com, we use machine learning algorithms for many different applications, such as recommending hotels and destinations to our customers, detecting fraudulent reservations, or scheduling our customer service agents.
Scikit-learn is one of the tools we use when implementing standard algorithms for prediction tasks. Its API and documentations are excellent and make it easy to use. The scikit-learn developers do a great job of incorporating state of
the art implementations and new algorithms into the package. Thus, scikit-learn provides convenient access to a wide
spectrum of algorithms, and allows us to readily find the right tool for the right job.
Melanie Mueller, Data Scientist

1.6.7 AWeber

1.6. Who is using scikit-learn?

17

scikit-learn user guide, Release 0.19.1

The scikit-learn toolkit is indispensable for the Data Analysis and Management team at AWeber. It allows us to do
AWesome stuff we would not otherwise have the time or resources to accomplish. The documentation is excellent,
allowing new engineers to quickly evaluate and apply many different algorithms to our data. The text feature extraction
utilities are useful when working with the large volume of email content we have at AWeber. The RandomizedPCA
implementation, along with Pipelining and FeatureUnions, allows us to develop complex machine learning algorithms
efficiently and reliably.
Anyone interested in learning more about how AWeber deploys scikit-learn in a production environment should check
out talks from PyData Boston by AWeber’s Michael Becker available at https://github.com/mdbecker/pydata_2013
Michael Becker, Software Engineer, Data Analysis and Management Ninjas

1.6.8 Yhat

The combination of consistent APIs, thorough documentation, and top notch implementation make scikit-learn our
favorite machine learning package in Python. scikit-learn makes doing advanced analysis in Python accessible to
anyone. At Yhat, we make it easy to integrate these models into your production applications. Thus eliminating the
unnecessary dev time encountered productionizing analytical work.
Greg Lamp, Co-founder Yhat

1.6.9 Rangespan

The Python scikit-learn toolkit is a core tool in the data science group at Rangespan. Its large collection of well
documented models and algorithms allow our team of data scientists to prototype fast and quickly iterate to find the
right solution to our learning problems. We find that scikit-learn is not only the right tool for prototyping, but its
careful and well tested implementation give us the confidence to run scikit-learn models in production.
Jurgen Van Gael, Data Science Director at Rangespan Ltd

18

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

1.6.10 Birchbox

At Birchbox, we face a range of machine learning problems typical to E-commerce: product recommendation, user
clustering, inventory prediction, trends detection, etc. Scikit-learn lets us experiment with many models, especially in
the exploration phase of a new project: the data can be passed around in a consistent way; models are easy to save and
reuse; updates keep us informed of new developments from the pattern discovery research community. Scikit-learn is
an important tool for our team, built the right way in the right language.
Thierry Bertin-Mahieux, Birchbox, Data Scientist

1.6.11 Bestofmedia Group

Scikit-learn is our #1 toolkit for all things machine learning at Bestofmedia. We use it for a variety of tasks (e.g. spam
fighting, ad click prediction, various ranking models) thanks to the varied, state-of-the-art algorithm implementations
packaged into it. In the lab it accelerates prototyping of complex pipelines. In production I can say it has proven to be
robust and efficient enough to be deployed for business critical components.
Eustache Diemert, Lead Scientist Bestofmedia Group

1.6.12 Change.org

1.6. Who is using scikit-learn?

19

scikit-learn user guide, Release 0.19.1

At change.org we automate the use of scikit-learn’s RandomForestClassifier in our production systems to drive email
targeting that reaches millions of users across the world each week. In the lab, scikit-learn’s ease-of-use, performance,
and overall variety of algorithms implemented has proved invaluable in giving us a single reliable source to turn to for
our machine-learning needs.
Vijay Ramesh, Software Engineer in Data/science at Change.org

1.6.13 PHIMECA Engineering

At PHIMECA Engineering, we use scikit-learn estimators as surrogates for expensive-to-evaluate numerical models
(mostly but not exclusively finite-element mechanical models) for speeding up the intensive post-processing operations
involved in our simulation-based decision making framework. Scikit-learn’s fit/predict API together with its efficient
cross-validation tools considerably eases the task of selecting the best-fit estimator. We are also using scikit-learn for
illustrating concepts in our training sessions. Trainees are always impressed by the ease-of-use of scikit-learn despite
the apparent theoretical complexity of machine learning.
Vincent Dubourg, PHIMECA Engineering, PhD Engineer

1.6.14 HowAboutWe

At HowAboutWe, scikit-learn lets us implement a wide array of machine learning techniques in analysis and in production, despite having a small team. We use scikit-learn’s classification algorithms to predict user behavior, enabling
us to (for example) estimate the value of leads from a given traffic source early in the lead’s tenure on our site. Also, our
users’ profiles consist of primarily unstructured data (answers to open-ended questions), so we use scikit-learn’s feature extraction and dimensionality reduction tools to translate these unstructured data into inputs for our matchmaking
system.
Daniel Weitzenfeld, Senior Data Scientist at HowAboutWe

1.6.15 PeerIndex

At PeerIndex we use scientific methodology to build the Influence Graph - a unique dataset that allows us to identify
who’s really influential and in which context. To do this, we have to tackle a range of machine learning and predictive modeling problems. Scikit-learn has emerged as our primary tool for developing prototypes and making quick
progress. From predicting missing data and classifying tweets to clustering communities of social media users, scikit20

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

learn proved useful in a variety of applications. Its very intuitive interface and excellent compatibility with other
python tools makes it and indispensable tool in our daily research efforts.
Ferenc Huszar - Senior Data Scientist at Peerindex

1.6.16 DataRobot

DataRobot is building next generation predictive analytics software to make data scientists more productive, and
scikit-learn is an integral part of our system. The variety of machine learning techniques in combination with the
solid implementations that scikit-learn offers makes it a one-stop-shopping library for machine learning in Python.
Moreover, its consistent API, well-tested code and permissive licensing allow us to use it in a production environment.
Scikit-learn has literally saved us years of work we would have had to do ourselves to bring our product to market.
Jeremy Achin, CEO & Co-founder DataRobot Inc.

1.6.17 OkCupid

We’re using scikit-learn at OkCupid to evaluate and improve our matchmaking system. The range of features it has,
especially preprocessing utilities, means we can use it for a wide variety of projects, and it’s performant enough to
handle the volume of data that we need to sort through. The documentation is really thorough, as well, which makes
the library quite easy to use.
David Koh - Senior Data Scientist at OkCupid

1.6.18 Lovely

At Lovely, we strive to deliver the best apartment marketplace, with respect to our users and our listings. From
understanding user behavior, improving data quality, and detecting fraud, scikit-learn is a regular tool for gathering
insights, predictive modeling and improving our product. The easy-to-read documentation and intuitive architecture of
the API makes machine learning both explorable and accessible to a wide range of python developers. I’m constantly
recommending that more developers and scientists try scikit-learn.
Simon Frid - Data Scientist, Lead at Lovely
1.6. Who is using scikit-learn?

21

scikit-learn user guide, Release 0.19.1

1.6.19 Data Publica

Data Publica builds a new predictive sales tool for commercial and marketing teams called C-Radar. We extensively
use scikit-learn to build segmentations of customers through clustering, and to predict future customers based on past
partnerships success or failure. We also categorize companies using their website communication thanks to scikit-learn
and its machine learning algorithm implementations. Eventually, machine learning makes it possible to detect weak
signals that traditional tools cannot see. All these complex tasks are performed in an easy and straightforward way
thanks to the great quality of the scikit-learn framework.
Guillaume Lebourgeois & Samuel Charron - Data Scientists at Data Publica

1.6.20 Machinalis

Scikit-learn is the cornerstone of all the machine learning projects carried at Machinalis. It has a consistent API, a
wide selection of algorithms and lots of auxiliary tools to deal with the boilerplate. We have used it in production environments on a variety of projects including click-through rate prediction, information extraction, and even counting
sheep!
In fact, we use it so much that we’ve started to freeze our common use cases into Python packages, some of them
open-sourced, like FeatureForge . Scikit-learn in one word: Awesome.
Rafael Carrascosa, Lead developer

1.6.21 solido

Scikit-learn is helping to drive Moore’s Law, via Solido. Solido creates computer-aided design tools used by the
majority of top-20 semiconductor companies and fabs, to design the bleeding-edge chips inside smartphones, automobiles, and more. Scikit-learn helps to power Solido’s algorithms for rare-event estimation, worst-case verification,
optimization, and more. At Solido, we are particularly fond of scikit-learn’s libraries for Gaussian Process models,
large-scale regularized linear regression, and classification. Scikit-learn has increased our productivity, because for
many ML problems we no longer need to “roll our own” code. This PyData 2014 talk has details.
Trent McConaghy, founder, Solido Design Automation Inc.

22

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

1.6.22 INFONEA

We employ scikit-learn for rapid prototyping and custom-made Data Science solutions within our in-memory based
Business Intelligence Software INFONEA®. As a well-documented and comprehensive collection of state-of-the-art
algorithms and pipelining methods, scikit-learn enables us to provide flexible and scalable scientific analysis solutions.
Thus, scikit-learn is immensely valuable in realizing a powerful integration of Data Science technology within selfservice business analytics.
Thorsten Kranz, Data Scientist, Coma Soft AG.

1.6.23 Dataiku

Our software, Data Science Studio (DSS), enables users to create data services that combine ETL with Machine
Learning. Our Machine Learning module integrates many scikit-learn algorithms. The scikit-learn library is a perfect
integration with DSS because it offers algorithms for virtually all business cases. Our goal is to offer a transparent and
flexible tool that makes it easier to optimize time consuming aspects of building a data service, preparing data, and
training machine learning algorithms on all types of data.
Florian Douetteau, CEO, Dataiku

1.6.24 Otto Group

Here at Otto Group, one of global Big Five B2C online retailers, we are using scikit-learn in all aspects of our daily
work from data exploration to development of machine learning application to the productive deployment of those
services. It helps us to tackle machine learning problems ranging from e-commerce to logistics. It consistent APIs
enabled us to build the Palladium REST-API framework around it and continuously deliver scikit-learn based services.
Christian Rammig, Head of Data Science, Otto Group

1.6.25 Zopa

At Zopa, the first ever Peer-to-Peer lending platform, we extensively use scikit-learn to run the business and optimize
our users’ experience. It powers our Machine Learning models involved in credit risk, fraud risk, marketing, and
pricing, and has been used for originating at least 1 billion GBP worth of Zopa loans. It is very well documented,
powerful, and simple to use. We are grateful for the capabilities it has provided, and for allowing us to deliver on our
mission of making money simple and fair.

1.6. Who is using scikit-learn?

23

scikit-learn user guide, Release 0.19.1

Vlasios Vasileiou, Head of Data Science, Zopa

1.7 Release history
1.7.1 Version 0.19.1
October, 2017
This is a bug-fix release with some minor documentation improvements and enhancements to features released in
0.19.0.
Note there may be minor differences in TSNE output in this release (due to #9623), in the case where multiple samples
have equal distance to some sample.
Changelog
API changes
• Reverted the addition of metrics.ndcg_score and metrics.dcg_score which had been merged into
version 0.19.0 by error. The implementations were broken and undocumented.
• return_train_score
which
was
added
to
model_selection.GridSearchCV ,
model_selection.RandomizedSearchCV and model_selection.cross_validate in
version 0.19.0 will be changing its default value from True to False in version 0.21. We found that calculating
training score could have a great effect on cross validation runtime in some cases. Users should explicitly
set return_train_score to False if prediction or scoring functions are slow, resulting in a deleterious
effect on CV runtime, or to True if they wish to use the calculated scores. #9677 by Kumar Ashutosh and Joel
Nothman.
• correlation_models and regression_models from the legacy gaussian processes implementation
have been belatedly deprecated. #9717 by Kumar Ashutosh.
Bug fixes
• Avoid integer overflows in metrics.matthews_corrcoef. #9693 by Sam Steingold.
• Fix ValueError in preprocessing.LabelEncoder when using inverse_transform on unseen labels. #9816 by Charlie Newey.
• Fixed a bug in the objective function for manifold.TSNE (both exact and with the Barnes-Hut approximation)
when n_components >= 3. #9711 by @goncalo-rodrigues.
• Fix regression in model_selection.cross_val_predict where it raised an error with
method='predict_proba' for some probabilistic classifiers. #9641 by James Bourbeau.
• Fixed a bug where datasets.make_classification modified its input weights. #9865 by Sachin
Kelkar.
• model_selection.StratifiedShuffleSplit now works with multioutput multiclass or multilabel
data with more than 1000 columns. #9922 by Charlie Brummitt.
• Fixed a bug with nested and conditional parameter setting, e.g. setting a pipeline step and its parameter at the
same time. #9945 by Andreas Müller and Joel Nothman.
Regressions in 0.19.0 fixed in 0.19.1:

24

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• Fixed a bug where parallelised prediction in random forests was not thread-safe and could (rarely) result in
arbitrary errors. #9830 by Joel Nothman.
• Fix regression in model_selection.cross_val_predict where it no longer accepted X as a list.
#9600 by Rasul Kerimov.
• Fixed handling of model_selection.cross_val_predict for binary
method='decision_function'. #9593 by Reiichiro Nakano and core devs.

classification

with

• Fix regression in pipeline.Pipeline where it no longer accepted steps as a tuple. #9604 by Joris Van
den Bossche.
• Fix bug where n_iter was not properly deprecated, leaving n_iter unavailable for interim use
in linear_model.SGDClassifier,
linear_model.SGDRegressor,
linear_model.
PassiveAggressiveClassifier,
linear_model.PassiveAggressiveRegressor and
linear_model.Perceptron. #9558 by Andreas Müller.
• Dataset fetchers make sure temporary files are closed before removing them, which caused errors on Windows.
#9847 by Joan Massich.
• Fixed a regression in manifold.TSNE where it no longer supported metrics other than ‘euclidean’ and ‘precomputed’. #9623 by Oli Blum.
Enhancements
• Our test suite and utils.estimator_checks.check_estimators can now be run without Nose installed. #9697 by Joan Massich.
• To improve usability of version 0.19’s pipeline.Pipeline caching, memory now allows joblib.
Memory instances. This make use of the new utils.validation.check_memory helper. #9584 by
Kumar Ashutosh
• Some fixes to examples: #9750, #9788, #9815
• Made a FutureWarning in SGD-based estimators less verbose. #9802 by Vrishank Bhardwaj.
Code and Documentation Contributors
With thanks to:
Joel Nothman, Loic Esteve, Andreas Mueller, Kumar Ashutosh, Vrishank Bhardwaj, Hanmin Qin, Rasul Kerimov,
James Bourbeau, Nagarjuna Kumar, Nathaniel Saul, Olivier Grisel, Roman Yurchak, Reiichiro Nakano, Sachin Kelkar,
Sam Steingold, Yaroslav Halchenko, diegodlh, felix, goncalo-rodrigues, jkleint, oliblum90, pasbi, Anthony Gitter, Ben
Lawson, Charlie Brummitt, Didi Bar-Zev, Gael Varoquaux, Joan Massich, Joris Van den Bossche, nielsenmarkus11

1.7.2 Version 0.19
August 12, 2017
Highlights
We are excited to release a number of great new features including neighbors.LocalOutlierFactor
for anomaly detection, preprocessing.QuantileTransformer for robust feature transformation, and
the multioutput.ClassifierChain meta-estimator to simply account for dependencies between classes

1.7. Release history

25

scikit-learn user guide, Release 0.19.1

in multilabel problems. We have some new algorithms in existing estimators, such as multiplicative update in decomposition.NMF and multinomial linear_model.LogisticRegression with L1 loss (use
solver='saga').
Cross validation is now able to return the results from multiple metric evaluations. The new model_selection.
cross_validate can return many scores on the test data as well as training set performance and timings, and we
have extended the scoring and refit parameters for grid/randomized search to handle multiple metrics.
You can also learn faster. For instance, the new option to cache transformations in pipeline.Pipeline makes
grid search over pipelines including slow transformations much more efficient. And you can predict faster: if you’re
sure you know what you’re doing, you can turn off validating that the input is finite using config_context.
We’ve made some important fixes too. We’ve fixed a longstanding implementation error in metrics.
average_precision_score, so please be cautious with prior results reported from that function. A number
of errors in the manifold.TSNE implementation have been fixed, particularly in the default Barnes-Hut approximation. semi_supervised.LabelSpreading and semi_supervised.LabelPropagation have had
substantial fixes. LabelPropagation was previously broken. LabelSpreading should now correctly respect its alpha
parameter.
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• cluster.KMeans with sparse X and initial centroids given (bug fix)
• cross_decomposition.PLSRegression with scale=True (bug fix)
• ensemble.GradientBoostingClassifier and ensemble.GradientBoostingRegressor
where min_impurity_split is used (bug fix)
• gradient boosting loss='quantile' (bug fix)
• ensemble.IsolationForest (bug fix)
• feature_selection.SelectFdr (bug fix)
• linear_model.RANSACRegressor (bug fix)
• linear_model.LassoLars (bug fix)
• linear_model.LassoLarsIC (bug fix)
• manifold.TSNE (bug fix)
• neighbors.NearestCentroid (bug fix)
• semi_supervised.LabelSpreading (bug fix)
• semi_supervised.LabelPropagation (bug fix)
• tree based models where min_weight_fraction_leaf is used (enhancement)
Details are listed in the changelog below.
(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)

26

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

Changelog
New features
Classifiers and regressors
• Added multioutput.ClassifierChain for multi-label classification. By Adam Kleczewski.
• Added solver 'saga' that implements the improved version of Stochastic Average Gradient, in
linear_model.LogisticRegression and linear_model.Ridge. It allows the use of L1 penalty
with multinomial logistic loss, and behaves marginally better than ‘sag’ during the first epochs of ridge and
logistic regression. #8446 by Arthur Mensch.
Other estimators
• Added the neighbors.LocalOutlierFactor class for anomaly detection based on nearest neighbors.
#5279 by Nicolas Goix and Alexandre Gramfort.
• Added
preprocessing.QuantileTransformer
class
and
preprocessing.
quantile_transform function for features normalization based on quantiles. #8363 by Denis Engemann,
Guillaume Lemaitre, Olivier Grisel, Raghav RV, Thierry Guillemot, and Gael Varoquaux.
• The new solver 'mu' implements a Multiplicate Update in decomposition.NMF, allowing the optimization
of all beta-divergences, including the Frobenius norm, the generalized Kullback-Leibler divergence and the
Itakura-Saito divergence. #5295 by Tom Dupre la Tour.
Model selection and evaluation
• model_selection.GridSearchCV and model_selection.RandomizedSearchCV now support
simultaneous evaluation of multiple metrics. Refer to the Specifying multiple metrics for evaluation section of
the user guide for more information. #7388 by Raghav RV
• Added the model_selection.cross_validate which allows evaluation of multiple metrics. This function returns a dict with more useful information from cross-validation such as the train scores, fit times and score
times. Refer to The cross_validate function and multiple metric evaluation section of the userguide for more
information. #7388 by Raghav RV
• Added metrics.mean_squared_log_error, which computes the mean square error of the logarithmic
transformation of targets, particularly useful for targets with an exponential trend. #7655 by Karan Desai.
• Added metrics.dcg_score and metrics.ndcg_score, which compute Discounted cumulative gain
(DCG) and Normalized discounted cumulative gain (NDCG). #7739 by David Gasquez.
• Added
the
model_selection.RepeatedKFold
RepeatedStratifiedKFold. #8120 by Neeraj Gangwar.

and

model_selection.

• Added a scorer based on metrics.explained_variance_score. #9259 by Hanmin Qin.
Miscellaneous
• Validation that input data contains no NaN or inf can now be suppressed using config_context, at your
own risk. This will save on runtime, and may be particularly useful for prediction time. #7548 by Joel Nothman.
• Added a test to ensure parameter listing in docstrings match the function/class signature. #9206 by Alexandre
Gramfort and Raghav RV.
Enhancements
Trees and ensembles

1.7. Release history

27

scikit-learn user guide, Release 0.19.1

• The min_weight_fraction_leaf constraint in tree construction is now more efficient, taking a fast path
to declare a node a leaf if its weight is less than 2 * the minimum. Note that the constructed tree will be different
from previous versions where min_weight_fraction_leaf is used. #7441 by Nelson Liu.
• ensemble.GradientBoostingClassifier and ensemble.GradientBoostingRegressor
now support sparse input for prediction. #6101 by Ibraim Ganiev.
• ensemble.VotingClassifier now allows changing estimators by using ensemble.
VotingClassifier.set_params. An estimator can also be removed by setting it to None. #7674 by
Yichuan Liu.
• tree.export_graphviz now shows configurable number of decimal places.
Lemaitre.

#8698 by Guillaume

• Added flatten_transform parameter to ensemble.VotingClassifier to change output shape of
transform method to 2 dimensional. #7794 by Ibraim Ganiev and Herilalaina Rakotoarison.
Linear, kernelized and related models
• linear_model.SGDClassifier,
linear_model.SGDRegressor,
linear_model.
PassiveAggressiveClassifier,
linear_model.PassiveAggressiveRegressor and
linear_model.Perceptron now expose max_iter and tol parameters, to handle convergence more
precisely. n_iter parameter is deprecated, and the fitted estimator exposes a n_iter_ attribute, with actual
number of iterations before convergence. #5036 by Tom Dupre la Tour.
• Added
average
parameter
to
perform
weight
PassiveAggressiveClassifier. #4939 by Andrea Esuli.

averaging

in

linear_model.

• linear_model.RANSACRegressor no longer throws an error when calling fit if no inliers are found in
its first iteration. Furthermore, causes of skipped iterations are tracked in newly added attributes, n_skips_*.
#7914 by Michael Horrell.
• In gaussian_process.GaussianProcessRegressor, method predict is a lot faster with
return_std=True. #8591 by Hadrien Bertrand.
• Added return_std to predict method of linear_model.ARDRegression and linear_model.
BayesianRidge. #7838 by Sergey Feldman.
• Memory usage enhancements:
Prevent cast from float32 to float64 in:
linear_model.
MultiTaskElasticNet; linear_model.LogisticRegression when using newton-cg solver; and
linear_model.Ridge when using svd, sparse_cg, cholesky or lsqr solvers. #8835, #8061 by Joan Massich
and Nicolas Cordier and Thierry Guillemot.
Other predictors
• Custom metrics for the neighbors binary trees now have fewer constraints: they must take two 1d-arrays and
return a float. #6288 by Jake Vanderplas.
• algorithm='auto in neighbors estimators now chooses the most appropriate algorithm for all input
types and metrics. #9145 by Herilalaina Rakotoarison and Reddy Chinthala.
Decomposition, manifold learning and clustering
• cluster.MiniBatchKMeans and cluster.KMeans now use significantly less memory when assigning
data points to their nearest cluster center. #7721 by Jon Crall.
• decomposition.PCA,
decomposition.IncrementalPCA
and
decomposition.
TruncatedSVD now expose the singular values from the underlying SVD. They are stored in the
attribute singular_values_, like in decomposition.IncrementalPCA. #7685 by Tommy Löfstedt
• Fixed the implementation of noise_variance_ in decomposition.PCA. #9108 by Hanmin Qin.
• decomposition.NMF now faster when beta_loss=0. #9277 by @hongkahjun.

28

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• Memory improvements for method barnes_hut in manifold.TSNE #7089 by Thomas Moreau and Olivier
Grisel.
• Optimization schedule improvements for Barnes-Hut manifold.TSNE so the results are closer to the one
from the reference implementation lvdmaaten/bhtsne by Thomas Moreau and Olivier Grisel.
• Memory usage enhancements: Prevent cast from float32 to float64 in decomposition.PCA and
decomposition.randomized_svd_low_rank. #9067 by Raghav RV.
Preprocessing and feature selection
• Added norm_order parameter to feature_selection.SelectFromModel to enable selection of the
norm order when coef_ is more than 1D. #6181 by Antoine Wendlinger.
• Added ability to use sparse matrices in feature_selection.f_regression with center=True.
#8065 by Daniel LeJeune.
• Small performance improvement to n-gram creation in feature_extraction.text by binding methods
for loops and special-casing unigrams. #7567 by Jaye Doepke
• Relax assumption on the data for the kernel_approximation.SkewedChi2Sampler. Since the
Skewed-Chi2 kernel is defined on the open interval (−𝑠𝑘𝑒𝑤𝑒𝑑𝑛𝑒𝑠𝑠; +∞)𝑑 , the transform function should not
check whether X < 0 but whether X < -self.skewedness. #7573 by Romain Brault.
• Made default kernel parameters kernel-dependent in kernel_approximation.Nystroem. #5229 by
Saurabh Bansod and Andreas Müller.
Model evaluation and meta-estimators
• pipeline.Pipeline is now able to cache transformers within a pipeline by using the memory constructor
parameter. #7990 by Guillaume Lemaitre.
• pipeline.Pipeline steps can now be accessed as attributes of its named_steps attribute. #8586 by
Herilalaina Rakotoarison.
• Added sample_weight parameter to pipeline.Pipeline.score. #7723 by Mikhail Korobov.
• Added ability to set n_jobs parameter to pipeline.make_union. A TypeError will be raised for any
other kwargs. #8028 by Alexander Booth.
• model_selection.GridSearchCV ,
model_selection.RandomizedSearchCV
and
model_selection.cross_val_score now allow estimators with callable kernels which were
previously prohibited. #8005 by Andreas Müller .
• model_selection.cross_val_predict now returns output of the correct shape for all values of the
argument method. #7863 by Aman Dalmia.
• Added shuffle and random_state parameters to shuffle training data before taking prefixes of it based on
training sizes in model_selection.learning_curve. #7506 by Narine Kokhlikyan.
• model_selection.StratifiedShuffleSplit now works with multioutput multiclass (or multilabel)
data. #9044 by Vlad Niculae.
• Speed improvements to model_selection.StratifiedShuffleSplit. #5991 by Arthur Mensch and
Joel Nothman.
• Add shuffle parameter to model_selection.train_test_split. #8845 by themrmax
• multioutput.MultiOutputRegressor and multioutput.MultiOutputClassifier now
support online learning using partial_fit. :issue: 8053 by Peng Yu.
• Add max_train_size parameter to model_selection.TimeSeriesSplit #8282 by Aman Dalmia.
• More clustering metrics are now available through metrics.get_scorer and scoring parameters. #8117
by Raghav RV.

1.7. Release history

29

scikit-learn user guide, Release 0.19.1

Metrics
• metrics.matthews_corrcoef now support multiclass classification. #8094 by Jon Crall.
• Add sample_weight parameter to metrics.cohen_kappa_score. #8335 by Victor Poughon.
Miscellaneous
• utils.check_estimator now attempts to ensure that methods transform, predict, etc. do not set attributes
on the estimator. #7533 by Ekaterina Krivich.
• Added type checking to the accept_sparse parameter in utils.validation methods. This parameter
now accepts only boolean, string, or list/tuple of strings. accept_sparse=None is deprecated and should
be replaced by accept_sparse=False. #7880 by Josh Karnofsky.
• Make it possible to load a chunk of an svmlight formatted file by passing a range of bytes to datasets.
load_svmlight_file. #935 by Olivier Grisel.
• dummy.DummyClassifier and dummy.DummyRegressor now accept non-finite features. #8931 by
@Attractadore.
Bug fixes
Trees and ensembles
• Fixed a memory leak in trees when using trees with criterion='mae'. #8002 by Raghav RV.
• Fixed a bug where ensemble.IsolationForest uses an an incorrect formula for the average path length
#8549 by Peter Wang.
• Fixed a bug where ensemble.AdaBoostClassifier throws ZeroDivisionError while fitting data
with single class labels. #7501 by Dominik Krzeminski.
• Fixed
a
bug
in
ensemble.GradientBoostingClassifier
and
ensemble.
GradientBoostingRegressor where a float being compared to 0.0 using == caused a divide by
zero error. #7970 by He Chen.
• Fix
a
bug
where
ensemble.GradientBoostingClassifier
and
ensemble.
GradientBoostingRegressor ignored the min_impurity_split parameter. #8006 by Sebastian
Pölsterl.
• Fixed oob_score in ensemble.BaggingClassifier. #8936 by Michael Lewis
• Fixed excessive memory usage in prediction for random forests estimators. #8672 by Mike Benfield.
• Fixed a bug where sample_weight as a list broke random forests in Python 2 #8068 by @xor.
• Fixed a bug where ensemble.IsolationForest fails when max_features is less than 1. #5732 by
Ishank Gulati.
• Fix a bug where gradient boosting with loss='quantile' computed negative errors for negative values of
ytrue - ypred leading to wrong values when calling __call__. #8087 by Alexis Mignon
• Fix a bug where ensemble.VotingClassifier raises an error when a numpy array is passed in for
weights. #7983 by Vincent Pham.
• Fixed a bug where tree.export_graphviz raised an error when the length of features_names does not
match n_features in the decision tree. #8512 by Li Li.
Linear, kernelized and related models
• Fixed a bug where linear_model.RANSACRegressor.fit may run until max_iter if it finds a large
inlier group early. #8251 by @aivision2020.

30

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• Fixed a bug where naive_bayes.MultinomialNB and naive_bayes.BernoulliNB failed when
alpha=0. #5814 by Yichuan Liu and Herilalaina Rakotoarison.
• Fixed a bug where linear_model.LassoLars does not give the same result as the LassoLars implementation available in R (lars library). #7849 by Jair Montoya Martinez.
• Fixed a bug in linear_model.RandomizedLasso, linear_model.Lars, linear_model.
LassoLars, linear_model.LarsCV and linear_model.LassoLarsCV , where the parameter
precompute was not used consistently across classes, and some values proposed in the docstring could raise
errors. #5359 by Tom Dupre la Tour.
• Fix inconsistent results between linear_model.RidgeCV and linear_model.Ridge when using
normalize=True. #9302 by Alexandre Gramfort.
• Fix a bug where linear_model.LassoLars.fit sometimes left coef_ as a list, rather than an ndarray.
#8160 by CJ Carey.
• Fix linear_model.BayesianRidge.fit to return ridge parameter alpha_ and lambda_ consistent
with calculated coefficients coef_ and intercept_. #8224 by Peter Gedeck.
• Fixed a bug in svm.OneClassSVM where it returned floats instead of integer classes. #8676 by Vathsala
Achar.
• Fix AIC/BIC criterion computation in linear_model.LassoLarsIC. #9022 by Alexandre Gramfort and
Mehmet Basbug.
• Fixed a memory leak in our LibLinear implementation. #9024 by Sergei Lebedev
• Fix bug where stratified CV splitters did not work with linear_model.LassoCV . #8973 by Paulo Haddad.
• Fixed a bug in gaussian_process.GaussianProcessRegressor when the standard deviation and
covariance predicted without fit would fail with a unmeaningful error by default. #6573 by Quazi Marufur
Rahman and Manoj Kumar.
Other predictors
• Fix semi_supervised.BaseLabelPropagation to correctly implement LabelPropagation and
LabelSpreading as done in the referenced papers. #9239 by Andre Ambrosio Boechat, Utkarsh Upadhyay,
and Joel Nothman.
Decomposition, manifold learning and clustering
• Fixed the implementation of manifold.TSNE:
• early_exageration parameter had no effect and is now used for the first 250 optimization iterations.
• Fixed the AssertionError:

Tree consistency failed exception reported in #8992.

• Improve the learning schedule to match the one from the reference implementation lvdmaaten/bhtsne.
by Thomas Moreau and Olivier Grisel.
• Fix a bug in decomposition.LatentDirichletAllocation where the perplexity method was
returning incorrect results because the transform method returns normalized document topic distributions as
of version 0.18. #7954 by Gary Foreman.
• Fix output shape and bugs with n_jobs > 1 in decomposition.SparseCoder transform and
decomposition.sparse_encode for one-dimensional data and one component. This also impacts the
output shape of decomposition.DictionaryLearning. #8086 by Andreas Müller.
• Fixed the implementation of explained_variance_ in decomposition.PCA, decomposition.
RandomizedPCA and decomposition.IncrementalPCA. #9105 by Hanmin Qin.
• Fixed the implementation of noise_variance_ in decomposition.PCA. #9108 by Hanmin Qin.

1.7. Release history

31

scikit-learn user guide, Release 0.19.1

• Fixed a bug where cluster.DBSCAN gives incorrect result when input is a precomputed sparse matrix with
initial rows all zero. #8306 by Akshay Gupta
• Fix a bug regarding fitting cluster.KMeans with a sparse array X and initial centroids, where X’s means
were unnecessarily being subtracted from the centroids. #7872 by Josh Karnofsky.
• Fixes to the input validation in covariance.EllipticEnvelope. #8086 by Andreas Müller.
• Fixed a bug in covariance.MinCovDet where inputting data that produced a singular covariance matrix
would cause the helper method _c_step to throw an exception. #3367 by Jeremy Steward
• Fixed a bug in manifold.TSNE affecting convergence of the gradient descent. #8768 by David DeTomaso.
• Fixed a bug in manifold.TSNE where it stored the incorrect kl_divergence_. #6507 by Sebastian
Saeger.
• Fixed improper scaling in cross_decomposition.PLSRegression with scale=True. #7819 by
jayzed82.
• cluster.bicluster.SpectralCoclustering
and
cluster.bicluster.
SpectralBiclustering fit method conforms with API by accepting y and returning the object.
#6126, #7814 by Laurent Direr and Maniteja Nandana.
• Fix bug where mixture sample methods did not return as many samples as requested. #7702 by Levi John
Wolf.
• Fixed the shrinkage implementation in neighbors.NearestCentroid. #9219 by Hanmin Qin.
Preprocessing and feature selection
• For sparse matrices, preprocessing.normalize with return_norm=True will now raise a
NotImplementedError with ‘l1’ or ‘l2’ norm and with norm ‘max’ the norms returned will be the same as
for dense matrices. #7771 by Ang Lu.
• Fix a bug where feature_selection.SelectFdr did not exactly implement Benjamini-Hochberg procedure. It formerly may have selected fewer features than it should. #7490 by Peng Meng.
• Fixed
a
bug
where
linear_model.RandomizedLasso
and
linear_model.
RandomizedLogisticRegression breaks for sparse input. #8259 by Aman Dalmia.
• Fix a bug where feature_extraction.FeatureHasher mandatorily applied a sparse random projection to the hashed features, preventing the use of feature_extraction.text.HashingVectorizer
in a pipeline with feature_extraction.text.TfidfTransformer. #7565 by Roman Yurchak.
• Fix a bug where feature_selection.mutual_info_regression did not correctly use
n_neighbors. #8181 by Guillaume Lemaitre.
Model evaluation and meta-estimators
• Fixed
a
bug
where
model_selection.BaseSearchCV.inverse_transform
returns
self.best_estimator_.transform()
instead
of
self.best_estimator_.
inverse_transform(). #8344 by Akshay Gupta and Rasmus Eriksson.
• Added classes_ attribute to model_selection.GridSearchCV , model_selection.
RandomizedSearchCV ,
grid_search.GridSearchCV ,
and
grid_search.
RandomizedSearchCV that matches the classes_ attribute of best_estimator_. #7661 and
#8295 by Alyssa Batula, Dylan Werner-Meier, and Stephen Hoover.
• Fixed a bug where model_selection.validation_curve reused the same estimator for each parameter value. #7365 by Aleksandr Sandrovskii.
• model_selection.permutation_test_score now works with Pandas types. #5697 by Stijn Tonk.
• Several fixes to input validation in multiclass.OutputCodeClassifier #8086 by Andreas Müller.

32

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• multiclass.OneVsOneClassifier’s partial_fit now ensures all classes are provided up-front.
#6250 by Asish Panda.
• Fix multioutput.MultiOutputClassifier.predict_proba to return a list of 2d arrays, rather
than a 3d array. In the case where different target columns had different numbers of classes, a ValueError
would be raised on trying to stack matrices with different dimensions. #8093 by Peter Bull.
• Cross validation now works with Pandas datatypes that that have a read-only index. #9507 by Loic Esteve.
Metrics
• metrics.average_precision_score no longer linearly interpolates between operating points, and instead weighs precisions by the change in recall since the last operating point, as per the Wikipedia entry. (#7356).
By Nick Dingwall and Gael Varoquaux.
• Fix a bug in metrics.classification._check_targets which would return 'binary' if y_true
and y_pred were both 'binary' but the union of y_true and y_pred was 'multiclass'. #8377 by
Loic Esteve.
• Fixed an integer overflow bug in metrics.confusion_matrix
cohen_kappa_score. #8354, #7929 by Joel Nothman and Jon Crall.

and

hence

metrics.

• Fixed passing of gamma parameter to the chi2 kernel in metrics.pairwise.pairwise_kernels
#5211 by Nick Rhinehart, Saurabh Bansod and Andreas Müller.
Miscellaneous
• Fixed a bug when datasets.make_classification fails when generating more than 30 features. #8159
by Herilalaina Rakotoarison.
• Fixed a bug where datasets.make_moons gives an incorrect result when n_samples is odd. #8198 by
Josh Levy.
• Some fetch_ functions in datasets were ignoring the download_if_missing keyword. #7944 by
Ralf Gommers.
• Fix estimators to accept a sample_weight parameter of type pandas.Series in their fit function.
#7825 by Kathleen Chen.
• Fix a bug in cases where numpy.cumsum may be numerically unstable, raising an exception if instability is
identified. #7376 and #7331 by Joel Nothman and @yangarbiter.
• Fix a bug where base.BaseEstimator.__getstate__ obstructed pickling customizations of childclasses, when used in a multiple inheritance context. #8316 by Holger Peters.
• Update Sphinx-Gallery from 0.1.4 to 0.1.7 for resolving links in documentation build with Sphinx>1.5 #8010,
#7986 by Oscar Najera
• Add data_home parameter to sklearn.datasets.fetch_kddcup99. #9289 by Loic Esteve.
• Fix dataset loaders using Python 3 version of makedirs to also work in Python 2. #9284 by Sebastin Santy.
• Several minor issues were fixed with thanks to the alerts of [lgtm.com](http://lgtm.com). #9278 by Jean Helie,
among others.
API changes summary
Trees and ensembles
• Gradient boosting base models are no longer estimators. By Andreas Müller.
• All tree based estimators now accept a min_impurity_decrease parameter in lieu of the
min_impurity_split, which is now deprecated. The min_impurity_decrease helps stop

1.7. Release history

33

scikit-learn user guide, Release 0.19.1

splitting the nodes in which the weighted impurity decrease from splitting is no longer alteast
min_impurity_decrease. #8449 by Raghav RV.
Linear, kernelized and related models
• n_iter parameter is deprecated in linear_model.SGDClassifier, linear_model.
SGDRegressor,
linear_model.PassiveAggressiveClassifier,
linear_model.
PassiveAggressiveRegressor and linear_model.Perceptron. By Tom Dupre la Tour.
Other predictors
• neighbors.LSHForest has been deprecated and will be removed in 0.21 due to poor performance. #9078
by Laurent Direr.
• neighbors.NearestCentroid no longer purports to support metric='precomputed' which now
raises an error. #8515 by Sergul Aydore.
• The alpha parameter of semi_supervised.LabelPropagation now has no effect and is deprecated
to be removed in 0.21. #9239 by Andre Ambrosio Boechat, Utkarsh Upadhyay, and Joel Nothman.
Decomposition, manifold learning and clustering
• Deprecate the doc_topic_distr argument of the perplexity method in decomposition.
LatentDirichletAllocation because the user no longer has access to the unnormalized document topic
distribution needed for the perplexity calculation. #7954 by Gary Foreman.
• The n_topics parameter of decomposition.LatentDirichletAllocation has been renamed to
n_components and will be removed in version 0.21. #8922 by @Attractadore.
• decomposition.SparsePCA.transform’s ridge_alpha parameter is deprecated in preference for
class parameter. #8137 by Naoya Kanai.
• cluster.DBSCAN now has a metric_params parameter. #8139 by Naoya Kanai.
Preprocessing and feature selection
• feature_selection.SelectFromModel now has a partial_fit method only if the underlying estimator does. By Andreas Müller.
• feature_selection.SelectFromModel now validates the threshold parameter and sets the
threshold_ attribute during the call to fit, and no longer during the call to transform`. By Andreas
Müller.
• The non_negative parameter in feature_extraction.FeatureHasher has been deprecated, and
replaced with a more principled alternative, alternate_sign. #7565 by Roman Yurchak.
• linear_model.RandomizedLogisticRegression, and linear_model.RandomizedLasso
have been deprecated and will be removed in version 0.21. #8995 by Ramana.S.
Model evaluation and meta-estimators
• Deprecate the fit_params constructor input to the model_selection.GridSearchCV and
model_selection.RandomizedSearchCV in favor of passing keyword parameters to the fit methods
of those classes. Data-dependent parameters needed for model training should be passed as keyword arguments
to fit, and conforming to this convention will allow the hyperparameter selection classes to be used with tools
such as model_selection.cross_val_predict. #2879 by Stephen Hoover.
• In version 0.21, the default behavior of splitters that use the test_size and train_size parameter will
change, such that specifying train_size alone will cause test_size to be the remainder. #7459 by Nelson
Liu.
• multiclass.OneVsRestClassifier now has partial_fit, decision_function and
predict_proba methods only when the underlying estimator does. #7812 by Andreas Müller and Mikhail
Korobov.

34

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• multiclass.OneVsRestClassifier now has a partial_fit method only if the underlying estimator does. By Andreas Müller.
• The decision_function output shape for binary classification in multiclass.
OneVsRestClassifier and multiclass.OneVsOneClassifier is now (n_samples,) to
conform to scikit-learn conventions. #9100 by Andreas Müller.
• The multioutput.MultiOutputClassifier.predict_proba function used to return a 3d array
(n_samples, n_classes, n_outputs). In the case where different target columns had different numbers
of classes, a ValueError would be raised on trying to stack matrices with different dimensions. This function now returns a list of arrays where the length of the list is n_outputs, and each array is (n_samples,
n_classes) for that particular output. #8093 by Peter Bull.
• Replace attribute named_steps dict to utils.Bunch in pipeline.Pipeline to enable tab completion in interactive environment. In the case conflict value on named_steps and dict attribute, dict
behavior will be prioritized. #8481 by Herilalaina Rakotoarison.
Miscellaneous
• Deprecate the y parameter in transform and inverse_transform. The method should not accept y
parameter, as it’s used at the prediction time. #8174 by Tahar Zanouda, Alexandre Gramfort and Raghav RV.
• SciPy >= 0.13.3 and NumPy >= 1.8.2 are now the minimum supported versions for scikit-learn. The following
backported functions in utils have been removed or deprecated accordingly. #8854 and #8874 by Naoya
Kanai
• The store_covariances and covariances_ parameters of discriminant_analysis.
QuadraticDiscriminantAnalysis has been renamed to store_covariance and covariance_
to be consistent with the corresponding parameter names of the discriminant_analysis.
LinearDiscriminantAnalysis. They will be removed in version 0.21. #7998 by Jiacheng
Removed in 0.19:
– utils.fixes.argpartition
– utils.fixes.array_equal
– utils.fixes.astype
– utils.fixes.bincount
– utils.fixes.expit
– utils.fixes.frombuffer_empty
– utils.fixes.in1d
– utils.fixes.norm
– utils.fixes.rankdata
– utils.fixes.safe_copy
Deprecated in 0.19, to be removed in 0.21:
– utils.arpack.eigs
– utils.arpack.eigsh
– utils.arpack.svds
– utils.extmath.fast_dot
– utils.extmath.logsumexp
– utils.extmath.norm

1.7. Release history

35

scikit-learn user guide, Release 0.19.1

– utils.extmath.pinvh
– utils.graph.graph_laplacian
– utils.random.choice
– utils.sparsetools.connected_components
– utils.stats.rankdata
• Estimators with both methods decision_function and predict_proba are now required to have a
monotonic relation between them. The method check_decision_proba_consistency has been added
in utils.estimator_checks to check their consistency. #7578 by Shubham Bhardwaj
• All checks in utils.estimator_checks,
in particular utils.estimator_checks.
check_estimator now accept estimator instances. Most other checks do not accept estimator classes any
more. #9019 by Andreas Müller.
• Ensure that estimators’ attributes ending with _ are not set in the constructor but only in the fit method.
Most notably, ensemble estimators (deriving from ensemble.BaseEnsemble) now only have self.
estimators_ available after fit. #7464 by Lars Buitinck and Loic Esteve.
Code and Documentation Contributors
Thanks to everyone who has contributed to the maintenance and improvement of the project since version 0.18, including:
Joel Nothman, Loic Esteve, Andreas Mueller, Guillaume Lemaitre, Olivier Grisel, Hanmin Qin, Raghav RV, Alexandre
Gramfort, themrmax, Aman Dalmia, Gael Varoquaux, Naoya Kanai, Tom Dupré la Tour, Rishikesh, Nelson Liu, Taehoon Lee, Nelle Varoquaux, Aashil, Mikhail Korobov, Sebastin Santy, Joan Massich, Roman Yurchak, RAKOTOARISON Herilalaina, Thierry Guillemot, Alexandre Abadie, Carol Willing, Balakumaran Manoharan, Josh Karnofsky,
Vlad Niculae, Utkarsh Upadhyay, Dmitry Petrov, Minghui Liu, Srivatsan, Vincent Pham, Albert Thomas, Jake VanderPlas, Attractadore, JC Liu, alexandercbooth, chkoar, Óscar Nájera, Aarshay Jain, Kyle Gilliam, Ramana Subramanyam, CJ Carey, Clement Joudet, David Robles, He Chen, Joris Van den Bossche, Karan Desai, Katie Luangkote,
Leland McInnes, Maniteja Nandana, Michele Lacchia, Sergei Lebedev, Shubham Bhardwaj, akshay0724, omtcyfz,
rickiepark, waterponey, Vathsala Achar, jbDelafosse, Ralf Gommers, Ekaterina Krivich, Vivek Kumar, Ishank Gulati,
Dave Elliott, ldirer, Reiichiro Nakano, Levi John Wolf, Mathieu Blondel, Sid Kapur, Dougal J. Sutherland, midinas,
mikebenfield, Sourav Singh, Aseem Bansal, Ibraim Ganiev, Stephen Hoover, AishwaryaRK, Steven C. Howell, Gary
Foreman, Neeraj Gangwar, Tahar, Jon Crall, dokato, Kathy Chen, ferria, Thomas Moreau, Charlie Brummitt, Nicolas
Goix, Adam Kleczewski, Sam Shleifer, Nikita Singh, Basil Beirouti, Giorgio Patrini, Manoj Kumar, Rafael Possas,
James Bourbeau, James A. Bednar, Janine Harper, Jaye, Jean Helie, Jeremy Steward, Artsiom, John Wei, Jonathan
LIgo, Jonathan Rahn, seanpwilliams, Arthur Mensch, Josh Levy, Julian Kuhlmann, Julien Aubert, Jörn Hees, Kai,
shivamgargsya, Kat Hempstalk, Kaushik Lakshmikanth, Kennedy, Kenneth Lyons, Kenneth Myers, Kevin Yap, Kirill Bobyrev, Konstantin Podshumok, Arthur Imbert, Lee Murray, toastedcornflakes, Lera, Li Li, Arthur Douillard,
Mainak Jas, tobycheese, Manraj Singh, Manvendra Singh, Marc Meketon, MarcoFalke, Matthew Brett, Matthias
Gilch, Mehul Ahuja, Melanie Goetz, Meng, Peng, Michael Dezube, Michal Baumgartner, vibrantabhi19, Artem Golubin, Milen Paskov, Antonin Carette, Morikko, MrMjauh, NALEPA Emmanuel, Namiya, Antoine Wendlinger, Narine
Kokhlikyan, NarineK, Nate Guerin, Angus Williams, Ang Lu, Nicole Vavrova, Nitish Pandey, Okhlopkov Daniil
Olegovich, Andy Craze, Om Prakash, Parminder Singh, Patrick Carlson, Patrick Pei, Paul Ganssle, Paulo Haddad,
Paweł Lorek, Peng Yu, Pete Bachant, Peter Bull, Peter Csizsek, Peter Wang, Pieter Arthur de Jong, Ping-Yao, Chang,
Preston Parry, Puneet Mathur, Quentin Hibon, Andrew Smith, Andrew Jackson, 1kastner, Rameshwar Bhaskaran, Rebecca Bilbro, Remi Rampin, Andrea Esuli, Rob Hall, Robert Bradshaw, Romain Brault, Aman Pratik, Ruifeng Zheng,
Russell Smith, Sachin Agarwal, Sailesh Choyal, Samson Tan, Samuël Weber, Sarah Brown, Sebastian Pölsterl, Sebastian Raschka, Sebastian Saeger, Alyssa Batula, Abhyuday Pratap Singh, Sergey Feldman, Sergul Aydore, Sharan
Yalburgi, willduan, Siddharth Gupta, Sri Krishna, Almer, Stijn Tonk, Allen Riddell, Theofilos Papapanagiotou, Alison,
Alexis Mignon, Tommy Boucher, Tommy Löfstedt, Toshihiro Kamishima, Tyler Folkman, Tyler Lanigan, Alexander
Junge, Varun Shenoy, Victor Poughon, Vilhelm von Ehrenheim, Aleksandr Sandrovskii, Alan Yee, Vlasios Vasileiou,

36

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

Warut Vijitbenjaronk, Yang Zhang, Yaroslav Halchenko, Yichuan Liu, Yuichi Fujikawa, affanv14, aivision2020, xor,
andreh7, brady salz, campustrampus, Agamemnon Krasoulis, ditenberg, elena-sharova, filipj8, fukatani, gedeck, guiniol, guoci, hakaa1, hongkahjun, i-am-xhy, jakirkham, jaroslaw-weber, jayzed82, jeroko, jmontoyam, jonathan.striebel,
josephsalmon, jschendel, leereeves, martin-hahn, mathurinm, mehak-sachdeva, mlewis1729, mlliou112, mthorrell,
ndingwall, nuffe, yangarbiter, plagree, pldtc325, Breno Freitas, Brett Olsen, Brian A. Alfano, Brian Burns, polmauri,
Brandon Carter, Charlton Austin, Chayant T15h, Chinmaya Pancholi, Christian Danielsen, Chung Yen, Chyi-Kwei
Yau, pravarmahajan, DOHMATOB Elvis, Daniel LeJeune, Daniel Hnyk, Darius Morawiec, David DeTomaso, David
Gasquez, David Haberthür, David Heryanto, David Kirkby, David Nicholson, rashchedrin, Deborah Gertrude Digges,
Denis Engemann, Devansh D, Dickson, Bob Baxley, Don86, E. Lynch-Klarup, Ed Rogers, Elizabeth Ferriss, EllenCo2, Fabian Egli, Fang-Chieh Chou, Bing Tian Dai, Greg Stupp, Grzegorz Szpak, Bertrand Thirion, Hadrien Bertrand,
Harizo Rajaona, zxcvbnius, Henry Lin, Holger Peters, Icyblade Dai, Igor Andriushchenko, Ilya, Isaac Laughlin, Iván
Vallés, Aurélien Bellet, JPFrancoia, Jacob Schreiber, Asish Mahapatra

1.7.3 Version 0.18.2
June 20, 2017
Last release with Python 2.6 support
Scikit-learn 0.18 is the last major release of scikit-learn to support Python 2.6. Later versions of scikit-learn will
require Python 2.7 or above.

Changelog
• Fixes for compatibility with NumPy 1.13.0: #7946 #8355 by Loic Esteve.
• Minor compatibility changes in the examples #9010 #8040 #9149.
Code Contributors
Aman Dalmia, Loic Esteve, Nate Guerin, Sergei Lebedev

1.7.4 Version 0.18.1
November 11, 2016
Changelog
Enhancements
• Improved sample_without_replacement speed by utilizing numpy.random.permutation for most cases.
As a result, samples may differ in this release for a fixed random state. Affected estimators:
– ensemble.BaggingClassifier
– ensemble.BaggingRegressor
– linear_model.RANSACRegressor
– model_selection.RandomizedSearchCV

1.7. Release history

37

scikit-learn user guide, Release 0.19.1

– random_projection.SparseRandomProjection
This also affects the datasets.make_classification method.
Bug fixes
• Fix issue where min_grad_norm and n_iter_without_progress parameters were not being utilised
by manifold.TSNE. #6497 by Sebastian Säger
• Fix bug for svm’s decision values when decision_function_shape is ovr in svm.SVC. svm.SVC’s
decision_function was incorrect from versions 0.17.0 through 0.18.0. #7724 by Bing Tian Dai
• Attribute
explained_variance_ratio
of
discriminant_analysis.
LinearDiscriminantAnalysis calculated with SVD and Eigen solver are now of the same length.
#7632 by JPFrancoia
• Fixes issue in Univariate feature selection where score functions were not accepting multi-label targets. #7676
by Mohammed Affan
• Fixed setting parameters when calling fit multiple times on feature_selection.SelectFromModel.
#7756 by Andreas Müller
• Fixes issue in partial_fit method of multiclass.OneVsRestClassifier when number of classes
used in partial_fit was less than the total number of classes in the data. #7786 by Srivatsan Ramesh
• Fixes issue in calibration.CalibratedClassifierCV where the sum of probabilities of each class
for a data was not 1, and CalibratedClassifierCV now handles the case where the training set has less
number of classes than the total data. #7799 by Srivatsan Ramesh
• Fix a bug where sklearn.feature_selection.SelectFdr did not exactly implement BenjaminiHochberg procedure. It formerly may have selected fewer features than it should. #7490 by Peng Meng.
• sklearn.manifold.LocallyLinearEmbedding now correctly handles integer inputs. #6282 by Jake
Vanderplas.
• The min_weight_fraction_leaf parameter of tree-based classifiers and regressors now assumes uniform
sample weights by default if the sample_weight argument is not passed to the fit function. Previously, the
parameter was silently ignored. #7301 by Nelson Liu.
• Numerical issue with linear_model.RidgeCV on centered data when n_features > n_samples. #6178 by
Bertrand Thirion
• Tree splitting criterion classes’ cloning/pickling is now memory safe #7680 by Ibraim Ganiev.
• Fixed a bug where decomposition.NMF sets its n_iters_ attribute in transform(). #7553 by Ekaterina
Krivich.
• sklearn.linear_model.LogisticRegressionCV now correctly handles string labels. #5874 by
Raghav RV.
• Fixed a bug where sklearn.model_selection.train_test_split raised an error when
stratify is a list of string labels. #7593 by Raghav RV.
• Fixed
a
bug
where
sklearn.model_selection.GridSearchCV
and
sklearn.
model_selection.RandomizedSearchCV were not pickleable because of a pickling bug in np.
ma.MaskedArray. #7594 by Raghav RV.
• All cross-validation utilities in sklearn.model_selection now permit one time cross-validation splitters
for the cv parameter. Also non-deterministic cross-validation splitters (where multiple calls to split produce
dissimilar splits) can be used as cv parameter. The sklearn.model_selection.GridSearchCV will

38

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

cross-validate each parameter setting on the split produced by the first split call to the cross-validation splitter.
#7660 by Raghav RV.
• Fix bug where preprocessing.MultiLabelBinarizer.fit_transform returned an invalid CSR
matrix. #7750 by CJ Carey.
• Fixed a bug where metrics.pairwise.cosine_distances could return a small negative distance.
#7732 by Artsion.
API changes summary
Trees and forests
• The min_weight_fraction_leaf parameter of tree-based classifiers and regressors now assumes uniform
sample weights by default if the sample_weight argument is not passed to the fit function. Previously, the
parameter was silently ignored. #7301 by Nelson Liu.
• Tree splitting criterion classes’ cloning/pickling is now memory safe. #7680 by Ibraim Ganiev.
Linear, kernelized and related models
• Length
of
explained_variance_ratio
of
discriminant_analysis.
LinearDiscriminantAnalysis changed for both Eigen and SVD solvers. The attribute has now
a length of min(n_components, n_classes - 1). #7632 by JPFrancoia
• Numerical issue with linear_model.RidgeCV on centered data when n_features > n_samples.
#6178 by Bertrand Thirion

1.7.5 Version 0.18
September 28, 2016
Last release with Python 2.6 support
Scikit-learn 0.18 will be the last version of scikit-learn to support Python 2.6. Later versions of scikit-learn will
require Python 2.7 or above.

Model Selection Enhancements and API Changes
• The model_selection module
The new module sklearn.model_selection, which groups together the functionalities of formerly
sklearn.cross_validation, sklearn.grid_search and sklearn.learning_curve, introduces new possibilities such as nested cross-validation and better manipulation of parameter searches with Pandas.
Many things will stay the same but there are some key differences. Read below to know more about the changes.
• Data-independent CV splitters enabling nested cross-validation
The new cross-validation splitters, defined in the sklearn.model_selection, are no longer initialized
with any data-dependent parameters such as y. Instead they expose a split method that takes in the data and
yields a generator for the different splits.
This change makes it possible to use the cross-validation splitters to perform nested cross-validation, facilitated
by model_selection.GridSearchCV and model_selection.RandomizedSearchCV utilities.

1.7. Release history

39

scikit-learn user guide, Release 0.19.1

• The enhanced cv_results_ attribute
The new cv_results_ attribute (of model_selection.GridSearchCV and model_selection.
RandomizedSearchCV ) introduced in lieu of the grid_scores_ attribute is a dict of 1D arrays with
elements in each array corresponding to the parameter settings (i.e. search candidates).
The cv_results_ dict can be easily imported into pandas as a DataFrame for exploring the search results.
The cv_results_ arrays include scores for each cross-validation split (with keys such as
'split0_test_score'), as well as their mean ('mean_test_score') and standard deviation
('std_test_score').
The ranks for the search candidates (based on their mean cross-validation score) is available at
cv_results_['rank_test_score'].
The parameter values for each parameter is stored separately as numpy masked object arrays. The value, for
that search candidate, is masked if the corresponding parameter is not applicable. Additionally a list of all the
parameter dicts are stored at cv_results_['params'].
• Parameters n_folds and n_iter renamed to n_splits
Some parameter names have changed: The n_folds parameter in new model_selection.KFold,
model_selection.GroupKFold (see below for the name change), and model_selection.
StratifiedKFold is now renamed to n_splits. The n_iter parameter in model_selection.
ShuffleSplit, the new class model_selection.GroupShuffleSplit and model_selection.
StratifiedShuffleSplit is now renamed to n_splits.
• Rename of splitter classes which accepts group labels along with data
The cross-validation splitters LabelKFold, LabelShuffleSplit, LeaveOneLabelOut and
LeavePLabelOut have been renamed to model_selection.GroupKFold, model_selection.
GroupShuffleSplit, model_selection.LeaveOneGroupOut and model_selection.
LeavePGroupsOut respectively.
Note the change from singular to plural form in model_selection.LeavePGroupsOut.
• Fit parameter labels renamed to groups
The labels parameter in the split method of the newly renamed splitters model_selection.
GroupKFold,
model_selection.LeaveOneGroupOut,
model_selection.
LeavePGroupsOut, model_selection.GroupShuffleSplit is renamed to groups following the
new nomenclature of their class names.
• Parameter n_labels renamed to n_groups
The parameter n_labels in the newly renamed model_selection.LeavePGroupsOut is changed to
n_groups.
• Training scores and Timing information
cv_results_ also includes the training scores for each cross-validation split (with keys such
as 'split0_train_score'), as well as their mean ('mean_train_score') and standard deviation ('std_train_score').
To avoid the cost of evaluating training score, set
return_train_score=False.
Additionally the mean and standard deviation of the times taken to split, train and score the model across all the
cross-validation splits is available at the key 'mean_time' and 'std_time' respectively.

40

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

Changelog
New features
Classifiers and Regressors
• The Gaussian Process module has been reimplemented and now offers classification and regression estimators through gaussian_process.GaussianProcessClassifier and gaussian_process.
GaussianProcessRegressor. Among other things, the new implementation supports kernel engineering,
gradient-based hyperparameter optimization or sampling of functions from GP prior and GP posterior. Extensive
documentation and examples are provided. By Jan Hendrik Metzen.
• Added new supervised learning algorithm: Multi-layer Perceptron #3204 by Issam H. Laradji
• Added linear_model.HuberRegressor, a linear model robust to outliers. #5291 by Manoj Kumar.
• Added the multioutput.MultiOutputRegressor meta-estimator. It converts single output regressors
to multi-output regressors by fitting one regressor per output. By Tim Head.
Other estimators
• New mixture.GaussianMixture and mixture.BayesianGaussianMixture replace former mixture models, employing faster inference for sounder results. #7295 by Wei Xue and Thierry Guillemot.
• Class decomposition.RandomizedPCA is now factored into decomposition.PCA and it is available calling with parameter svd_solver='randomized'. The default number of n_iter for
'randomized' has changed to 4. The old behavior of PCA is recovered by svd_solver='full'. An
additional solver calls arpack and performs truncated (non-randomized) SVD. By default, the best solver is
selected depending on the size of the input and the number of components requested. #5299 by Giorgio Patrini.
• Added
two
functions
for
mutual
information
estimation:
feature_selection.
mutual_info_classif and feature_selection.mutual_info_regression.
These
functions can be used in feature_selection.SelectKBest and feature_selection.
SelectPercentile as score functions. By Andrea Bravi and Nikolay Mayorov.
• Added the ensemble.IsolationForest class for anomaly detection based on random forests. By Nicolas
Goix.
• Added algorithm="elkan" to cluster.KMeans implementing Elkan’s fast K-Means algorithm. By
Andreas Müller.
Model selection and evaluation
• Added metrics.cluster.fowlkes_mallows_score, the Fowlkes Mallows Index which measures the
similarity of two clusterings of a set of points By Arnaud Fouchet and Thierry Guillemot.
• Added metrics.calinski_harabaz_score, which computes the Calinski and Harabaz score to evaluate the resulting clustering of a set of points. By Arnaud Fouchet and Thierry Guillemot.
• Added new cross-validation splitter model_selection.TimeSeriesSplit to handle time series data.
#6586 by YenChen Lin
• The cross-validation iterators are replaced by cross-validation splitters available from sklearn.
model_selection, allowing for nested cross-validation. See Model Selection Enhancements and API
Changes for more information. #4294 by Raghav RV.
Enhancements
Trees and ensembles

1.7. Release history

41

scikit-learn user guide, Release 0.19.1

• Added a new splitting criterion for tree.DecisionTreeRegressor, the mean absolute error.
This criterion can also be used in ensemble.ExtraTreesRegressor, ensemble.
RandomForestRegressor, and the gradient boosting estimators. #6667 by Nelson Liu.
• Added weighted impurity-based early stopping criterion for decision tree growth. #6954 by Nelson Liu
• The random forest, extra tree and decision tree estimators now has a method decision_path which returns
the decision path of samples in the tree. By Arnaud Joly.
• A new example has been added unveiling the decision tree structure. By Arnaud Joly.
• Random forest, extra trees, decision trees and gradient boosting estimator accept the parameter
min_samples_split and min_samples_leaf provided as a percentage of the training samples. By
yelite and Arnaud Joly.
• Gradient boosting estimators accept the parameter criterion to specify to splitting criterion used in built
decision trees. #6667 by Nelson Liu.
• The memory footprint is reduced (sometimes greatly) for ensemble.bagging.BaseBagging and classes
that inherit from it, i.e, ensemble.BaggingClassifier, ensemble.BaggingRegressor, and
ensemble.IsolationForest, by dynamically generating attribute estimators_samples_ only
when it is needed. By David Staub.
• Added n_jobs and sample_weight parameters for ensemble.VotingClassifier to fit underlying
estimators in parallel. #5805 by Ibraim Ganiev.
Linear, kernelized and related models
• In linear_model.LogisticRegression, the SAG solver is now available in the multinomial case.
#5251 by Tom Dupre la Tour.
• linear_model.RANSACRegressor,
sample_weight. By Imaculate.

svm.LinearSVC

and

svm.LinearSVR

now

support

• Add parameter loss to linear_model.RANSACRegressor to measure the error on the samples for every
trial. By Manoj Kumar.
• Prediction of out-of-sample events with Isotonic Regression (isotonic.IsotonicRegression) is now
much faster (over 1000x in tests with synthetic data). By Jonathan Arfa.
• Isotonic regression (isotonic.IsotonicRegression) now uses a better algorithm to avoid O(n^2) behavior in pathological cases, and is also generally faster (##6691). By Antony Lee.
• naive_bayes.GaussianNB now accepts data-independent class-priors through the parameter priors.
By Guillaume Lemaitre.
• linear_model.ElasticNet and linear_model.Lasso now works with np.float32 input data
without converting it into np.float64. This allows to reduce the memory consumption. #6913 by YenChen
Lin.
• semi_supervised.LabelPropagation and semi_supervised.LabelSpreading now accept
arbitrary kernel functions in addition to strings knn and rbf. #5762 by Utkarsh Upadhyay.
Decomposition, manifold learning and clustering
• Added inverse_transform function to decomposition.NMF to compute data matrix of original shape.
By Anish Shah.
• cluster.KMeans and cluster.MiniBatchKMeans now works with np.float32 and np.
float64 input data without converting it. This allows to reduce the memory consumption by using np.
float32. #6846 by Sebastian Säger and YenChen Lin.
Preprocessing and feature selection

42

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• preprocessing.RobustScaler now accepts quantile_range parameter. #5929 by Konstantin Podshumok.
• feature_extraction.FeatureHasher now accepts string values.
Devashish Deshpande.

#6173 by Ryad Zenine and

• Keyword arguments can now be supplied to func in preprocessing.FunctionTransformer by
means of the kw_args parameter. By Brian McFee.
• feature_selection.SelectKBest and feature_selection.SelectPercentile now accept
score functions that take X, y as input and return only the scores. By Nikolay Mayorov.
Model evaluation and meta-estimators
• multiclass.OneVsOneClassifier and multiclass.OneVsRestClassifier now support
partial_fit. By Asish Panda and Philipp Dowling.
• Added support for substituting or disabling pipeline.Pipeline and pipeline.FeatureUnion components using the set_params interface that powers sklearn.grid_search. See Selecting dimensionality reduction with Pipeline and GridSearchCV By Joel Nothman and Robert McGibbon.
• The new cv_results_ attribute of model_selection.GridSearchCV (and model_selection.
RandomizedSearchCV ) can be easily imported into pandas as a DataFrame. Ref Model Selection Enhancements and API Changes for more information. #6697 by Raghav RV.
• Generalization of model_selection.cross_val_predict. One can pass method names such as predict_proba to be used in the cross validation framework instead of the default predict. By Ori Ziv and Sears
Merritt.
• The training scores and time taken for training followed by scoring for each search candidate are now available
at the cv_results_ dict. See Model Selection Enhancements and API Changes for more information. #7325
by Eugene Chen and Raghav RV.
Metrics
• Added labels flag to metrics.log_loss to explicitly provide the labels when the number of classes in
y_true and y_pred differ. #7239 by Hong Guangguo with help from Mads Jensen and Nelson Liu.
• Support sparse contingency matrices in cluster evaluation (metrics.cluster.supervised) to scale to a
large number of clusters. #7419 by Gregory Stupp and Joel Nothman.
• Add sample_weight parameter to metrics.matthews_corrcoef. By Jatin Shah and Raghav RV.
• Speed up metrics.silhouette_score by using vectorized operations. By Manoj Kumar.
• Add sample_weight parameter to metrics.confusion_matrix. By Bernardo Stein.
Miscellaneous
• Added n_jobs parameter to feature_selection.RFECV to compute the score on the test folds in parallel. By Manoj Kumar
• Codebase does not contain C/C++ cython generated files: they are generated during build. Distribution packages
will still contain generated C/C++ files. By Arthur Mensch.
• Reduce the memory usage for 32-bit float input arrays of utils.sparse_func.mean_variance_axis
and utils.sparse_func.incr_mean_variance_axis by supporting cython fused types. By
YenChen Lin.
• The ignore_warnings now accept a category argument to ignore only the warnings of a specified type. By
Thierry Guillemot.

1.7. Release history

43

scikit-learn user guide, Release 0.19.1

• Added parameter return_X_y and return type (data, target) : tuple option to load_iris
dataset #7049, load_breast_cancer dataset #7152, load_digits dataset, load_diabetes dataset,
load_linnerud dataset, load_boston dataset #7154 by Manvendra Singh.
• Simplification of the clone function, deprecate support for estimators that modify parameters in __init__.
#5540 by Andreas Müller.
• When unpickling a scikit-learn estimator in a different version than the one the estimator was trained with, a
UserWarning is raised, see the documentation on model persistence for more details. (#7248) By Andreas
Müller.
Bug fixes
Trees and ensembles
• Random forest, extra trees, decision trees and gradient boosting won’t accept anymore
min_samples_split=1 as at least 2 samples are required to split a decision tree node. By Arnaud
Joly
• ensemble.VotingClassifier now raises NotFittedError if predict, transform or
predict_proba are called on the non-fitted estimator. by Sebastian Raschka.
• Fix bug where ensemble.AdaBoostClassifier and ensemble.AdaBoostRegressor would perform poorly if the random_state was fixed (#7411). By Joel Nothman.
• Fix bug in ensembles with randomization where the ensemble would not set random_state
on base estimators in a pipeline or similar nesting.
(#7411).
Note, results for ensemble.
BaggingClassifier ensemble.BaggingRegressor, ensemble.AdaBoostClassifier and
ensemble.AdaBoostRegressor will now differ from previous versions. By Joel Nothman.
Linear, kernelized and related models
• Fixed incorrect gradient computation for loss='squared_epsilon_insensitive' in
linear_model.SGDClassifier and linear_model.SGDRegressor (#6764).
By Wenhua
Yang.
• Fix bug in linear_model.LogisticRegressionCV where solver='liblinear' did not accept
class_weights='balanced. (#6817). By Tom Dupre la Tour.
• Fix bug in neighbors.RadiusNeighborsClassifier where an error occurred when there were outliers being labelled and a weight function specified (#6902). By LeonieBorne.
• Fix linear_model.ElasticNet sparse decision function to match output with dense in the multioutput
case.
Decomposition, manifold learning and clustering
• decomposition.RandomizedPCA default number of iterated_power is 4 instead of 3. #5141 by Giorgio
Patrini.
• utils.extmath.randomized_svd performs 4 power iterations by default, instead or 0. In practice this
is enough for obtaining a good approximation of the true eigenvalues/vectors in the presence of noise. When
n_components is small (< .1 * min(X.shape)) n_iter is set to 7, unless the user specifies a higher number.
This improves precision with few components. #5299 by Giorgio Patrini.
• Whiten/non-whiten inconsistency between components of decomposition.PCA and decomposition.
RandomizedPCA (now factored into PCA, see the New features) is fixed. components_ are stored with no
whitening. #5299 by Giorgio Patrini.
• Fixed bug in manifold.spectral_embedding where diagonal of unnormalized Laplacian matrix was
incorrectly set to 1. #4995 by Peter Fischer.

44

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• Fixed incorrect initialization of utils.arpack.eigsh on all occurrences.
Affects cluster.
bicluster.SpectralBiclustering,
decomposition.KernelPCA,
manifold.
LocallyLinearEmbedding, and manifold.SpectralEmbedding (#5012). By Peter Fischer.
• Attribute
explained_variance_ratio_
calculated
with
the
SVD
solver
discriminant_analysis.LinearDiscriminantAnalysis now returns correct results.
JPFrancoia

of
By

Preprocessing and feature selection
• preprocessing.data._transform_selected now always passes a copy of X to transform function
when copy=True (#7194). By Caio Oliveira.
Model evaluation and meta-estimators
• model_selection.StratifiedKFold now raises error if all n_labels for individual classes is less than
n_folds. #6182 by Devashish Deshpande.
• Fixed bug in model_selection.StratifiedShuffleSplit where train and test sample could overlap
in some edge cases, see #6121 for more details. By Loic Esteve.
• Fix in sklearn.model_selection.StratifiedShuffleSplit
train_size and test_size in all cases (#6472). By Andreas Müller.

to

return

splits

of

size

• Cross-validation of OneVsOneClassifier and OneVsRestClassifier now works with precomputed
kernels. #7350 by Russell Smith.
• Fix incomplete predict_proba method delegation from model_selection.GridSearchCV to
linear_model.SGDClassifier (#7159) by Yichuan Liu.
Metrics
• Fix bug in metrics.silhouette_score in which clusters of size 1 were incorrectly scored. They should
get a score of 0. By Joel Nothman.
• Fix bug in metrics.silhouette_samples so that it now works with arbitrary labels, not just those
ranging from 0 to n_clusters - 1.
• Fix bug where expected and adjusted mutual information were incorrect if cluster contingency cells exceeded
2**16. By Joel Nothman.
• metrics.pairwise.pairwise_distances now converts arrays to boolean arrays when required in
scipy.spatial.distance. #5460 by Tom Dupre la Tour.
• Fix sparse input support in metrics.silhouette_score
ples/text/document_clustering.py. By YenChen Lin.

as

well

as

example

exam-

• metrics.roc_curve and metrics.precision_recall_curve no longer round y_score values
when creating ROC curves; this was causing problems for users with very small differences in scores (#7353).
Miscellaneous
• model_selection.tests._search._check_param_grid now works correctly with all types that
extends/implements Sequence (except string), including range (Python 3.x) and xrange (Python 2.x). #7323 by
Viacheslav Kovalevskyi.
• utils.extmath.randomized_range_finder is more numerically stable when many power iterations
are requested, since it applies LU normalization by default. If n_iter<2 numerical issues are unlikely, thus
no normalization is applied. Other normalization options are available: 'none', 'LU' and 'QR'. #5141 by
Giorgio Patrini.
• Fix a bug where some formats of scipy.sparse matrix, and estimators with them as parameters, could not
be passed to base.clone. By Loic Esteve.

1.7. Release history

45

scikit-learn user guide, Release 0.19.1

• datasets.load_svmlight_file now is able to read long int QID values. #7101 by Ibraim Ganiev.
API changes summary
Linear, kernelized and related models
• residual_metric has been deprecated in linear_model.RANSACRegressor. Use loss instead.
By Manoj Kumar.
• Access to public attributes .X_ and .y_ has been deprecated in isotonic.IsotonicRegression. By
Jonathan Arfa.
Decomposition, manifold learning and clustering
• The old mixture.DPGMM is deprecated in favor of the new mixture.BayesianGaussianMixture
(with the parameter weight_concentration_prior_type='dirichlet_process'). The new
class solves the computational problems of the old class and computes the Gaussian mixture with a Dirichlet process prior faster than before. #7295 by Wei Xue and Thierry Guillemot.
• The old mixture.VBGMM is deprecated in favor of the new mixture.BayesianGaussianMixture
(with the parameter weight_concentration_prior_type='dirichlet_distribution'). The
new class solves the computational problems of the old class and computes the Variational Bayesian Gaussian
mixture faster than before. #6651 by Wei Xue and Thierry Guillemot.
• The old mixture.GMM is deprecated in favor of the new mixture.GaussianMixture. The new class
computes the Gaussian mixture faster than before and some of computational problems have been solved. #6666
by Wei Xue and Thierry Guillemot.
Model evaluation and meta-estimators
• The sklearn.cross_validation, sklearn.grid_search and sklearn.learning_curve
have been deprecated and the classes and functions have been reorganized into the sklearn.
model_selection module. Ref Model Selection Enhancements and API Changes for more information.
#4294 by Raghav RV.
• The grid_scores_ attribute of model_selection.GridSearchCV and model_selection.
RandomizedSearchCV is deprecated in favor of the attribute cv_results_. Ref Model Selection Enhancements and API Changes for more information. #6697 by Raghav RV.
• The parameters n_iter or n_folds in old CV splitters are replaced by the new parameter n_splits since
it can provide a consistent and unambiguous interface to represent the number of train-test splits. #7187 by
YenChen Lin.
• classes parameter was renamed to labels in metrics.hamming_loss. #7260 by Sebastián Vanrell.
• The splitter
classes LabelKFold,
LabelShuffleSplit,
LeaveOneLabelOut and
LeavePLabelsOut are renamed to model_selection.GroupKFold, model_selection.
GroupShuffleSplit, model_selection.LeaveOneGroupOut and model_selection.
LeavePGroupsOut respectively. Also the parameter labels in the split method of the newly renamed
splitters model_selection.LeaveOneGroupOut and model_selection.LeavePGroupsOut
is renamed to groups. Additionally in model_selection.LeavePGroupsOut, the parameter
n_labels is renamed to n_groups. #6660 by Raghav RV.
• Error and loss names for scoring parameters are now prefixed by 'neg_', such as
neg_mean_squared_error. The unprefixed versions are deprecated and will be removed in version 0.20.
#7261 by Tim Head.

46

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

Code Contributors
Aditya Joshi, Alejandro, Alexander Fabisch, Alexander Loginov, Alexander Minyushkin, Alexander Rudy, Alexandre Abadie, Alexandre Abraham, Alexandre Gramfort, Alexandre Saint, alexfields, Alvaro Ulloa, alyssaq, Amlan
Kar, Andreas Mueller, andrew giessel, Andrew Jackson, Andrew McCulloh, Andrew Murray, Anish Shah, Arafat,
Archit Sharma, Ariel Rokem, Arnaud Joly, Arnaud Rachez, Arthur Mensch, Ash Hoover, asnt, b0noI, Behzad Tabibian, Bernardo, Bernhard Kratzwald, Bhargav Mangipudi, blakeflei, Boyuan Deng, Brandon Carter, Brett Naul, Brian
McFee, Caio Oliveira, Camilo Lamus, Carol Willing, Cass, CeShine Lee, Charles Truong, Chyi-Kwei Yau, CJ Carey,
codevig, Colin Ni, Dan Shiebler, Daniel, Daniel Hnyk, David Ellis, David Nicholson, David Staub, David Thaler,
David Warshaw, Davide Lasagna, Deborah, definitelyuncertain, Didi Bar-Zev, djipey, dsquareindia, edwinENSAE,
Elias Kuthe, Elvis DOHMATOB, Ethan White, Fabian Pedregosa, Fabio Ticconi, fisache, Florian Wilhelm, Francis,
Francis O’Donovan, Gael Varoquaux, Ganiev Ibraim, ghg, Gilles Louppe, Giorgio Patrini, Giovanni Cherubin, Giovanni Lanzani, Glenn Qian, Gordon Mohr, govin-vatsan, Graham Clenaghan, Greg Reda, Greg Stupp, Guillaume
Lemaitre, Gustav Mörtberg, halwai, Harizo Rajaona, Harry Mavroforakis, hashcode55, hdmetor, Henry Lin, Hobson Lane, Hugo Bowne-Anderson, Igor Andriushchenko, Imaculate, Inki Hwang, Isaac Sijaranamual, Ishank Gulati,
Issam Laradji, Iver Jordal, jackmartin, Jacob Schreiber, Jake Vanderplas, James Fiedler, James Routley, Jan Zikes,
Janna Brettingen, jarfa, Jason Laska, jblackburne, jeff levesque, Jeffrey Blackburne, Jeffrey04, Jeremy Hintz, jeremynixon, Jeroen, Jessica Yung, Jill-Jênn Vie, Jimmy Jia, Jiyuan Qian, Joel Nothman, johannah, John, John Boersma,
John Kirkham, John Moeller, jonathan.striebel, joncrall, Jordi, Joseph Munoz, Joshua Cook, JPFrancoia, jrfiedler,
JulianKahnert, juliathebrave, kaichogami, KamalakerDadi, Kenneth Lyons, Kevin Wang, kingjr, kjell, Konstantin
Podshumok, Kornel Kielczewski, Krishna Kalyan, krishnakalyan3, Kvle Putnam, Kyle Jackson, Lars Buitinck, ldavid,
LeiG, LeightonZhang, Leland McInnes, Liang-Chi Hsieh, Lilian Besson, lizsz, Loic Esteve, Louis Tiao, Léonie Borne,
Mads Jensen, Maniteja Nandana, Manoj Kumar, Manvendra Singh, Marco, Mario Krell, Mark Bao, Mark Szepieniec,
Martin Madsen, MartinBpr, MaryanMorel, Massil, Matheus, Mathieu Blondel, Mathieu Dubois, Matteo, Matthias Ekman, Max Moroz, Michael Scherer, michiaki ariga, Mikhail Korobov, Moussa Taifi, mrandrewandrade, Mridul Seth,
nadya-p, Naoya Kanai, Nate George, Nelle Varoquaux, Nelson Liu, Nick James, NickleDave, Nico, Nicolas Goix,
Nikolay Mayorov, ningchi, nlathia, okbalefthanded, Okhlopkov, Olivier Grisel, Panos Louridas, Paul Strickland, Perrine Letellier, pestrickland, Peter Fischer, Pieter, Ping-Yao, Chang, practicalswift, Preston Parry, Qimu Zheng, Rachit
Kansal, Raghav RV, Ralf Gommers, Ramana.S, Rammig, Randy Olson, Rob Alexander, Robert Lutz, Robin Schucker,
Rohan Jain, Ruifeng Zheng, Ryan Yu, Rémy Léone, saihttam, Saiwing Yeung, Sam Shleifer, Samuel St-Jean, Sartaj Singh, Sasank Chilamkurthy, saurabh.bansod, Scott Andrews, Scott Lowe, seales, Sebastian Raschka, Sebastian
Saeger, Sebastián Vanrell, Sergei Lebedev, shagun Sodhani, shanmuga cv, Shashank Shekhar, shawpan, shengxiduan, Shota, shuckle16, Skipper Seabold, sklearn-ci, SmedbergM, srvanrell, Sébastien Lerique, Taranjeet, themrmax,
Thierry, Thierry Guillemot, Thomas, Thomas Hallock, Thomas Moreau, Tim Head, tKammy, toastedcornflakes, Tom,
TomDLT, Toshihiro Kamishima, tracer0tong, Trent Hauck, trevorstephens, Tue Vo, Varun, Varun Jewalikar, Viacheslav, Vighnesh Birodkar, Vikram, Villu Ruusmann, Vinayak Mehta, walter, waterponey, Wenhua Yang, Wenjian
Huang, Will Welch, wyseguy7, xyguo, yanlend, Yaroslav Halchenko, yelite, Yen, YenChenLin, Yichuan Liu, Yoav
Ram, Yoshiki, Zheng RuiFeng, zivori, Óscar Nájera

1.7.6 Version 0.17.1
February 18, 2016
Changelog
Bug fixes
• Upgrade vendored joblib to version 0.9.4 that fixes an important bug in joblib.Parallel that can silently
yield to wrong results when working on datasets larger than 1MB: https://github.com/joblib/joblib/blob/0.9.4/
CHANGES.rst

1.7. Release history

47

scikit-learn user guide, Release 0.19.1

• Fixed reading of Bunch pickles generated with scikit-learn version <= 0.16. This can affect users who have
already downloaded a dataset with scikit-learn 0.16 and are loading it with scikit-learn 0.17. See #6196 for how
this affected datasets.fetch_20newsgroups. By Loic Esteve.
• Fixed a bug that prevented using ROC AUC score to perform grid search on several CPU / cores on large arrays.
See #6147 By Olivier Grisel.
• Fixed a bug that prevented to properly set the presort
GradientBoostingRegressor. See #5857 By Andrew McCulloh.

parameter

• Fixed
a
joblib
error
when
evaluating
the
perplexity
of
LatentDirichletAllocation model. See #6258 By Chyi-Kwei Yau.

a

in

ensemble.

decomposition.

1.7.7 Version 0.17
November 5, 2015
Changelog
New features
• All the Scaler classes but preprocessing.RobustScaler can be fitted online by calling partial_fit. By
Giorgio Patrini.
• The new class ensemble.VotingClassifier implements a “majority rule” / “soft voting” ensemble
classifier to combine estimators for classification. By Sebastian Raschka.
• The new class preprocessing.RobustScaler provides an alternative to preprocessing.
StandardScaler for feature-wise centering and range normalization that is robust to outliers. By Thomas
Unterthiner.
• The new class preprocessing.MaxAbsScaler provides an alternative to preprocessing.
MinMaxScaler for feature-wise range normalization when the data is already centered or sparse. By Thomas
Unterthiner.
• The new class preprocessing.FunctionTransformer turns a Python function into a Pipelinecompatible transformer object. By Joe Jevnik.
• The
new
classes
cross_validation.LabelKFold
and
cross_validation.
LabelShuffleSplit generate train-test folds, respectively similar to cross_validation.KFold and
cross_validation.ShuffleSplit, except that the folds are conditioned on a label array. By Brian
McFee, Jean Kossaifi and Gilles Louppe.
• decomposition.LatentDirichletAllocation implements the Latent Dirichlet Allocation topic
model with online variational inference. By Chyi-Kwei Yau, with code based on an implementation by Matt
Hoffman. (#3659)
• The new solver sag implements a Stochastic Average Gradient descent and is available in both
linear_model.LogisticRegression and linear_model.Ridge. This solver is very efficient for
large datasets. By Danny Sullivan and Tom Dupre la Tour. (#4738)
• The new solver cd implements a Coordinate Descent in decomposition.NMF. Previous solver based on
Projected Gradient is still available setting new parameter solver to pg, but is deprecated and will be removed
in 0.19, along with decomposition.ProjectedGradientNMF and parameters sparseness, eta,
beta and nls_max_iter. New parameters alpha and l1_ratio control L1 and L2 regularization, and
shuffle adds a shuffling step in the cd solver. By Tom Dupre la Tour and Mathieu Blondel.

48

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

Enhancements
• manifold.TSNE now supports approximate optimization via the Barnes-Hut method, leading to much faster
fitting. By Christopher Erick Moody. (#4025)
• cluster.mean_shift_.MeanShift now supports parallel execution, as implemented in the
mean_shift function. By Martino Sorbaro.
• naive_bayes.GaussianNB now supports fitting with sample_weight. By Jan Hendrik Metzen.
• dummy.DummyClassifier now supports a prior fitting strategy. By Arnaud Joly.
• Added a fit_predict method for mixture.GMM and subclasses. By Cory Lorenz.
• Added the metrics.label_ranking_loss metric. By Arnaud Joly.
• Added the metrics.cohen_kappa_score metric.
• Added a warm_start constructor parameter to the bagging ensemble models to increase the size of the ensemble. By Tim Head.
• Added option to use multi-output regression metrics without averaging. By Konstantin Shmelkov and Michael
Eickenberg.
• Added stratify option to cross_validation.train_test_split for stratified splitting.
Miroslav Batchkarov.

By

• The tree.export_graphviz function now supports aesthetic improvements for tree.
DecisionTreeClassifier and tree.DecisionTreeRegressor, including options for coloring
nodes by their majority class or impurity, showing variable names, and using node proportions instead of raw
sample counts. By Trevor Stephens.
• Improved speed of newton-cg solver in linear_model.LogisticRegression, by avoiding loss computation. By Mathieu Blondel and Tom Dupre la Tour.
• The class_weight="auto" heuristic in classifiers supporting class_weight was deprecated and replaced by the class_weight="balanced" option, which has a simpler formula and interpretation. By
Hanna Wallach and Andreas Müller.
• Add class_weight parameter to automatically weight samples by class frequency for linear_model.
PassiveAgressiveClassifier. By Trevor Stephens.
• Added backlinks from the API reference pages to the user guide. By Andreas Müller.
• The labels parameter to sklearn.metrics.f1_score, sklearn.metrics.fbeta_score,
sklearn.metrics.recall_score and sklearn.metrics.precision_score has been extended. It is now possible to ignore one or more labels, such as where a multiclass problem has a majority
class to ignore. By Joel Nothman.
• Add sample_weight support to linear_model.RidgeClassifier. By Trevor Stephens.
• Provide an option for sparse output from sklearn.metrics.pairwise.cosine_similarity. By
Jaidev Deshpande.
• Add minmax_scale to provide a function interface for MinMaxScaler. By Thomas Unterthiner.
• dump_svmlight_file now handles multi-label datasets. By Chih-Wei Chang.
• RCV1 dataset loader (sklearn.datasets.fetch_rcv1). By Tom Dupre la Tour.
• The “Wisconsin Breast Cancer” classical two-class classification dataset is now included in scikit-learn, available with sklearn.dataset.load_breast_cancer.

1.7. Release history

49

scikit-learn user guide, Release 0.19.1

• Upgraded to joblib 0.9.3 to benefit from the new automatic batching of short tasks. This makes it possible for
scikit-learn to benefit from parallelism when many very short tasks are executed in parallel, for instance by the
grid_search.GridSearchCV meta-estimator with n_jobs > 1 used with a large grid of parameters
on a small dataset. By Vlad Niculae, Olivier Grisel and Loic Esteve.
• For more details about changes in joblib 0.9.3 see the release notes: https://github.com/joblib/joblib/blob/master/
CHANGES.rst#release-093
• Improved speed (3 times per iteration) of decomposition.DictLearning with coordinate descent
method from linear_model.Lasso. By Arthur Mensch.
• Parallel processing (threaded) for queries of nearest neighbors (using the ball-tree) by Nikolay Mayorov.
• Allow datasets.make_multilabel_classification to output a sparse y. By Kashif Rasul.
• cluster.DBSCAN now accepts a sparse matrix of precomputed distances, allowing memory-efficient distance
precomputation. By Joel Nothman.
• tree.DecisionTreeClassifier now exposes an apply method for retrieving the leaf indices samples
are predicted as. By Daniel Galvez and Gilles Louppe.
• Speed up decision tree regressors, random forest regressors, extra trees regressors and gradient boosting estimators by computing a proxy of the impurity improvement during the tree growth. The proxy quantity is such that
the split that maximizes this value also maximizes the impurity improvement. By Arnaud Joly, Jacob Schreiber
and Gilles Louppe.
• Speed up tree based methods by reducing the number of computations needed when computing the impurity
measure taking into account linear relationship of the computed statistics. The effect is particularly visible with
extra trees and on datasets with categorical or sparse features. By Arnaud Joly.
• ensemble.GradientBoostingRegressor and ensemble.GradientBoostingClassifier
now expose an apply method for retrieving the leaf indices each sample ends up in under each try. By Jacob Schreiber.
• Add sample_weight support to linear_model.LinearRegression. By Sonny Hu. (##4881)
• Add n_iter_without_progress to manifold.TSNE to control the stopping criterion. By Santi Villalba. (#5186)
• Added optional parameter random_state in linear_model.Ridge , to set the seed of the pseudo random
generator used in sag solver. By Tom Dupre la Tour.
• Added optional parameter warm_start in linear_model.LogisticRegression. If set to True, the
solvers lbfgs, newton-cg and sag will be initialized with the coefficients computed in the previous fit. By
Tom Dupre la Tour.
• Added sample_weight support to linear_model.LogisticRegression for the lbfgs,
newton-cg, and sag solvers. By Valentin Stolbunov. Support added to the liblinear solver. By Manoj
Kumar.
• Added optional parameter presort to ensemble.GradientBoostingRegressor and ensemble.
GradientBoostingClassifier, keeping default behavior the same. This allows gradient boosters to
turn off presorting when building deep trees or using sparse data. By Jacob Schreiber.
• Altered metrics.roc_curve to drop unnecessary thresholds by default. By Graham Clenaghan.
• Added feature_selection.SelectFromModel meta-transformer which can be used along with estimators that have coef_ or feature_importances_ attribute to select important features of the input data. By
Maheshakya Wijewardena, Joel Nothman and Manoj Kumar.
• Added metrics.pairwise.laplacian_kernel. By Clyde Fare.

50

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• covariance.GraphLasso allows separate control of the convergence criterion for the Elastic-Net subproblem via the enet_tol parameter.
• Improved verbosity in decomposition.DictionaryLearning.
• ensemble.RandomForestClassifier and ensemble.RandomForestRegressor no longer explicitly store the samples used in bagging, resulting in a much reduced memory footprint for storing random
forest models.
• Added positive option to linear_model.Lars and linear_model.lars_path to force coefficients to be positive. (#5131)
• Added the X_norm_squared parameter to metrics.pairwise.euclidean_distances to provide
precomputed squared norms for X.
• Added the fit_predict method to pipeline.Pipeline.
• Added the preprocessing.min_max_scale function.
Bug fixes
• Fixed non-determinism in dummy.DummyClassifier with sparse multi-label output. By Andreas Müller.
• Fixed the output shape of linear_model.RANSACRegressor to (n_samples, ). By Andreas Müller.
• Fixed bug in decomposition.DictLearning when n_jobs < 0. By Andreas Müller.
• Fixed bug where grid_search.RandomizedSearchCV could consume a lot of memory for large discrete
grids. By Joel Nothman.
• Fixed bug in linear_model.LogisticRegressionCV where penalty was ignored in the final fit. By
Manoj Kumar.
• Fixed bug in ensemble.forest.ForestClassifier while computing oob_score and X is a
sparse.csc_matrix. By Ankur Ankan.
• All regressors now consistently handle and warn when given y that is of shape (n_samples, 1). By Andreas
Müller and Henry Lin. (#5431)
• Fix in cluster.KMeans cluster reassignment for sparse input by Lars Buitinck.
• Fixed a bug in lda.LDA that could cause asymmetric covariance matrices when using shrinkage. By Martin
Billinger.
• Fixed cross_validation.cross_val_predict for estimators with sparse predictions. By Buddha
Prakash.
• Fixed the predict_proba method of linear_model.LogisticRegression to use soft-max instead
of one-vs-rest normalization. By Manoj Kumar. (#5182)
• Fixed the partial_fit method of linear_model.SGDClassifier
average=True. By Andrew Lamb. (#5282)

when

called

with

• Dataset fetchers use different filenames under Python 2 and Python 3 to avoid pickling compatibility issues. By
Olivier Grisel. (#5355)
• Fixed a bug in naive_bayes.GaussianNB which caused classification results to depend on scale. By Jake
Vanderplas.
• Fixed temporarily linear_model.Ridge, which was incorrect when fitting the intercept in the case of
sparse data. The fix automatically changes the solver to ‘sag’ in this case. #5360 by Tom Dupre la Tour.

1.7. Release history

51

scikit-learn user guide, Release 0.19.1

• Fixed a performance bug in decomposition.RandomizedPCA on data with a large number of features
and fewer samples. (#4478) By Andreas Müller, Loic Esteve and Giorgio Patrini.
• Fixed bug in cross_decomposition.PLS that yielded unstable and platform dependent output, and failed
on fit_transform. By Arthur Mensch.
• Fixes to the Bunch class used to store datasets.
• Fixed ensemble.plot_partial_dependence ignoring the percentiles parameter.
• Providing a set as vocabulary in CountVectorizer no longer leads to inconsistent results when pickling.
• Fixed the conditions on when a precomputed Gram matrix needs to be recomputed in linear_model.
LinearRegression, linear_model.OrthogonalMatchingPursuit, linear_model.Lasso
and linear_model.ElasticNet.
• Fixed inconsistent memory layout in the coordinate descent solver that affected linear_model.
DictionaryLearning and covariance.GraphLasso. (#5337) By Olivier Grisel.
• manifold.LocallyLinearEmbedding no longer ignores the reg parameter.
• Nearest Neighbor estimators with custom distance metrics can now be pickled. (#4362)
• Fixed a bug in pipeline.FeatureUnion where transformer_weights were not properly handled
when performing grid-searches.
• Fixed
a
bug
in
linear_model.LogisticRegression
and
linear_model.
LogisticRegressionCV
when
using
class_weight='balanced'```or
``class_weight='auto'. By Tom Dupre la Tour.
• Fixed bug #5495 when doing OVR(SVC(decision_function_shape=”ovr”)). Fixed by Elvis Dohmatob.
API changes summary
• Attribute data_min, data_max and data_range in preprocessing.MinMaxScaler are deprecated and
won’t be available from 0.19. Instead, the class now exposes data_min_, data_max_ and data_range_. By
Giorgio Patrini.
• All Scaler classes now have an scale_ attribute, the feature-wise rescaling applied by their transform methods.
The old attribute std_ in preprocessing.StandardScaler is deprecated and superseded by scale_; it
won’t be available in 0.19. By Giorgio Patrini.
• svm.SVC` and svm.NuSVC now have an decision_function_shape parameter to make their decision
function of shape (n_samples, n_classes) by setting decision_function_shape='ovr'. This
will be the default behavior starting in 0.19. By Andreas Müller.
• Passing 1D data arrays as input to estimators is now deprecated as it caused confusion in how the array elements should be interpreted as features or as samples. All data arrays are now expected to be explicitly shaped
(n_samples, n_features). By Vighnesh Birodkar.
• lda.LDA
and
qda.QDA
have
LinearDiscriminantAnalysis
QuadraticDiscriminantAnalysis.

been

moved
and

to

discriminant_analysis.
discriminant_analysis.

• The store_covariance and tol parameters have been moved from the fit method to the constructor in
discriminant_analysis.LinearDiscriminantAnalysis and the store_covariances and
tol parameters have been moved from the fit method to the constructor in discriminant_analysis.
QuadraticDiscriminantAnalysis.
• Models inheriting from _LearntSelectorMixin will no longer support the transform methods. (i.e, RandomForests, GradientBoosting, LogisticRegression, DecisionTrees, SVMs and SGD related models). Wrap

52

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

these models around the metatransfomer feature_selection.SelectFromModel to remove features
(according to coefs_ or feature_importances_) which are below a certain threshold value instead.
• cluster.KMeans re-runs cluster-assignments in case of non-convergence, to ensure consistency of
predict(X) and labels_. By Vighnesh Birodkar.
• Classifier and Regressor models are now tagged as such using the _estimator_type attribute.
• Cross-validation iterators always provide indices into training and test set, not boolean masks.
• The decision_function on all regressors was deprecated and will be removed in 0.19. Use predict
instead.
• datasets.load_lfw_pairs is deprecated and will be removed in 0.19.
fetch_lfw_pairs instead.

Use datasets.

• The deprecated hmm module was removed.
• The deprecated Bootstrap cross-validation iterator was removed.
• The deprecated Ward and WardAgglomerative classes have been removed.
AgglomerativeClustering instead.

Use clustering.

• cross_validation.check_cv is now a public function.
• The property residues_ of linear_model.LinearRegression is deprecated and will be removed in
0.19.
• The deprecated n_jobs parameter of linear_model.LinearRegression has been moved to the constructor.
• Removed deprecated class_weight parameter from linear_model.SGDClassifier’s fit method.
Use the construction parameter instead.
• The deprecated support for the sequence of sequences (or list of lists) multilabel format was removed. To convert
to and from the supported binary indicator matrix format, use MultiLabelBinarizer.
• The behavior of calling the inverse_transform method of Pipeline.pipeline will change in 0.19.
It will no longer reshape one-dimensional input to two-dimensional input.
• The deprecated attributes indicator_matrix_, multilabel_ and classes_ of preprocessing.
LabelBinarizer were removed.
• Using gamma=0 in svm.SVC and svm.SVR to automatically set the gamma to 1. / n_features is deprecated and will be removed in 0.19. Use gamma="auto" instead.
Code Contributors
Aaron Schumacher, Adithya Ganesh, akitty, Alexandre Gramfort, Alexey Grigorev, Ali Baharev, Allen Riddell, Ando
Saabas, Andreas Mueller, Andrew Lamb, Anish Shah, Ankur Ankan, Anthony Erlinger, Ari Rouvinen, Arnaud Joly,
Arnaud Rachez, Arthur Mensch, banilo, Barmaley.exe, benjaminirving, Boyuan Deng, Brett Naul, Brian McFee,
Buddha Prakash, Chi Zhang, Chih-Wei Chang, Christof Angermueller, Christoph Gohlke, Christophe Bourguignat,
Christopher Erick Moody, Chyi-Kwei Yau, Cindy Sridharan, CJ Carey, Clyde-fare, Cory Lorenz, Dan Blanchard,
Daniel Galvez, Daniel Kronovet, Danny Sullivan, Data1010, David, David D Lowe, David Dotson, djipey, Dmitry
Spikhalskiy, Donne Martin, Dougal J. Sutherland, Dougal Sutherland, edson duarte, Eduardo Caro, Eric Larson, Eric
Martin, Erich Schubert, Fernando Carrillo, Frank C. Eckert, Frank Zalkow, Gael Varoquaux, Ganiev Ibraim, Gilles
Louppe, Giorgio Patrini, giorgiop, Graham Clenaghan, Gryllos Prokopis, gwulfs, Henry Lin, Hsuan-Tien Lin, Immanuel Bayer, Ishank Gulati, Jack Martin, Jacob Schreiber, Jaidev Deshpande, Jake Vanderplas, Jan Hendrik Metzen,
Jean Kossaifi, Jeffrey04, Jeremy, jfraj, Jiali Mei, Joe Jevnik, Joel Nothman, John Kirkham, John Wittenauer, Joseph,
Joshua Loyal, Jungkook Park, KamalakerDadi, Kashif Rasul, Keith Goodman, Kian Ho, Konstantin Shmelkov, Kyler

1.7. Release history

53

scikit-learn user guide, Release 0.19.1

Brown, Lars Buitinck, Lilian Besson, Loic Esteve, Louis Tiao, maheshakya, Maheshakya Wijewardena, Manoj Kumar, MarkTab marktab.net, Martin Ku, Martin Spacek, MartinBpr, martinosorb, MaryanMorel, Masafumi Oyamada,
Mathieu Blondel, Matt Krump, Matti Lyra, Maxim Kolganov, mbillinger, mhg, Michael Heilman, Michael Patterson,
Miroslav Batchkarov, Nelle Varoquaux, Nicolas, Nikolay Mayorov, Olivier Grisel, Omer Katz, Óscar Nájera, Pauli
Virtanen, Peter Fischer, Peter Prettenhofer, Phil Roth, pianomania, Preston Parry, Raghav RV, Rob Zinkov, Robert
Layton, Rohan Ramanath, Saket Choudhary, Sam Zhang, santi, saurabh.bansod, scls19fr, Sebastian Raschka, Sebastian Saeger, Shivan Sornarajah, SimonPL, sinhrks, Skipper Seabold, Sonny Hu, sseg, Stephen Hoover, Steven De
Gryze, Steven Seguin, Theodore Vasiloudis, Thomas Unterthiner, Tiago Freitas Pereira, Tian Wang, Tim Head, Timothy Hopper, tokoroten, Tom Dupré la Tour, Trevor Stephens, Valentin Stolbunov, Vighnesh Birodkar, Vinayak Mehta,
Vincent, Vincent Michel, vstolbunov, wangz10, Wei Xue, Yucheng Low, Yury Zhauniarovich, Zac Stewart, zhai_pro,
Zichen Wang

1.7.8 Version 0.16.1
April 14, 2015
Changelog
Bug fixes
• Allow input data larger than block_size in covariance.LedoitWolf by Andreas Müller.
• Fix a bug in isotonic.IsotonicRegression deduplication that caused unstable result in
calibration.CalibratedClassifierCV by Jan Hendrik Metzen.
• Fix sorting of labels in func:preprocessing.label_binarize by Michael Heilman.
• Fix
several
stability
and
convergence
issues
in
cross_decomposition.CCA
cross_decomposition.PLSCanonical by Andreas Müller

and

• Fix a bug in cluster.KMeans when precompute_distances=False on fortran-ordered data.
• Fix a speed regression in ensemble.RandomForestClassifier’s predict and predict_proba
by Andreas Müller.
• Fix a regression where utils.shuffle converted lists and dataframes to arrays, by Olivier Grisel

1.7.9 Version 0.16
March 26, 2015
Highlights
• Speed improvements (notably in cluster.DBSCAN ), reduced memory requirements, bug-fixes and better
default settings.
• Multinomial Logistic regression and a path algorithm in linear_model.LogisticRegressionCV .
• Out-of core learning of PCA via decomposition.IncrementalPCA.
• Probability callibration of classifiers using calibration.CalibratedClassifierCV .
• cluster.Birch clustering method for large-scale datasets.
• Scalable approximate nearest neighbors search with Locality-sensitive hashing forests in neighbors.
LSHForest.

54

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• Improved error messages and better validation when using malformed input data.
• More robust integration with pandas dataframes.
Changelog
New features
• The new neighbors.LSHForest implements locality-sensitive hashing for approximate nearest neighbors
search. By Maheshakya Wijewardena.
• Added svm.LinearSVR. This class uses the liblinear implementation of Support Vector Regression which is
much faster for large sample sizes than svm.SVR with linear kernel. By Fabian Pedregosa and Qiang Luo.
• Incremental fit for GaussianNB.
• Added sample_weight support to dummy.DummyClassifier and dummy.DummyRegressor. By
Arnaud Joly.
• Added the metrics.label_ranking_average_precision_score metrics. By Arnaud Joly.
• Add the metrics.coverage_error metrics. By Arnaud Joly.
• Added linear_model.LogisticRegressionCV . By Manoj Kumar, Fabian Pedregosa, Gael Varoquaux
and Alexandre Gramfort.
• Added warm_start constructor parameter to make it possible for any trained forest model to grow additional
trees incrementally. By Laurent Direr.
• Added sample_weight support to ensemble.GradientBoostingClassifier and ensemble.
GradientBoostingRegressor. By Peter Prettenhofer.
• Added decomposition.IncrementalPCA, an implementation of the PCA algorithm that supports outof-core learning with a partial_fit method. By Kyle Kastner.
• Averaged SGD for SGDClassifier and SGDRegressor By Danny Sullivan.
• Added cross_val_predict function which computes cross-validated estimates. By Luis Pedro Coelho
• Added linear_model.TheilSenRegressor, a robust generalized-median-based estimator. By Florian
Wilhelm.
• Added metrics.median_absolute_error, a robust metric. By Gael Varoquaux and Florian Wilhelm.
• Add cluster.Birch, an online clustering algorithm. By Manoj Kumar, Alexandre Gramfort and Joel Nothman.
• Added shrinkage support to discriminant_analysis.LinearDiscriminantAnalysis using two
new solvers. By Clemens Brunner and Martin Billinger.
• Added kernel_ridge.KernelRidge, an implementation of kernelized ridge regression. By Mathieu
Blondel and Jan Hendrik Metzen.
• All solvers in linear_model.Ridge now support sample_weight. By Mathieu Blondel.
• Added cross_validation.PredefinedSplit cross-validation for fixed user-provided cross-validation
folds. By Thomas Unterthiner.
• Added calibration.CalibratedClassifierCV , an approach for calibrating the predicted probabilities of a classifier. By Alexandre Gramfort, Jan Hendrik Metzen, Mathieu Blondel and Balazs Kegl.

1.7. Release history

55

scikit-learn user guide, Release 0.19.1

Enhancements
• Add option return_distance in hierarchical.ward_tree to return distances between nodes for
both structured and unstructured versions of the algorithm. By Matteo Visconti di Oleggio Castello. The same
option was added in hierarchical.linkage_tree. By Manoj Kumar
• Add support for sample weights in scorer objects. Metrics with sample weight support will automatically benefit
from it. By Noel Dawe and Vlad Niculae.
• Added newton-cg and lbfgs solver support in linear_model.LogisticRegression. By Manoj Kumar.
• Add selection="random" parameter to implement stochastic coordinate descent for linear_model.
Lasso, linear_model.ElasticNet and related. By Manoj Kumar.
• Add sample_weight parameter to metrics.jaccard_similarity_score and metrics.
log_loss. By Jatin Shah.
• Support sparse multilabel indicator representation in preprocessing.LabelBinarizer and
multiclass.OneVsRestClassifier (by Hamzeh Alsalhi with thanks to Rohit Sivaprasad), as
well as evaluation metrics (by Joel Nothman).
• Add sample_weight parameter to metrics.jaccard_similarity_score. By Jatin Shah.
• Add support for multiclass in metrics.hinge_loss. Added labels=None as optional parameter. By Saurabh
Jha.
• Add sample_weight parameter to metrics.hinge_loss. By Saurabh Jha.
• Add multi_class="multinomial" option in linear_model.LogisticRegression to implement a Logistic Regression solver that minimizes the cross-entropy or multinomial loss instead of the default
One-vs-Rest setting. Supports lbfgs and newton-cg solvers. By Lars Buitinck and Manoj Kumar. Solver option
newton-cg by Simon Wu.
• DictVectorizer can now perform fit_transform on an iterable in a single pass, when giving the option
sort=False. By Dan Blanchard.
• GridSearchCV and RandomizedSearchCV can now be configured to work with estimators that may fail
and raise errors on individual folds. This option is controlled by the error_score parameter. This does not affect
errors raised on re-fit. By Michal Romaniuk.
• Add digits parameter to metrics.classification_report to allow report to show different precision of floating
point numbers. By Ian Gilmore.
• Add a quantile prediction strategy to the dummy.DummyRegressor. By Aaron Staple.
• Add handle_unknown option to preprocessing.OneHotEncoder to handle unknown categorical features more gracefully during transform. By Manoj Kumar.
• Added support for sparse input data to decision trees and their ensembles. By Fares Hedyati and Arnaud Joly.
• Optimized cluster.AffinityPropagation by reducing the number of memory allocations of large
temporary data-structures. By Antony Lee.
• Parellization of the computation of feature importances in random forest. By Olivier Grisel and Arnaud Joly.
• Add n_iter_ attribute to estimators that accept a max_iter attribute in their constructor. By Manoj Kumar.
• Added decision function for multiclass.OneVsOneClassifier By Raghav RV and Kyle Beauchamp.
• neighbors.kneighbors_graph and radius_neighbors_graph support non-Euclidean metrics.
By Manoj Kumar

56

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• Parameter connectivity in cluster.AgglomerativeClustering and family now accept callables
that return a connectivity matrix. By Manoj Kumar.
• Sparse support for paired_distances. By Joel Nothman.
• cluster.DBSCAN now supports sparse input and sample weights and has been optimized: the inner loop has
been rewritten in Cython and radius neighbors queries are now computed in batch. By Joel Nothman and Lars
Buitinck.
• Add class_weight parameter to automatically weight samples by class frequency for
ensemble.RandomForestClassifier,
tree.DecisionTreeClassifier,
ensemble.
ExtraTreesClassifier and tree.ExtraTreeClassifier. By Trevor Stephens.
• grid_search.RandomizedSearchCV now does sampling without replacement if all parameters are
given as lists. By Andreas Müller.
• Parallelized calculation of pairwise_distances is now supported for scipy metrics and custom callables.
By Joel Nothman.
• Allow the fitting and scoring of all clustering algorithms in pipeline.Pipeline. By Andreas Müller.
• More robust seeding and improved error messages in cluster.MeanShift by Andreas Müller.
• Make the stopping criterion for mixture.GMM , mixture.DPGMM and mixture.VBGMM less dependent
on the number of samples by thresholding the average log-likelihood change instead of its sum over all samples.
By Hervé Bredin.
• The outcome of manifold.spectral_embedding was made deterministic by flipping the sign of eigenvectors. By Hasil Sharma.
• Significant performance and memory usage improvements in preprocessing.PolynomialFeatures.
By Eric Martin.
• Numerical stability improvements for preprocessing.StandardScaler and preprocessing.
scale. By Nicolas Goix
• svm.SVC fitted on sparse input now implements decision_function. By Rob Zinkov and Andreas
Müller.
• cross_validation.train_test_split now preserves the input type, instead of converting to numpy
arrays.
Documentation improvements
• Added example of using FeatureUnion for heterogeneous input. By Matt Terry
• Documentation on scorers was improved, to highlight the handling of loss functions. By Matt Pico.
• A discrepancy between liblinear output and scikit-learn’s wrappers is now noted. By Manoj Kumar.
• Improved documentation generation: examples referring to a class or function are now shown in a gallery on
the class/function’s API reference page. By Joel Nothman.
• More explicit documentation of sample generators and of data transformation. By Joel Nothman.
• sklearn.neighbors.BallTree and sklearn.neighbors.KDTree used to point to empty pages
stating that they are aliases of BinaryTree. This has been fixed to show the correct class docs. By Manoj Kumar.
• Added silhouette plots for analysis of KMeans clustering using metrics.silhouette_samples and
metrics.silhouette_score. See Selecting the number of clusters with silhouette analysis on KMeans
clustering

1.7. Release history

57

scikit-learn user guide, Release 0.19.1

Bug fixes
• Metaestimators now support ducktyping for the presence of decision_function,
predict_proba and other methods.
This fixes behavior of grid_search.GridSearchCV ,
grid_search.RandomizedSearchCV , pipeline.Pipeline, feature_selection.RFE,
feature_selection.RFECV when nested. By Joel Nothman
• The scoring attribute of grid-search and cross-validation methods is no longer ignored when a
grid_search.GridSearchCV is given as a base estimator or the base estimator doesn’t have predict.
• The function hierarchical.ward_tree now returns the children in the same order for both the structured
and unstructured versions. By Matteo Visconti di Oleggio Castello.
• feature_selection.RFECV now correctly handles cases when step is not equal to 1. By Nikolay
Mayorov
• The decomposition.PCA now undoes whitening in its inverse_transform. Also, its components_
now always have unit length. By Michael Eickenberg.
• Fix incomplete download of the dataset when datasets.download_20newsgroups is called. By Manoj
Kumar.
• Various fixes to the Gaussian processes subpackage by Vincent Dubourg and Jan Hendrik Metzen.
• Calling partial_fit with class_weight=='auto' throws an appropriate error message and suggests
a work around. By Danny Sullivan.
• RBFSampler with gamma=g formerly approximated rbf_kernel with gamma=g/2.; the definition of
gamma is now consistent, which may substantially change your results if you use a fixed value. (If you crossvalidated over gamma, it probably doesn’t matter too much.) By Dougal Sutherland.
• Pipeline object delegate the classes_ attribute to the underlying estimator. It allows, for instance, to make
bagging of a pipeline object. By Arnaud Joly
• neighbors.NearestCentroid now uses the median as the centroid when metric is set to manhattan.
It was using the mean before. By Manoj Kumar
• Fix numerical stability issues in linear_model.SGDClassifier and linear_model.
SGDRegressor by clipping large gradients and ensuring that weight decay rescaling is always positive (for
large l2 regularization and large learning rate values). By Olivier Grisel
• When compute_full_tree is set to “auto”, the full tree is built when n_clusters is high and is early stopped when
n_clusters is low, while the behavior should be vice-versa in cluster.AgglomerativeClustering (and
friends). This has been fixed By Manoj Kumar
• Fix lazy centering of data in linear_model.enet_path and linear_model.lasso_path. It was
centered around one. It has been changed to be centered around the origin. By Manoj Kumar
• Fix handling of precomputed affinity matrices in cluster.AgglomerativeClustering when using
connectivity constraints. By Cathy Deng
• Correct partial_fit handling of class_prior for sklearn.naive_bayes.MultinomialNB and
sklearn.naive_bayes.BernoulliNB. By Trevor Stephens.
• Fixed a crash in metrics.precision_recall_fscore_support when using unsorted labels in the
multi-label setting. By Andreas Müller.
• Avoid skipping the first nearest neighbor in the methods radius_neighbors, kneighbors,
kneighbors_graph
and
radius_neighbors_graph
in
sklearn.neighbors.
NearestNeighbors and family, when the query data is not the same as fit data. By Manoj Kumar.
• Fix log-density calculation in the mixture.GMM with tied covariance. By Will Dawson

58

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• Fixed a scaling error in feature_selection.SelectFdr where a factor n_features was missing. By
Andrew Tulloch
• Fix zero division in neighbors.KNeighborsRegressor and related classes when using distance weighting and having identical data points. By Garret-R.
• Fixed round off errors with non positive-definite covariance matrices in GMM. By Alexis Mignon.
• Fixed a error in the computation of conditional probabilities in naive_bayes.BernoulliNB. By Hanna
Wallach.
• Make the method radius_neighbors of neighbors.NearestNeighbors return the samples lying
on the boundary for algorithm='brute'. By Yan Yi.
• Flip sign of dual_coef_ of svm.SVC to make it consistent with the documentation and
decision_function. By Artem Sobolev.
• Fixed handling of ties in isotonic.IsotonicRegression. We now use the weighted average of targets
(secondary method). By Andreas Müller and Michael Bommarito.
API changes summary
• GridSearchCV and cross_val_score and other meta-estimators don’t convert pandas DataFrames into
arrays any more, allowing DataFrame specific operations in custom estimators.
• multiclass.fit_ovr, multiclass.predict_ovr, predict_proba_ovr, multiclass.
fit_ovo,
multiclass.predict_ovo,
multiclass.fit_ecoc
and
multiclass.
predict_ecoc are deprecated. Use the underlying estimators instead.
• Nearest neighbors estimators used to take arbitrary keyword arguments and pass these to their distance metric.
This will no longer be supported in scikit-learn 0.18; use the metric_params argument instead.
• n_jobs parameter of the fit method shifted to the constructor of the LinearRegression class.
• The predict_proba method of multiclass.OneVsRestClassifier now returns two probabilities
per sample in the multiclass case; this is consistent with other estimators and with the method’s documentation, but previous versions accidentally returned only the positive probability. Fixed by Will Lamond and Lars
Buitinck.
• Change default value of precompute in ElasticNet and Lasso to False. Setting precompute to “auto” was
found to be slower when n_samples > n_features since the computation of the Gram matrix is computationally
expensive and outweighs the benefit of fitting the Gram for just one alpha. precompute="auto" is now
deprecated and will be removed in 0.18 By Manoj Kumar.
• Expose positive option in linear_model.enet_path and linear_model.enet_path which
constrains coefficients to be positive. By Manoj Kumar.
• Users should now supply an explicit average parameter to sklearn.metrics.f1_score, sklearn.
metrics.fbeta_score,
sklearn.metrics.recall_score
and
sklearn.metrics.
precision_score when performing multiclass or multilabel (i.e. not binary) classification. By Joel
Nothman.
• scoring parameter for cross validation now accepts ‘f1_micro’, ‘f1_macro’ or ‘f1_weighted’. ‘f1’ is now for
binary classification only. Similar changes apply to ‘precision’ and ‘recall’. By Joel Nothman.
• The fit_intercept, normalize and return_models parameters in linear_model.enet_path
and linear_model.lasso_path have been removed. They were deprecated since 0.14
• From now onwards, all estimators will uniformly raise NotFittedError (utils.validation.
NotFittedError), when any of the predict like methods are called before the model is fit. By Raghav
RV.

1.7. Release history

59

scikit-learn user guide, Release 0.19.1

• Input data validation was refactored for more consistent input validation. The check_arrays function was
replaced by check_array and check_X_y. By Andreas Müller.
• Allow X=None in the methods radius_neighbors, kneighbors, kneighbors_graph and
radius_neighbors_graph in sklearn.neighbors.NearestNeighbors and family. If set to
None, then for every sample this avoids setting the sample itself as the first nearest neighbor. By Manoj Kumar.
• Add parameter include_self in neighbors.kneighbors_graph and neighbors.
radius_neighbors_graph which has to be explicitly set by the user. If set to True, then the
sample itself is considered as the first nearest neighbor.
• thresh parameter is deprecated in favor of new tol parameter in GMM, DPGMM and VBGMM. See Enhancements
section for details. By Hervé Bredin.
• Estimators will treat input with dtype object as numeric when possible. By Andreas Müller
• Estimators now raise ValueError consistently when fitted on empty data (less than 1 sample or less than 1 feature
for 2D input). By Olivier Grisel.
• The shuffle option of linear_model.SGDClassifier, linear_model.SGDRegressor,
linear_model.Perceptron,
linear_model.PassiveAgressiveClassifier
and
linear_model.PassiveAgressiveRegressor now defaults to True.
• cluster.DBSCAN now uses a deterministic initialization. The random_state parameter is deprecated. By
Erich Schubert.
Code Contributors
A. Flaxman, Aaron Schumacher, Aaron Staple, abhishek thakur, Akshay, akshayah3, Aldrian Obaja, Alexander
Fabisch, Alexandre Gramfort, Alexis Mignon, Anders Aagaard, Andreas Mueller, Andreas van Cranenburgh, Andrew Tulloch, Andrew Walker, Antony Lee, Arnaud Joly, banilo, Barmaley.exe, Ben Davies, Benedikt Koehler, bhsu,
Boris Feld, Borja Ayerdi, Boyuan Deng, Brent Pedersen, Brian Wignall, Brooke Osborn, Calvin Giles, Cathy Deng,
Celeo, cgohlke, chebee7i, Christian Stade-Schuldt, Christof Angermueller, Chyi-Kwei Yau, CJ Carey, Clemens Brunner, Daiki Aminaka, Dan Blanchard, danfrankj, Danny Sullivan, David Fletcher, Dmitrijs Milajevs, Dougal J. Sutherland, Erich Schubert, Fabian Pedregosa, Florian Wilhelm, floydsoft, Félix-Antoine Fortin, Gael Varoquaux, Garrett-R,
Gilles Louppe, gpassino, gwulfs, Hampus Bengtsson, Hamzeh Alsalhi, Hanna Wallach, Harry Mavroforakis, Hasil
Sharma, Helder, Herve Bredin, Hsiang-Fu Yu, Hugues SALAMIN, Ian Gilmore, Ilambharathi Kanniah, Imran Haque,
isms, Jake VanderPlas, Jan Dlabal, Jan Hendrik Metzen, Jatin Shah, Javier López Peña, jdcaballero, Jean Kossaifi, Jeff
Hammerbacher, Joel Nothman, Jonathan Helmus, Joseph, Kaicheng Zhang, Kevin Markham, Kyle Beauchamp, Kyle
Kastner, Lagacherie Matthieu, Lars Buitinck, Laurent Direr, leepei, Loic Esteve, Luis Pedro Coelho, Lukas Michelbacher, maheshakya, Manoj Kumar, Manuel, Mario Michael Krell, Martin, Martin Billinger, Martin Ku, Mateusz
Susik, Mathieu Blondel, Matt Pico, Matt Terry, Matteo Visconti dOC, Matti Lyra, Max Linke, Mehdi Cherti, Michael
Bommarito, Michael Eickenberg, Michal Romaniuk, MLG, mr.Shu, Nelle Varoquaux, Nicola Montecchio, Nicolas,
Nikolay Mayorov, Noel Dawe, Okal Billy, Olivier Grisel, Óscar Nájera, Paolo Puggioni, Peter Prettenhofer, Pratap
Vardhan, pvnguyen, queqichao, Rafael Carrascosa, Raghav R V, Rahiel Kasim, Randall Mason, Rob Zinkov, Robert
Bradshaw, Saket Choudhary, Sam Nicholls, Samuel Charron, Saurabh Jha, sethdandridge, sinhrks, snuderl, Stefan
Otte, Stefan van der Walt, Steve Tjoa, swu, Sylvain Zimmer, tejesh95, terrycojones, Thomas Delteil, Thomas Unterthiner, Tomas Kazmar, trevorstephens, tttthomasssss, Tzu-Ming Kuo, ugurcaliskan, ugurthemaster, Vinayak Mehta,
Vincent Dubourg, Vjacheslav Murashkin, Vlad Niculae, wadawson, Wei Xue, Will Lamond, Wu Jiang, x0l, Xinfan
Meng, Yan Yi, Yu-Chin

1.7.10 Version 0.15.2
September 4, 2014

60

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

Bug fixes
• Fixed handling of the p parameter of the Minkowski distance that was previously ignored in nearest neighbors
models. By Nikolay Mayorov.
• Fixed duplicated alphas in linear_model.LassoLars with early stopping on 32 bit Python. By Olivier
Grisel and Fabian Pedregosa.
• Fixed the build under Windows when scikit-learn is built with MSVC while NumPy is built with MinGW. By
Olivier Grisel and Federico Vaggi.
• Fixed an array index overflow bug in the coordinate descent solver. By Gael Varoquaux.
• Better handling of numpy 1.9 deprecation warnings. By Gael Varoquaux.
• Removed unnecessary data copy in cluster.KMeans. By Gael Varoquaux.
• Explicitly close open files to avoid ResourceWarnings under Python 3. By Calvin Giles.
• The transform of discriminant_analysis.LinearDiscriminantAnalysis now projects the
input on the most discriminant directions. By Martin Billinger.
• Fixed potential overflow in _tree.safe_realloc by Lars Buitinck.
• Performance optimization in isotonic.IsotonicRegression. By Robert Bradshaw.
• nose is non-longer a runtime dependency to import sklearn, only for running the tests. By Joel Nothman.
• Many documentation and website fixes by Joel Nothman, Lars Buitinck Matt Pico, and others.

1.7.11 Version 0.15.1
August 1, 2014
Bug fixes
• Made cross_validation.cross_val_score use cross_validation.KFold instead of
cross_validation.StratifiedKFold on multi-output classification problems. By Nikolay Mayorov.
• Support unseen labels preprocessing.LabelBinarizer to restore the default behavior of 0.14.1 for
backward compatibility. By Hamzeh Alsalhi.
• Fixed the cluster.KMeans stopping criterion that prevented early convergence detection. By Edward Raff
and Gael Varoquaux.
• Fixed the behavior of multiclass.OneVsOneClassifier. in case of ties at the per-class vote level by
computing the correct per-class sum of prediction scores. By Andreas Müller.
• Made cross_validation.cross_val_score and grid_search.GridSearchCV accept Python
lists as input data. This is especially useful for cross-validation and model selection of text processing pipelines.
By Andreas Müller.
• Fixed data input checks of most estimators to accept input data that implements the NumPy __array__
protocol. This is the case for for pandas.Series and pandas.DataFrame in recent versions of pandas.
By Gael Varoquaux.
• Fixed a regression for linear_model.SGDClassifier with class_weight="auto" on data with
non-contiguous labels. By Olivier Grisel.

1.7. Release history

61

scikit-learn user guide, Release 0.19.1

1.7.12 Version 0.15
July 15, 2014
Highlights
• Many speed and memory improvements all across the code
• Huge speed and memory improvements to random forests (and extra trees) that also benefit better from parallel
computing.
• Incremental fit to BernoulliRBM
• Added cluster.AgglomerativeClustering for hierarchical agglomerative clustering with average
linkage, complete linkage and ward strategies.
• Added linear_model.RANSACRegressor for robust regression models.
• Added dimensionality reduction with manifold.TSNE which can be used to visualize high-dimensional data.
Changelog
New features
• Added ensemble.BaggingClassifier and ensemble.BaggingRegressor meta-estimators for
ensembling any kind of base estimator. See the Bagging section of the user guide for details and examples.
By Gilles Louppe.
• New unsupervised feature selection algorithm feature_selection.VarianceThreshold, by Lars
Buitinck.
• Added linear_model.RANSACRegressor meta-estimator for the robust fitting of regression models. By
Johannes Schönberger.
• Added cluster.AgglomerativeClustering for hierarchical agglomerative clustering with average
linkage, complete linkage and ward strategies, by Nelle Varoquaux and Gael Varoquaux.
• Shorthand constructors pipeline.make_pipeline and pipeline.make_union were added by Lars
Buitinck.
• Shuffle option for cross_validation.StratifiedKFold. By Jeffrey Blackburne.
• Incremental learning (partial_fit) for Gaussian Naive Bayes by Imran Haque.
• Added partial_fit to BernoulliRBM By Danny Sullivan.
• Added learning_curve utility to chart performance with respect to training size. See Plotting Learning
Curves. By Alexander Fabisch.
• Add positive option in LassoCV and ElasticNetCV . By Brian Wignall and Alexandre Gramfort.
• Added linear_model.MultiTaskElasticNetCV and linear_model.MultiTaskLassoCV . By
Manoj Kumar.
• Added manifold.TSNE. By Alexander Fabisch.

62

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

Enhancements
• Add
sparse
input
support
to
ensemble.AdaBoostClassifier
AdaBoostRegressor meta-estimators. By Hamzeh Alsalhi.

and

ensemble.

• Memory improvements of decision trees, by Arnaud Joly.
• Decision trees can now be built in best-first manner by using max_leaf_nodes as the stopping criteria.
Refactored the tree code to use either a stack or a priority queue for tree building. By Peter Prettenhofer and
Gilles Louppe.
• Decision trees can now be fitted on fortran- and c-style arrays, and non-continuous arrays without the need to
make a copy. If the input array has a different dtype than np.float32, a fortran- style copy will be made
since fortran-style memory layout has speed advantages. By Peter Prettenhofer and Gilles Louppe.
• Speed improvement of regression trees by optimizing the the computation of the mean square error criterion.
This lead to speed improvement of the tree, forest and gradient boosting tree modules. By Arnaud Joly
• The img_to_graph and grid_tograph functions in sklearn.feature_extraction.image now
return np.ndarray instead of np.matrix when return_as=np.ndarray. See the Notes section for
more information on compatibility.
• Changed the internal storage of decision trees to use a struct array. This fixed some small bugs, while improving
code and providing a small speed gain. By Joel Nothman.
• Reduce memory usage and overhead when fitting and predicting with forests of randomized trees in parallel
with n_jobs != 1 by leveraging new threading backend of joblib 0.8 and releasing the GIL in the tree fitting
Cython code. By Olivier Grisel and Gilles Louppe.
• Speed improvement of the sklearn.ensemble.gradient_boosting module. By Gilles Louppe and
Peter Prettenhofer.
• Various enhancements to the sklearn.ensemble.gradient_boosting module: a warm_start argument to fit additional trees, a max_leaf_nodes argument to fit GBM style trees, a monitor fit argument
to inspect the estimator during training, and refactoring of the verbose code. By Peter Prettenhofer.
• Faster sklearn.ensemble.ExtraTrees by caching feature values. By Arnaud Joly.
• Faster depth-based tree building algorithm such as decision tree, random forest, extra trees or gradient tree
boosting (with depth based growing strategy) by avoiding trying to split on found constant features in the sample
subset. By Arnaud Joly.
• Add min_weight_fraction_leaf pre-pruning parameter to tree-based methods: the minimum weighted
fraction of the input samples required to be at a leaf node. By Noel Dawe.
• Added metrics.pairwise_distances_argmin_min, by Philippe Gervais.
• Added predict method to cluster.AffinityPropagation and cluster.MeanShift, by Mathieu
Blondel.
• Vector and matrix multiplications have been optimised throughout the library by Denis Engemann, and Alexandre Gramfort. In particular, they should take less memory with older NumPy versions (prior to 1.7.2).
• Precision-recall and ROC examples now use train_test_split, and have more explanation of why these metrics
are useful. By Kyle Kastner
• The training algorithm for decomposition.NMF is faster for sparse matrices and has much lower memory
complexity, meaning it will scale up gracefully to large datasets. By Lars Buitinck.
• Added svd_method option with default value to “randomized” to decomposition.FactorAnalysis to
save memory and significantly speedup computation by Denis Engemann, and Alexandre Gramfort.

1.7. Release history

63

scikit-learn user guide, Release 0.19.1

• Changed cross_validation.StratifiedKFold to try and preserve as much of the original ordering of
samples as possible so as not to hide overfitting on datasets with a non-negligible level of samples dependency.
By Daniel Nouri and Olivier Grisel.
• Add multi-output support to gaussian_process.GaussianProcess by John Novak.
• Support for precomputed distance matrices in nearest neighbor estimators by Robert Layton and Joel Nothman.
• Norm computations optimized for NumPy 1.6 and later versions by Lars Buitinck. In particular, the k-means
algorithm no longer needs a temporary data structure the size of its input.
• dummy.DummyClassifier can now be used to predict a constant output value. By Manoj Kumar.
• dummy.DummyRegressor has now a strategy parameter which allows to predict the mean, the median of the
training set or a constant output value. By Maheshakya Wijewardena.
• Multi-label classification output in multilabel indicator format is now supported by metrics.
roc_auc_score and metrics.average_precision_score by Arnaud Joly.
• Significant performance improvements (more than 100x speedup for large problems) in isotonic.
IsotonicRegression by Andrew Tulloch.
• Speed and memory usage improvements to the SGD algorithm for linear models: it now uses threads, not
separate processes, when n_jobs>1. By Lars Buitinck.
• Grid search and cross validation allow NaNs in the input arrays so that preprocessors such as
preprocessing.Imputer can be trained within the cross validation loop, avoiding potentially skewed
results.
• Ridge regression can now deal with sample weights in feature space (only sample space until then). By Michael
Eickenberg. Both solutions are provided by the Cholesky solver.
• Several classification and regression metrics now support weighted samples with the new
sample_weight
argument:
metrics.accuracy_score,
metrics.zero_one_loss,
metrics.precision_score,
metrics.average_precision_score,
metrics.
f1_score, metrics.fbeta_score, metrics.recall_score, metrics.roc_auc_score,
metrics.explained_variance_score,
metrics.mean_squared_error,
metrics.
mean_absolute_error, metrics.r2_score. By Noel Dawe.
• Speed up of the sample generator datasets.make_multilabel_classification. By Joel Nothman.
Documentation improvements
• The Working With Text Data tutorial has now been worked in to the main documentation’s tutorial section.
Includes exercises and skeletons for tutorial presentation. Original tutorial created by several authors including
Olivier Grisel, Lars Buitinck and many others. Tutorial integration into the scikit-learn documentation by Jaques
Grobler
• Added Computational Performance documentation. Discussion and examples of prediction latency / throughput
and different factors that have influence over speed. Additional tips for building faster models and choosing a
relevant compromise between speed and predictive power. By Eustache Diemert.
Bug fixes
• Fixed bug in decomposition.MiniBatchDictionaryLearning : partial_fit was not working
properly.
• Fixed bug in linear_model.stochastic_gradient : l1_ratio was used as (1.0 - l1_ratio)
.

64

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• Fixed bug in multiclass.OneVsOneClassifier with string labels
• Fixed a bug in LassoCV and ElasticNetCV : they would not pre-compute the Gram matrix with
precompute=True or precompute="auto" and n_samples > n_features. By Manoj Kumar.
• Fixed incorrect estimation of the degrees of freedom in feature_selection.f_regression when variates are not centered. By Virgile Fritsch.
• Fixed a race condition in parallel processing with pre_dispatch != "all" (for instance, in
cross_val_score). By Olivier Grisel.
• Raise error in cluster.FeatureAgglomeration and cluster.WardAgglomeration when no
samples are given, rather than returning meaningless clustering.
• Fixed bug in gradient_boosting.GradientBoostingRegressor with loss='huber': gamma
might have not been initialized.
• Fixed feature importances as computed with a forest of randomized trees when fit with sample_weight !=
None and/or with bootstrap=True. By Gilles Louppe.
API changes summary
• sklearn.hmm is deprecated. Its removal is planned for the 0.17 release.
• Use of covariance.EllipticEnvelop has now been removed after deprecation.
covariance.EllipticEnvelope instead.

Please use

• cluster.Ward is deprecated. Use cluster.AgglomerativeClustering instead.
• cluster.WardClustering is deprecated. Use
• cluster.AgglomerativeClustering instead.
• cross_validation.Bootstrap
is
deprecated.
cross_validation.KFold
cross_validation.ShuffleSplit are recommended instead.

or

• Direct support for the sequence of sequences (or list of lists) multilabel format is deprecated. To convert to and
from the supported binary indicator matrix format, use MultiLabelBinarizer. By Joel Nothman.
• Add score method to PCA following the model of probabilistic PCA and deprecate ProbabilisticPCA
model whose score implementation is not correct. The computation now also exploits the matrix inversion
lemma for faster computation. By Alexandre Gramfort.
• The score method of FactorAnalysis now returns the average log-likelihood of the samples.
score_samples to get log-likelihood of each sample. By Alexandre Gramfort.

Use

• Generating boolean masks (the setting indices=False) from cross-validation generators is deprecated. Support for masks will be removed in 0.17. The generators have produced arrays of indices by default since 0.10.
By Joel Nothman.
• 1-d arrays containing strings with dtype=object (as used in Pandas) are now considered valid classification
targets. This fixes a regression from version 0.13 in some classifiers. By Joel Nothman.
• Fix wrong explained_variance_ratio_ attribute in RandomizedPCA. By Alexandre Gramfort.
• Fit alphas for each l1_ratio instead of mean_l1_ratio in linear_model.ElasticNetCV and
linear_model.LassoCV . This changes the shape of alphas_ from (n_alphas,) to (n_l1_ratio,
n_alphas) if the l1_ratio provided is a 1-D array like object of length greater than one. By Manoj Kumar.
• Fix linear_model.ElasticNetCV and linear_model.LassoCV when fitting intercept and input
data is sparse. The automatic grid of alphas was not computed correctly and the scaling with normalize was
wrong. By Manoj Kumar.

1.7. Release history

65

scikit-learn user guide, Release 0.19.1

• Fix wrong maximal number of features drawn (max_features) at each split for decision trees, random forests
and gradient tree boosting. Previously, the count for the number of drawn features started only after one non
constant features in the split. This bug fix will affect computational and generalization performance of those
algorithms in the presence of constant features. To get back previous generalization performance, you should
modify the value of max_features. By Arnaud Joly.
• Fix wrong maximal number of features drawn (max_features) at each split for ensemble.
ExtraTreesClassifier and ensemble.ExtraTreesRegressor. Previously, only non constant
features in the split was counted as drawn. Now constant features are counted as drawn. Furthermore at least
one feature must be non constant in order to make a valid split. This bug fix will affect computational and generalization performance of extra trees in the presence of constant features. To get back previous generalization
performance, you should modify the value of max_features. By Arnaud Joly.
• Fix utils.compute_class_weight when class_weight=="auto". Previously it was broken for
input of non-integer dtype and the weighted array that was returned was wrong. By Manoj Kumar.
• Fix cross_validation.Bootstrap to return ValueError when n_train + n_test > n. By
Ronald Phlypo.
People
List of contributors for release 0.15 by number of commits.
• 312 Olivier Grisel
• 275 Lars Buitinck
• 221 Gael Varoquaux
• 148 Arnaud Joly
• 134 Johannes Schönberger
• 119 Gilles Louppe
• 113 Joel Nothman
• 111 Alexandre Gramfort
• 95 Jaques Grobler
• 89 Denis Engemann
• 83 Peter Prettenhofer
• 83 Alexander Fabisch
• 62 Mathieu Blondel
• 60 Eustache Diemert
• 60 Nelle Varoquaux
• 49 Michael Bommarito
• 45 Manoj-Kumar-S
• 28 Kyle Kastner
• 26 Andreas Mueller
• 22 Noel Dawe
• 21 Maheshakya Wijewardena
• 21 Brooke Osborn
66

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• 21 Hamzeh Alsalhi
• 21 Jake VanderPlas
• 21 Philippe Gervais
• 19 Bala Subrahmanyam Varanasi
• 12 Ronald Phlypo
• 10 Mikhail Korobov
• 8 Thomas Unterthiner
• 8 Jeffrey Blackburne
• 8 eltermann
• 8 bwignall
• 7 Ankit Agrawal
• 7 CJ Carey
• 6 Daniel Nouri
• 6 Chen Liu
• 6 Michael Eickenberg
• 6 ugurthemaster
• 5 Aaron Schumacher
• 5 Baptiste Lagarde
• 5 Rajat Khanduja
• 5 Robert McGibbon
• 5 Sergio Pascual
• 4 Alexis Metaireau
• 4 Ignacio Rossi
• 4 Virgile Fritsch
• 4 Sebastian Säger
• 4 Ilambharathi Kanniah
• 4 sdenton4
• 4 Robert Layton
• 4 Alyssa
• 4 Amos Waterland
• 3 Andrew Tulloch
• 3 murad
• 3 Steven Maude
• 3 Karol Pysniak
• 3 Jacques Kvam
• 3 cgohlke

1.7. Release history

67

scikit-learn user guide, Release 0.19.1

• 3 cjlin
• 3 Michael Becker
• 3 hamzeh
• 3 Eric Jacobsen
• 3 john collins
• 3 kaushik94
• 3 Erwin Marsi
• 2 csytracy
• 2 LK
• 2 Vlad Niculae
• 2 Laurent Direr
• 2 Erik Shilts
• 2 Raul Garreta
• 2 Yoshiki Vázquez Baeza
• 2 Yung Siang Liau
• 2 abhishek thakur
• 2 James Yu
• 2 Rohit Sivaprasad
• 2 Roland Szabo
• 2 amormachine
• 2 Alexis Mignon
• 2 Oscar Carlsson
• 2 Nantas Nardelli
• 2 jess010
• 2 kowalski87
• 2 Andrew Clegg
• 2 Federico Vaggi
• 2 Simon Frid
• 2 Félix-Antoine Fortin
• 1 Ralf Gommers
• 1 t-aft
• 1 Ronan Amicel
• 1 Rupesh Kumar Srivastava
• 1 Ryan Wang
• 1 Samuel Charron
• 1 Samuel St-Jean

68

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• 1 Fabian Pedregosa
• 1 Skipper Seabold
• 1 Stefan Walk
• 1 Stefan van der Walt
• 1 Stephan Hoyer
• 1 Allen Riddell
• 1 Valentin Haenel
• 1 Vijay Ramesh
• 1 Will Myers
• 1 Yaroslav Halchenko
• 1 Yoni Ben-Meshulam
• 1 Yury V. Zaytsev
• 1 adrinjalali
• 1 ai8rahim
• 1 alemagnani
• 1 alex
• 1 benjamin wilson
• 1 chalmerlowe
• 1 dzikie drożdże
• 1 jamestwebber
• 1 matrixorz
• 1 popo
• 1 samuela
• 1 François Boulogne
• 1 Alexander Measure
• 1 Ethan White
• 1 Guilherme Trein
• 1 Hendrik Heuer
• 1 IvicaJovic
• 1 Jan Hendrik Metzen
• 1 Jean Michel Rouly
• 1 Eduardo Ariño de la Rubia
• 1 Jelle Zijlstra
• 1 Eddy L O Jansson
• 1 Denis
• 1 John

1.7. Release history

69

scikit-learn user guide, Release 0.19.1

• 1 John Schmidt
• 1 Jorge Cañardo Alastuey
• 1 Joseph Perla
• 1 Joshua Vredevoogd
• 1 José Ricardo
• 1 Julien Miotte
• 1 Kemal Eren
• 1 Kenta Sato
• 1 David Cournapeau
• 1 Kyle Kelley
• 1 Daniele Medri
• 1 Laurent Luce
• 1 Laurent Pierron
• 1 Luis Pedro Coelho
• 1 DanielWeitzenfeld
• 1 Craig Thompson
• 1 Chyi-Kwei Yau
• 1 Matthew Brett
• 1 Matthias Feurer
• 1 Max Linke
• 1 Chris Filo Gorgolewski
• 1 Charles Earl
• 1 Michael Hanke
• 1 Michele Orrù
• 1 Bryan Lunt
• 1 Brian Kearns
• 1 Paul Butler
• 1 Paweł Mandera
• 1 Peter
• 1 Andrew Ash
• 1 Pietro Zambelli
• 1 staubda

1.7.13 Version 0.14
August 7, 2013

70

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

Changelog
• Missing values with sparse and dense matrices can be imputed with the transformer preprocessing.
Imputer by Nicolas Trésegnie.
• The core implementation of decisions trees has been rewritten from scratch, allowing for faster tree induction
and lower memory consumption in all tree-based estimators. By Gilles Louppe.
• Added ensemble.AdaBoostClassifier and ensemble.AdaBoostRegressor, by Noel Dawe and
Gilles Louppe. See the AdaBoost section of the user guide for details and examples.
• Added grid_search.RandomizedSearchCV and grid_search.ParameterSampler for randomized hyperparameter optimization. By Andreas Müller.
• Added biclustering algorithms (sklearn.cluster.bicluster.SpectralCoclustering and
sklearn.cluster.bicluster.SpectralBiclustering), data generation methods (sklearn.
datasets.make_biclusters and sklearn.datasets.make_checkerboard), and scoring metrics (sklearn.metrics.consensus_score). By Kemal Eren.
• Added Restricted Boltzmann Machines (neural_network.BernoulliRBM ). By Yann Dauphin.
• Python 3 support by Justin Vincent, Lars Buitinck, Subhodeep Moitra and Olivier Grisel. All tests now pass
under Python 3.3.
• Ability to pass one penalty (alpha value) per target in linear_model.Ridge, by @eickenberg and Mathieu
Blondel.
• Fixed sklearn.linear_model.stochastic_gradient.py L2 regularization issue (minor practical
significance). By Norbert Crombach and Mathieu Blondel .
• Added an interactive version of Andreas Müller’s Machine Learning Cheat Sheet (for scikit-learn) to the documentation. See Choosing the right estimator. By Jaques Grobler.
• grid_search.GridSearchCV and cross_validation.cross_val_score now support the use
of advanced scoring function such as area under the ROC curve and f-beta scores. See The scoring parameter:
defining model evaluation rules for details. By Andreas Müller and Lars Buitinck. Passing a function from
sklearn.metrics as score_func is deprecated.
• Multi-label
classification
output
is
now
supported
by
metrics.accuracy_score,
metrics.zero_one_loss,
metrics.f1_score,
metrics.fbeta_score,
metrics.
classification_report, metrics.precision_score and metrics.recall_score by
Arnaud Joly.
• Two new metrics metrics.hamming_loss and metrics.jaccard_similarity_score are added
with multi-label support by Arnaud Joly.
• Speed and memory usage improvements in feature_extraction.text.CountVectorizer and
feature_extraction.text.TfidfVectorizer, by Jochen Wersdörfer and Roman Sinayev.
• The
min_df
parameter
in
feature_extraction.text.CountVectorizer
and
feature_extraction.text.TfidfVectorizer, which used to be 2, has been reset to 1 to
avoid unpleasant surprises (empty vocabularies) for novice users who try it out on tiny document collections. A
value of at least 2 is still recommended for practical use.
• svm.LinearSVC, linear_model.SGDClassifier and linear_model.SGDRegressor now
have a sparsify method that converts their coef_ into a sparse matrix, meaning stored models trained
using these estimators can be made much more compact.
• linear_model.SGDClassifier now produces multiclass probability estimates when trained under log
loss or modified Huber loss.
• Hyperlinks to documentation in example code on the website by Martin Luessi.

1.7. Release history

71

scikit-learn user guide, Release 0.19.1

• Fixed bug in preprocessing.MinMaxScaler causing incorrect scaling of the features for non-default
feature_range settings. By Andreas Müller.
• max_features in tree.DecisionTreeClassifier, tree.DecisionTreeRegressor and all
derived ensemble estimators now supports percentage values. By Gilles Louppe.
• Performance improvements in isotonic.IsotonicRegression by Nelle Varoquaux.
• metrics.accuracy_score has an option normalize to return the fraction or the number of correctly classified sample by Arnaud Joly.
• Added metrics.log_loss that computes log loss, aka cross-entropy loss. By Jochen Wersdörfer and Lars
Buitinck.
• A bug that caused ensemble.AdaBoostClassifier’s to output incorrect probabilities has been fixed.
• Feature selectors now share a mixin providing consistent transform, inverse_transform and
get_support methods. By Joel Nothman.
• A fitted grid_search.GridSearchCV or grid_search.RandomizedSearchCV can now generally
be pickled. By Joel Nothman.
• Refactored
and
vectorized
implementation
precision_recall_curve. By Joel Nothman.

of

metrics.roc_curve

and

metrics.

• The new estimator sklearn.decomposition.TruncatedSVD performs dimensionality reduction using
SVD on sparse matrices, and can be used for latent semantic analysis (LSA). By Lars Buitinck.
• Added self-contained example of out-of-core learning on text data Out-of-core classification of text documents.
By Eustache Diemert.
• The default number of components for sklearn.decomposition.RandomizedPCA is now correctly
documented to be n_features. This was the default behavior, so programs using it will continue to work as
they did.
• sklearn.cluster.KMeans now fits several orders of magnitude faster on sparse data (the speedup depends
on the sparsity). By Lars Buitinck.
• Reduce memory footprint of FastICA by Denis Engemann and Alexandre Gramfort.
• Verbose output in sklearn.ensemble.gradient_boosting now uses a column format and prints
progress in decreasing frequency. It also shows the remaining time. By Peter Prettenhofer.
• sklearn.ensemble.gradient_boosting provides out-of-bag improvement oob_improvement_
rather than the OOB score for model selection. An example that shows how to use OOB estimates to select the
number of trees was added. By Peter Prettenhofer.
• Most metrics now support string labels for multiclass classification by Arnaud Joly and Lars Buitinck.
• New OrthogonalMatchingPursuitCV class by Alexandre Gramfort and Vlad Niculae.
• Fixed a bug in sklearn.covariance.GraphLassoCV : the ‘alphas’ parameter now works as expected
when given a list of values. By Philippe Gervais.
• Fixed an important bug in sklearn.covariance.GraphLassoCV that prevented all folds provided by
a CV object to be used (only the first 3 were used). When providing a CV object, execution time may thus
increase significantly compared to the previous version (bug results are correct now). By Philippe Gervais.
• cross_validation.cross_val_score and the grid_search module is now tested with multioutput data by Arnaud Joly.
• datasets.make_multilabel_classification can now return the output in label indicator multilabel format by Arnaud Joly.

72

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• K-nearest
neighbors,
neighbors.KNeighborsRegressor
and
neighbors.
RadiusNeighborsRegressor, and radius neighbors, neighbors.RadiusNeighborsRegressor
and neighbors.RadiusNeighborsClassifier support multioutput data by Arnaud Joly.
• Random state in LibSVM-based estimators (svm.SVC, NuSVC, OneClassSVM, svm.SVR, svm.NuSVR)
can now be controlled. This is useful to ensure consistency in the probability estimates for the classifiers trained
with probability=True. By Vlad Niculae.
• Out-of-core learning support for discrete naive Bayes classifiers sklearn.naive_bayes.
MultinomialNB and sklearn.naive_bayes.BernoulliNB by adding the partial_fit
method by Olivier Grisel.
• New website design and navigation by Gilles Louppe, Nelle Varoquaux, Vincent Michel and Andreas Müller.
• Improved documentation on multi-class, multi-label and multi-output classification by Yannick Schwartz and
Arnaud Joly.
• Better input and error handling in the metrics module by Arnaud Joly and Joel Nothman.
• Speed optimization of the hmm module by Mikhail Korobov
• Significant speed improvements for sklearn.cluster.DBSCAN by cleverless
API changes summary
• The auc_score was renamed roc_auc_score.
• Testing scikit-learn with sklearn.test() is deprecated. Use nosetests sklearn from the command
line.
• Feature importances in tree.DecisionTreeClassifier, tree.DecisionTreeRegressor and all
derived ensemble estimators are now computed on the fly when accessing the feature_importances_
attribute. Setting compute_importances=True is no longer required. By Gilles Louppe.
• linear_model.lasso_path and linear_model.enet_path can return its results in the same format
as that of linear_model.lars_path. This is done by setting the return_models parameter to False.
By Jaques Grobler and Alexandre Gramfort
• grid_search.IterGrid was renamed to grid_search.ParameterGrid.
• Fixed bug in KFold causing imperfect class balance in some cases. By Alexandre Gramfort and Tadej Janež.
• sklearn.neighbors.BallTree has been refactored, and a sklearn.neighbors.KDTree has been
added which shares the same interface. The Ball Tree now works with a wide variety of distance metrics.
Both classes have many new methods, including single-tree and dual-tree queries, breadth-first and depth-first
searching, and more advanced queries such as kernel density estimation and 2-point correlation functions. By
Jake Vanderplas
• Support for scipy.spatial.cKDTree within neighbors queries has been removed, and the functionality replaced
with the new KDTree class.
• sklearn.neighbors.KernelDensity has been added, which performs efficient kernel density estimation with a variety of kernels.
• sklearn.decomposition.KernelPCA now always returns output with n_components components,
unless the new parameter remove_zero_eig is set to True. This new behavior is consistent with the way
kernel PCA was always documented; previously, the removal of components with zero eigenvalues was tacitly
performed on all data.
• gcv_mode="auto" no longer tries to perform SVD on a densified sparse matrix in sklearn.
linear_model.RidgeCV .

1.7. Release history

73

scikit-learn user guide, Release 0.19.1

• Sparse matrix support in sklearn.decomposition.RandomizedPCA is now deprecated in favor of the
new TruncatedSVD.
• cross_validation.KFold and cross_validation.StratifiedKFold now enforce n_folds >=
2 otherwise a ValueError is raised. By Olivier Grisel.
• datasets.load_files’s charset and charset_errors parameters were renamed encoding and
decode_errors.
• Attribute
oob_score_
in
sklearn.ensemble.GradientBoostingRegressor
and
sklearn.ensemble.GradientBoostingClassifier is deprecated and has been replaced by
oob_improvement_ .
• Attributes in OrthogonalMatchingPursuit have been deprecated (copy_X, Gram, . . . ) and precompute_gram
renamed precompute for consistency. See #2224.
• sklearn.preprocessing.StandardScaler now converts integer input to float, and raises a warning.
Previously it rounded for dense integer input.
• sklearn.multiclass.OneVsRestClassifier now has a decision_function method. This
will return the distance of each sample from the decision boundary for each class, as long as the underlying
estimators implement the decision_function method. By Kyle Kastner.
• Better input validation, warning on unexpected shapes for y.
People
List of contributors for release 0.14 by number of commits.
• 277 Gilles Louppe
• 245 Lars Buitinck
• 187 Andreas Mueller
• 124 Arnaud Joly
• 112 Jaques Grobler
• 109 Gael Varoquaux
• 107 Olivier Grisel
• 102 Noel Dawe
• 99 Kemal Eren
• 79 Joel Nothman
• 75 Jake VanderPlas
• 73 Nelle Varoquaux
• 71 Vlad Niculae
• 65 Peter Prettenhofer
• 64 Alexandre Gramfort
• 54 Mathieu Blondel
• 38 Nicolas Trésegnie
• 35 eustache
• 27 Denis Engemann

74

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• 25 Yann N. Dauphin
• 19 Justin Vincent
• 17 Robert Layton
• 15 Doug Coleman
• 14 Michael Eickenberg
• 13 Robert Marchman
• 11 Fabian Pedregosa
• 11 Philippe Gervais
• 10 Jim Holmström
• 10 Tadej Janež
• 10 syhw
• 9 Mikhail Korobov
• 9 Steven De Gryze
• 8 sergeyf
• 7 Ben Root
• 7 Hrishikesh Huilgolkar
• 6 Kyle Kastner
• 6 Martin Luessi
• 6 Rob Speer
• 5 Federico Vaggi
• 5 Raul Garreta
• 5 Rob Zinkov
• 4 Ken Geis
• 3 A. Flaxman
• 3 Denton Cockburn
• 3 Dougal Sutherland
• 3 Ian Ozsvald
• 3 Johannes Schönberger
• 3 Robert McGibbon
• 3 Roman Sinayev
• 3 Szabo Roland
• 2 Diego Molla
• 2 Imran Haque
• 2 Jochen Wersdörfer
• 2 Sergey Karayev
• 2 Yannick Schwartz

1.7. Release history

75

scikit-learn user guide, Release 0.19.1

• 2 jamestwebber
• 1 Abhijeet Kolhe
• 1 Alexander Fabisch
• 1 Bastiaan van den Berg
• 1 Benjamin Peterson
• 1 Daniel Velkov
• 1 Fazlul Shahriar
• 1 Felix Brockherde
• 1 Félix-Antoine Fortin
• 1 Harikrishnan S
• 1 Jack Hale
• 1 JakeMick
• 1 James McDermott
• 1 John Benediktsson
• 1 John Zwinck
• 1 Joshua Vredevoogd
• 1 Justin Pati
• 1 Kevin Hughes
• 1 Kyle Kelley
• 1 Matthias Ekman
• 1 Miroslav Shubernetskiy
• 1 Naoki Orii
• 1 Norbert Crombach
• 1 Rafael Cunha de Almeida
• 1 Rolando Espinoza La fuente
• 1 Seamus Abshere
• 1 Sergey Feldman
• 1 Sergio Medina
• 1 Stefano Lattarini
• 1 Steve Koch
• 1 Sturla Molden
• 1 Thomas Jarosch
• 1 Yaroslav Halchenko

76

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

1.7.14 Version 0.13.1
February 23, 2013
The 0.13.1 release only fixes some bugs and does not add any new functionality.
Changelog
• Fixed a testing error caused by the function cross_validation.train_test_split being interpreted
as a test by Yaroslav Halchenko.
• Fixed a bug in the reassignment of small clusters in the cluster.MiniBatchKMeans by Gael Varoquaux.
• Fixed default value of gamma in decomposition.KernelPCA by Lars Buitinck.
• Updated joblib to 0.7.0d by Gael Varoquaux.
• Fixed scaling of the deviance in ensemble.GradientBoostingClassifier by Peter Prettenhofer.
• Better tie-breaking in multiclass.OneVsOneClassifier by Andreas Müller.
• Other small improvements to tests and documentation.
People
List of contributors for release 0.13.1 by number of commits.
• 16 Lars Buitinck
• 12 Andreas Müller
• 8 Gael Varoquaux
• 5 Robert Marchman
• 3 Peter Prettenhofer
• 2 Hrishikesh Huilgolkar
• 1 Bastiaan van den Berg
• 1 Diego Molla
• 1 Gilles Louppe
• 1 Mathieu Blondel
• 1 Nelle Varoquaux
• 1 Rafael Cunha de Almeida
• 1 Rolando Espinoza La fuente
• 1 Vlad Niculae
• 1 Yaroslav Halchenko

1.7.15 Version 0.13
January 21, 2013

1.7. Release history

77

scikit-learn user guide, Release 0.19.1

New Estimator Classes
• dummy.DummyClassifier and dummy.DummyRegressor, two data-independent predictors by Mathieu
Blondel. Useful to sanity-check your estimators. See Dummy estimators in the user guide. Multioutput support
added by Arnaud Joly.
• decomposition.FactorAnalysis, a transformer implementing the classical factor analysis, by Christian Osendorfer and Alexandre Gramfort. See Factor Analysis in the user guide.
• feature_extraction.FeatureHasher, a transformer implementing the “hashing trick” for fast,
low-memory feature extraction from string fields by Lars Buitinck and feature_extraction.text.
HashingVectorizer for text documents by Olivier Grisel See Feature hashing and Vectorizing a large
text corpus with the hashing trick for the documentation and sample usage.
• pipeline.FeatureUnion, a transformer that concatenates results of several other transformers by Andreas
Müller. See FeatureUnion: composite feature spaces in the user guide.
• random_projection.GaussianRandomProjection,
random_projection.
SparseRandomProjection
and
the
function
random_projection.
johnson_lindenstrauss_min_dim. The first two are transformers implementing Gaussian and
sparse random projection matrix by Olivier Grisel and Arnaud Joly. See Random Projection in the user guide.
• kernel_approximation.Nystroem, a transformer for approximating arbitrary kernels by Andreas
Müller. See Nystroem Method for Kernel Approximation in the user guide.
• preprocessing.OneHotEncoder, a transformer that computes binary encodings of categorical features
by Andreas Müller. See Encoding categorical features in the user guide.
• linear_model.PassiveAggressiveClassifier
and
linear_model.
PassiveAggressiveRegressor, predictors implementing an efficient stochastic optimization for
linear models by Rob Zinkov and Mathieu Blondel. See Passive Aggressive Algorithms in the user guide.
• ensemble.RandomTreesEmbedding, a transformer for creating high-dimensional sparse representations
using ensembles of totally random trees by Andreas Müller. See Totally Random Trees Embedding in the user
guide.
• manifold.SpectralEmbedding and function manifold.spectral_embedding, implementing
the “laplacian eigenmaps” transformation for non-linear dimensionality reduction by Wei Li. See Spectral
Embedding in the user guide.
• isotonic.IsotonicRegression by Fabian Pedregosa, Alexandre Gramfort and Nelle Varoquaux,
Changelog
• metrics.zero_one_loss (formerly metrics.zero_one) now has option for normalized output that
reports the fraction of misclassifications, rather than the raw number of misclassifications. By Kyle Beauchamp.
• tree.DecisionTreeClassifier and all derived ensemble models now support sample weighting, by
Noel Dawe and Gilles Louppe.
• Speedup improvement when using bootstrap samples in forests of randomized trees, by Peter Prettenhofer and
Gilles Louppe.
• Partial dependence plots for Gradient Tree Boosting in ensemble.partial_dependence.
partial_dependence by Peter Prettenhofer. See Partial Dependence Plots for an example.
• The table of contents on the website has now been made expandable by Jaques Grobler.
• feature_selection.SelectPercentile now breaks ties deterministically instead of returning all
equally ranked features.

78

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• feature_selection.SelectKBest and feature_selection.SelectPercentile are more
numerically stable since they use scores, rather than p-values, to rank results. This means that they might
sometimes select different features than they did previously.
• Ridge regression and ridge classification fitting with sparse_cg solver no longer has quadratic memory complexity, by Lars Buitinck and Fabian Pedregosa.
• Ridge regression and ridge classification now support a new fast solver called lsqr, by Mathieu Blondel.
• Speed up of metrics.precision_recall_curve by Conrad Lee.
• Added support for reading/writing svmlight files with pairwise preference attribute (qid in svmlight file format)
in datasets.dump_svmlight_file and datasets.load_svmlight_file by Fabian Pedregosa.
• Faster and more robust metrics.confusion_matrix and Clustering performance evaluation by Wei Li.
• cross_validation.cross_val_score now works with precomputed kernels and affinity matrices, by
Andreas Müller.
• LARS algorithm made more numerically stable with heuristics to drop regressors too correlated as well as to
stop the path when numerical noise becomes predominant, by Gael Varoquaux.
• Faster implementation of metrics.precision_recall_curve by Conrad Lee.
• New kernel metrics.chi2_kernel by Andreas Müller, often used in computer vision applications.
• Fix of longstanding bug in naive_bayes.BernoulliNB fixed by Shaun Jackman.
• Implemented predict_proba in multiclass.OneVsRestClassifier, by Andrew Winterman.
• Improve consistency in gradient boosting: estimators ensemble.GradientBoostingRegressor and
ensemble.GradientBoostingClassifier use the estimator tree.DecisionTreeRegressor
instead of the tree._tree.Tree data structure by Arnaud Joly.
• Fixed a floating point exception in the decision trees module, by Seberg.
• Fix metrics.roc_curve fails when y_true has only one class by Wei Li.
• Add the metrics.mean_absolute_error function which computes the mean absolute error. The
metrics.mean_squared_error, metrics.mean_absolute_error and metrics.r2_score
metrics support multioutput by Arnaud Joly.
• Fixed class_weight support in svm.LinearSVC and linear_model.LogisticRegression by
Andreas Müller. The meaning of class_weight was reversed as erroneously higher weight meant less
positives of a given class in earlier releases.
• Improve narrative documentation and consistency in sklearn.metrics for regression and classification
metrics by Arnaud Joly.
• Fixed a bug in sklearn.svm.SVC when using csr-matrices with unsorted indices by Xinfan Meng and Andreas Müller.
• MiniBatchKMeans: Add random reassignment of cluster centers with little observations attached to them,
by Gael Varoquaux.
API changes summary
• Renamed
all
occurrences
of
n_atoms
to
n_components
This
applies
to
decomposition.DictionaryLearning,
MiniBatchDictionaryLearning,
decomposition.dict_learning,
dict_learning_online.

1.7. Release history

for
consistency.
decomposition.
decomposition.

79

scikit-learn user guide, Release 0.19.1

• Renamed all occurrences of max_iters to max_iter for consistency.
This applies to
semi_supervised.LabelPropagation
and
semi_supervised.label_propagation.
LabelSpreading.
• Renamed all occurrences of learn_rate to learning_rate for consistency in ensemble.
BaseGradientBoosting and ensemble.GradientBoostingRegressor.
• The module sklearn.linear_model.sparse is gone. Sparse matrix support was already integrated into
the “regular” linear models.
• sklearn.metrics.mean_square_error, which incorrectly returned the accumulated error, was removed. Use mean_squared_error instead.
• Passing class_weight parameters to fit methods is no longer supported. Pass them to estimator constructors instead.
• GMMs no longer have decode and rvs methods. Use the score, predict or sample methods instead.
• The solver fit option in Ridge regression and classification is now deprecated and will be removed in v0.14.
Use the constructor option instead.
• feature_extraction.text.DictVectorizer now returns sparse matrices in the CSR format, instead of COO.
• Renamed k in cross_validation.KFold and cross_validation.StratifiedKFold to
n_folds, renamed n_bootstraps to n_iter in cross_validation.Bootstrap.
• Renamed all occurrences of n_iterations to n_iter for consistency.
This applies to
cross_validation.ShuffleSplit,
cross_validation.StratifiedShuffleSplit,
utils.randomized_range_finder and utils.randomized_svd.
• Replaced rho in linear_model.ElasticNet and linear_model.SGDClassifier by
l1_ratio. The rho parameter had different meanings; l1_ratio was introduced to avoid confusion. It has the same meaning as previously rho in linear_model.ElasticNet and (1-rho) in
linear_model.SGDClassifier.
• linear_model.LassoLars and linear_model.Lars now store a list of paths in the case of multiple
targets, rather than an array of paths.
• The attribute gmm of hmm.GMMHMM was renamed to gmm_ to adhere more strictly with the API.
• cluster.spectral_embedding was moved to manifold.spectral_embedding.
• Renamed eig_tol in manifold.spectral_embedding, cluster.SpectralClustering to
eigen_tol, renamed mode to eigen_solver.
• Renamed mode in manifold.spectral_embedding and cluster.SpectralClustering to
eigen_solver.
• classes_ and n_classes_ attributes of tree.DecisionTreeClassifier and all derived ensemble
models are now flat in case of single output problems and nested in case of multi-output problems.
• The estimators_ attribute of ensemble.gradient_boosting.GradientBoostingRegressor
and ensemble.gradient_boosting.GradientBoostingClassifier is now an array of
:class:’tree.DecisionTreeRegressor’.
• Renamed chunk_size to batch_size in decomposition.MiniBatchDictionaryLearning
and decomposition.MiniBatchSparsePCA for consistency.
• svm.SVC and svm.NuSVC now provide a classes_ attribute and support arbitrary dtypes for labels y.
Also, the dtype returned by predict now reflects the dtype of y during fit (used to be np.float).

80

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• Changed default test_size in cross_validation.train_test_split to None, added possibility to infer test_size from train_size in cross_validation.ShuffleSplit and
cross_validation.StratifiedShuffleSplit.
• Renamed function sklearn.metrics.zero_one to sklearn.metrics.zero_one_loss. Be
aware that the default behavior in sklearn.metrics.zero_one_loss is different from sklearn.
metrics.zero_one: normalize=False is changed to normalize=True.
• Renamed function metrics.zero_one_score to metrics.accuracy_score.
• datasets.make_circles now has the same number of inner and outer points.
• In the Naive Bayes classifiers, the class_prior parameter was moved from fit to __init__.
People
List of contributors for release 0.13 by number of commits.
• 364 Andreas Müller
• 143 Arnaud Joly
• 137 Peter Prettenhofer
• 131 Gael Varoquaux
• 117 Mathieu Blondel
• 108 Lars Buitinck
• 106 Wei Li
• 101 Olivier Grisel
• 65 Vlad Niculae
• 54 Gilles Louppe
• 40 Jaques Grobler
• 38 Alexandre Gramfort
• 30 Rob Zinkov
• 19 Aymeric Masurelle
• 18 Andrew Winterman
• 17 Fabian Pedregosa
• 17 Nelle Varoquaux
• 16 Christian Osendorfer
• 14 Daniel Nouri
• 13 Virgile Fritsch
• 13 syhw
• 12 Satrajit Ghosh
• 10 Corey Lynch
• 10 Kyle Beauchamp
• 9 Brian Cheung

1.7. Release history

81

scikit-learn user guide, Release 0.19.1

• 9 Immanuel Bayer
• 9 mr.Shu
• 8 Conrad Lee
• 8 James Bergstra
• 7 Tadej Janež
• 6 Brian Cajes
• 6 Jake Vanderplas
• 6 Michael
• 6 Noel Dawe
• 6 Tiago Nunes
• 6 cow
• 5 Anze
• 5 Shiqiao Du
• 4 Christian Jauvin
• 4 Jacques Kvam
• 4 Richard T. Guy
• 4 Robert Layton
• 3 Alexandre Abraham
• 3 Doug Coleman
• 3 Scott Dickerson
• 2 ApproximateIdentity
• 2 John Benediktsson
• 2 Mark Veronda
• 2 Matti Lyra
• 2 Mikhail Korobov
• 2 Xinfan Meng
• 1 Alejandro Weinstein
• 1 Alexandre Passos
• 1 Christoph Deil
• 1 Eugene Nizhibitsky
• 1 Kenneth C. Arnold
• 1 Luis Pedro Coelho
• 1 Miroslav Batchkarov
• 1 Pavel
• 1 Sebastian Berg
• 1 Shaun Jackman

82

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• 1 Subhodeep Moitra
• 1 bob
• 1 dengemann
• 1 emanuele
• 1 x006

1.7.16 Version 0.12.1
October 8, 2012
The 0.12.1 release is a bug-fix release with no additional features, but is instead a set of bug fixes
Changelog
• Improved numerical stability in spectral embedding by Gael Varoquaux
• Doctest under windows 64bit by Gael Varoquaux
• Documentation fixes for elastic net by Andreas Müller and Alexandre Gramfort
• Proper behavior with fortran-ordered NumPy arrays by Gael Varoquaux
• Make GridSearchCV work with non-CSR sparse matrix by Lars Buitinck
• Fix parallel computing in MDS by Gael Varoquaux
• Fix Unicode support in count vectorizer by Andreas Müller
• Fix MinCovDet breaking with X.shape = (3, 1) by Virgile Fritsch
• Fix clone of SGD objects by Peter Prettenhofer
• Stabilize GMM by Virgile Fritsch
People
• 14 Peter Prettenhofer
• 12 Gael Varoquaux
• 10 Andreas Müller
• 5 Lars Buitinck
• 3 Virgile Fritsch
• 1 Alexandre Gramfort
• 1 Gilles Louppe
• 1 Mathieu Blondel

1.7.17 Version 0.12
September 4, 2012

1.7. Release history

83

scikit-learn user guide, Release 0.19.1

Changelog
• Various speed improvements of the decision trees module, by Gilles Louppe.
• ensemble.GradientBoostingRegressor and ensemble.GradientBoostingClassifier
now support feature subsampling via the max_features argument, by Peter Prettenhofer.
• Added Huber and Quantile loss functions to ensemble.GradientBoostingRegressor, by Peter Prettenhofer.
• Decision trees and forests of randomized trees now support multi-output classification and regression problems,
by Gilles Louppe.
• Added preprocessing.LabelEncoder, a simple utility class to normalize labels or transform nonnumerical labels, by Mathieu Blondel.
• Added the epsilon-insensitive loss and the ability to make probabilistic predictions with the modified huber loss
in Stochastic Gradient Descent, by Mathieu Blondel.
• Added Multi-dimensional Scaling (MDS), by Nelle Varoquaux.
• SVMlight file format loader now detects compressed (gzip/bzip2) files and decompresses them on the fly, by
Lars Buitinck.
• SVMlight file format serializer now preserves double precision floating point values, by Olivier Grisel.
• A common testing framework for all estimators was added, by Andreas Müller.
• Understandable error messages for estimators that do not accept sparse input by Gael Varoquaux
• Speedups in hierarchical clustering by Gael Varoquaux. In particular building the tree now supports early
stopping. This is useful when the number of clusters is not small compared to the number of samples.
• Add MultiTaskLasso and MultiTaskElasticNet for joint feature selection, by Alexandre Gramfort.
• Added metrics.auc_score and metrics.average_precision_score convenience functions by
Andreas Müller.
• Improved sparse matrix support in the Feature selection module by Andreas Müller.
• New word boundaries-aware character n-gram analyzer for the Text feature extraction module by @kernc.
• Fixed bug in spectral clustering that led to single point clusters by Andreas Müller.
• In feature_extraction.text.CountVectorizer, added an option to ignore infrequent words,
min_df by Andreas Müller.
• Add support for multiple targets in some linear models (ElasticNet, Lasso and OrthogonalMatchingPursuit) by
Vlad Niculae and Alexandre Gramfort.
• Fixes in decomposition.ProbabilisticPCA score function by Wei Li.
• Fixed feature importance computation in Gradient Tree Boosting.
API changes summary
• The old scikits.learn package has disappeared; all code should import from sklearn instead, which
was introduced in 0.9.
• In metrics.roc_curve, the thresholds array is now returned with it’s order reversed, in order to keep
it consistent with the order of the returned fpr and tpr.
• In hmm objects, like hmm.GaussianHMM, hmm.MultinomialHMM, etc., all parameters must be passed to
the object when initialising it and not through fit. Now fit will only accept the data as an input parameter.

84

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• For all SVM classes, a faulty behavior of gamma was fixed. Previously, the default gamma value was only
computed the first time fit was called and then stored. It is now recalculated on every call to fit.
• All Base classes are now abstract meta classes so that they can not be instantiated.
• cluster.ward_tree now also returns the parent array. This is necessary for early-stopping in which case
the tree is not completely built.
• In feature_extraction.text.CountVectorizer the parameters min_n and max_n were joined to
the parameter n_gram_range to enable grid-searching both at once.
• In feature_extraction.text.CountVectorizer, words that appear only in one document are now
ignored by default. To reproduce the previous behavior, set min_df=1.
• Fixed API inconsistency: linear_model.SGDClassifier.predict_proba now returns 2d array
when fit on two classes.
• Fixed API inconsistency:
discriminant_analysis.QuadraticDiscriminantAnalysis.
decision_function
and
discriminant_analysis.LinearDiscriminantAnalysis.
decision_function now return 1d arrays when fit on two classes.
• Grid of alphas used for fitting linear_model.LassoCV and linear_model.ElasticNetCV is now
stored in the attribute alphas_ rather than overriding the init parameter alphas.
• Linear models when alpha is estimated by cross-validation store the estimated value in the alpha_ attribute
rather than just alpha or best_alpha.
• ensemble.GradientBoostingClassifier
now
GradientBoostingClassifier.staged_predict_proba,
GradientBoostingClassifier.staged_predict.

supports
and

ensemble.
ensemble.

• svm.sparse.SVC and other sparse SVM classes are now deprecated. The all classes in the Support Vector
Machines module now automatically select the sparse or dense representation base on the input.
• All clustering algorithms now interpret the array X given to fit as input data, in particular cluster.
SpectralClustering and cluster.AffinityPropagation which previously expected affinity matrices.
• For clustering algorithms that take the desired number of clusters as a parameter, this parameter is now called
n_clusters.
People
• 267 Andreas Müller
• 94 Gilles Louppe
• 89 Gael Varoquaux
• 79 Peter Prettenhofer
• 60 Mathieu Blondel
• 57 Alexandre Gramfort
• 52 Vlad Niculae
• 45 Lars Buitinck
• 44 Nelle Varoquaux
• 37 Jaques Grobler
• 30 Alexis Mignon

1.7. Release history

85

scikit-learn user guide, Release 0.19.1

• 30 Immanuel Bayer
• 27 Olivier Grisel
• 16 Subhodeep Moitra
• 13 Yannick Schwartz
• 12 @kernc
• 11 Virgile Fritsch
• 9 Daniel Duckworth
• 9 Fabian Pedregosa
• 9 Robert Layton
• 8 John Benediktsson
• 7 Marko Burjek
• 5 Nicolas Pinto
• 4 Alexandre Abraham
• 4 Jake Vanderplas
• 3 Brian Holt
• 3 Edouard Duchesnay
• 3 Florian Hoenig
• 3 flyingimmidev
• 2 Francois Savard
• 2 Hannes Schulz
• 2 Peter Welinder
• 2 Yaroslav Halchenko
• 2 Wei Li
• 1 Alex Companioni
• 1 Brandyn A. White
• 1 Bussonnier Matthias
• 1 Charles-Pierre Astolfi
• 1 Dan O’Huiginn
• 1 David Cournapeau
• 1 Keith Goodman
• 1 Ludwig Schwardt
• 1 Olivier Hervieu
• 1 Sergio Medina
• 1 Shiqiao Du
• 1 Tim Sheerman-Chase
• 1 buguen

86

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

1.7.18 Version 0.11
May 7, 2012
Changelog
Highlights
• Gradient boosted regression trees (Gradient Tree Boosting) for classification and regression by Peter Prettenhofer and Scott White .
• Simple dict-based feature loader with support for categorical variables (feature_extraction.
DictVectorizer) by Lars Buitinck.
• Added Matthews correlation coefficient (metrics.matthews_corrcoef) and added macro and micro average options to metrics.precision_score, metrics.recall_score and metrics.f1_score
by Satrajit Ghosh.
• Out of Bag Estimates of generalization error for Ensemble methods by Andreas Müller.
• Randomized sparse linear models for feature selection, by Alexandre Gramfort and Gael Varoquaux
• Label Propagation for semi-supervised learning, by Clay Woolam. Note the semi-supervised API is still work
in progress, and may change.
• Added BIC/AIC model selection to classical Gaussian mixture models and unified the API with the remainder
of scikit-learn, by Bertrand Thirion
• Added sklearn.cross_validation.StratifiedShuffleSplit, which is a sklearn.
cross_validation.ShuffleSplit with balanced splits, by Yannick Schwartz.
• sklearn.neighbors.NearestCentroid classifier added, along with a shrink_threshold parameter, which implements shrunken centroid classification, by Robert Layton.
Other changes
• Merged dense and sparse implementations of Stochastic Gradient Descent module and exposed utility extension
types for sequential datasets seq_dataset and weight vectors weight_vector by Peter Prettenhofer.
• Added partial_fit (support for online/minibatch learning) and warm_start to the Stochastic Gradient Descent module by Mathieu Blondel.
• Dense and sparse implementations of Support Vector Machines classes and linear_model.
LogisticRegression merged by Lars Buitinck.
• Regressors can now be used as base estimator in the Multiclass and multilabel algorithms module by Mathieu
Blondel.
• Added n_jobs option to metrics.pairwise.pairwise_distances and metrics.pairwise.
pairwise_kernels for parallel computation, by Mathieu Blondel.
• K-means can now be run in parallel, using the n_jobs argument to either K-means or KMeans, by Robert
Layton.
• Improved Cross-validation: evaluating estimator performance and Tuning the hyper-parameters of an estimator documentation and introduced the new cross_validation.train_test_split helper function by
Olivier Grisel

1.7. Release history

87

scikit-learn user guide, Release 0.19.1

• svm.SVC members coef_ and intercept_ changed sign for consistency with decision_function;
for kernel==linear, coef_ was fixed in the one-vs-one case, by Andreas Müller.
• Performance improvements to efficient leave-one-out cross-validated Ridge regression, esp.
n_samples > n_features case, in linear_model.RidgeCV , by Reuben Fletcher-Costin.

for the

• Refactoring and simplification of the Text feature extraction API and fixed a bug that caused possible negative
IDF, by Olivier Grisel.
• Beam pruning option in _BaseHMM module has been removed since it is difficult to Cythonize. If you are
interested in contributing a Cython version, you can use the python version in the git history as a reference.
• Classes in Nearest Neighbors now support arbitrary Minkowski metric for nearest neighbors searches. The
metric can be specified by argument p.
API changes summary
• covariance.EllipticEnvelop is now deprecated - Please use covariance.EllipticEnvelope
instead.
• NeighborsClassifier and NeighborsRegressor are gone in the module Nearest Neighbors. Use
the classes KNeighborsClassifier, RadiusNeighborsClassifier, KNeighborsRegressor
and/or RadiusNeighborsRegressor instead.
• Sparse classes in the Stochastic Gradient Descent module are now deprecated.
• In mixture.GMM , mixture.DPGMM and mixture.VBGMM , parameters must be passed to an object when
initialising it and not through fit. Now fit will only accept the data as an input parameter.
• methods rvs and decode in GMM module are now deprecated. sample and score or predict should be
used instead.
• attribute _scores and _pvalues in univariate feature selection objects are now deprecated. scores_ or
pvalues_ should be used instead.
• In LogisticRegression, LinearSVC, SVC and NuSVC, the class_weight parameter is now an initialization parameter, not a parameter to fit. This makes grid searches over this parameter possible.
• LFW data is now always shape (n_samples, n_features) to be consistent with the Olivetti faces
dataset. Use images and pairs attribute to access the natural images shapes instead.
• In svm.LinearSVC, the meaning of the multi_class parameter changed. Options now are 'ovr' and
'crammer_singer', with 'ovr' being the default. This does not change the default behavior but hopefully
is less confusing.
• Class
feature_selection.text.Vectorizer
feature_selection.text.TfidfVectorizer.

is

deprecated

and

replaced

by

• The preprocessor / analyzer nested structure for text feature extraction has been removed. All those features are
now directly passed as flat constructor arguments to feature_selection.text.TfidfVectorizer
and feature_selection.text.CountVectorizer, in particular the following parameters are now
used:
• analyzer can be 'word' or 'char' to switch the default analysis scheme, or use a specific python callable
(as previously).
• tokenizer and preprocessor have been introduced to make it still possible to customize those steps with
the new API.
• input explicitly control how to interpret the sequence passed to fit and predict: filenames, file objects or
direct (byte or Unicode) strings.

88

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• charset decoding is explicit and strict by default.
• the vocabulary, fitted or not is now stored in the vocabulary_ attribute to be consistent with the project
conventions.
• Class
feature_selection.text.TfidfVectorizer
now
derives
feature_selection.text.CountVectorizer to make grid search trivial.

directly

from

• methods rvs in _BaseHMM module are now deprecated. sample should be used instead.
• Beam pruning option in _BaseHMM module is removed since it is difficult to be Cythonized. If you are interested, you can look in the history codes by git.
• The SVMlight format loader now supports files with both zero-based and one-based column indices, since both
occur “in the wild”.
• Arguments in class ShuffleSplit are now consistent with StratifiedShuffleSplit. Arguments
test_fraction and train_fraction are deprecated and renamed to test_size and train_size
and can accept both float and int.
• Arguments in class Bootstrap are now consistent with StratifiedShuffleSplit. Arguments
n_test and n_train are deprecated and renamed to test_size and train_size and can accept both
float and int.
• Argument p added to classes in Nearest Neighbors to specify an arbitrary Minkowski metric for nearest neighbors searches.
People
• 282 Andreas Müller
• 239 Peter Prettenhofer
• 198 Gael Varoquaux
• 129 Olivier Grisel
• 114 Mathieu Blondel
• 103 Clay Woolam
• 96 Lars Buitinck
• 88 Jaques Grobler
• 82 Alexandre Gramfort
• 50 Bertrand Thirion
• 42 Robert Layton
• 28 flyingimmidev
• 26 Jake Vanderplas
• 26 Shiqiao Du
• 21 Satrajit Ghosh
• 17 David Marek
• 17 Gilles Louppe
• 14 Vlad Niculae
• 11 Yannick Schwartz

1.7. Release history

89

scikit-learn user guide, Release 0.19.1

• 10 Fabian Pedregosa
• 9 fcostin
• 7 Nick Wilson
• 5 Adrien Gaidon
• 5 Nicolas Pinto
• 4 David Warde-Farley
• 5 Nelle Varoquaux
• 5 Emmanuelle Gouillart
• 3 Joonas Sillanpää
• 3 Paolo Losi
• 2 Charles McCarthy
• 2 Roy Hyunjin Han
• 2 Scott White
• 2 ibayer
• 1 Brandyn White
• 1 Carlos Scheidegger
• 1 Claire Revillet
• 1 Conrad Lee
• 1 Edouard Duchesnay
• 1 Jan Hendrik Metzen
• 1 Meng Xinfan
• 1 Rob Zinkov
• 1 Shiqiao
• 1 Udi Weinsberg
• 1 Virgile Fritsch
• 1 Xinfan Meng
• 1 Yaroslav Halchenko
• 1 jansoe
• 1 Leon Palafox

1.7.19 Version 0.10
January 11, 2012

90

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

Changelog
• Python 2.5 compatibility was dropped; the minimum Python version needed to use scikit-learn is now 2.6.
• Sparse inverse covariance estimation using the graph Lasso, with associated cross-validated estimator, by Gael
Varoquaux
• New Tree module by Brian Holt, Peter Prettenhofer, Satrajit Ghosh and Gilles Louppe. The module comes with
complete documentation and examples.
• Fixed a bug in the RFE module by Gilles Louppe (issue #378).
• Fixed a memory leak in Support Vector Machines module by Brian Holt (issue #367).
• Faster tests by Fabian Pedregosa and others.
• Silhouette Coefficient cluster analysis
silhouette_score by Robert Layton.

evaluation

metric

added

as

sklearn.metrics.

• Fixed a bug in K-means in the handling of the n_init parameter: the clustering algorithm used to be run
n_init times but the last solution was retained instead of the best solution by Olivier Grisel.
• Minor refactoring in Stochastic Gradient Descent module; consolidated dense and sparse predict methods; Enhanced test time performance by converting model parameters to fortran-style arrays after fitting (only multiclass).
• Adjusted Mutual Information metric added as sklearn.metrics.adjusted_mutual_info_score
by Robert Layton.
• Models like SVC/SVR/LinearSVC/LogisticRegression from libsvm/liblinear now support scaling of C regularization parameter by the number of samples by Alexandre Gramfort.
• New Ensemble Methods module by Gilles Louppe and Brian Holt. The module comes with the random forest
algorithm and the extra-trees method, along with documentation and examples.
• Novelty and Outlier Detection: outlier and novelty detection, by Virgile Fritsch.
• Kernel Approximation: a transform implementing kernel approximation for fast SGD on non-linear kernels by
Andreas Müller.
• Fixed a bug due to atom swapping in Orthogonal Matching Pursuit (OMP) by Vlad Niculae.
• Sparse coding with a precomputed dictionary by Vlad Niculae.
• Mini Batch K-Means performance improvements by Olivier Grisel.
• K-means support for sparse matrices by Mathieu Blondel.
• Improved documentation for developers and for the sklearn.utils module, by Jake Vanderplas.
• Vectorized 20newsgroups dataset loader (sklearn.datasets.fetch_20newsgroups_vectorized)
by Mathieu Blondel.
• Multiclass and multilabel algorithms by Lars Buitinck.
• Utilities for fast computation of mean and variance for sparse matrices by Mathieu Blondel.
• Make sklearn.preprocessing.scale and sklearn.preprocessing.Scaler work on sparse
matrices by Olivier Grisel
• Feature importances using decision trees and/or forest of trees, by Gilles Louppe.
• Parallel implementation of forests of randomized trees by Gilles Louppe.
• sklearn.cross_validation.ShuffleSplit can subsample the train sets as well as the test sets by
Olivier Grisel.

1.7. Release history

91

scikit-learn user guide, Release 0.19.1

• Errors in the build of the documentation fixed by Andreas Müller.
API changes summary
Here are the code migration instructions when upgrading from scikit-learn version 0.9:
• Some estimators that may overwrite their inputs to save memory previously had overwrite_ parameters;
these have been replaced with copy_ parameters with exactly the opposite meaning.
This particularly affects some of the estimators in linear_model. The default behavior is still to copy
everything passed in.
• The SVMlight dataset loader sklearn.datasets.load_svmlight_file no longer supports loading
two files at once; use load_svmlight_files instead. Also, the (unused) buffer_mb parameter is gone.
• Sparse estimators in the Stochastic Gradient Descent module use dense parameter vector coef_ instead of
sparse_coef_. This significantly improves test time performance.
• The Covariance estimation module now has a robust estimator of covariance, the Minimum Covariance Determinant estimator.
• Cluster evaluation metrics in metrics.cluster have been refactored but the changes are backwards compatible. They have been moved to the metrics.cluster.supervised, along with metrics.cluster.
unsupervised which contains the Silhouette Coefficient.
• The permutation_test_score function now behaves the same way as cross_val_score (i.e. uses
the mean score across the folds.)
• Cross Validation generators now use integer indices (indices=True) by default instead of boolean masks.
This make it more intuitive to use with sparse matrix data.
• The functions used for sparse coding, sparse_encode and sparse_encode_parallel have been combined into sklearn.decomposition.sparse_encode, and the shapes of the arrays have been transposed for consistency with the matrix factorization setting, as opposed to the regression setting.
• Fixed an off-by-one error in the SVMlight/LibSVM file format handling; files generated using sklearn.
datasets.dump_svmlight_file should be re-generated. (They should continue to work, but accidentally had one extra column of zeros prepended.)
• BaseDictionaryLearning class replaced by SparseCodingMixin.
• sklearn.utils.extmath.fast_svd
has
been
renamed
sklearn.utils.extmath.
randomized_svd and the default oversampling is now fixed to 10 additional random vectors instead
of doubling the number of components to extract. The new behavior follows the reference paper.
People
The following people contributed to scikit-learn since last release:
• 246 Andreas Müller
• 242 Olivier Grisel
• 220 Gilles Louppe
• 183 Brian Holt
• 166 Gael Varoquaux
• 144 Lars Buitinck
• 73 Vlad Niculae

92

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• 65 Peter Prettenhofer
• 64 Fabian Pedregosa
• 60 Robert Layton
• 55 Mathieu Blondel
• 52 Jake Vanderplas
• 44 Noel Dawe
• 38 Alexandre Gramfort
• 24 Virgile Fritsch
• 23 Satrajit Ghosh
• 3 Jan Hendrik Metzen
• 3 Kenneth C. Arnold
• 3 Shiqiao Du
• 3 Tim Sheerman-Chase
• 3 Yaroslav Halchenko
• 2 Bala Subrahmanyam Varanasi
• 2 DraXus
• 2 Michael Eickenberg
• 1 Bogdan Trach
• 1 Félix-Antoine Fortin
• 1 Juan Manuel Caicedo Carvajal
• 1 Nelle Varoquaux
• 1 Nicolas Pinto
• 1 Tiziano Zito
• 1 Xinfan Meng

1.7.20 Version 0.9
September 21, 2011
scikit-learn 0.9 was released on September 2011, three months after the 0.8 release and includes the new modules
Manifold learning, The Dirichlet Process as well as several new algorithms and documentation improvements.
This release also includes the dictionary-learning work developed by Vlad Niculae as part of the Google Summer of
Code program.

1.7. Release history

93

scikit-learn user guide, Release 0.19.1

94

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

Changelog
• New Manifold learning module by Jake Vanderplas and Fabian Pedregosa.
• New Dirichlet Process Gaussian Mixture Model by Alexandre Passos
• Nearest Neighbors module refactoring by Jake Vanderplas : general refactoring, support for sparse matrices in
input, speed and documentation improvements. See the next section for a full list of API changes.
• Improvements on the Feature selection module by Gilles Louppe : refactoring of the RFE classes, documentation rewrite, increased efficiency and minor API changes.
• Sparse principal components analysis (SparsePCA and MiniBatchSparsePCA) by Vlad Niculae, Gael Varoquaux and Alexandre Gramfort
• Printing an estimator now behaves independently of architectures and Python version thanks to Jean Kossaifi.
• Loader for libsvm/svmlight format by Mathieu Blondel and Lars Buitinck
• Documentation improvements: thumbnails in example gallery by Fabian Pedregosa.
• Important bugfixes in Support Vector Machines module (segfaults, bad performance) by Fabian Pedregosa.
• Added Multinomial Naive Bayes and Bernoulli Naive Bayes by Lars Buitinck
• Text feature extraction optimizations by Lars Buitinck
• Chi-Square feature selection (feature_selection.univariate_selection.chi2) by Lars Buitinck.
• Sample generators module refactoring by Gilles Louppe
• Multiclass and multilabel algorithms by Mathieu Blondel
• Ball tree rewrite by Jake Vanderplas

1.7. Release history

95

scikit-learn user guide, Release 0.19.1

• Implementation of DBSCAN algorithm by Robert Layton
• Kmeans predict and transform by Robert Layton
• Preprocessing module refactoring by Olivier Grisel
• Faster mean shift by Conrad Lee
• New Bootstrap, Random permutations cross-validation a.k.a. Shuffle & Split and various other improvements in cross validation schemes by Olivier Grisel and Gael Varoquaux
• Adjusted Rand index and V-Measure clustering evaluation metrics by Olivier Grisel
• Added Orthogonal Matching Pursuit by Vlad Niculae
• Added 2D-patch extractor utilities in the Feature extraction module by Vlad Niculae
• Implementation of linear_model.LassoLarsCV (cross-validated Lasso solver using the Lars algorithm)
and linear_model.LassoLarsIC (BIC/AIC model selection in Lars) by Gael Varoquaux and Alexandre
Gramfort
• Scalability improvements to metrics.roc_curve by Olivier Hervieu
• Distance helper functions metrics.pairwise.pairwise_distances and metrics.pairwise.
pairwise_kernels by Robert Layton
• Mini-Batch K-Means by Nelle Varoquaux and Peter Prettenhofer.
• Downloading datasets from the mldata.org repository utilities by Pietro Berkes.
• The Olivetti faces dataset by David Warde-Farley.
API changes summary
Here are the code migration instructions when upgrading from scikit-learn version 0.8:
• The scikits.learn package was renamed sklearn. There is still a scikits.learn package alias for
backward compatibility.
Third-party projects with a dependency on scikit-learn 0.9+ should upgrade their codebase. For instance, under
Linux / MacOSX just run (make a backup first!):
find -name "*.py" | xargs sed -i 's/\bscikits.learn\b/sklearn/g'

• Estimators no longer accept model parameters as fit arguments: instead all parameters must be only
be passed as constructor arguments or using the now public set_params method inherited from base.
BaseEstimator.
Some estimators can still accept keyword arguments on the fit but this is restricted to data-dependent values
(e.g. a Gram matrix or an affinity matrix that are precomputed from the X data matrix.
• The cross_val package has been renamed to cross_validation although there is also a cross_val
package alias in place for backward compatibility.
Third-party projects with a dependency on scikit-learn 0.9+ should upgrade their codebase. For instance, under
Linux / MacOSX just run (make a backup first!):
find -name "*.py" | xargs sed -i 's/\bcross_val\b/cross_validation/g'

• The score_func argument of the sklearn.cross_validation.cross_val_score function is
now expected to accept y_test and y_predicted as only arguments for classification and regression tasks
or X_test for unsupervised estimators.

96

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• gamma parameter for support vector machine algorithms is set to 1 / n_features by default, instead of 1
/ n_samples.
• The sklearn.hmm has been marked as orphaned: it will be removed from scikit-learn in version 0.11 unless
someone steps up to contribute documentation, examples and fix lurking numerical stability issues.
• sklearn.neighbors has been made into a submodule. The two previously available estimators,
NeighborsClassifier and NeighborsRegressor have been marked as deprecated. Their functionality has been divided among five new classes: NearestNeighbors for unsupervised neighbors searches,
KNeighborsClassifier & RadiusNeighborsClassifier for supervised classification problems,
and KNeighborsRegressor & RadiusNeighborsRegressor for supervised regression problems.
• sklearn.ball_tree.BallTree has been moved to sklearn.neighbors.BallTree. Using the
former will generate a warning.
• sklearn.linear_model.LARS() and related classes (LassoLARS, LassoLARSCV, etc.) have been renamed to sklearn.linear_model.Lars().
• All distance metrics and kernels in sklearn.metrics.pairwise now have a Y parameter, which by
default is None. If not given, the result is the distance (or kernel similarity) between each sample in Y. If given,
the result is the pairwise distance (or kernel similarity) between samples in X to Y.
• sklearn.metrics.pairwise.l1_distance is now called manhattan_distance, and by default
returns the pairwise distance. For the component wise distance, set the parameter sum_over_features to
False.
Backward compatibility package aliases and other deprecated classes and functions will be removed in version 0.11.
People
38 people contributed to this release.
• 387 Vlad Niculae
• 320 Olivier Grisel
• 192 Lars Buitinck
• 179 Gael Varoquaux
• 168 Fabian Pedregosa (INRIA, Parietal Team)
• 127 Jake Vanderplas
• 120 Mathieu Blondel
• 85 Alexandre Passos
• 67 Alexandre Gramfort
• 57 Peter Prettenhofer
• 56 Gilles Louppe
• 42 Robert Layton
• 38 Nelle Varoquaux
• 32 Jean Kossaifi
• 30 Conrad Lee
• 22 Pietro Berkes
• 18 andy

1.7. Release history

97

scikit-learn user guide, Release 0.19.1

• 17 David Warde-Farley
• 12 Brian Holt
• 11 Robert
• 8 Amit Aides
• 8 Virgile Fritsch
• 7 Yaroslav Halchenko
• 6 Salvatore Masecchia
• 5 Paolo Losi
• 4 Vincent Schut
• 3 Alexis Metaireau
• 3 Bryan Silverthorn
• 3 Andreas Müller
• 2 Minwoo Jake Lee
• 1 Emmanuelle Gouillart
• 1 Keith Goodman
• 1 Lucas Wiman
• 1 Nicolas Pinto
• 1 Thouis (Ray) Jones
• 1 Tim Sheerman-Chase

1.7.21 Version 0.8
May 11, 2011
scikit-learn 0.8 was released on May 2011, one month after the first “international” scikit-learn coding sprint and is
marked by the inclusion of important modules: Hierarchical clustering, Cross decomposition, Non-negative matrix
factorization (NMF or NNMF), initial support for Python 3 and by important enhancements and bug fixes.
Changelog
Several new modules where introduced during this release:
• New Hierarchical clustering module by Vincent Michel, Bertrand Thirion, Alexandre Gramfort and Gael Varoquaux.
• Kernel PCA implementation by Mathieu Blondel
• The Labeled Faces in the Wild face recognition dataset by Olivier Grisel.
• New Cross decomposition module by Edouard Duchesnay.
• Non-negative matrix factorization (NMF or NNMF) module Vlad Niculae
• Implementation of the Oracle Approximating Shrinkage algorithm by Virgile Fritsch in the Covariance estimation module.
Some other modules benefited from significant improvements or cleanups.

98

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• Initial support for Python 3: builds and imports cleanly, some modules are usable while others have failing tests
by Fabian Pedregosa.
• decomposition.PCA is now usable from the Pipeline object by Olivier Grisel.
• Guide How to optimize for speed by Olivier Grisel.
• Fixes for memory leaks in libsvm bindings, 64-bit safer BallTree by Lars Buitinck.
• bug and style fixing in K-means algorithm by Jan Schlüter.
• Add attribute converged to Gaussian Mixture Models by Vincent Schut.
• Implemented
transform,
predict_log_proba
LinearDiscriminantAnalysis By Mathieu Blondel.

in

discriminant_analysis.

• Refactoring in the Support Vector Machines module and bug fixes by Fabian Pedregosa, Gael Varoquaux and
Amit Aides.
• Refactored SGD module (removed code duplication, better variable naming), added interface for sample weight
by Peter Prettenhofer.
• Wrapped BallTree with Cython by Thouis (Ray) Jones.
• Added function svm.l1_min_c by Paolo Losi.
• Typos, doc style, etc. by Yaroslav Halchenko, Gael Varoquaux, Olivier Grisel, Yann Malet, Nicolas Pinto, Lars
Buitinck and Fabian Pedregosa.
People
People that made this release possible preceded by number of commits:
• 159 Olivier Grisel
• 96 Gael Varoquaux
• 96 Vlad Niculae
• 94 Fabian Pedregosa
• 36 Alexandre Gramfort
• 32 Paolo Losi
• 31 Edouard Duchesnay
• 30 Mathieu Blondel
• 25 Peter Prettenhofer
• 22 Nicolas Pinto
• 11 Virgile Fritsch
– 7 Lars Buitinck
– 6 Vincent Michel
– 5 Bertrand Thirion
– 4 Thouis (Ray) Jones
– 4 Vincent Schut
– 3 Jan Schlüter
– 2 Julien Miotte
1.7. Release history

99

scikit-learn user guide, Release 0.19.1

– 2 Matthieu Perrot
– 2 Yann Malet
– 2 Yaroslav Halchenko
– 1 Amit Aides
– 1 Andreas Müller
– 1 Feth Arezki
– 1 Meng Xinfan

1.7.22 Version 0.7
March 2, 2011
scikit-learn 0.7 was released in March 2011, roughly three months after the 0.6 release. This release is marked by the
speed improvements in existing algorithms like k-Nearest Neighbors and K-Means algorithm and by the inclusion of
an efficient algorithm for computing the Ridge Generalized Cross Validation solution. Unlike the preceding release,
no new modules where added to this release.
Changelog
• Performance improvements for Gaussian Mixture Model sampling [Jan Schlüter].
• Implementation of efficient leave-one-out cross-validated Ridge in linear_model.RidgeCV [Mathieu
Blondel]
• Better handling of collinearity and early stopping in linear_model.lars_path [Alexandre Gramfort and
Fabian Pedregosa].
• Fixes for liblinear ordering of labels and sign of coefficients [Dan Yamins, Paolo Losi, Mathieu Blondel and
Fabian Pedregosa].
• Performance improvements for Nearest Neighbors algorithm in high-dimensional spaces [Fabian Pedregosa].
• Performance improvements for cluster.KMeans [Gael Varoquaux and James Bergstra].
• Sanity checks for SVM-based classes [Mathieu Blondel].
• Refactoring of neighbors.NeighborsClassifier and neighbors.kneighbors_graph: added
different algorithms for the k-Nearest Neighbor Search and implemented a more stable algorithm for finding
barycenter weights. Also added some developer documentation for this module, see notes_neighbors for more
information [Fabian Pedregosa].
• Documentation
improvements:
Added
pca.RandomizedPCA
and
linear_model.
LogisticRegression to the class reference. Also added references of matrices used for clustering
and other fixes [Gael Varoquaux, Fabian Pedregosa, Mathieu Blondel, Olivier Grisel, Virgile Fritsch ,
Emmanuelle Gouillart]
• Binded decision_function in classes that make use of liblinear, dense and sparse variants, like svm.
LinearSVC or linear_model.LogisticRegression [Fabian Pedregosa].
• Performance and API improvements
RandomizedPCA [James Bergstra].

to

metrics.euclidean_distances

and

to

pca.

• Fix compilation issues under NetBSD [Kamel Ibn Hassen Derouiche]
• Allow input sequences of different lengths in hmm.GaussianHMM [Ron Weiss].

100

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• Fix bug in affinity propagation caused by incorrect indexing [Xinfan Meng]
People
People that made this release possible preceded by number of commits:
• 85 Fabian Pedregosa
• 67 Mathieu Blondel
• 20 Alexandre Gramfort
• 19 James Bergstra
• 14 Dan Yamins
• 13 Olivier Grisel
• 12 Gael Varoquaux
• 4 Edouard Duchesnay
• 4 Ron Weiss
• 2 Satrajit Ghosh
• 2 Vincent Dubourg
• 1 Emmanuelle Gouillart
• 1 Kamel Ibn Hassen Derouiche
• 1 Paolo Losi
• 1 VirgileFritsch
• 1 Yaroslav Halchenko
• 1 Xinfan Meng

1.7.23 Version 0.6
December 21, 2010
scikit-learn 0.6 was released on December 2010. It is marked by the inclusion of several new modules and a general
renaming of old ones. It is also marked by the inclusion of new example, including applications to real-world datasets.
Changelog
• New stochastic gradient descent module by Peter Prettenhofer. The module comes with complete documentation
and examples.
• Improved svm module: memory consumption has been reduced by 50%, heuristic to automatically set class
weights, possibility to assign weights to samples (see SVM: Weighted samples for an example).
• New Gaussian Processes module by Vincent Dubourg.
This module also has great documentation and some very neat examples.
See example_gaussian_process_plot_gp_regression.py or example_gaussian_process_plot_gp_probabilistic_classification_after_regression.py for a taste of what can be done.
• It is now possible to use liblinear’s Multi-class SVC (option multi_class in svm.LinearSVC)
• New features and performance improvements of text feature extraction.

1.7. Release history

101

scikit-learn user guide, Release 0.19.1

• Improved sparse matrix support, both in main classes (grid_search.GridSearchCV ) as in modules
sklearn.svm.sparse and sklearn.linear_model.sparse.
• Lots of cool new examples and a new section that uses real-world datasets was created. These include: Faces
recognition example using eigenfaces and SVMs, Species distribution modeling, Libsvm GUI, Wikipedia principal eigenvector and others.
• Faster Least Angle Regression algorithm. It is now 2x faster than the R version on worst case and up to 10x
times faster on some cases.
• Faster coordinate descent algorithm. In particular, the full path version of lasso (linear_model.
lasso_path) is more than 200x times faster than before.
• It is now possible to get probability estimates from a linear_model.LogisticRegression model.
• module renaming: the glm module has been renamed to linear_model, the gmm module has been included into
the more general mixture model and the sgd module has been included in linear_model.
• Lots of bug fixes and documentation improvements.
People
People that made this release possible preceded by number of commits:
• 207 Olivier Grisel
• 167 Fabian Pedregosa
• 97 Peter Prettenhofer
• 68 Alexandre Gramfort
• 59 Mathieu Blondel
• 55 Gael Varoquaux
• 33 Vincent Dubourg
• 21 Ron Weiss
• 9 Bertrand Thirion
• 3 Alexandre Passos
• 3 Anne-Laure Fouque
• 2 Ronan Amicel
• 1 Christian Osendorfer

1.7.24 Version 0.5
October 11, 2010
Changelog
New classes
• Support for sparse matrices in some classifiers of modules svm and linear_model (see svm.
sparse.SVC, svm.sparse.SVR, svm.sparse.LinearSVC, linear_model.sparse.Lasso,
linear_model.sparse.ElasticNet)

102

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• New pipeline.Pipeline object to compose different estimators.
• Recursive Feature Elimination routines in module Feature selection.
• Addition of various classes capable of cross validation in the linear_model module (linear_model.
LassoCV , linear_model.ElasticNetCV , etc.).
• New, more efficient LARS algorithm implementation. The Lasso variant of the algorithm is also implemented.
See linear_model.lars_path, linear_model.Lars and linear_model.LassoLars.
• New Hidden Markov Models module (see classes hmm.GaussianHMM, hmm.MultinomialHMM, hmm.
GMMHMM)
• New module feature_extraction (see class reference)
• New FastICA algorithm in module sklearn.fastica
Documentation
• Improved documentation for many modules, now separating narrative documentation from the class reference.
As an example, see documentation for the SVM module and the complete class reference.
Fixes
• API changes: adhere variable names to PEP-8, give more meaningful names.
• Fixes for svm module to run on a shared memory context (multiprocessing).
• It is again possible to generate latex (and thus PDF) from the sphinx docs.
Examples

• new examples using some of the mlcomp datasets: sphx_glr_auto_examples_mlcomp_sparse_document_classif
py (since removed) and Classification of text documents using sparse features
• Many more examples. See here the full list of examples.
External dependencies
• Joblib is now a dependency of this package, although it is shipped with (sklearn.externals.joblib).
Removed modules
• Module ann (Artificial Neural Networks) has been removed from the distribution. Users wanting this sort of
algorithms should take a look into pybrain.
Misc
• New sphinx theme for the web page.

1.7. Release history

103

scikit-learn user guide, Release 0.19.1

Authors
The following is a list of authors for this release, preceded by number of commits:
• 262 Fabian Pedregosa
• 240 Gael Varoquaux
• 149 Alexandre Gramfort
• 116 Olivier Grisel
• 40 Vincent Michel
• 38 Ron Weiss
• 23 Matthieu Perrot
• 10 Bertrand Thirion
• 7 Yaroslav Halchenko
• 9 VirgileFritsch
• 6 Edouard Duchesnay
• 4 Mathieu Blondel
• 1 Ariel Rokem
• 1 Matthieu Brucher

1.7.25 Version 0.4
August 26, 2010
Changelog
Major changes in this release include:
• Coordinate Descent algorithm (Lasso, ElasticNet) refactoring & speed improvements (roughly 100x times
faster).
• Coordinate Descent Refactoring (and bug fixing) for consistency with R’s package GLMNET.
• New metrics module.
• New GMM module contributed by Ron Weiss.
• Implementation of the LARS algorithm (without Lasso variant for now).
• feature_selection module redesign.
• Migration to GIT as version control system.
• Removal of obsolete attrselect module.
• Rename of private compiled extensions (added underscore).
• Removal of legacy unmaintained code.
• Documentation improvements (both docstring and rst).
• Improvement of the build system to (optionally) link with MKL. Also, provide a lite BLAS implementation in
case no system-wide BLAS is found.

104

Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.19.1

• Lots of new examples.
• Many, many bug fixes . . .
Authors
The committer list for this release is the following (preceded by number of commits):
• 143 Fabian Pedregosa
• 35 Alexandre Gramfort
• 34 Olivier Grisel
• 11 Gael Varoquaux
• 5 Yaroslav Halchenko
• 2 Vincent Michel
• 1 Chris Filo Gorgolewski

1.7.26 Earlier versions
Earlier versions included contributions by Fred Mailhot, David Cooke, David Huard, Dave Morrill, Ed Schofield,
Travis Oliphant, Pearu Peterson.

1.7. Release history

105

scikit-learn user guide, Release 0.19.1

106

Chapter 1. Welcome to scikit-learn

CHAPTER

TWO

SCIKIT-LEARN TUTORIALS

2.1 An introduction to machine learning with scikit-learn
Section contents
In this section, we introduce the machine learning vocabulary that we use throughout scikit-learn and give a simple
learning example.

2.1.1 Machine learning: the problem setting
In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data.
If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is
said to have several attributes or features.
We can separate learning problems in a few large categories:
• supervised learning, in which the data comes with additional attributes that we want to predict (Click here to go
to the scikit-learn supervised learning page).This problem can be either:
– classification: samples belong to two or more classes and we want to learn from already labeled data how
to predict the class of unlabeled data. An example of classification problem would be the handwritten digit
recognition example, in which the aim is to assign each input vector to one of a finite number of discrete
categories. Another way to think of classification is as a discrete (as opposed to continuous) form of
supervised learning where one has a limited number of categories and for each of the n samples provided,
one is to try to label them with the correct category or class.
– regression: if the desired output consists of one or more continuous variables, then the task is called
regression. An example of a regression problem would be the prediction of the length of a salmon as a
function of its age and weight.
• unsupervised learning, in which the training data consists of a set of input vectors x without any corresponding
target values. The goal in such problems may be to discover groups of similar examples within the data, where
it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of
visualization (Click here to go to the Scikit-Learn unsupervised learning page).

107

scikit-learn user guide, Release 0.19.1

Training set and testing set
Machine learning is about learning some properties of a data set and applying them to new data. This is why a
common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we
call the training set on which we learn data properties and one that we call the testing set on which we test these
properties.

2.1.2 Loading an example dataset
scikit-learn comes with a few standard datasets, for instance the iris and digits datasets for classification and the boston
house prices dataset for regression.
In the following, we start a Python interpreter from our shell and then load the iris and digits datasets. Our
notational convention is that $ denotes the shell prompt while >>> denotes the Python interpreter prompt:
$ python
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> digits = datasets.load_digits()

A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in
the .data member, which is a n_samples, n_features array. In the case of supervised problem, one or more
response variables are stored in the .target member. More details on the different datasets can be found in the
dedicated section.
For instance, in the case of the digits dataset, digits.data gives access to the features that can be used to classify
the digits samples:
>>> print(digits.data)
[[ 0.
0.
5. ...,
[ 0.
0.
0. ...,
[ 0.
0.
0. ...,
...,
[ 0.
0.
1. ...,
[ 0.
0.
2. ...,
[ 0.
0. 10. ...,

0.
10.
16.

0.
0.
9.

0.]
0.]
0.]

6.
12.
12.

0.
0.
1.

0.]
0.]
0.]]

and digits.target gives the ground truth for the digit dataset, that is the number corresponding to each digit
image that we are trying to learn:
>>> digits.target
array([0, 1, 2, ..., 8, 9, 8])

Shape of the data arrays
The data is always a 2D array, shape (n_samples, n_features), although the original data may have had a
different shape. In the case of the digits, each original sample is an image of shape (8, 8) and can be accessed
using:
>>> digits.images[0]
array([[ 0.,
0.,
[ 0.,
0.,
[ 0.,
3.,
[ 0.,
4.,
[ 0.,
5.,
[ 0.,
4.,
[ 0.,
2.,
[
0.,
0.,
108

5.,
13.,
15.,
12.,
8.,
11.,
14.,
6.,

13.,
15.,
2.,
0.,
0.,
0.,
5.,
13.,

9.,
10.,
0.,
0.,
0.,
1.,
10.,
10.,

1.,
15.,
11.,
8.,
9.,
12.,
12.,
0.,

0.,
5.,
8.,
8.,
8.,
7.,
0.,
0.,

0.],
0.],
0.],
0.],
0.],
0.],
0.],
0.]]) Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

The simple example on this dataset illustrates how starting from the original problem one can shape the data for
consumption in scikit-learn.

Loading from external datasets
To load from an external dataset, please refer to loading external datasets.

2.1.3 Learning and predicting
In the case of the digits dataset, the task is to predict, given an image, which digit it represents. We are given samples
of each of the 10 possible classes (the digits zero through nine) on which we fit an estimator to be able to predict the
classes to which unseen samples belong.
In scikit-learn, an estimator for classification is a Python object that implements the methods fit(X, y) and
predict(T).
An example of an estimator is the class sklearn.svm.SVC that implements support vector classification. The
constructor of an estimator takes as arguments the parameters of the model, but for the time being, we will consider
the estimator as a black box:
>>> from sklearn import svm
>>> clf = svm.SVC(gamma=0.001, C=100.)

Choosing the parameters of the model
In this example we set the value of gamma manually. It is possible to automatically find good values for the
parameters by using tools such as grid search and cross validation.
We call our estimator instance clf, as it is a classifier. It now must be fitted to the model, that is, it must learn from
the model. This is done by passing our training set to the fit method. As a training set, let us use all the images of
our dataset apart from the last one. We select this training set with the [:-1] Python syntax, which produces a new
array that contains all but the last entry of digits.data:
>>> clf.fit(digits.data[:-1], digits.target[:-1])
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

Now you can predict new values, in particular, we can ask to the classifier what is the digit of our last image in the
digits dataset, which we have not used to train the classifier:
>>> clf.predict(digits.data[-1:])
array([8])

2.1. An introduction to machine learning with scikit-learn

109

scikit-learn user guide, Release 0.19.1

The corresponding image is the following:
images are of poor resolution. Do you agree with the classifier?

As you can see, it is a challenging task: the

A complete example of this classification problem is available as an example that you can run and study: Recognizing
hand-written digits.

2.1.4 Model persistence
It is possible to save a model in the scikit by using Python’s built-in persistence model, namely pickle:
>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> clf2.predict(X[0:1])
array([0])
>>> y[0]
0

In the specific case of the scikit, it may be more interesting to use joblib’s replacement of pickle (joblib.dump &
joblib.load), which is more efficient on big data, but can only pickle to the disk and not to a string:
>>> from sklearn.externals import joblib
>>> joblib.dump(clf, 'filename.pkl')

Later you can load back the pickled model (possibly in another Python process) with:
>>> clf = joblib.load('filename.pkl')

Note: joblib.dump and joblib.load functions also accept file-like object instead of filenames. More information on data persistence with Joblib is available here.
Note that pickle has some security and maintainability issues. Please refer to section Model persistence for more
detailed information about model persistence with scikit-learn.

110

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

2.1.5 Conventions
scikit-learn estimators follow certain rules to make their behavior more predictive.
Type casting
Unless otherwise specified, input will be cast to float64:
>>> import numpy as np
>>> from sklearn import random_projection
>>> rng = np.random.RandomState(0)
>>> X = rng.rand(10, 2000)
>>> X = np.array(X, dtype='float32')
>>> X.dtype
dtype('float32')
>>> transformer = random_projection.GaussianRandomProjection()
>>> X_new = transformer.fit_transform(X)
>>> X_new.dtype
dtype('float64')

In this example, X is float32, which is cast to float64 by fit_transform(X).
Regression targets are cast to float64, classification targets are maintained:
>>> from sklearn import datasets
>>> from sklearn.svm import SVC
>>> iris = datasets.load_iris()
>>> clf = SVC()
>>> clf.fit(iris.data, iris.target)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> list(clf.predict(iris.data[:3]))
[0, 0, 0]
>>> clf.fit(iris.data, iris.target_names[iris.target])
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> list(clf.predict(iris.data[:3]))
['setosa', 'setosa', 'setosa']

Here, the first predict() returns an integer array, since iris.target (an integer array) was used in fit. The
second predict() returns a string array, since iris.target_names was for fitting.
Refitting and updating parameters
Hyper-parameters of an estimator can be updated after it has been constructed via the sklearn.pipeline.
Pipeline.set_params method. Calling fit() more than once will overwrite what was learned by any previous
fit():

2.1. An introduction to machine learning with scikit-learn

111

scikit-learn user guide, Release 0.19.1

>>> import numpy as np
>>> from sklearn.svm import SVC
>>>
>>>
>>>
>>>

rng = np.random.RandomState(0)
X = rng.rand(100, 10)
y = rng.binomial(1, 0.5, 100)
X_test = rng.rand(5, 10)

>>> clf = SVC()
>>> clf.set_params(kernel='linear').fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> clf.predict(X_test)
array([1, 0, 1, 1, 0])
>>> clf.set_params(kernel='rbf').fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> clf.predict(X_test)
array([0, 0, 0, 1, 0])

Here, the default kernel rbf is first changed to linear after the estimator has been constructed via SVC(), and
changed back to rbf to refit the estimator and to make a second prediction.
Multiclass vs. multilabel fitting
When using multiclass classifiers, the learning and prediction task that is performed is dependent on the
format of the target data fit upon:
>>> from sklearn.svm import SVC
>>> from sklearn.multiclass import OneVsRestClassifier
>>> from sklearn.preprocessing import LabelBinarizer
>>> X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
>>> y = [0, 0, 1, 1, 2]
>>> classif = OneVsRestClassifier(estimator=SVC(random_state=0))
>>> classif.fit(X, y).predict(X)
array([0, 0, 1, 1, 2])

In the above case, the classifier is fit on a 1d array of multiclass labels and the predict() method therefore provides
corresponding multiclass predictions. It is also possible to fit upon a 2d array of binary label indicators:
>>> y = LabelBinarizer().fit_transform(y)
>>> classif.fit(X, y).predict(X)
array([[1, 0, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
[0, 0, 0]])

Here, the classifier is fit() on a 2d binary label representation of y, using the LabelBinarizer. In this case
predict() returns a 2d array representing the corresponding multilabel predictions.
112

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

Note that the fourth and fifth instances returned all zeroes, indicating that they matched none of the three labels fit
upon. With multilabel outputs, it is similarly possible for an instance to be assigned multiple labels:
>> from sklearn.preprocessing import MultiLabelBinarizer
>> y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
>> y = MultiLabelBinarizer().fit_transform(y)
>> classif.fit(X, y).predict(X)
array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[1, 0, 1, 1, 0],
[0, 0, 1, 0, 1]])

In this case, the classifier is fit upon instances each assigned multiple labels. The MultiLabelBinarizer is
used to binarize the 2d array of multilabels to fit upon. As a result, predict() returns a 2d array with multiple
predicted labels for each instance.

2.2 A tutorial on statistical-learning for scientific data processing
Statistical learning
Machine learning is a technique with a growing importance, as the size of the datasets experimental sciences are facing is rapidly growing. Problems it tackles range from building a prediction function linking different observations,
to classifying observations, or learning the structure in an unlabeled dataset.
This tutorial will explore statistical learning, the use of machine learning techniques with the goal of statistical
inference: drawing conclusions on the data at hand.
Scikit-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific
Python packages (NumPy, SciPy, matplotlib).

2.2.1 Statistical learning: the setting and the estimator object in scikit-learn
Datasets
Scikit-learn deals with learning information from one or more datasets that are represented as 2D arrays. They can be
understood as a list of multi-dimensional observations. We say that the first axis of these arrays is the samples axis,
while the second is the features axis.
A simple example shipped with the scikit: iris dataset
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> data = iris.data
>>> data.shape
(150, 4)

It is made of 150 observations of irises, each described by 4 features: their sepal and petal length and width, as
detailed in iris.DESCR.

2.2. A tutorial on statistical-learning for scientific data processing

113

scikit-learn user guide, Release 0.19.1

When the data is not initially in the (n_samples, n_features) shape, it needs to be preprocessed in order to
be used by scikit-learn.
An example of reshaping data would be the digits dataset

The digits dataset is made of 1797 8x8 images of hand-written digits
>>> digits = datasets.load_digits()
>>> digits.images.shape
(1797, 8, 8)
>>> import matplotlib.pyplot as plt
>>> plt.imshow(digits.images[-1], cmap=plt.cm.gray_r)


To use this dataset with the scikit, we transform each 8x8 image into a feature vector of length 64
>>> data = digits.images.reshape((digits.images.shape[0], -1))

Estimators objects
Fitting data: the main API implemented by scikit-learn is that of the estimator. An estimator is any object that learns
from data; it may be a classification, regression or clustering algorithm or a transformer that extracts/filters useful
features from raw data.
All estimator objects expose a fit method that takes a dataset (usually a 2-d array):
>>> estimator.fit(data)

Estimator parameters: All the parameters of an estimator can be set when it is instantiated or by modifying the
corresponding attribute:
>>> estimator = Estimator(param1=1, param2=2)
>>> estimator.param1
1

Estimated parameters: When data is fitted with an estimator, parameters are estimated from the data at hand. All the
estimated parameters are attributes of the estimator object ending by an underscore:
>>> estimator.estimated_param_

114

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

2.2.2 Supervised learning: predicting an output variable from high-dimensional observations
The problem solved in supervised learning
Supervised learning consists in learning the link between two datasets: the observed data X and an external variable
y that we are trying to predict, usually called “target” or “labels”. Most often, y is a 1D array of length n_samples.
All supervised estimators in scikit-learn implement a fit(X, y) method to fit the model and a predict(X)
method that, given unlabeled observations X, returns the predicted labels y.

Vocabulary: classification and regression
If the prediction task is to classify the observations in a set of finite labels, in other words to “name” the objects
observed, the task is said to be a classification task. On the other hand, if the goal is to predict a continuous target
variable, it is said to be a regression task.
When doing classification in scikit-learn, y is a vector of integers or strings.
Note: See the Introduction to machine learning with scikit-learn Tutorial for a quick run-through on the basic
machine learning vocabulary used within scikit-learn.

Nearest neighbor and the curse of dimensionality

Classifying irises:

2.2. A tutorial on statistical-learning for scientific data processing

115

scikit-learn user guide, Release 0.19.1

The iris dataset is a
classification task consisting in identifying 3 different types of irises (Setosa, Versicolour, and Virginica) from their
petal and sepal length and width:
>>> import numpy as np
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> iris_X = iris.data
>>> iris_y = iris.target
>>> np.unique(iris_y)
array([0, 1, 2])

k-Nearest neighbors classifier
The simplest possible classifier is the nearest neighbor: given a new observation X_test, find in the training set (i.e.
the data used to train the estimator) the observation with the closest feature vector. (Please see the Nearest Neighbors
section of the online Scikit-learn documentation for more information about this type of classifier.)
Training set and testing set
While experimenting with any learning algorithm, it is important not to test the prediction of an estimator on the
data used to fit the estimator as this would not be evaluating the performance of the estimator on new data. This is
why datasets are often split into train and test data.

116

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

KNN (k nearest neighbors) classification example:
>>> # Split iris data in train and test data
>>> # A random permutation, to split the data randomly
>>> np.random.seed(0)
>>> indices = np.random.permutation(len(iris_X))
>>> iris_X_train = iris_X[indices[:-10]]
>>> iris_y_train = iris_y[indices[:-10]]
>>> iris_X_test = iris_X[indices[-10:]]
>>> iris_y_test = iris_y[indices[-10:]]
>>> # Create and fit a nearest-neighbor classifier
>>> from sklearn.neighbors import KNeighborsClassifier
>>> knn = KNeighborsClassifier()
>>> knn.fit(iris_X_train, iris_y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')
>>> knn.predict(iris_X_test)
array([1, 2, 1, 0, 0, 0, 2, 1, 2, 0])
>>> iris_y_test
array([1, 1, 1, 0, 0, 0, 2, 1, 2, 0])

The curse of dimensionality
For an estimator to be effective, you need the distance between neighboring points to be less than some value 𝑑, which
depends on the problem. In one dimension, this requires on average 𝑛 ∼ 1/𝑑 points. In the context of the above 𝑘-NN
example, if the data is described by just one feature with values ranging from 0 to 1 and with 𝑛 training observations,
then new data will be no further away than 1/𝑛. Therefore, the nearest neighbor decision rule will be efficient as soon
as 1/𝑛 is small compared to the scale of between-class feature variations.
If the number of features is 𝑝, you now require 𝑛 ∼ 1/𝑑𝑝 points. Let’s say that we require 10 points in one dimension:
now 10𝑝 points are required in 𝑝 dimensions to pave the [0, 1] space. As 𝑝 becomes large, the number of training points
required for a good estimator grows exponentially.
For example, if each point is just a single number (8 bytes), then an effective 𝑘-NN estimator in a paltry 𝑝 ∼ 20
dimensions would require more training data than the current estimated size of the entire internet (±1000 Exabytes or
2.2. A tutorial on statistical-learning for scientific data processing

117

scikit-learn user guide, Release 0.19.1

so).
This is called the curse of dimensionality and is a core problem that machine learning addresses.
Linear model: from regression to sparsity

Diabetes dataset
The diabetes dataset consists of 10 physiological variables (age, sex, weight, blood pressure) measure on 442
patients, and an indication of disease progression after one year:
>>>
>>>
>>>
>>>
>>>

diabetes = datasets.load_diabetes()
diabetes_X_train = diabetes.data[:-20]
diabetes_X_test = diabetes.data[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

The task at hand is to predict disease progression from physiological variables.

Linear regression
LinearRegression, in its simplest form, fits a linear model to the data set by adjusting a set
of parameters in order to make the sum of the squared residuals of the model as small as possible.

Linear models: 𝑦 = 𝑋𝛽 + 𝜖
• 𝑋: data
• 𝑦: target variable
• 𝛽: Coefficients
• 𝜖: Observation noise
>>> from sklearn import linear_model
>>> regr = linear_model.LinearRegression()
>>> regr.fit(diabetes_X_train, diabetes_y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
>>> print(regr.coef_)
[
0.30349955 -237.63931533 510.53060544 327.73698041 -814.13170937
492.81458798 102.84845219 184.60648906 743.51961675
76.09517222]
>>> # The mean square error
>>> np.mean((regr.predict(diabetes_X_test)-diabetes_y_test)**2)
2004.56760268...

118

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

>>> # Explained variance score: 1 is perfect prediction
>>> # and 0 means that there is no linear relationship
>>> # between X and y.
>>> regr.score(diabetes_X_test, diabetes_y_test)
0.5850753022690...

Shrinkage
If

>>>
>>>
>>>
>>>

there

are

few

data

points

per

dimension,

noise

in

the

observations

induces

high

variance:

X = np.c_[ .5, 1].T
y = [.5, 1]
test = np.c_[ 0, 2].T
regr = linear_model.LinearRegression()

>>> import matplotlib.pyplot as plt
>>> plt.figure()
>>> np.random.seed(0)
>>> for _ in range(6):
...
this_X = .1*np.random.normal(size=(2, 1)) + X
...
regr.fit(this_X, y)
...
plt.plot(test, regr.predict(test))
...
plt.scatter(this_X, y, s=3)

A solution in high-dimensional statistical learning is to shrink the regression coefficients to zero: any
two randomly chosen set of observations are likely to be uncorrelated. This is called Ridge regression:

2.2. A tutorial on statistical-learning for scientific data processing

119

scikit-learn user guide, Release 0.19.1

>>> regr = linear_model.Ridge(alpha=.1)
>>> plt.figure()
>>> np.random.seed(0)
>>> for _ in range(6):
...
this_X = .1*np.random.normal(size=(2, 1)) + X
...
regr.fit(this_X, y)
...
plt.plot(test, regr.predict(test))
...
plt.scatter(this_X, y, s=3)

This is an example of bias/variance tradeoff: the larger the ridge alpha parameter, the higher the bias and the lower
the variance.
We can choose alpha to minimize left out error, this time using the diabetes dataset rather than our synthetic data:
>>> alphas = np.logspace(-4, -1, 6)
>>> from __future__ import print_function
>>> print([regr.set_params(alpha=alpha
...
).fit(diabetes_X_train, diabetes_y_train,
...
).score(diabetes_X_test, diabetes_y_test) for alpha in alphas])
[0.5851110683883..., 0.5852073015444..., 0.5854677540698..., 0.5855512036503..., 0.
˓→5830717085554..., 0.57058999437...]

Note: Capturing in the fitted parameters noise that prevents the model to generalize to new data is called overfitting.
The bias introduced by the ridge regression is called a regularization.

Sparsity
Fitting only features 1 and 2

120

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

Note: A representation of the full diabetes dataset would involve 11 dimensions (10 feature dimensions and one of
the target variable). It is hard to develop an intuition on such representation, but it may be useful to keep in mind that
it would be a fairly empty space.
We can see that, although feature 2 has a strong coefficient on the full model, it conveys little information on y when
considered with feature 1.
To improve the conditioning of the problem (i.e. mitigating the The curse of dimensionality), it would be interesting
to select only the informative features and set non-informative ones, like feature 2 to 0. Ridge regression will decrease
their contribution, but not set them to zero. Another penalization approach, called Lasso (least absolute shrinkage and
selection operator), can set some coefficients to zero. Such methods are called sparse method and sparsity can be
seen as an application of Occam’s razor: prefer simpler models.
>>> regr = linear_model.Lasso()
>>> scores = [regr.set_params(alpha=alpha
...
).fit(diabetes_X_train, diabetes_y_train
...
).score(diabetes_X_test, diabetes_y_test)
...
for alpha in alphas]
>>> best_alpha = alphas[scores.index(max(scores))]
>>> regr.alpha = best_alpha
>>> regr.fit(diabetes_X_train, diabetes_y_train)
Lasso(alpha=0.025118864315095794, copy_X=True, fit_intercept=True,
max_iter=1000, normalize=False, positive=False, precompute=False,
random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
>>> print(regr.coef_)
[
0.
-212.43764548 517.19478111 313.77959962 -160.8303982
-187.19554705
69.38229038 508.66011217
71.84239008]

2.2. A tutorial on statistical-learning for scientific data processing

-0.

121

scikit-learn user guide, Release 0.19.1

Different algorithms for the same problem
Different algorithms can be used to solve the same mathematical problem. For instance the Lasso object in scikitlearn solves the lasso regression problem using a coordinate descent method, that is efficient on large datasets.
However, scikit-learn also provides the LassoLars object using the LARS algorithm, which is very efficient for
problems in which the weight vector estimated is very sparse (i.e. problems with very few observations).

Classification

For classification, as in the labeling iris task, linear regression is not
the right approach as it will give too much weight to data far from the decision frontier. A linear approach is to fit a
sigmoid function or logistic function:
𝑦 = sigmoid(𝑋𝛽 − offset) + 𝜖 =

1
+𝜖
1 + exp(−𝑋𝛽 + offset)

>>> logistic = linear_model.LogisticRegression(C=1e5)
>>> logistic.fit(iris_X_train, iris_y_train)
LogisticRegression(C=100000.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

This is known as LogisticRegression.

122

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

Multiclass classification
If you have several classes to predict, an option often used is to fit one-versus-all classifiers and then use a voting
heuristic for the final decision.

Shrinkage and sparsity with logistic regression
The C parameter controls the amount of regularization in the LogisticRegression object: a large value
for C results in less regularization. penalty="l2" gives Shrinkage (i.e. non-sparse coefficients), while
penalty="l1" gives Sparsity.

Exercise
Try classifying the digits dataset with nearest neighbors and a linear model. Leave out the last 10% and test
prediction performance on these observations.
from sklearn import datasets, neighbors, linear_model
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

Solution: ../../auto_examples/exercises/plot_digits_classification_exercise.py

Support vector machines (SVMs)
Linear SVMs
Support Vector Machines belong to the discriminant model family: they try to find a combination of samples to build
a plane maximizing the margin between the two classes. Regularization is set by the C parameter: a small value for C
means the margin is calculated using many or all of the observations around the separating line (more regularization);
a large value for C means the margin is calculated on observations close to the separating line (less regularization).
Unregularized SVM

Regularized SVM (default)

2.2. A tutorial on statistical-learning for scientific data processing

123

scikit-learn user guide, Release 0.19.1

Example:
• Plot different SVM classifiers in the iris dataset
SVMs can be used in regression –SVR (Support Vector Regression)–, or in classification –SVC (Support Vector Classification).
>>> from sklearn import svm
>>> svc = svm.SVC(kernel='linear')
>>> svc.fit(iris_X_train, iris_y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

Warning: Normalizing data
For many estimators, including the SVMs, having datasets with unit standard deviation for each feature is important
to get good prediction.

Using kernels
Classes are not always linearly separable in feature space. The solution is to build a decision function that is not linear
but may be polynomial instead. This is done using the kernel trick that can be seen as creating a decision energy by
positioning kernels on observations:
Linear kernel

Polynomial kernel

>>> svc = svm.SVC(kernel='linear')

>>> svc = svm.SVC(kernel='poly',
...
degree=3)
>>> # degree: polynomial degree

124

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

RBF kernel (Radial Basis Function)

>>> svc = svm.SVC(kernel='rbf')
>>> # gamma: inverse of size of
>>> # radial kernel

Interactive example
See the SVM GUI to download svm_gui.py; add data points of both classes with right and left button, fit the
model and change parameters and data.

2.2. A tutorial on statistical-learning for scientific data processing

125

scikit-learn user guide, Release 0.19.1

Exercise
Try classifying classes 1 and 2 from the iris dataset with SVMs, with the 2 first features. Leave out 10% of each
class and test prediction performance on these observations.
Warning: the classes are ordered, do not leave out the last 10%, you would be testing on only one class.
Hint: You can use the decision_function method on a grid to get intuitions.
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0, :2]
y = y[y != 0]

Solution: ../../auto_examples/exercises/plot_iris_exercise.py

2.2.3 Model selection: choosing estimators and their parameters
Score, and cross-validated scores
As we have seen, every estimator exposes a score method that can judge the quality of the fit (or the prediction) on
new data. Bigger is better.
>>> from sklearn import datasets, svm
>>> digits = datasets.load_digits()
>>> X_digits = digits.data
>>> y_digits = digits.target
>>> svc = svm.SVC(C=1, kernel='linear')
>>> svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])
0.97999999999999998

To get a better measure of prediction accuracy (which we can use as a proxy for goodness of fit of the model), we can
successively split the data in folds that we use for training and testing:
>>> import numpy as np
>>> X_folds = np.array_split(X_digits, 3)
>>> y_folds = np.array_split(y_digits, 3)
>>> scores = list()
>>> for k in range(3):
...
# We use 'list' to copy, in order to 'pop' later on
...
X_train = list(X_folds)
...
X_test = X_train.pop(k)
...
X_train = np.concatenate(X_train)
...
y_train = list(y_folds)
...
y_test = y_train.pop(k)
...
y_train = np.concatenate(y_train)
...
scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
>>> print(scores)
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]

This is called a KFold cross-validation.

126

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

Cross-validation generators
Scikit-learn has a collection of classes which can be used to generate lists of train/test indices for popular crossvalidation strategies.
They expose a split method which accepts the input dataset to be split and yields the train/test set indices for each
iteration of the chosen cross-validation strategy.
This example shows an example usage of the split method.
>>> from sklearn.model_selection import KFold, cross_val_score
>>> X = ["a", "a", "b", "c", "c", "c"]
>>> k_fold = KFold(n_splits=3)
>>> for train_indices, test_indices in k_fold.split(X):
...
print('Train: %s | test: %s' % (train_indices, test_indices))
Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]

The cross-validation can then be performed easily:
>>> [svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
...
for train, test in k_fold.split(X_digits)]
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]

The cross-validation score can be directly calculated using the cross_val_score helper. Given an estimator, the
cross-validation object and the input dataset, the cross_val_score splits the data repeatedly into a training and a
testing set, trains the estimator using the training set and computes the scores based on the testing set for each iteration
of cross-validation.
By default the estimator’s score method is used to compute the individual scores.
Refer the metrics module to learn more on the available scoring methods.
>>> cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)
array([ 0.93489149, 0.95659432, 0.93989983])

n_jobs=-1 means that the computation will be dispatched on all the CPUs of the computer.
Alternatively, the scoring argument can be provided to specify an alternative scoring method.
>>> cross_val_score(svc, X_digits, y_digits, cv=k_fold,
...
scoring='precision_macro')
array([ 0.93969761, 0.95911415, 0.94041254])

Cross-validation generators
KFold (n_splits, shuffle, random_state)
Splits it into K folds, trains on K-1
and then tests on the left-out.

ShuffleSplit
(n_splits,
test_size,
train_size,
random_state)
Generates train/test indices based
on random permutation.

StratifiedKFold
(n_splits,
shuffle, random_state)
Same as K-Fold but preserves the
class distribution within each fold.

GroupKFold (n_splits)
Ensures that the same group is not in
both testing and training sets.

StratifiedShuffleSplit

GroupShuffleSplit

Same as shuffle split but preserves the
class distribution within each iteration.

Ensures that the same group is not
in both testing and training sets.

2.2. A tutorial on statistical-learning for scientific data processing

127

scikit-learn user guide, Release 0.19.1

LeaveOneGroupOut ()
Takes a group array to group observations.

LeavePOut (p)
Leave P observations out.

LeavePGroupsOut (n_groups)
Leave P groups out.

LeaveOneOut ()
Leave one observation out.

PredefinedSplit
Generates train/test indices based on predefined splits.

Exercise

On the digits dataset, plot the cross-validation
score of a SVC estimator with an linear kernel as a function of parameter C (use a logarithmic grid of points,
from 1 to 10).
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn import datasets, svm
digits = datasets.load_digits()
X = digits.data
y = digits.target
svc = svm.SVC(kernel='linear')
C_s = np.logspace(-10, 0, 10)

Solution: Cross-validation on Digits Dataset Exercise

Grid-search and cross-validated estimators
Grid-search
scikit-learn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and
chooses the parameters to maximize the cross-validation score. This object takes an estimator during the construction
and exposes an estimator API:
>>> from sklearn.model_selection import GridSearchCV, cross_val_score
>>> Cs = np.logspace(-6, -1, 10)

128

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

>>> clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs),
...
n_jobs=-1)
>>> clf.fit(X_digits[:1000], y_digits[:1000])
GridSearchCV(cv=None,...
>>> clf.best_score_
0.925...
>>> clf.best_estimator_.C
0.0077...
>>> # Prediction performance on test set is not as good as on train set
>>> clf.score(X_digits[1000:], y_digits[1000:])
0.943...

By default, the GridSearchCV uses a 3-fold cross-validation. However, if it detects that a classifier is passed, rather
than a regressor, it uses a stratified 3-fold.
Nested cross-validation
>>> cross_val_score(clf, X_digits, y_digits)
...
array([ 0.938..., 0.963..., 0.944...])

Two cross-validation loops are performed in parallel: one by the GridSearchCV estimator to set gamma and the
other one by cross_val_score to measure the prediction performance of the estimator. The resulting scores
are unbiased estimates of the prediction score on new data.

Warning: You cannot nest objects with parallel computing (n_jobs different than 1).

Cross-validated estimators
Cross-validation to set a parameter can be done more efficiently on an algorithm-by-algorithm basis. This is why, for
certain estimators, scikit-learn exposes Cross-validation: evaluating estimator performance estimators that set their
parameter automatically by cross-validation:
>>> from sklearn import linear_model, datasets
>>> lasso = linear_model.LassoCV()
>>> diabetes = datasets.load_diabetes()
>>> X_diabetes = diabetes.data
>>> y_diabetes = diabetes.target
>>> lasso.fit(X_diabetes, y_diabetes)
LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
verbose=False)
>>> # The estimator chose automatically its lambda:
>>> lasso.alpha_
0.01229...

These estimators are called similarly to their counterparts, with ‘CV’ appended to their name.

2.2. A tutorial on statistical-learning for scientific data processing

129

scikit-learn user guide, Release 0.19.1

Exercise
On the diabetes dataset, find the optimal regularization parameter alpha.
Bonus: How much can you trust the selection of alpha?
from
from
from
from
from

sklearn import datasets
sklearn.linear_model import LassoCV
sklearn.linear_model import Lasso
sklearn.model_selection import KFold
sklearn.model_selection import GridSearchCV

diabetes = datasets.load_diabetes()

Solution: Cross-validation on diabetes Dataset Exercise

2.2.4 Unsupervised learning: seeking representations of the data
Clustering: grouping observations together

The problem solved in clustering
Given the iris dataset, if we knew that there were 3 types of iris, but did not have access to a taxonomist to label
them: we could try a clustering task: split the observations into well-separated group called clusters.

K-means clustering
Note that there exist a lot of different clustering criteria and associated algorithms. The simplest clustering algorithm

is K-means.
>>>
>>>
>>>
>>>

from sklearn import cluster, datasets
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target

>>> k_means = cluster.KMeans(n_clusters=3)
>>> k_means.fit(X_iris)
KMeans(algorithm='auto', copy_x=True, init='k-means++', ...
>>> print(k_means.labels_[::10])

130

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

[1 1 1 1 1 0 0 0 0 0 2 2 2 2 2]
>>> print(y_iris[::10])
[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]

Warning: There is absolutely no guarantee of recovering a ground truth. First, choosing the right number of
clusters is hard. Second, the algorithm is sensitive to initialization, and can fall into local minima, although scikitlearn employs several tricks to mitigate this issue.

Bad initialization

8 clusters

Ground truth

Don’t over-interpret clustering results

Application example: vector quantization
Clustering in general and KMeans, in particular, can be seen as a way of choosing a small number of exemplars to
compress the information. The problem is sometimes known as vector quantization. For instance, this can be used
to posterize an image:
>>> import scipy as sp
>>> try:
...
face = sp.face(gray=True)
... except AttributeError:
...
from scipy import misc
...
face = misc.face(gray=True)
>>> X = face.reshape((-1, 1)) # We need an (n_sample, n_feature) array
>>> k_means = cluster.KMeans(n_clusters=5, n_init=1)
>>> k_means.fit(X)
KMeans(algorithm='auto', copy_x=True, init='k-means++', ...
>>> values = k_means.cluster_centers_.squeeze()
>>> labels = k_means.labels_
>>> face_compressed = np.choose(labels, values)
>>> face_compressed.shape = face.shape

Raw image

K-means quantization

Equal bins

2.2. A tutorial on statistical-learning for scientific data processing

Image histogram

131

scikit-learn user guide, Release 0.19.1

Hierarchical agglomerative clustering: Ward
A Hierarchical clustering method is a type of cluster analysis that aims to build a hierarchy of clusters. In general, the
various approaches of this technique are either:
• Agglomerative - bottom-up approaches: each observation starts in its own cluster, and clusters are iteratively
merged in such a way to minimize a linkage criterion. This approach is particularly interesting when the clusters of interest are made of only a few observations. When the number of clusters is large, it is much more
computationally efficient than k-means.
• Divisive - top-down approaches: all observations start in one cluster, which is iteratively split as one moves
down the hierarchy. For estimating large numbers of clusters, this approach is both slow (due to all observations
starting as one cluster, which it splits recursively) and statistically ill-posed.
Connectivity-constrained clustering
With agglomerative clustering, it is possible to specify which samples can be clustered together by giving a connectivity graph. Graphs in the scikit are represented by their adjacency matrix. Often, a sparse matrix is used. This
can be useful, for instance, to retrieve connected regions (sometimes also referred to as connected components) when

clustering an image:
import matplotlib.pyplot as plt
from sklearn.feature_extraction.image import grid_to_graph
from sklearn.cluster import AgglomerativeClustering

# #############################################################################
# Generate data
try: # SciPy >= 0.16 have face in misc
from scipy.misc import face
face = face(gray=True)
except ImportError:
face = sp.face(gray=True)
# Resize it to 10% of the original size to speed up the processing
face = sp.misc.imresize(face, 0.10) / 255.
X = np.reshape(face, (-1, 1))
# #############################################################################
# Define the structure A of the data. Pixels connected to their neighbors.
connectivity = grid_to_graph(*face.shape)

132

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

# #############################################################################

Feature agglomeration
We have seen that sparsity could be used to mitigate the curse of dimensionality, i.e an insufficient amount of observations compared to the number of features. Another approach is to merge together similar features: feature
agglomeration. This approach can be implemented by clustering in the feature direction, in other words clustering

the transposed data.
>>>
>>>
>>>
>>>

digits = datasets.load_digits()
images = digits.images
X = np.reshape(images, (len(images), -1))
connectivity = grid_to_graph(*images[0].shape)

>>> agglo = cluster.FeatureAgglomeration(connectivity=connectivity,
...
n_clusters=32)
>>> agglo.fit(X)
FeatureAgglomeration(affinity='euclidean', compute_full_tree='auto',...
>>> X_reduced = agglo.transform(X)
>>> X_approx = agglo.inverse_transform(X_reduced)
>>> images_approx = np.reshape(X_approx, images.shape)

transform and inverse_transform methods
Some estimators expose a transform method, for instance to reduce the dimensionality of the dataset.

Decompositions: from a signal to components and loadings

Components and loadings
If X is our multivariate data, then the problem that we are trying to solve is to rewrite it on a different observational
basis: we want to learn loadings L and a set of components C such that X = L C. Different criteria exist to choose
the components

2.2. A tutorial on statistical-learning for scientific data processing

133

scikit-learn user guide, Release 0.19.1

Principal component analysis: PCA
Principal component analysis (PCA) selects the successive components that explain the maximum variance in the
signal.

The point cloud spanned by the observations above is very flat in one direction: one of the three univariate features
can almost be exactly computed using the other two. PCA finds the directions in which the data is not flat
When used to transform data, PCA can reduce the dimensionality of the data by projecting on a principal subspace.
>>>
>>>
>>>
>>>
>>>

# Create a signal with only 2 useful dimensions
x1 = np.random.normal(size=100)
x2 = np.random.normal(size=100)
x3 = x1 + x2
X = np.c_[x1, x2, x3]

>>> from sklearn import decomposition
>>> pca = decomposition.PCA()
>>> pca.fit(X)
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
>>> print(pca.explained_variance_)
[ 2.18565811e+00
1.19346747e+00
8.43026679e-32]
>>> # As we can see, only the 2 first components are useful
>>> pca.n_components = 2
>>> X_reduced = pca.fit_transform(X)
>>> X_reduced.shape
(100, 2)

Independent Component Analysis: ICA
Independent component analysis (ICA) selects components so that the distribution of their loadings carries
a maximum amount of independent information. It is able to recover non-Gaussian independent signals:

134

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>

# Generate sample data
import numpy as np
from scipy import signal
time = np.linspace(0, 10, 2000)
s1 = np.sin(2 * time) # Signal 1 : sinusoidal signal
s2 = np.sign(np.sin(3 * time)) # Signal 2 : square signal
s3 = signal.sawtooth(2 * np.pi * time) # Signal 3: saw tooth signal
S = np.c_[s1, s2, s3]
S += 0.2 * np.random.normal(size=S.shape) # Add noise
S /= S.std(axis=0) # Standardize data
# Mix data
A = np.array([[1, 1, 1], [0.5, 2, 1], [1.5, 1, 2]]) # Mixing matrix
X = np.dot(S, A.T) # Generate observations

>>> # Compute ICA
>>> ica = decomposition.FastICA()
>>> S_ = ica.fit_transform(X) # Get the estimated sources
>>> A_ = ica.mixing_.T
>>> np.allclose(X, np.dot(S_, A_) + ica.mean_)
True

2.2. A tutorial on statistical-learning for scientific data processing

135

scikit-learn user guide, Release 0.19.1

2.2.5 Putting it all together
Pipelining
We have seen that some estimators can transform data and that some estimators can predict variables. We can also

create combined estimators:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
# Plot the PCA spectrum
pca.fit(X_digits)
plt.figure(1, figsize=(4, 3))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_')
# Prediction
n_components = [20, 40, 64]
Cs = np.logspace(-4, 4, 3)
# Parameters of pipelines can be set using ‘__’ separated parameter names:
estimator = GridSearchCV(pipe,
dict(pca__n_components=n_components,
logistic__C=Cs))
estimator.fit(X_digits, y_digits)
plt.axvline(estimator.best_estimator_.named_steps['pca'].n_components,
linestyle=':', label='n_components chosen')

136

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

plt.legend(prop=dict(size=12))
plt.show()

Face recognition with eigenfaces
The dataset used in this example is a preprocessed excerpt of the “Labeled Faces in the Wild”, also known as LFW:
http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)
"""
===================================================
Faces recognition example using eigenfaces and SVMs
===================================================
The dataset used in this example is a preprocessed excerpt of the
"Labeled Faces in the Wild", aka LFW_:
http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)
.. _LFW: http://vis-www.cs.umass.edu/lfw/
Expected results for the top 5 most represented people in the dataset:
================== ============ ======= ========== =======
precision
recall f1-score
support
================== ============ ======= ========== =======
Ariel Sharon
0.67
0.92
0.77
13
Colin Powell
0.75
0.78
0.76
60
Donald Rumsfeld
0.78
0.67
0.72
27
George W Bush
0.86
0.86
0.86
146
Gerhard Schroeder
0.76
0.76
0.76
25
Hugo Chavez
0.67
0.67
0.67
15
Tony Blair
0.81
0.69
0.75
36
avg / total
0.80
0.80
0.80
322
================== ============ ======= ========== =======
"""
from __future__ import print_function
from time import time
import logging
import matplotlib.pyplot as plt
from
from
from
from
from
from
from

sklearn.model_selection import train_test_split
sklearn.model_selection import GridSearchCV
sklearn.datasets import fetch_lfw_people
sklearn.metrics import classification_report
sklearn.metrics import confusion_matrix
sklearn.decomposition import PCA
sklearn.svm import SVC

print(__doc__)
# Display progress logs on stdout
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')

2.2. A tutorial on statistical-learning for scientific data processing

137

scikit-learn user guide, Release 0.19.1

# #############################################################################
# Download the data, if not already on disk and load it as numpy arrays
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
# introspect the images arrays to find the shapes (for plotting)
n_samples, h, w = lfw_people.images.shape
# for machine learning we use the 2 data directly (as relative pixel
# positions info is ignored by this model)
X = lfw_people.data
n_features = X.shape[1]
# the label to predict is the id of the person
y = lfw_people.target
target_names = lfw_people.target_names
n_classes = target_names.shape[0]
print("Total dataset size:")
print("n_samples: %d" % n_samples)
print("n_features: %d" % n_features)
print("n_classes: %d" % n_classes)

# #############################################################################
# Split into a training set and a test set using a stratified k fold
# split into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42)

# #############################################################################
# Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled
# dataset): unsupervised feature extraction / dimensionality reduction
n_components = 150
print("Extracting the top %d eigenfaces from %d faces"
% (n_components, X_train.shape[0]))
t0 = time()
pca = PCA(n_components=n_components, svd_solver='randomized',
whiten=True).fit(X_train)
print("done in %0.3fs" % (time() - t0))
eigenfaces = pca.components_.reshape((n_components, h, w))
print("Projecting the input data on the eigenfaces orthonormal basis")
t0 = time()
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
print("done in %0.3fs" % (time() - t0))

# #############################################################################
# Train a SVM classification model

138

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

print("Fitting the classifier to the training set")
t0 = time()
param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5],
'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], }
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf = clf.fit(X_train_pca, y_train)
print("done in %0.3fs" % (time() - t0))
print("Best estimator found by grid search:")
print(clf.best_estimator_)

# #############################################################################
# Quantitative evaluation of the model quality on the test set
print("Predicting people's names on the test set")
t0 = time()
y_pred = clf.predict(X_test_pca)
print("done in %0.3fs" % (time() - t0))
print(classification_report(y_test, y_pred, target_names=target_names))
print(confusion_matrix(y_test, y_pred, labels=range(n_classes)))

# #############################################################################
# Qualitative evaluation of the predictions using matplotlib
def plot_gallery(images, titles, h, w, n_row=3, n_col=4):
"""Helper function to plot a gallery of portraits"""
plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
plt.subplot(n_row, n_col, i + 1)
plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
plt.title(titles[i], size=12)
plt.xticks(())
plt.yticks(())

# plot the result of the prediction on a portion of the test set
def title(y_pred, y_test, target_names, i):
pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]
true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
return 'predicted: %s\ntrue:
%s' % (pred_name, true_name)
prediction_titles = [title(y_pred, y_test, target_names, i)
for i in range(y_pred.shape[0])]
plot_gallery(X_test, prediction_titles, h, w)
# plot the gallery of the most significative eigenfaces
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)
plt.show()

2.2. A tutorial on statistical-learning for scientific data processing

139

scikit-learn user guide, Release 0.19.1

Prediction

Eigenfaces

Expected results for the top 5 most represented people in the dataset:
precision

recall

f1-score

support

Gerhard_Schroeder
Donald_Rumsfeld
Tony_Blair
Colin_Powell
George_W_Bush

0.91
0.84
0.65
0.78
0.93

0.75
0.82
0.82
0.88
0.86

0.82
0.83
0.73
0.83
0.90

28
33
34
58
129

avg / total

0.86

0.84

0.85

282

Open problem: Stock Market Structure
Can we predict the variation in stock prices for Google over a given time frame?
Learning a graph structure

2.2.6 Finding help
The project mailing list
If you encounter a bug with scikit-learn or something that needs clarification in the docstring or the online
documentation, please feel free to ask on the Mailing List
Q&A communities with Machine Learning practitioners
Quora.com Quora has a topic for Machine Learning related questions that also features some
interesting discussions: https://www.quora.com/topic/Machine-Learning

140

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

Stack Exchange The Stack Exchange family of sites hosts multiple subdomains for Machine
Learning questions.
– _’An excellent free online course for Machine Learning taught by Professor Andrew Ng of Stanford’: https://www.
coursera.org/learn/machine-learning
– _’Another excellent free online course that takes a more general approach to Artificial Intelligence’:
https://www.udacity.com/course/intro-to-artificial-intelligence–cs271

2.3 Working With Text Data
The goal of this guide is to explore some of the main scikit-learn tools on a single practical task: analysing a
collection of text documents (newsgroups posts) on twenty different topics.
In this section we will see how to:
• load the file contents and the categories
• extract feature vectors suitable for machine learning
• train a linear model to perform categorization
• use a grid search strategy to find a good configuration of both the feature extraction components and the classifier

2.3.1 Tutorial setup
To get started with this tutorial, you firstly must have the scikit-learn and all of its required dependencies installed.
Please refer to the installation instructions page for more information and for per-system instructions.
The source of this tutorial can be found within your scikit-learn folder:
scikit-learn/doc/tutorial/text_analytics/

The tutorial folder, should contain the following folders:
• *.rst files - the source of the tutorial document written with sphinx
• data - folder to put the datasets used during the tutorial
• skeletons - sample incomplete scripts for the exercises
• solutions - solutions of the exercises
You can already copy the skeletons into a new folder somewhere on your hard-drive named
sklearn_tut_workspace where you will edit your own files for the exercises while keeping the original
skeletons intact:
% cp -r skeletons work_directory/sklearn_tut_workspace

Machine Learning algorithms need data. Go to each $TUTORIAL_HOME/data sub-folder and run the
fetch_data.py script from there (after having read them first).
For instance:
% cd $TUTORIAL_HOME/data/languages
% less fetch_data.py
% python fetch_data.py

2.3. Working With Text Data

141

scikit-learn user guide, Release 0.19.1

2.3.2 Loading the 20 newsgroups dataset
The dataset is called “Twenty Newsgroups”. Here is the official description, quoted from the website:
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned
(nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected
by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments
in text applications of machine learning techniques, such as text classification and text clustering.
In the following we will use the built-in dataset loader for 20 newsgroups from scikit-learn. Alternatively, it is possible
to download the dataset manually from the web-site and use the sklearn.datasets.load_files function by
pointing it to the 20news-bydate-train subfolder of the uncompressed archive folder.
In order to get faster execution times for this first example we will work on a partial dataset with only 4 categories out
of the 20 available in the dataset:
>>> categories = ['alt.atheism', 'soc.religion.christian',
...
'comp.graphics', 'sci.med']

We can now load the list of files matching those categories as follows:
>>> from sklearn.datasets import fetch_20newsgroups
>>> twenty_train = fetch_20newsgroups(subset='train',
...
categories=categories, shuffle=True, random_state=42)

The returned dataset is a scikit-learn “bunch”: a simple holder object with fields that can be both accessed
as python dict keys or object attributes for convenience, for instance the target_names holds the list of the
requested category names:
>>> twenty_train.target_names
['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

The files themselves are loaded in memory in the data attribute. For reference the filenames are also available:
>>> len(twenty_train.data)
2257
>>> len(twenty_train.filenames)
2257

Let’s print the first lines of the first loaded file:
>>> print("\n".join(twenty_train.data[0].split("\n")[:3]))
From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
>>> print(twenty_train.target_names[twenty_train.target[0]])
comp.graphics

Supervised learning algorithms will require a category label for each document in the training set. In this case the category is the name of the newsgroup which also happens to be the name of the folder holding the individual documents.
For speed and space efficiency reasons scikit-learn loads the target attribute as an array of integers that corresponds to the index of the category name in the target_names list. The category integer id of each sample is stored
in the target attribute:

142

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

>>> twenty_train.target[:10]
array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2])

It is possible to get back the category names as follows:
>>> for t in twenty_train.target[:10]:
...
print(twenty_train.target_names[t])
...
comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med

You can notice that the samples have been shuffled randomly (with a fixed RNG seed): this is useful if you select only
the first samples to quickly train a model and get a first idea of the results before re-training on the complete dataset
later.

2.3.3 Extracting features from text files
In order to perform machine learning on text documents, we first need to turn the text content into numerical feature
vectors.
Bags of words
The most intuitive way to do so is the bags of words representation:
1. assign a fixed integer id to each word occurring in any document of the training set (for instance by building a
dictionary from words to integer indices).
2. for each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value
of feature #j where j is the index of word w in the dictionary
The bags of words representation implies that n_features is the number of distinct words in the corpus: this
number is typically larger than 100,000.
If n_samples == 10000, storing X as a numpy array of type float32 would require 10000 x 100000 x 4 bytes =
4GB in RAM which is barely manageable on today’s computers.
Fortunately, most values in X will be zeros since for a given document less than a couple thousands of distinct words
will be used. For this reason we say that bags of words are typically high-dimensional sparse datasets. We can save
a lot of memory by only storing the non-zero parts of the feature vectors in memory.
scipy.sparse matrices are data structures that do exactly this, and scikit-learn has built-in support for these
structures.
Tokenizing text with scikit-learn
Text preprocessing, tokenizing and filtering of stopwords are included in a high level component that is able to build a
dictionary of features and transform documents to feature vectors:

2.3. Working With Text Data

143

scikit-learn user guide, Release 0.19.1

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> count_vect = CountVectorizer()
>>> X_train_counts = count_vect.fit_transform(twenty_train.data)
>>> X_train_counts.shape
(2257, 35788)

CountVectorizer supports counts of N-grams of words or consecutive characters. Once fitted, the vectorizer has
built a dictionary of feature indices:
>>> count_vect.vocabulary_.get(u'algorithm')
4690

The index value of a word in the vocabulary is linked to its frequency in the whole training corpus.
From occurrences to frequencies
Occurrence count is a good start but there is an issue: longer documents will have higher average count values than
shorter documents, even though they might talk about the same topics.
To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by
the total number of words in the document: these new features are called tf for Term Frequencies.
Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are
therefore less informative than those that occur only in a smaller portion of the corpus.
This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”.
Both tf and tf–idf can be computed as follows:
>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
>>> X_train_tf = tf_transformer.transform(X_train_counts)
>>> X_train_tf.shape
(2257, 35788)

In the above example-code, we firstly use the fit(..) method to fit our estimator to the data and secondly the
transform(..) method to transform our count-matrix to a tf-idf representation. These two steps can be combined to achieve the same end result faster by skipping redundant processing. This is done through using the
fit_transform(..) method as shown below, and as mentioned in the note in the previous section:
>>> tfidf_transformer = TfidfTransformer()
>>> X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
>>> X_train_tfidf.shape
(2257, 35788)

2.3.4 Training a classifier
Now that we have our features, we can train a classifier to try to predict the category of a post. Let’s start with a
naïve Bayes classifier, which provides a nice baseline for this task. scikit-learn includes several variants of this
classifier; the one most suitable for word counts is the multinomial variant:
>>> from sklearn.naive_bayes import MultinomialNB
>>> clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

144

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

To try to predict the outcome on a new document we need to extract the features using almost the same feature extracting chain as before. The difference is that we call transform instead of fit_transform on the transformers,
since they have already been fit to the training set:
>>> docs_new = ['God is love', 'OpenGL on the GPU is fast']
>>> X_new_counts = count_vect.transform(docs_new)
>>> X_new_tfidf = tfidf_transformer.transform(X_new_counts)
>>> predicted = clf.predict(X_new_tfidf)
>>> for
...
...
'God is
'OpenGL

doc, category in zip(docs_new, predicted):
print('%r => %s' % (doc, twenty_train.target_names[category]))
love' => soc.religion.christian
on the GPU is fast' => comp.graphics

2.3.5 Building a pipeline
In order to make the vectorizer => transformer => classifier easier to work with, scikit-learn provides a
Pipeline class that behaves like a compound classifier:
>>> from sklearn.pipeline import Pipeline
>>> text_clf = Pipeline([('vect', CountVectorizer()),
...
('tfidf', TfidfTransformer()),
...
('clf', MultinomialNB()),
... ])

The names vect, tfidf and clf (classifier) are arbitrary. We shall see their use in the section on grid search, below.
We can now train the model with a single command:
>>> text_clf.fit(twenty_train.data, twenty_train.target)
Pipeline(...)

2.3.6 Evaluation of the performance on the test set
Evaluating the predictive accuracy of the model is equally easy:
>>> import numpy as np
>>> twenty_test = fetch_20newsgroups(subset='test',
...
categories=categories, shuffle=True, random_state=42)
>>> docs_test = twenty_test.data
>>> predicted = text_clf.predict(docs_test)
>>> np.mean(predicted == twenty_test.target)
0.834...

I.e., we achieved 83.4% accuracy. Let’s see if we can do better with a linear support vector machine (SVM), which is
widely regarded as one of the best text classification algorithms (although it’s also a bit slower than naïve Bayes). We
can change the learner by just plugging a different classifier object into our pipeline:
>>> from sklearn.linear_model import SGDClassifier
>>> text_clf = Pipeline([('vect', CountVectorizer()),
...
('tfidf', TfidfTransformer()),
...
('clf', SGDClassifier(loss='hinge', penalty='l2',
...
alpha=1e-3, random_state=42,
...
max_iter=5, tol=None)),

2.3. Working With Text Data

145

scikit-learn user guide, Release 0.19.1

... ])
>>> text_clf.fit(twenty_train.data, twenty_train.target)
Pipeline(...)
>>> predicted = text_clf.predict(docs_test)
>>> np.mean(predicted == twenty_test.target)
0.912...

scikit-learn further provides utilities for more detailed performance analysis of the results:
>>> from sklearn import metrics
>>> print(metrics.classification_report(twenty_test.target, predicted,
...
target_names=twenty_test.target_names))
...
precision
recall f1-score
support
alt.atheism
comp.graphics
sci.med
soc.religion.christian

0.95
0.88
0.94
0.90

0.81
0.97
0.90
0.95

0.87
0.92
0.92
0.93

319
389
396
398

avg / total

0.92

0.91

0.91

1502

>>> metrics.confusion_matrix(twenty_test.target, predicted)
array([[258, 11, 15, 35],
[ 4, 379,
3,
3],
[ 5, 33, 355,
3],
[ 5, 10,
4, 379]])

As expected the confusion matrix shows that posts from the newsgroups on atheism and christian are more often
confused for one another than with computer graphics.

2.3.7 Parameter tuning using grid search
We’ve already encountered some parameters such as use_idf in the TfidfTransformer. Classifiers tend to have
many parameters as well; e.g., MultinomialNB includes a smoothing parameter alpha and SGDClassifier
has a penalty parameter alpha and configurable loss and penalty terms in the objective function (see the module
documentation, or use the Python help function, to get a description of these).
Instead of tweaking the parameters of the various components of the chain, it is possible to run an exhaustive search of
the best parameters on a grid of possible values. We try out all classifiers on either words or bigrams, with or without
idf, and with a penalty parameter of either 0.01 or 0.001 for the linear SVM:
>>> from sklearn.model_selection import GridSearchCV
>>> parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
...
'tfidf__use_idf': (True, False),
...
'clf__alpha': (1e-2, 1e-3),
... }

Obviously, such an exhaustive search can be expensive. If we have multiple CPU cores at our disposal, we can tell
the grid searcher to try these eight parameter combinations in parallel with the n_jobs parameter. If we give this
parameter a value of -1, grid search will detect how many cores are installed and uses them all:
>>> gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

146

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

The grid search instance behaves like a normal scikit-learn model. Let’s perform the search on a smaller subset
of the training data to speed up the computation:
>>> gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])

The result of calling fit on a GridSearchCV object is a classifier that we can use to predict:
>>> twenty_train.target_names[gs_clf.predict(['God is love'])[0]]
'soc.religion.christian'

The object’s best_score_ and best_params_ attributes store the best mean score and the parameters setting
corresponding to that score:
>>> gs_clf.best_score_
0.900...
>>> for param_name in sorted(parameters.keys()):
...
print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))
...
clf__alpha: 0.001
tfidf__use_idf: True
vect__ngram_range: (1, 1)

A more detailed summary of the search is available at gs_clf.cv_results_.
The cv_results_ parameter can be easily imported into pandas as a DataFrame for further inspection.
Exercises
To do the exercises, copy the content of the ‘skeletons’ folder as a new folder named ‘workspace’:
% cp -r skeletons workspace

You can then edit the content of the workspace without fear of loosing the original exercise instructions.
Then fire an ipython shell and run the work-in-progress script with:
[1] %run workspace/exercise_XX_script.py arg1 arg2 arg3

If an exception is triggered, use %debug to fire-up a post mortem ipdb session.
Refine the implementation and iterate until the exercise is solved.
For each exercise, the skeleton file provides all the necessary import statements, boilerplate code to load the
data and sample code to evaluate the predictive accuracy of the model.

2.3.8 Exercise 1: Language identification
• Write a text classification pipeline using a custom preprocessor and CharNGramAnalyzer using data from
Wikipedia articles as training set.
• Evaluate the performance on some held out test set.
ipython command line:
%run workspace/exercise_01_language_train_model.py data/languages/paragraphs/

2.3. Working With Text Data

147

scikit-learn user guide, Release 0.19.1

2.3.9 Exercise 2: Sentiment Analysis on movie reviews
• Write a text classification pipeline to classify movie reviews as either positive or negative.
• Find a good set of parameters using grid search.
• Evaluate the performance on a held out test set.
ipython command line:
%run workspace/exercise_02_sentiment.py data/movie_reviews/txt_sentoken/

2.3.10 Exercise 3: CLI text classification utility
Using the results of the previous exercises and the cPickle module of the standard library, write a command line
utility that detects the language of some text provided on stdin and estimate the polarity (positive or negative) if the
text is written in English.
Bonus point if the utility is able to give a confidence level for its predictions.

2.3.11 Where to from here
Here are a few suggestions to help further your scikit-learn intuition upon the completion of this tutorial:
• Try playing around with the analyzer and token normalisation under CountVectorizer
• If you don’t have labels, try using Clustering on your problem.
• If you have multiple labels per document, e.g categories, have a look at the Multiclass and multilabel section
• Try using Truncated SVD for latent semantic analysis.
• Have a look at using Out-of-core Classification to learn from data that would not fit into the computer main
memory.
• Have a look at the Hashing Vectorizer as a memory efficient alternative to CountVectorizer.

2.4 Choosing the right estimator
Often the hardest part of solving a machine learning problem can be finding the right estimator for the job.
Different estimators are better suited for different types of data and different problems.
The flowchart below is designed to give users a bit of a rough guide on how to approach problems with regard to which
estimators to try on your data.
Click on any estimator in the chart below to see its documentation.

2.5 External Resources, Videos and Talks
For written tutorials, see the Tutorial section of the documentation.

148

Chapter 2. scikit-learn Tutorials

scikit-learn user guide, Release 0.19.1

2.5.1 New to Scientific Python?
For those that are still new to the scientific Python ecosystem, we highly recommend the Python Scientific Lecture
Notes. This will help you find your footing a bit and will definitely improve your scikit-learn experience. A basic
understanding of NumPy arrays is recommended to make the most of scikit-learn.

2.5.2 External Tutorials
There are several online tutorials available which are geared toward specific subject areas:
• Machine Learning for NeuroImaging in Python
• Machine Learning for Astronomical Data Analysis

2.5.3 Videos
• An introduction to scikit-learn Part I and Part II at Scipy 2013 by Gael Varoquaux, Jake Vanderplas and Olivier
Grisel. Notebooks on github.
• Introduction to scikit-learn by Gael Varoquaux at ICML 2010
A three minute video from a very early stage of the scikit, explaining the basic idea and approach we
are following.
• Introduction to statistical learning with scikit-learn by Gael Varoquaux at SciPy 2011
An extensive tutorial, consisting of four sessions of one hour. The tutorial covers the basics of machine learning, many algorithms and how to apply them using scikit-learn. The material corresponding is now in the scikit-learn documentation section A tutorial on statistical-learning for scientific
data processing.
• Statistical Learning for Text Classification with scikit-learn and NLTK (and slides) by Olivier Grisel at PyCon
2011
Thirty minute introduction to text classification. Explains how to use NLTK and scikit-learn to solve
real-world text classification tasks and compares against cloud-based solutions.
• Introduction to Interactive Predictive Analytics in Python with scikit-learn by Olivier Grisel at PyCon 2012
3-hours long introduction to prediction tasks using scikit-learn.
• scikit-learn - Machine Learning in Python by Jake Vanderplas at the 2012 PyData workshop at Google
Interactive demonstration of some scikit-learn features. 75 minutes.
• scikit-learn tutorial by Jake Vanderplas at PyData NYC 2012
Presentation using the online tutorial, 45 minutes.

Note: Doctest Mode
The code-examples in the above tutorials are written in a python-console format. If you wish to easily execute these
examples in IPython, use:

2.5. External Resources, Videos and Talks

149

scikit-learn user guide, Release 0.19.1

%doctest_mode

in the IPython-console. You can then simply copy and paste the examples directly into IPython without having to
worry about removing the >>> manually.

150

Chapter 2. scikit-learn Tutorials

CHAPTER

THREE

USER GUIDE

3.1 Supervised learning
3.1.1 Generalized Linear Models
The following are a set of methods intended for regression in which the target value is expected to be a linear combination of the input variables. In mathematical notion, if 𝑦ˆ is the predicted value.
𝑦ˆ(𝑤, 𝑥) = 𝑤0 + 𝑤1 𝑥1 + ... + 𝑤𝑝 𝑥𝑝
Across the module, we designate the vector 𝑤 = (𝑤1 , ..., 𝑤𝑝 ) as coef_ and 𝑤0 as intercept_.
To perform classification with generalized linear models, see Logistic regression.
Ordinary Least Squares
LinearRegression fits a linear model with coefficients 𝑤 = (𝑤1 , ..., 𝑤𝑝 ) to minimize the residual sum of squares
between the observed responses in the dataset, and the responses predicted by the linear approximation. Mathematically it solves a problem of the form:
𝑚𝑖𝑛 ||𝑋𝑤 − 𝑦||2

2

𝑤

LinearRegression will take in its fit method arrays X, y and will store the coefficients 𝑤 of the linear model
in its coef_ member:
151

scikit-learn user guide, Release 0.19.1

>>> from sklearn import linear_model
>>> reg = linear_model.LinearRegression()
>>> reg.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
>>> reg.coef_
array([ 0.5, 0.5])

However, coefficient estimates for Ordinary Least Squares rely on the independence of the model terms. When terms
are correlated and the columns of the design matrix 𝑋 have an approximate linear dependence, the design matrix
becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the
observed response, producing a large variance. This situation of multicollinearity can arise, for example, when data
are collected without an experimental design.
Examples:
• Linear Regression Example

Ordinary Least Squares Complexity
This method computes the least squares solution using a singular value decomposition of X. If X is a matrix of size (n,
p) this method has a cost of 𝑂(𝑛𝑝2 ), assuming that 𝑛 ≥ 𝑝.
Ridge Regression
Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of
coefficients. The ridge coefficients minimize a penalized residual sum of squares,
2

𝑚𝑖𝑛 ||𝑋𝑤 − 𝑦||2 + 𝛼||𝑤||2

2

𝑤

Here, 𝛼 ≥ 0 is a complexity parameter that controls the amount of shrinkage: the larger the value of 𝛼, the greater the
amount of shrinkage and thus the coefficients become more robust to collinearity.

As with other linear models, Ridge will take in its fit method arrays X, y and will store the coefficients 𝑤 of the
linear model in its coef_ member:

152

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>> from sklearn import linear_model
>>> reg = linear_model.Ridge (alpha = .5)
>>> reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=None, solver='auto', tol=0.001)
>>> reg.coef_
array([ 0.34545455, 0.34545455])
>>> reg.intercept_
0.13636...

Examples:
• Plot Ridge coefficients as a function of the regularization
• Classification of text documents using sparse features

Ridge Complexity
This method has the same order of complexity than an Ordinary Least Squares.
Setting the regularization parameter: generalized Cross-Validation
RidgeCV implements ridge regression with built-in cross-validation of the alpha parameter. The object works in
the same way as GridSearchCV except that it defaults to Generalized Cross-Validation (GCV), an efficient form of
leave-one-out cross-validation:
>>> from sklearn import linear_model
>>> reg = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0])
>>> reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
RidgeCV(alphas=[0.1, 1.0, 10.0], cv=None, fit_intercept=True, scoring=None,
normalize=False)
>>> reg.alpha_
0.1

References
• “Notes on Regularized Least Squares”, Rifkin & Lippert (technical report, course slides).

Lasso
The Lasso is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency
to prefer solutions with fewer parameter values, effectively reducing the number of variables upon which the given
solution is dependent. For this reason, the Lasso and its variants are fundamental to the field of compressed sensing.
Under certain conditions, it can recover the exact set of non-zero weights (see Compressive sensing: tomography
reconstruction with L1 prior (Lasso)).
Mathematically, it consists of a linear model trained with ℓ1 prior as regularizer. The objective function to minimize

3.1. Supervised learning

153

scikit-learn user guide, Release 0.19.1

is:
𝑚𝑖𝑛
𝑤

1
2𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠

||𝑋𝑤 − 𝑦||22 + 𝛼||𝑤||1

The lasso estimate thus solves the minimization of the least-squares penalty with 𝛼||𝑤||1 added, where 𝛼 is a constant
and ||𝑤||1 is the ℓ1 -norm of the parameter vector.
The implementation in the class Lasso uses coordinate descent as the algorithm to fit the coefficients. See Least
Angle Regression for another implementation:
>>> from sklearn import linear_model
>>> reg = linear_model.Lasso(alpha = 0.1)
>>> reg.fit([[0, 0], [1, 1]], [0, 1])
Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)
>>> reg.predict([[1, 1]])
array([ 0.8])

Also useful for lower-level tasks is the function lasso_path that computes the coefficients along the full path of
possible values.
Examples:
• Lasso and Elastic Net for Sparse Signals
• Compressive sensing: tomography reconstruction with L1 prior (Lasso)

Note: Feature selection with Lasso
As the Lasso regression yields sparse models, it can thus be used to perform feature selection, as detailed in L1-based
feature selection.

Setting regularization parameter
The alpha parameter controls the degree of sparsity of the coefficients estimated.
Using cross-validation
scikit-learn exposes objects that set the Lasso alpha parameter by cross-validation: LassoCV and LassoLarsCV .
LassoLarsCV is based on the Least Angle Regression algorithm explained below.
For high-dimensional datasets with many collinear regressors, LassoCV is most often preferable. However,
LassoLarsCV has the advantage of exploring more relevant values of alpha parameter, and if the number of samples
is very small compared to the number of features, it is often faster than LassoCV .

154

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Information-criteria based model selection
Alternatively, the estimator LassoLarsIC proposes to use the Akaike information criterion (AIC) and the Bayes
Information criterion (BIC). It is a computationally cheaper alternative to find the optimal value of alpha as the regularization path is computed only once instead of k+1 times when using k-fold cross-validation. However, such criteria
needs a proper estimation of the degrees of freedom of the solution, are derived for large samples (asymptotic results)
and assume the model is correct, i.e. that the data are actually generated by this model. They also tend to break when
the problem is badly conditioned (more features than samples).

Examples:
• Lasso model selection: Cross-Validation / AIC / BIC

Comparison with the regularization parameter of SVM
The equivalence between alpha and the regularization parameter of SVM, C is given by alpha = 1 / C or
alpha = 1 / (n_samples * C), depending on the estimator and the exact objective function optimized by
the model.

3.1. Supervised learning

155

scikit-learn user guide, Release 0.19.1

Multi-task Lasso
The MultiTaskLasso is a linear model that estimates sparse coefficients for multiple regression problems jointly:
y is a 2D array, of shape (n_samples, n_tasks). The constraint is that the selected features are the same for all
the regression problems, also called tasks.
The following figure compares the location of the non-zeros in W obtained with a simple Lasso or a MultiTaskLasso.
The Lasso estimates yields scattered non-zeros while the non-zeros of the MultiTaskLasso are full columns.

Fitting a time-series model, imposing that any active feature be active at all times.
Examples:
• Joint feature selection with multi-task Lasso
Mathematically, it consists of a linear model trained with a mixed ℓ1 ℓ2 prior as regularizer. The objective function to
minimize is:
𝑚𝑖𝑛
𝑤

1
2𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠

||𝑋𝑊 − 𝑌 ||2𝐹 𝑟𝑜 + 𝛼||𝑊 ||21

where 𝐹 𝑟𝑜 indicates the Frobenius norm:
||𝐴||𝐹 𝑟𝑜 =

√︃∑︁

𝑎2𝑖𝑗

𝑖𝑗

156

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

and ℓ1 ℓ2 reads:
||𝐴||21 =

∑︁ √︃∑︁
𝑖

𝑎2𝑖𝑗

𝑗

The implementation in the class MultiTaskLasso uses coordinate descent as the algorithm to fit the coefficients.
Elastic Net
ElasticNet is a linear regression model trained with L1 and L2 prior as regularizer. This combination allows for
learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization
properties of Ridge. We control the convex combination of L1 and L2 using the l1_ratio parameter.
Elastic-net is useful when there are multiple features which are correlated with one another. Lasso is likely to pick one
of these at random, while elastic-net is likely to pick both.
A practical advantage of trading-off between Lasso and Ridge is it allows Elastic-Net to inherit some of Ridge’s
stability under rotation.
The objective function to minimize is in this case
𝑚𝑖𝑛
𝑤

𝛼(1 − 𝜌)
1
||𝑤||22
||𝑋𝑤 − 𝑦||22 + 𝛼𝜌||𝑤||1 +
2𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠
2

The class ElasticNetCV can be used to set the parameters alpha (𝛼) and l1_ratio (𝜌) by cross-validation.
Examples:
• Lasso and Elastic Net for Sparse Signals
• Lasso and Elastic Net

Multi-task Elastic Net
The MultiTaskElasticNet is an elastic-net model that estimates sparse coefficients for multiple regression problems jointly: Y is a 2D array, of shape (n_samples, n_tasks). The constraint is that the selected features are
the same for all the regression problems, also called tasks.
3.1. Supervised learning

157

scikit-learn user guide, Release 0.19.1

Mathematically, it consists of a linear model trained with a mixed ℓ1 ℓ2 prior and ℓ2 prior as regularizer. The objective
function to minimize is:
𝑚𝑖𝑛
𝑊

1
𝛼(1 − 𝜌)
||𝑋𝑊 − 𝑌 ||2𝐹 𝑟𝑜 + 𝛼𝜌||𝑊 ||21 +
||𝑊 ||2𝐹 𝑟𝑜
2𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠
2

The implementation in the class MultiTaskElasticNet uses coordinate descent as the algorithm to fit the coefficients.
The class MultiTaskElasticNetCV can be used to set the parameters alpha (𝛼) and l1_ratio (𝜌) by crossvalidation.
Least Angle Regression
Least-angle regression (LARS) is a regression algorithm for high-dimensional data, developed by Bradley Efron,
Trevor Hastie, Iain Johnstone and Robert Tibshirani. LARS is similar to forward stepwise regression. At each step,
it finds the predictor most correlated with the response. When there are multiple predictors having equal correlation,
instead of continuing along the same predictor, it proceeds in a direction equiangular between the predictors.
The advantages of LARS are:
• It is numerically efficient in contexts where p >> n (i.e., when the number of dimensions is significantly greater
than the number of points)
• It is computationally just as fast as forward selection and has the same order of complexity as an ordinary least
squares.
• It produces a full piecewise linear solution path, which is useful in cross-validation or similar attempts to tune
the model.
• If two variables are almost equally correlated with the response, then their coefficients should increase at approximately the same rate. The algorithm thus behaves as intuition would expect, and also is more stable.
• It is easily modified to produce solutions for other estimators, like the Lasso.
The disadvantages of the LARS method include:
• Because LARS is based upon an iterative refitting of the residuals, it would appear to be especially sensitive to
the effects of noise. This problem is discussed in detail by Weisberg in the discussion section of the Efron et al.
(2004) Annals of Statistics article.
The LARS model can be used using estimator Lars, or its low-level implementation lars_path.
LARS Lasso
LassoLars is a lasso model implemented using the LARS algorithm, and unlike the implementation based on
coordinate_descent, this yields the exact solution, which is piecewise linear as a function of the norm of its coefficients.
>>> from sklearn import linear_model
>>> reg = linear_model.LassoLars(alpha=.1)
>>> reg.fit([[0, 0], [1, 1]], [0, 1])
LassoLars(alpha=0.1, copy_X=True, eps=..., fit_intercept=True,
fit_path=True, max_iter=500, normalize=True, positive=False,
precompute='auto', verbose=False)
>>> reg.coef_
array([ 0.717157..., 0.
])

158

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Examples:
• Lasso path using LARS
The Lars algorithm provides the full path of the coefficients along the regularization parameter almost for free, thus a
common operation consist of retrieving the path with function lars_path
Mathematical formulation
The algorithm is similar to forward stepwise regression, but instead of including variables at each step, the estimated
parameters are increased in a direction equiangular to each one’s correlations with the residual.
Instead of giving a vector result, the LARS solution consists of a curve denoting the solution for each value of the
L1 norm of the parameter vector. The full coefficients path is stored in the array coef_path_, which has size
(n_features, max_features+1). The first column is always zero.
References:
• Original Algorithm is detailed in the paper Least Angle Regression by Hastie et al.

Orthogonal Matching Pursuit (OMP)
OrthogonalMatchingPursuit and orthogonal_mp implements the OMP algorithm for approximating the
fit of a linear model with constraints imposed on the number of non-zero coefficients (ie. the L 0 pseudo-norm).
Being a forward feature selection method like Least Angle Regression, orthogonal matching pursuit can approximate
the optimum solution vector with a fixed number of non-zero elements:
arg min ||𝑦 − 𝑋𝛾||22 subject to ||𝛾||0 ≤ 𝑛𝑛𝑜𝑛𝑧𝑒𝑟𝑜_𝑐𝑜𝑒𝑓 𝑠
Alternatively, orthogonal matching pursuit can target a specific error instead of a specific number of non-zero coefficients. This can be expressed as:
arg min ||𝛾||0 subject to ||𝑦 − 𝑋𝛾||22 ≤ tol

3.1. Supervised learning

159

scikit-learn user guide, Release 0.19.1

OMP is based on a greedy algorithm that includes at each step the atom most highly correlated with the current
residual. It is similar to the simpler matching pursuit (MP) method, but better in that at each iteration, the residual is
recomputed using an orthogonal projection on the space of the previously chosen dictionary elements.
Examples:
• Orthogonal Matching Pursuit

References:
• http://www.cs.technion.ac.il/~ronrubin/Publications/KSVD-OMP-v2.pdf
• Matching pursuits with time-frequency dictionaries, S. G. Mallat, Z. Zhang,

Bayesian Regression
Bayesian regression techniques can be used to include regularization parameters in the estimation procedure: the
regularization parameter is not set in a hard sense but tuned to the data at hand.
This can be done by introducing uninformative priors over the hyper parameters of the model. The ℓ2 regularization
used in Ridge Regression is equivalent to finding a maximum a posteriori estimation under a Gaussian prior over the
parameters 𝑤 with precision 𝜆−1 . Instead of setting lambda manually, it is possible to treat it as a random variable to
be estimated from the data.
To obtain a fully probabilistic model, the output 𝑦 is assumed to be Gaussian distributed around 𝑋𝑤:
𝑝(𝑦|𝑋, 𝑤, 𝛼) = 𝒩 (𝑦|𝑋𝑤, 𝛼)
Alpha is again treated as a random variable that is to be estimated from the data.
The advantages of Bayesian Regression are:
• It adapts to the data at hand.
• It can be used to include regularization parameters in the estimation procedure.
The disadvantages of Bayesian regression include:
• Inference of the model can be time consuming.
References
• A good introduction to Bayesian methods is given in C. Bishop: Pattern Recognition and Machine learning
• Original Algorithm is detailed in the book Bayesian learning for neural networks by Radford M. Neal

Bayesian Ridge Regression
BayesianRidge estimates a probabilistic model of the regression problem as described above. The prior for the
parameter 𝑤 is given by a spherical Gaussian:
𝑝(𝑤|𝜆) = 𝒩 (𝑤|0, 𝜆−1 Ip )

160

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The priors over 𝛼 and 𝜆 are chosen to be gamma distributions, the conjugate prior for the precision of the Gaussian.
The resulting model is called Bayesian Ridge Regression, and is similar to the classical Ridge. The parameters
𝑤, 𝛼 and 𝜆 are estimated jointly during the fit of the model. The remaining hyperparameters are the parameters of
the gamma priors over 𝛼 and 𝜆. These are usually chosen to be non-informative. The parameters are estimated by
maximizing the marginal log likelihood.
By default 𝛼1 = 𝛼2 = 𝜆1 = 𝜆2 = 10−6 .

Bayesian Ridge Regression is used for regression:
>>> from sklearn import linear_model
>>> X = [[0., 0.], [1., 1.], [2., 2.], [3., 3.]]
>>> Y = [0., 1., 2., 3.]
>>> reg = linear_model.BayesianRidge()
>>> reg.fit(X, Y)
BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True,
fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=300,
normalize=False, tol=0.001, verbose=False)

After being fitted, the model can then be used to predict new values:
>>> reg.predict ([[1, 0.]])
array([ 0.50000013])

The weights 𝑤 of the model can be access:
>>> reg.coef_
array([ 0.49999993,

0.49999993])

Due to the Bayesian framework, the weights found are slightly different to the ones found by Ordinary Least Squares.
However, Bayesian Ridge Regression is more robust to ill-posed problem.
Examples:
• Bayesian Ridge Regression

3.1. Supervised learning

161

scikit-learn user guide, Release 0.19.1

References
• More details can be found in the article Bayesian Interpolation by MacKay, David J. C.

Automatic Relevance Determination - ARD
ARDRegression is very similar to Bayesian Ridge Regression, but can lead to sparser weights 𝑤12 .
ARDRegression poses a different prior over 𝑤, by dropping the assumption of the Gaussian being spherical.
Instead, the distribution over 𝑤 is assumed to be an axis-parallel, elliptical Gaussian distribution.
This means each weight 𝑤𝑖 is drawn from a Gaussian distribution, centered on zero and with a precision 𝜆𝑖 :
𝑝(𝑤|𝜆) = 𝒩 (𝑤|0, 𝐴−1 )
with 𝑑𝑖𝑎𝑔 (𝐴) = 𝜆 = {𝜆1 , ..., 𝜆𝑝 }.
In contrast to Bayesian Ridge Regression, each coordinate of 𝑤𝑖 has its own standard deviation 𝜆𝑖 . The prior over all
𝜆𝑖 is chosen to be the same gamma distribution given by hyperparameters 𝜆1 and 𝜆2 .

ARD is also known in the literature as Sparse Bayesian Learning and Relevance Vector Machine34 .
Examples:
• Automatic Relevance Determination Regression (ARD)

References:
1
2
3
4

Christopher M. Bishop: Pattern Recognition and Machine Learning, Chapter 7.2.1
David Wipf and Srikantan Nagarajan: A new view of automatic relevance determination
Michael E. Tipping: Sparse Bayesian Learning and the Relevance Vector Machine
Tristan Fletcher: Relevance Vector Machines explained

162

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Logistic regression
Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is
also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier.
In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.
The implementation of logistic regression in scikit-learn can be accessed from class LogisticRegression. This
implementation can fit binary, One-vs- Rest, or multinomial logistic regression with optional L2 or L1 regularization.
As an optimization problem, binary class L2 penalized logistic regression minimizes the following cost function:
𝑛

∑︁
1
𝑚𝑖𝑛 𝑤𝑇 𝑤 + 𝐶
log(exp(−𝑦𝑖 (𝑋𝑖𝑇 𝑤 + 𝑐)) + 1).
𝑤,𝑐 2
𝑖=1
Similarly, L1 regularized logistic regression solves the following optimization problem
𝑚𝑖𝑛 ‖𝑤‖1 + 𝐶
𝑤,𝑐

𝑛
∑︁

log(exp(−𝑦𝑖 (𝑋𝑖𝑇 𝑤 + 𝑐)) + 1).

𝑖=1

The solvers implemented in the class LogisticRegression are “liblinear”, “newton-cg”, “lbfgs”, “sag” and
“saga”:
The solver “liblinear” uses a coordinate descent (CD) algorithm, and relies on the excellent C++ LIBLINEAR library,
which is shipped with scikit-learn. However, the CD algorithm implemented in liblinear cannot learn a true multinomial (multiclass) model; instead, the optimization problem is decomposed in a “one-vs-rest” fashion so separate binary
classifiers are trained for all classes. This happens under the hood, so LogisticRegression instances using this
solver behave as multiclass classifiers. For L1 penalization sklearn.svm.l1_min_c allows to calculate the lower
bound for C in order to get a non “null” (all feature weights to zero) model.
The “lbfgs”, “sag” and “newton-cg” solvers only support L2 penalization and are found to converge faster for some
high dimensional data. Setting multi_class to “multinomial” with these solvers learns a true multinomial logistic
regression model5 , which means that its probability estimates should be better calibrated than the default “one-vsrest” setting.
The “sag” solver uses a Stochastic Average Gradient descent6 . It is faster than other solvers for large datasets, when
both the number of samples and the number of features are large.
The “saga” solver7 is a variant of “sag” that also supports the non-smooth penalty=”l1” option. This is therefore the
solver of choice for sparse multinomial logistic regression.
In a nutshell, one may choose the solver with the following rules:
Case
L1 penalty
Multinomial loss
Very Large dataset (n_samples)

Solver
“liblinear” or “saga”
“lbfgs”, “sag”, “saga” or “newton-cg”
“sag” or “saga”

The “saga” solver is often the best choice. The “liblinear” solver is used by default for historical reasons.
For large dataset, you may also consider using SGDClassifier with ‘log’ loss.
Examples:
5

Christopher M. Bishop: Pattern Recognition and Machine Learning, Chapter 4.3.4
Mark Schmidt, Nicolas Le Roux, and Francis Bach: Minimizing Finite Sums with the Stochastic Average Gradient.
7 Aaron Defazio, Francis Bach, Simon Lacoste-Julien: SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex
Composite Objectives.
6

3.1. Supervised learning

163

scikit-learn user guide, Release 0.19.1

• L1 Penalty and Sparsity in Logistic Regression
• Path with L1- Logistic Regression
• Plot multinomial and One-vs-Rest Logistic Regression
• Multiclass sparse logisitic regression on newgroups20
• MNIST classfification using multinomial logistic + L1

Differences from liblinear:
There might be a difference in the scores obtained between LogisticRegression with solver=liblinear
or LinearSVC and the external liblinear library directly, when fit_intercept=False and the fit coef_
(or) the data to be predicted are zeroes. This is because for the sample(s) with decision_function zero,
LogisticRegression and LinearSVC predict the negative class, while liblinear predicts the positive class.
Note that a model with fit_intercept=False and having many samples with decision_function zero,
is likely to be a underfit, bad model and you are advised to set fit_intercept=True and increase the intercept_scaling.

Note: Feature selection with sparse logistic regression
A logistic regression with L1 penalty yields sparse models, and can thus be used to perform feature selection, as
detailed in L1-based feature selection.
LogisticRegressionCV implements Logistic Regression with builtin cross-validation to find out the optimal C
parameter. “newton-cg”, “sag”, “saga” and “lbfgs” solvers are found to be faster for high-dimensional dense data, due
to warm-starting. For the multiclass case, if multi_class option is set to “ovr”, an optimal C is obtained for each class
and if the multi_class option is set to “multinomial”, an optimal C is obtained by minimizing the cross-entropy loss.
References:

Stochastic Gradient Descent - SGD
Stochastic gradient descent is a simple yet very efficient approach to fit linear models. It is particularly useful when the
number of samples (and the number of features) is very large. The partial_fit method allows only/out-of-core
learning.
The classes SGDClassifier and SGDRegressor provide functionality to fit linear models for classification and regression using different (convex) loss functions and different penalties. E.g., with loss="log",
SGDClassifier fits a logistic regression model, while with loss="hinge" it fits a linear support vector machine (SVM).
References
• Stochastic Gradient Descent

164

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Perceptron
The Perceptron is another simple algorithm suitable for large scale learning. By default:
• It does not require a learning rate.
• It is not regularized (penalized).
• It updates its model only on mistakes.
The last characteristic implies that the Perceptron is slightly faster to train than SGD with the hinge loss and that the
resulting models are sparser.
Passive Aggressive Algorithms
The passive-aggressive algorithms are a family of algorithms for large-scale learning. They are similar to the Perceptron in that they do not require a learning rate. However, contrary to the Perceptron, they include a regularization
parameter C.
For classification, PassiveAggressiveClassifier can be used with loss='hinge' (PA-I) or
loss='squared_hinge' (PA-II). For regression, PassiveAggressiveRegressor can be used with
loss='epsilon_insensitive' (PA-I) or loss='squared_epsilon_insensitive' (PA-II).
References:
• “Online Passive-Aggressive Algorithms” K. Crammer, O. Dekel, J. Keshat, S. Shalev-Shwartz, Y. Singer JMLR 7 (2006)

Robustness regression: outliers and modeling errors
Robust regression is interested in fitting a regression model in the presence of corrupt data: either outliers, or error in
the model.

Different scenario and useful concepts
There are different things to keep in mind when dealing with data corrupted by outliers:

3.1. Supervised learning

165

scikit-learn user guide, Release 0.19.1

• Outliers in X or in y?
Outliers in the y direction

Outliers in the X direction

• Fraction of outliers versus amplitude of error
The number of outlying points matters, but also how much they are outliers.
Small outliers

Large outliers

An important notion of robust fitting is that of breakdown point: the fraction of data that can be outlying for the fit to
start missing the inlying data.
Note that in general, robust fitting in high-dimensional setting (large n_features) is very hard. The robust models here
will probably not work in these settings.
Trade-offs: which estimator?
Scikit-learn provides 3 robust regression estimators: RANSAC, Theil Sen and HuberRegressor
• HuberRegressor should be faster than RANSAC and Theil Sen unless the number of samples are
very large, i.e n_samples >> n_features. This is because RANSAC and Theil Sen fit on
smaller subsets of the data. However, both Theil Sen and RANSAC are unlikely to be as robust as
HuberRegressor for the default parameters.

166

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• RANSAC is faster than Theil Sen and scales much better with the number of samples
• RANSAC will deal better with large outliers in the y direction (most common situation)
• Theil Sen will cope better with medium-size outliers in the X direction, but this property will
disappear in large dimensional settings.
When in doubt, use RANSAC

RANSAC: RANdom SAmple Consensus
RANSAC (RANdom SAmple Consensus) fits a model from random subsets of inliers from the complete data set.
RANSAC is a non-deterministic algorithm producing only a reasonable result with a certain probability, which is dependent on the number of iterations (see max_trials parameter). It is typically used for linear and non-linear regression
problems and is especially popular in the fields of photogrammetric computer vision.
The algorithm splits the complete input sample data into a set of inliers, which may be subject to noise, and outliers,
which are e.g. caused by erroneous measurements or invalid hypotheses about the data. The resulting model is then
estimated only from the determined inliers.

Details of the algorithm
Each iteration performs the following steps:
1. Select min_samples random samples from the original data and check whether the set of data is valid (see
is_data_valid).
2. Fit a model to the random subset (base_estimator.fit) and check whether the estimated model is valid
(see is_model_valid).
3. Classify all data as inliers or outliers by calculating the residuals to the estimated model (base_estimator.
predict(X) - y) - all data samples with absolute residuals smaller than the residual_threshold are
considered as inliers.
4. Save fitted model as best model if number of inlier samples is maximal. In case the current estimated model has
the same number of inliers, it is only considered as the best model if it has better score.

3.1. Supervised learning

167

scikit-learn user guide, Release 0.19.1

These steps are performed either a maximum number of times (max_trials) or until one of the special stop criteria
are met (see stop_n_inliers and stop_score). The final model is estimated using all inlier samples (consensus
set) of the previously determined best model.
The is_data_valid and is_model_valid functions allow to identify and reject degenerate combinations of
random sub-samples. If the estimated model is not needed for identifying degenerate cases, is_data_valid should
be used as it is called prior to fitting the model and thus leading to better computational performance.
Examples:
• Robust linear model estimation using RANSAC
• Robust linear estimator fitting

References:
• https://en.wikipedia.org/wiki/RANSAC
• “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography” Martin A. Fischler and Robert C. Bolles - SRI International (1981)
• “Performance Evaluation of RANSAC Family” Sunglok Choi, Taemin Kim and Wonpil Yu - BMVC (2009)

Theil-Sen estimator: generalized-median-based estimator
The TheilSenRegressor estimator uses a generalization of the median in multiple dimensions. It is thus robust
to multivariate outliers. Note however that the robustness of the estimator decreases quickly with the dimensionality of
the problem. It looses its robustness properties and becomes no better than an ordinary least squares in high dimension.
Examples:
• Theil-Sen Regression
• Robust linear estimator fitting

References:
• https://en.wikipedia.org/wiki/Theil%E2%80%93Sen_estimator

Theoretical considerations
TheilSenRegressor is comparable to the Ordinary Least Squares (OLS) in terms of asymptotic efficiency and as
an unbiased estimator. In contrast to OLS, Theil-Sen is a non-parametric method which means it makes no assumption
about the underlying distribution of the data. Since Theil-Sen is a median-based estimator, it is more robust against
corrupted data aka outliers. In univariate setting, Theil-Sen has a breakdown point of about 29.3% in case of a simple
linear regression which means that it can tolerate arbitrary corrupted data of up to 29.3%.

168

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The implementation of TheilSenRegressor in scikit-learn follows a generalization to a multivariate linear regression model8 using the spatial median which is a generalization of the median to multiple dimensions9 .
In terms of time and space complexity, Theil-Sen scales according to
(︂
)︂
𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠
𝑛𝑠𝑢𝑏𝑠𝑎𝑚𝑝𝑙𝑒𝑠
which makes it infeasible to be applied exhaustively to problems with a large number of samples and features. Therefore, the magnitude of a subpopulation can be chosen to limit the time and space complexity by considering only a
random subset of all possible combinations.
Examples:
• Theil-Sen Regression

References:

Huber Regression
The HuberRegressor is different to Ridge because it applies a linear loss to samples that are classified as outliers.
A sample is classified as an inlier if the absolute error of that sample is lesser than a certain threshold. It differs from
TheilSenRegressor and RANSACRegressor because it does not ignore the effect of the outliers but gives a
lesser weight to them.
The loss function that HuberRegressor minimizes is given by
𝑚𝑖𝑛
𝑤,𝜎

8

𝑛 (︂
∑︁
𝑖=1

(︂
𝜎 + 𝐻𝑚

𝑋𝑖 𝑤 − 𝑦𝑖
𝜎

)︂ )︂
2
𝜎 + 𝛼||𝑤||2

Xin Dang, Hanxiang Peng, Xueqin Wang and Heping Zhang: Theil-Sen Estimators in a Multiple Linear Regression Model.

9

20. Kärkkäinen and S. Äyrämö: On Computation of Spatial Median for Robust Data Mining.

3.1. Supervised learning

169

scikit-learn user guide, Release 0.19.1

where
{︃
𝐻𝑚 (𝑧) =

𝑧2,
if |𝑧| < 𝜖,
2𝜖|𝑧| − 𝜖2 , otherwise

It is advised to set the parameter epsilon to 1.35 to achieve 95% statistical efficiency.
Notes
The HuberRegressor differs from using SGDRegressor with loss set to huber in the following ways.
• HuberRegressor is scaling invariant. Once epsilon is set, scaling X and y down or up by different values
would produce the same robustness to outliers as before. as compared to SGDRegressor where epsilon
has to be set again when X and y are scaled.
• HuberRegressor should be more efficient to use on data with small number of samples while
SGDRegressor needs a number of passes on the training data to produce the same robustness.
Examples:
• HuberRegressor vs Ridge on dataset with strong outliers

References:
• Peter J. Huber, Elvezio M. Ronchetti: Robust Statistics, Concomitant scale estimates, pg 172
Also, this estimator is different from the R implementation of Robust Regression (http://www.ats.ucla.edu/stat/r/dae/
rreg.htm) because the R implementation does a weighted least squares implementation with weights given to each
sample on the basis of how much the residual is greater than a certain threshold.
Polynomial regression: extending linear models with basis functions
One common pattern within machine learning is to use linear models trained on nonlinear functions of the data. This
approach maintains the generally fast performance of linear methods, while allowing them to fit a much wider range
of data.
170

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

For example, a simple linear regression can be extended by constructing polynomial features from the coefficients.
In the standard linear regression case, you might have a model that looks like this for two-dimensional data:
𝑦ˆ(𝑤, 𝑥) = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
If we want to fit a paraboloid to the data instead of a plane, we can combine the features in second-order polynomials,
so that the model looks like this:
𝑦ˆ(𝑤, 𝑥) = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥1 𝑥2 + 𝑤4 𝑥21 + 𝑤5 𝑥22
The (sometimes surprising) observation is that this is still a linear model: to see this, imagine creating a new variable
𝑧 = [𝑥1 , 𝑥2 , 𝑥1 𝑥2 , 𝑥21 , 𝑥22 ]
With this re-labeling of the data, our problem can be written
𝑦ˆ(𝑤, 𝑥) = 𝑤0 + 𝑤1 𝑧1 + 𝑤2 𝑧2 + 𝑤3 𝑧3 + 𝑤4 𝑧4 + 𝑤5 𝑧5
We see that the resulting polynomial regression is in the same class of linear models we’d considered above (i.e. the
model is linear in 𝑤) and can be solved by the same techniques. By considering linear fits within a higher-dimensional
space built with these basis functions, the model has the flexibility to fit a much broader range of data.
Here is an example of applying this idea to one-dimensional data, using polynomial features of varying degrees:

This figure is created using the PolynomialFeatures preprocessor. This preprocessor transforms an input data
matrix into a new data matrix of a given degree. It can be used as follows:
>>> from sklearn.preprocessing import PolynomialFeatures
>>> import numpy as np
>>> X = np.arange(6).reshape(3, 2)
>>> X
array([[0, 1],
[2, 3],
[4, 5]])
>>> poly = PolynomialFeatures(degree=2)
>>> poly.fit_transform(X)
array([[ 1.,
0.,
1.,
0.,
0.,
1.],
[ 1.,
2.,
3.,
4.,
6.,
9.],
[ 1.,
4.,
5., 16., 20., 25.]])

3.1. Supervised learning

171

scikit-learn user guide, Release 0.19.1

The features of X have been transformed from [𝑥1 , 𝑥2 ] to [1, 𝑥1 , 𝑥2 , 𝑥21 , 𝑥1 𝑥2 , 𝑥22 ], and can now be used within any
linear model.
This sort of preprocessing can be streamlined with the Pipeline tools. A single object representing a simple polynomial
regression can be created and used as follows:
>>> from sklearn.preprocessing import PolynomialFeatures
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.pipeline import Pipeline
>>> import numpy as np
>>> model = Pipeline([('poly', PolynomialFeatures(degree=3)),
...
('linear', LinearRegression(fit_intercept=False))])
>>> # fit to an order-3 polynomial data
>>> x = np.arange(5)
>>> y = 3 - 2 * x + x ** 2 - x ** 3
>>> model = model.fit(x[:, np.newaxis], y)
>>> model.named_steps['linear'].coef_
array([ 3., -2., 1., -1.])

The linear model trained on polynomial features is able to exactly recover the input polynomial coefficients.
In some cases it’s not necessary to include higher powers of any single feature, but only the so-called interaction
features that multiply together at most 𝑑 distinct features. These can be gotten from PolynomialFeatures with
the setting interaction_only=True.
For example, when dealing with boolean features, 𝑥𝑛𝑖 = 𝑥𝑖 for all 𝑛 and is therefore useless; but 𝑥𝑖 𝑥𝑗 represents the
conjunction of two booleans. This way, we can solve the XOR problem with a linear classifier:
>>> from sklearn.linear_model import Perceptron
>>> from sklearn.preprocessing import PolynomialFeatures
>>> import numpy as np
>>> X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
>>> y = X[:, 0] ^ X[:, 1]
>>> y
array([0, 1, 1, 0])
>>> X = PolynomialFeatures(interaction_only=True).fit_transform(X).astype(int)
>>> X
array([[1, 0, 0, 0],
[1, 0, 1, 0],
[1, 1, 0, 0],
[1, 1, 1, 1]])
>>> clf = Perceptron(fit_intercept=False, max_iter=10, tol=None,
...
shuffle=False).fit(X, y)

And the classifier “predictions” are perfect:
>>> clf.predict(X)
array([0, 1, 1, 0])
>>> clf.score(X, y)
1.0

3.1.2 Linear and Quadratic Discriminant Analysis
Linear Discriminant Analysis (discriminant_analysis.LinearDiscriminantAnalysis) and
Quadratic Discriminant Analysis (discriminant_analysis.QuadraticDiscriminantAnalysis)
are two classic classifiers, with, as their names suggest, a linear and a quadratic decision surface, respectively.

172

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

These classifiers are attractive because they have closed-form solutions that can be easily computed, are inherently
multiclass, have proven to work well in practice and have no hyperparameters to tune.

The plot shows decision boundaries for Linear Discriminant Analysis and Quadratic Discriminant Analysis. The
bottom row demonstrates that Linear Discriminant Analysis can only learn linear boundaries, while Quadratic Discriminant Analysis can learn quadratic boundaries and is therefore more flexible.
Examples:
Linear and Quadratic Discriminant Analysis with covariance ellipsoid: Comparison of LDA and QDA on synthetic
data.

Dimensionality reduction using Linear Discriminant Analysis
discriminant_analysis.LinearDiscriminantAnalysis can be used to perform supervised dimensionality reduction, by projecting the input data to a linear subspace consisting of the directions which maximize
the separation between classes (in a precise sense discussed in the mathematics section below). The dimension of the
output is necessarily less than the number of classes, so this is a in general a rather strong dimensionality reduction,
and only makes senses in a multiclass setting.
This is implemented in discriminant_analysis.LinearDiscriminantAnalysis.transform. The
desired dimensionality can be set using the n_components constructor parameter. This parameter has no influence
on discriminant_analysis.LinearDiscriminantAnalysis.fit or discriminant_analysis.
LinearDiscriminantAnalysis.predict.
Examples:

3.1. Supervised learning

173

scikit-learn user guide, Release 0.19.1

Comparison of LDA and PCA 2D projection of Iris dataset: Comparison of LDA and PCA for dimensionality
reduction of the Iris dataset

Mathematical formulation of the LDA and QDA classifiers
Both LDA and QDA can be derived from simple probabilistic models which model the class conditional distribution
of the data 𝑃 (𝑋|𝑦 = 𝑘) for each class 𝑘. Predictions can then be obtained by using Bayes’ rule:
𝑃 (𝑦 = 𝑘|𝑋) =

𝑃 (𝑋|𝑦 = 𝑘)𝑃 (𝑦 = 𝑘)
𝑃 (𝑋|𝑦 = 𝑘)𝑃 (𝑦 = 𝑘)
= ∑︀
𝑃 (𝑋)
𝑙 𝑃 (𝑋|𝑦 = 𝑙) · 𝑃 (𝑦 = 𝑙)

and we select the class 𝑘 which maximizes this conditional probability.
More specifically, for linear and quadratic discriminant analysis, 𝑃 (𝑋|𝑦) is modelled as a multivariate Gaussian
distribution with density:
(︂
)︂
1
1
𝑡 −1
𝑝(𝑋|𝑦 = 𝑘) =
exp − (𝑋 − 𝜇𝑘 ) Σ𝑘 (𝑋 − 𝜇𝑘 )
2
(2𝜋)𝑛 |Σ𝑘 |1/2
To use this model as a classifier, we just need to estimate from the training data the class priors 𝑃 (𝑦 = 𝑘) (by the
proportion of instances of class 𝑘), the class means 𝜇𝑘 (by the empirical sample class means) and the covariance
matrices (either by the empirical sample class covariance matrices, or by a regularized estimator: see the section on
shrinkage below).
In the case of LDA, the Gaussians for each class are assumed to share the same covariance matrix: Σ𝑘 = Σ for all
𝑘. This leads to linear decision surfaces between, as can be seen by comparing the log-probability ratios log[𝑃 (𝑦 =
𝑘|𝑋)/𝑃 (𝑦 = 𝑙|𝑋)]:
(︂
)︂
𝑃 (𝑦 = 𝑘|𝑋)
1
log
= 0 ⇔ (𝜇𝑘 − 𝜇𝑙 )Σ−1 𝑋 = (𝜇𝑡𝑘 Σ−1 𝜇𝑘 − 𝜇𝑡𝑙 Σ−1 𝜇𝑙 )
𝑃 (𝑦 = 𝑙|𝑋)
2
In the case of QDA, there are no assumptions on the covariance matrices Σ𝑘 of the Gaussians, leading to quadratic
decision surfaces. See3 for more details.
Note: Relation with Gaussian Naive Bayes
If in the QDA model one assumes that the covariance matrices are diagonal, then the inputs are assumed to be conditionally independent in each class, and the resulting classifier is equivalent to the Gaussian Naive Bayes classifier
naive_bayes.GaussianNB.

Mathematical formulation of LDA dimensionality reduction
To understand the use of LDA in dimensionality reduction, it is useful to start with a geometric reformulation of the
LDA classification rule explained above. We write 𝐾 for the total number of target classes. Since in LDA we assume
that all classes have the same estimated covariance Σ, we can rescale the data so that this covariance is the identity:
𝑋 * = 𝐷−1/2 𝑈 𝑡 𝑋 with Σ = 𝑈 𝐷𝑈 𝑡
Then one can show that to classify a data point after scaling is equivalent to finding the estimated class mean 𝜇*𝑘 which
is closest to the data point in the Euclidean distance. But this can be done just as well after projecting on the 𝐾 − 1
affine subspace 𝐻𝐾 generated by all the 𝜇*𝑘 for all classes. This shows that, implicit in the LDA classifier, there is a
dimensionality reduction by linear projection onto a 𝐾 − 1 dimensional space.
3

“The Elements of Statistical Learning”, Hastie T., Tibshirani R., Friedman J., Section 4.3, p.106-119, 2008.

174

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

We can reduce the dimension even more, to a chosen 𝐿, by projecting onto the linear subspace 𝐻𝐿 which maximize the variance of the 𝜇*𝑘 after projection (in effect, we are doing a form of PCA for the transformed class
means 𝜇*𝑘 ). This 𝐿 corresponds to the n_components parameter used in the discriminant_analysis.
LinearDiscriminantAnalysis.transform method. See3 for more details.
Shrinkage
Shrinkage is a tool to improve estimation of covariance matrices in situations where the number of training samples is small compared to the number of features. In this scenario, the empirical sample covariance is a poor estimator. Shrinkage LDA can be used by setting the shrinkage parameter of the discriminant_analysis.
LinearDiscriminantAnalysis class to ‘auto’. This automatically determines the optimal shrinkage parameter
in an analytic way following the lemma introduced by Ledoit and Wolf4 . Note that currently shrinkage only works
when setting the solver parameter to ‘lsqr’ or ‘eigen’.
The shrinkage parameter can also be manually set between 0 and 1. In particular, a value of 0 corresponds to
no shrinkage (which means the empirical covariance matrix will be used) and a value of 1 corresponds to complete
shrinkage (which means that the diagonal matrix of variances will be used as an estimate for the covariance matrix).
Setting this parameter to a value between these two extrema will estimate a shrunk version of the covariance matrix.

Estimation algorithms
The default solver is ‘svd’. It can perform both classification and transform, and it does not rely on the calculation
of the covariance matrix. This can be an advantage in situations where the number of features is large. However, the
‘svd’ solver cannot be used with shrinkage.
The ‘lsqr’ solver is an efficient algorithm that only works for classification. It supports shrinkage.
The ‘eigen’ solver is based on the optimization of the between class scatter to within class scatter ratio. It can be used
for both classification and transform, and it supports shrinkage. However, the ‘eigen’ solver needs to compute the
covariance matrix, so it might not be suitable for situations with a high number of features.
4

Ledoit O, Wolf M. Honey, I Shrunk the Sample Covariance Matrix. The Journal of Portfolio Management 30(4), 110-119, 2004.

3.1. Supervised learning

175

scikit-learn user guide, Release 0.19.1

Examples:
Normal and Shrinkage Linear Discriminant Analysis for classification: Comparison of LDA classifiers with and
without shrinkage.

References:

3.1.3 Kernel ridge regression
Kernel ridge regression (KRR) [M2012] combines Ridge Regression (linear least squares with l2-norm regularization)
with the kernel trick. It thus learns a linear function in the space induced by the respective kernel and the data. For
non-linear kernels, this corresponds to a non-linear function in the original space.
The form of the model learned by KernelRidge is identical to support vector regression (SVR). However, different
loss functions are used: KRR uses squared error loss while support vector regression uses 𝜖-insensitive loss, both
combined with l2 regularization. In contrast to SVR, fitting KernelRidge can be done in closed-form and is typically
faster for medium-sized datasets. On the other hand, the learned model is non-sparse and thus slower than SVR, which
learns a sparse model for 𝜖 > 0, at prediction-time.
The following figure compares KernelRidge and SVR on an artificial dataset, which consists of a sinusoidal target
function and strong noise added to every fifth datapoint. The learned model of KernelRidge and SVR is plotted,
where both complexity/regularization and bandwidth of the RBF kernel have been optimized using grid-search. The
learned functions are very similar; however, fitting KernelRidge is approx. seven times faster than fitting SVR
(both with grid-search). However, prediction of 100000 target values is more than three times faster with SVR since it
has learned a sparse model using only approx. 1/3 of the 100 training datapoints as support vectors.
The next figure compares the time for fitting and prediction of KernelRidge and SVR for different sizes of the
training set. Fitting KernelRidge is faster than SVR for medium-sized training sets (less than 1000 samples);
however, for larger training sets SVR scales better. With regard to prediction time, SVR is faster than KernelRidge
for all sizes of the training set because of the learned sparse solution. Note that the degree of sparsity and thus the
prediction time depends on the parameters 𝜖 and 𝐶 of the SVR; 𝜖 = 0 would correspond to a dense model.
References:

3.1.4 Support Vector Machines
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and
outliers detection.
The advantages of support vector machines are:
• Effective in high dimensional spaces.
• Still effective in cases where number of dimensions is greater than the number of samples.
• Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
• Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided,
but it is also possible to specify custom kernels.
The disadvantages of support vector machines include:

176

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

3.1. Supervised learning

177

scikit-learn user guide, Release 0.19.1

178

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel
functions and regularization term is crucial.
• SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold crossvalidation (see Scores and probabilities, below).
The support vector machines in scikit-learn support both dense (numpy.ndarray and convertible to that by numpy.
asarray) and sparse (any scipy.sparse) sample vectors as input. However, to use an SVM to make predictions
for sparse data, it must have been fit on such data. For optimal performance, use C-ordered numpy.ndarray (dense)
or scipy.sparse.csr_matrix (sparse) with dtype=float64.
Classification
SVC, NuSVC and LinearSVC are classes capable of performing multi-class classification on a dataset.

SVC and NuSVC are similar methods, but accept slightly different sets of parameters and have different mathematical
formulations (see section Mathematical formulation). On the other hand, LinearSVC is another implementation
of Support Vector Classification for the case of a linear kernel. Note that LinearSVC does not accept keyword
kernel, as this is assumed to be linear. It also lacks some of the members of SVC and NuSVC, like support_.
As other classifiers, SVC, NuSVC and LinearSVC take as input two arrays: an array X of size [n_samples,
n_features] holding the training samples, and an array y of class labels (strings or integers), size [n_samples]:
>>> from sklearn import svm
>>> X = [[0, 0], [1, 1]]

3.1. Supervised learning

179

scikit-learn user guide, Release 0.19.1

>>> y = [0, 1]
>>> clf = svm.SVC()
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

After being fitted, the model can then be used to predict new values:
>>> clf.predict([[2., 2.]])
array([1])

SVMs decision function depends on some subset of the training data, called the support vectors. Some properties of
these support vectors can be found in members support_vectors_, support_ and n_support:
>>> # get support vectors
>>> clf.support_vectors_
array([[ 0., 0.],
[ 1., 1.]])
>>> # get indices of support vectors
>>> clf.support_
array([0, 1]...)
>>> # get number of support vectors for each class
>>> clf.n_support_
array([1, 1]...)

Multi-class classification
SVC and NuSVC implement the “one-against-one” approach (Knerr et al., 1990) for multi- class classification. If n_class is the number of classes, then n_class * (n_class - 1) / 2 classifiers are constructed and each one trains data from two classes. To provide a consistent interface with other classifiers, the
decision_function_shape option allows to aggregate the results of the “one-against-one” classifiers to a decision function of shape (n_samples, n_classes):
>>> X = [[0], [1], [2], [3]]
>>> Y = [0, 1, 2, 3]
>>> clf = svm.SVC(decision_function_shape='ovo')
>>> clf.fit(X, Y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes: 4*3/2 = 6
6
>>> clf.decision_function_shape = "ovr"
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes
4

On the other hand, LinearSVC implements “one-vs-the-rest” multi-class strategy, thus training n_class models. If
there are only two classes, only one model is trained:

180

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>> lin_clf = svm.LinearSVC()
>>> lin_clf.fit(X, Y)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='squared_hinge', max_iter=1000,
multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
verbose=0)
>>> dec = lin_clf.decision_function([[1]])
>>> dec.shape[1]
4

See Mathematical formulation for a complete description of the decision function.
Note that the LinearSVC also implements an alternative multi-class strategy, the so-called multi-class SVM formulated by Crammer and Singer, by using the option multi_class='crammer_singer'. This method is consistent, which is not true for one-vs-rest classification. In practice, one-vs-rest classification is usually preferred, since
the results are mostly similar, but the runtime is significantly less.
For “one-vs-rest” LinearSVC the attributes coef_ and intercept_ have the shape [n_class,
n_features] and [n_class] respectively. Each row of the coefficients corresponds to one of the n_class
many “one-vs-rest” classifiers and similar for the intercepts, in the order of the “one” class.
In the case of “one-vs-one” SVC, the layout of the attributes is a little more involved. In the case of having a linear
kernel, The layout of coef_ and intercept_ is similar to the one described for LinearSVC described above,
except that the shape of coef_ is [n_class * (n_class - 1) / 2, n_features], corresponding to as
many binary classifiers. The order for classes 0 to n is “0 vs 1”, “0 vs 2” , . . . “0 vs n”, “1 vs 2”, “1 vs 3”, “1 vs n”, . .
. “n-1 vs n”.
The shape of dual_coef_ is [n_class-1, n_SV] with a somewhat hard to grasp layout. The columns correspond to the support vectors involved in any of the n_class * (n_class - 1) / 2 “one-vs-one” classifiers.
Each of the support vectors is used in n_class - 1 classifiers. The n_class - 1 entries in each row correspond
to the dual coefficients for these classifiers.
This might be made more clear by an example:
Consider a three class problem with class 0 having three support vectors 𝑣00 , 𝑣01 , 𝑣02 and class 1 and 2 having two
support vectors 𝑣10 , 𝑣11 and 𝑣20 , 𝑣21 respectively. For each support vector 𝑣𝑖𝑗 , there are two dual coefficients. Let’s call
𝑗
the coefficient of support vector 𝑣𝑖𝑗 in the classifier between classes 𝑖 and 𝑘 𝛼𝑖,𝑘
. Then dual_coef_ looks like this:
0
𝛼0,1
1
𝛼0,1
2
𝛼0,1
0
𝛼1,0
1
𝛼1,0
0
𝛼2,0
1
𝛼2,0

0
𝛼0,2
1
𝛼0,2
2
𝛼0,2
0
𝛼1,2
1
𝛼1,2
0
𝛼2,1
1
𝛼2,1

Coefficients for SVs of class 0

Coefficients for SVs of class 1
Coefficients for SVs of class 2

Scores and probabilities
The SVC method decision_function gives per-class scores for each sample (or a single score per sample in the
binary case). When the constructor option probability is set to True, class membership probability estimates
(from the methods predict_proba and predict_log_proba) are enabled. In the binary case, the probabilities
are calibrated using Platt scaling: logistic regression on the SVM’s scores, fit by an additional cross-validation on the
training data. In the multiclass case, this is extended as per Wu et al. (2004).
Needless to say, the cross-validation involved in Platt scaling is an expensive operation for large datasets. In addition,
the probability estimates may be inconsistent with the scores, in the sense that the “argmax” of the scores may not be
3.1. Supervised learning

181

scikit-learn user guide, Release 0.19.1

the argmax of the probabilities. (E.g., in binary classification, a sample may be labeled by predict as belonging
to a class that has probability <½ according to predict_proba.) Platt’s method is also known to have theoretical issues. If confidence scores are required, but these do not have to be probabilities, then it is advisable to set
probability=False and use decision_function instead of predict_proba.
References:
• Wu, Lin and Weng, “Probability estimates for multi-class classification by pairwise coupling”, JMLR 5:9751005, 2004.
• Platt “Probabilistic outputs for SVMs and comparisons to regularized likelihood methods”
.

Unbalanced problems
In problems where it is desired to give more importance to certain classes or certain individual samples keywords
class_weight and sample_weight can be used.
SVC (but not NuSVC) implement a keyword class_weight in the fit method. It’s a dictionary of the form
{class_label : value}, where value is a floating point number > 0 that sets the parameter C of class
class_label to C * value.

SVC, NuSVC, SVR, NuSVR and OneClassSVM implement also weights for individual samples in method fit
through keyword sample_weight. Similar to class_weight, these set the parameter C for the i-th example to
C * sample_weight[i].
Examples:

182

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• Plot different SVM classifiers in the iris dataset,
• SVM: Maximum margin separating hyperplane,
• SVM: Separating hyperplane for unbalanced classes
• SVM-Anova: SVM with univariate feature selection,
• Non-linear SVM
• SVM: Weighted samples,

Regression
The method of Support Vector Classification can be extended to solve regression problems. This method is called
Support Vector Regression.
The model produced by support vector classification (as described above) depends only on a subset of the training
data, because the cost function for building the model does not care about training points that lie beyond the margin.
Analogously, the model produced by Support Vector Regression depends only on a subset of the training data, because
the cost function for building the model ignores any training data close to the model prediction.
There are three different implementations of Support Vector Regression: SVR, NuSVR and LinearSVR.
LinearSVR provides a faster implementation than SVR but only considers linear kernels, while NuSVR implements
a slightly different formulation than SVR and LinearSVR. See Implementation details for further details.
As with classification classes, the fit method will take as argument vectors X, y, only that in this case y is expected to
have floating point values instead of integer values:
>>> from sklearn import svm
>>> X = [[0, 0], [2, 2]]
>>> y = [0.5, 2.5]
>>> clf = svm.SVR()
>>> clf.fit(X, y)
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
>>> clf.predict([[1, 1]])
array([ 1.5])

3.1. Supervised learning

183

scikit-learn user guide, Release 0.19.1

Examples:
• Support Vector Regression (SVR) using linear and non-linear kernels

Density estimation, novelty detection
One-class SVM is used for novelty detection, that is, given a set of samples, it will detect the soft boundary of that set
so as to classify new points as belonging to that set or not. The class that implements this is called OneClassSVM .
In this case, as it is a type of unsupervised learning, the fit method will only take as input an array X, as there are no
class labels.
See, section Novelty and Outlier Detection for more details on this usage.

Examples:
• One-class SVM with non-linear kernel (RBF)
• Species distribution modeling

Complexity
Support Vector Machines are powerful tools, but their compute and storage requirements increase rapidly with the
number of training vectors. The core of an SVM is a quadratic programming problem (QP), separating support
vectors from the rest of the training data. The QP solver used by this libsvm-based implementation scales between
𝑂(𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 × 𝑛2𝑠𝑎𝑚𝑝𝑙𝑒𝑠 ) and 𝑂(𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 × 𝑛3𝑠𝑎𝑚𝑝𝑙𝑒𝑠 ) depending on how efficiently the libsvm cache is used in

184

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

practice (dataset dependent). If the data is very sparse 𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 should be replaced by the average number of nonzero features in a sample vector.
Also note that for the linear case, the algorithm used in LinearSVC by the liblinear implementation is much more
efficient than its libsvm-based SVC counterpart and can scale almost linearly to millions of samples and/or features.
Tips on Practical Use
• Avoiding data copy: For SVC, SVR, NuSVC and NuSVR, if the data passed to certain methods is not C-ordered
contiguous, and double precision, it will be copied before calling the underlying C implementation. You can
check whether a given numpy array is C-contiguous by inspecting its flags attribute.
For LinearSVC (and LogisticRegression) any input passed as a numpy array will be copied and converted to the liblinear internal sparse data representation (double precision floats and int32 indices of non-zero
components). If you want to fit a large-scale linear classifier without copying a dense numpy C-contiguous
double precision array as input we suggest to use the SGDClassifier class instead. The objective function
can be configured to be almost the same as the LinearSVC model.
• Kernel cache size: For SVC, SVR, nuSVC and NuSVR, the size of the kernel cache has a strong impact on run
times for larger problems. If you have enough RAM available, it is recommended to set cache_size to a
higher value than the default of 200(MB), such as 500(MB) or 1000(MB).
• Setting C: C is 1 by default and it’s a reasonable default choice. If you have a lot of noisy observations you
should decrease it. It corresponds to regularize more the estimation.
• Support Vector Machine algorithms are not scale invariant, so it is highly recommended to scale your data.
For example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0
and variance 1. Note that the same scaling must be applied to the test vector to obtain meaningful results. See
section Preprocessing data for more details on scaling and normalization.
• Parameter nu in NuSVC/OneClassSVM /NuSVR approximates the fraction of training errors and support vectors.
• In SVC, if data for classification are unbalanced (e.g.
many positive and few negative), set
class_weight='balanced' and/or try different penalty parameters C.
• The underlying LinearSVC implementation uses a random number generator to select features when fitting
the model. It is thus not uncommon, to have slightly different results for the same input data. If that happens,
try with a smaller tol parameter.
• Using L1 penalization as provided by LinearSVC(loss='l2', penalty='l1', dual=False)
yields a sparse solution, i.e. only a subset of feature weights is different from zero and contribute to the decision function. Increasing C yields a more complex model (more feature are selected). The C value that yields
a “null” model (all weights equal to zero) can be calculated using l1_min_c.
Kernel functions
The kernel function can be any of the following:
• linear: ⟨𝑥, 𝑥′ ⟩.
• polynomial: (𝛾⟨𝑥, 𝑥′ ⟩ + 𝑟)𝑑 . 𝑑 is specified by keyword degree, 𝑟 by coef0.
• rbf: exp(−𝛾‖𝑥 − 𝑥′ ‖2 ). 𝛾 is specified by keyword gamma, must be greater than 0.
• sigmoid (tanh(𝛾⟨𝑥, 𝑥′ ⟩ + 𝑟)), where 𝑟 is specified by coef0.
Different kernels are specified by keyword kernel at initialization:

3.1. Supervised learning

185

scikit-learn user guide, Release 0.19.1

>>> linear_svc = svm.SVC(kernel='linear')
>>> linear_svc.kernel
'linear'
>>> rbf_svc = svm.SVC(kernel='rbf')
>>> rbf_svc.kernel
'rbf'

Custom Kernels
You can define your own kernels by either giving the kernel as a python function or by precomputing the Gram matrix.
Classifiers with custom kernels behave the same way as any other classifiers, except that:
• Field support_vectors_ is now empty, only indices of support vectors are stored in support_
• A reference (and not a copy) of the first argument in the fit() method is stored for future reference. If that
array changes between the use of fit() and predict() you will have unexpected results.
Using Python functions as kernels
You can also use your own defined kernels by passing a function to the keyword kernel in the constructor.
Your kernel must take as arguments two matrices of shape (n_samples_1, n_features), (n_samples_2,
n_features) and return a kernel matrix of shape (n_samples_1, n_samples_2).
The following code defines a linear kernel and creates a classifier instance that will use that kernel:
>>>
>>>
>>>
...
...
>>>

import numpy as np
from sklearn import svm
def my_kernel(X, Y):
return np.dot(X, Y.T)
clf = svm.SVC(kernel=my_kernel)

Examples:
• SVM with custom kernel.

Using the Gram matrix
Set kernel='precomputed' and pass the Gram matrix instead of X in the fit method. At the moment, the kernel
values between all training vectors and the test vectors must be provided.
>>> import numpy as np
>>> from sklearn import svm
>>> X = np.array([[0, 0], [1, 1]])
>>> y = [0, 1]
>>> clf = svm.SVC(kernel='precomputed')
>>> # linear kernel computation
>>> gram = np.dot(X, X.T)
>>> clf.fit(gram, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto',

186

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

kernel='precomputed', max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=0.001, verbose=False)
>>> # predict on training examples
>>> clf.predict(gram)
array([0, 1])

Parameters of the RBF Kernel
When training an SVM with the Radial Basis Function (RBF) kernel, two parameters must be considered: C and
gamma. The parameter C, common to all SVM kernels, trades off misclassification of training examples against
simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all
training examples correctly. gamma defines how much influence a single training example has. The larger gamma is,
the closer other examples must be to be affected.
Proper choice of C and gamma is critical to the SVM’s performance. One is advised to use sklearn.
model_selection.GridSearchCV with C and gamma spaced exponentially far apart to choose good values.
Examples:
• RBF SVM parameters

Mathematical formulation
A support vector machine constructs a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which
can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane
that has the largest distance to the nearest training data points of any class (so-called functional margin), since in
general the larger the margin the lower the generalization error of the classifier.

3.1. Supervised learning

187

scikit-learn user guide, Release 0.19.1

SVC
Given training vectors 𝑥𝑖 ∈ R𝑝 , i=1,. . . , n, in two classes, and a vector 𝑦 ∈ {1, −1}𝑛 , SVC solves the following primal
problem:
𝑛

∑︁
1
min 𝑤𝑇 𝑤 + 𝐶
𝜁𝑖
𝑤,𝑏,𝜁 2
𝑖=1
subject to 𝑦𝑖 (𝑤𝑇 𝜑(𝑥𝑖 ) + 𝑏) ≥ 1 − 𝜁𝑖 ,
𝜁𝑖 ≥ 0, 𝑖 = 1, ..., 𝑛
Its dual is
1
min 𝛼𝑇 𝑄𝛼 − 𝑒𝑇 𝛼
𝛼 2
subject to 𝑦 𝑇 𝛼 = 0
0 ≤ 𝛼𝑖 ≤ 𝐶, 𝑖 = 1, ..., 𝑛
where 𝑒 is the vector of all ones, 𝐶 > 0 is the upper bound, 𝑄 is an 𝑛 by 𝑛 positive semidefinite matrix, 𝑄𝑖𝑗 ≡
𝑦𝑖 𝑦𝑗 𝐾(𝑥𝑖 , 𝑥𝑗 ), where 𝐾(𝑥𝑖 , 𝑥𝑗 ) = 𝜑(𝑥𝑖 )𝑇 𝜑(𝑥𝑗 ) is the kernel. Here training vectors are implicitly mapped into a
higher (maybe infinite) dimensional space by the function 𝜑.
The decision function is:
sgn(

𝑛
∑︁

𝑦𝑖 𝛼𝑖 𝐾(𝑥𝑖 , 𝑥) + 𝜌)

𝑖=1

Note: While SVM models derived from libsvm and liblinear use C as regularization parameter, most other estimators
use alpha. The exact equivalence between the amount of regularization of two models depends on the exact objective
function optimized by the model. For example, when the estimator used is sklearn.linear_model.Ridge
1
regression, the relation between them is given as 𝐶 = 𝑎𝑙𝑝ℎ𝑎
.
This parameters can be accessed through the members dual_coef_ which holds the product 𝑦𝑖 𝛼𝑖 ,
support_vectors_ which holds the support vectors, and intercept_ which holds the independent term 𝜌
:
References:
• “Automatic Capacity Tuning of Very Large VC-dimension Classifiers”, I. Guyon, B. Boser, V. Vapnik Advances in neural information processing 1993.
• “Support-vector networks”, C. Cortes, V. Vapnik - Machine Learning, 20, 273-297 (1995).

NuSVC
We introduce a new parameter 𝜈 which controls the number of support vectors and training errors. The parameter
𝜈 ∈ (0, 1] is an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors.
It can be shown that the 𝜈-SVC formulation is a reparameterization of the 𝐶-SVC and therefore mathematically
equivalent.

188

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

SVR
Given training vectors 𝑥𝑖 ∈ R𝑝 , i=1,. . . , n, and a vector 𝑦 ∈ R𝑛 𝜀-SVR solves the following primal problem:
𝑛

∑︁
1
min * 𝑤𝑇 𝑤 + 𝐶
(𝜁𝑖 + 𝜁𝑖* )
𝑤,𝑏,𝜁,𝜁 2
𝑖=1
subject to 𝑦𝑖 − 𝑤𝑇 𝜑(𝑥𝑖 ) − 𝑏 ≤ 𝜀 + 𝜁𝑖 ,
𝑤𝑇 𝜑(𝑥𝑖 ) + 𝑏 − 𝑦𝑖 ≤ 𝜀 + 𝜁𝑖* ,
𝜁𝑖 , 𝜁𝑖* ≥ 0, 𝑖 = 1, ..., 𝑛
Its dual is
1
min (𝛼 − 𝛼* )𝑇 𝑄(𝛼 − 𝛼* ) + 𝜀𝑒𝑇 (𝛼 + 𝛼* ) − 𝑦 𝑇 (𝛼 − 𝛼* )
2
subject to 𝑒𝑇 (𝛼 − 𝛼* ) = 0

𝛼,𝛼*

0 ≤ 𝛼𝑖 , 𝛼𝑖* ≤ 𝐶, 𝑖 = 1, ..., 𝑛
where 𝑒 is the vector of all ones, 𝐶 > 0 is the upper bound, 𝑄 is an 𝑛 by 𝑛 positive semidefinite matrix, 𝑄𝑖𝑗 ≡
𝐾(𝑥𝑖 , 𝑥𝑗 ) = 𝜑(𝑥𝑖 )𝑇 𝜑(𝑥𝑗 ) is the kernel. Here training vectors are implicitly mapped into a higher (maybe infinite)
dimensional space by the function 𝜑.
The decision function is:
𝑛
∑︁
(𝛼𝑖 − 𝛼𝑖* )𝐾(𝑥𝑖 , 𝑥) + 𝜌
𝑖=1

These parameters can be accessed through the members dual_coef_ which holds the difference 𝛼𝑖 − 𝛼𝑖* ,
support_vectors_ which holds the support vectors, and intercept_ which holds the independent term 𝜌
References:
• “A Tutorial on Support Vector Regression”, Alex J. Smola, Bernhard Schölkopf - Statistics and Computing
archive Volume 14 Issue 3, August 2004, p. 199-222.

Implementation details
Internally, we use libsvm and liblinear to handle all computations. These libraries are wrapped using C and Cython.
References:
For a description of the implementation and details of the algorithms used, please refer to
• LIBSVM: A Library for Support Vector Machines.
• LIBLINEAR – A Library for Large Linear Classification.

3.1.5 Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. Even though
3.1. Supervised learning

189

scikit-learn user guide, Release 0.19.1

SGD has been around in the machine learning community for a long time, it has received a considerable amount of
attention just recently in the context of large-scale learning.
SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text
classification and natural language processing. Given that the data is sparse, the classifiers in this module easily scale
to problems with more than 10^5 training examples and more than 10^5 features.
The advantages of Stochastic Gradient Descent are:
• Efficiency.
• Ease of implementation (lots of opportunities for code tuning).
The disadvantages of Stochastic Gradient Descent include:
• SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations.
• SGD is sensitive to feature scaling.
Classification

Warning: Make sure you permute (shuffle) your training data before fitting the model or use shuffle=True
to shuffle after each iteration.
The class SGDClassifier implements a plain stochastic gradient descent learning routine which supports different
loss functions and penalties for classification.

As other classifiers, SGD has to be fitted with two arrays: an array X of size [n_samples, n_features] holding the
training samples, and an array Y of size [n_samples] holding the target values (class labels) for the training samples:
>>> from sklearn.linear_model import SGDClassifier
>>> X = [[0., 0.], [1., 1.]]

190

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>> y = [0, 1]
>>> clf = SGDClassifier(loss="hinge", penalty="l2")
>>> clf.fit(X, y)
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
eta0=0.0, fit_intercept=True, l1_ratio=0.15,
learning_rate='optimal', loss='hinge', max_iter=None, n_iter=None,
n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
shuffle=True, tol=None, verbose=0, warm_start=False)

After being fitted, the model can then be used to predict new values:
>>> clf.predict([[2., 2.]])
array([1])

SGD fits a linear model to the training data. The member coef_ holds the model parameters:
>>> clf.coef_
array([[ 9.9...,

9.9...]])

Member intercept_ holds the intercept (aka offset or bias):
>>> clf.intercept_
array([-9.9...])

Whether or not the model should use an intercept, i.e.
fit_intercept.

a biased hyperplane, is controlled by the parameter

To get the signed distance to the hyperplane use SGDClassifier.decision_function:
>>> clf.decision_function([[2., 2.]])
array([ 29.6...])

The concrete loss function can be set via the loss parameter. SGDClassifier supports the following loss functions:
• loss="hinge": (soft-margin) linear Support Vector Machine,
• loss="modified_huber": smoothed hinge loss,
• loss="log": logistic regression,
• and all regression losses below.
The first two loss functions are lazy, they only update the model parameters if an example violates the margin constraint, which makes training very efficient and may result in sparser models, even when L2 penalty is used.
Using loss="log" or loss="modified_huber" enables the predict_proba method, which gives a vector
of probability estimates 𝑃 (𝑦|𝑥) per sample 𝑥:
>>> clf = SGDClassifier(loss="log").fit(X, y)
>>> clf.predict_proba([[1., 1.]])
array([[ 0.00..., 0.99...]])

The concrete penalty can be set via the penalty parameter. SGD supports the following penalties:
• penalty="l2": L2 norm penalty on coef_.
• penalty="l1": L1 norm penalty on coef_.
• penalty="elasticnet":
l1_ratio * L1.
3.1. Supervised learning

Convex combination of L2 and L1;

(1 - l1_ratio) * L2 +

191

scikit-learn user guide, Release 0.19.1

The default setting is penalty="l2". The L1 penalty leads to sparse solutions, driving most coefficients to zero.
The Elastic Net solves some deficiencies of the L1 penalty in the presence of highly correlated attributes. The parameter l1_ratio controls the convex combination of L1 and L2 penalty.
SGDClassifier supports multi-class classification by combining multiple binary classifiers in a “one versus all”
(OVA) scheme. For each of the 𝐾 classes, a binary classifier is learned that discriminates between that and all other
𝐾 − 1 classes. At testing time, we compute the confidence score (i.e. the signed distances to the hyperplane) for each
classifier and choose the class with the highest confidence. The Figure below illustrates the OVA approach on the iris
dataset. The dashed lines represent the three OVA classifiers; the background colors show the decision surface induced
by the three classifiers.

In the case of multi-class classification coef_ is a two-dimensionally array of shape=[n_classes,
n_features] and intercept_ is a one dimensional array of shape=[n_classes]. The i-th row of coef_
holds the weight vector of the OVA classifier for the i-th class; classes are indexed in ascending order (see attribute classes_). Note that, in principle, since they allow to create a probability model, loss="log" and
loss="modified_huber" are more suitable for one-vs-all classification.
SGDClassifier supports both weighted classes and weighted instances via the fit parameters class_weight
and sample_weight. See the examples below and the doc string of SGDClassifier.fit for further information.
Examples:
• SGD: Maximum margin separating hyperplane,
• Plot multi-class SGD on the iris dataset
• SGD: Weighted samples
• Comparing various online solvers
• SVM: Separating hyperplane for unbalanced classes (See the Note)

192

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

SGDClassifier supports averaged SGD (ASGD). Averaging can be enabled by setting `average=True`.
ASGD works by averaging the coefficients of the plain SGD over each iteration over a sample. When using ASGD
the learning rate can be larger and even constant leading on some datasets to a speed up in training time.
For classification with a logistic loss, another variant of SGD with an averaging strategy is available with Stochastic
Average Gradient (SAG) algorithm, available as a solver in LogisticRegression.
Regression
The class SGDRegressor implements a plain stochastic gradient descent learning routine which supports different
loss functions and penalties to fit linear regression models. SGDRegressor is well suited for regression problems with a large number of training samples (> 10.000), for other problems we recommend Ridge, Lasso, or
ElasticNet.
The concrete loss function can be set via the loss parameter. SGDRegressor supports the following loss functions:
• loss="squared_loss": Ordinary least squares,
• loss="huber": Huber loss for robust regression,
• loss="epsilon_insensitive": linear Support Vector Regression.
The Huber and epsilon-insensitive loss functions can be used for robust regression. The width of the insensitive region
has to be specified via the parameter epsilon. This parameter depends on the scale of the target variables.
SGDRegressor supports averaged SGD as SGDClassifier.
`average=True`.

Averaging can be enabled by setting

For regression with a squared loss and a l2 penalty, another variant of SGD with an averaging strategy is available with
Stochastic Average Gradient (SAG) algorithm, available as a solver in Ridge.
Stochastic Gradient Descent for sparse data

Note: The sparse implementation produces slightly different results than the dense implementation due to a shrunk
learning rate for the intercept.
There is built-in support for sparse data given in any matrix in a format supported by scipy.sparse. For maximum
efficiency, however, use the CSR matrix format as defined in scipy.sparse.csr_matrix.
Examples:
• Classification of text documents using sparse features

Complexity
The major advantage of SGD is its efficiency, which is basically linear in the number of training examples. If X is a
matrix of size (n, p) training has a cost of 𝑂(𝑘𝑛¯
𝑝), where k is the number of iterations (epochs) and 𝑝¯ is the average
number of non-zero attributes per sample.
Recent theoretical results, however, show that the runtime to get some desired optimization accuracy does not increase
as the training set size increases.

3.1. Supervised learning

193

scikit-learn user guide, Release 0.19.1

Tips on Practical Use
• Stochastic Gradient Descent is sensitive to feature scaling, so it is highly recommended to scale your data. For
example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and
variance 1. Note that the same scaling must be applied to the test vector to obtain meaningful results. This can
be easily done using StandardScaler:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train) # Don't cheat - fit only on training data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test) # apply same transformation to test data

If your attributes have an intrinsic scale (e.g. word frequencies or indicator features) scaling is not needed.
• Finding a reasonable regularization term 𝛼 is best done using GridSearchCV, usually in the range 10.
0**-np.arange(1,7).
• Empirically, we found that SGD converges after observing approx. 10^6 training samples. Thus, a reasonable
first guess for the number of iterations is n_iter = np.ceil(10**6 / n), where n is the size of the
training set.
• If you apply SGD to features extracted using PCA we found that it is often wise to scale the feature values by
some constant c such that the average L2 norm of the training data equals one.
• We found that Averaged SGD works best with a larger number of features and a higher eta0
References:
• “Efficient BackProp” Y. LeCun, L. Bottou, G. Orr, K. Müller - In Neural Networks: Tricks of the Trade 1998.

Mathematical formulation
Given a set of training examples (𝑥1 , 𝑦1 ), . . . , (𝑥𝑛 , 𝑦𝑛 ) where 𝑥𝑖 ∈ R𝑚 and 𝑦𝑖 ∈ {−1, 1}, our goal is to learn a linear
scoring function 𝑓 (𝑥) = 𝑤𝑇 𝑥 + 𝑏 with model parameters 𝑤 ∈ R𝑚 and intercept 𝑏 ∈ R. In order to make predictions,
we simply look at the sign of 𝑓 (𝑥). A common choice to find the model parameters is by minimizing the regularized
training error given by
𝑛

𝐸(𝑤, 𝑏) =

1 ∑︁
𝐿(𝑦𝑖 , 𝑓 (𝑥𝑖 )) + 𝛼𝑅(𝑤)
𝑛 𝑖=1

where 𝐿 is a loss function that measures model (mis)fit and 𝑅 is a regularization term (aka penalty) that penalizes
model complexity; 𝛼 > 0 is a non-negative hyperparameter.
Different choices for 𝐿 entail different classifiers such as
• Hinge: (soft-margin) Support Vector Machines.
• Log: Logistic Regression.
• Least-Squares: Ridge Regression.
• Epsilon-Insensitive: (soft-margin) Support Vector Regression.
All of the above loss functions can be regarded as an upper bound on the misclassification error (Zero-one loss) as
shown in the Figure below.
Popular choices for the regularization term 𝑅 include:
194

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• L2 norm: 𝑅(𝑤) :=

1
2

∑︀𝑛

𝑖=1

𝑤𝑖2 ,

∑︀𝑛

|𝑤𝑖 |, which leads to sparse solutions.
∑︀
∑︀𝑛
𝑛
• Elastic Net: 𝑅(𝑤) := 𝜌2 𝑖=1 𝑤𝑖2 + (1 − 𝜌) 𝑖=1 |𝑤𝑖 |, a convex combination of L2 and L1, where 𝜌 is given
by 1 - l1_ratio.
• L1 norm: 𝑅(𝑤) :=

𝑖=1

The Figure below shows the contours of the different regularization terms in the parameter space when 𝑅(𝑤) = 1.
SGD
Stochastic gradient descent is an optimization method for unconstrained optimization problems. In contrast to (batch)
gradient descent, SGD approximates the true gradient of 𝐸(𝑤, 𝑏) by considering a single training example at a time.
The class SGDClassifier implements a first-order SGD learning routine. The algorithm iterates over the training
examples and for each example updates the model parameters according to the update rule given by
𝑤 ← 𝑤 − 𝜂(𝛼

𝜕𝑅(𝑤) 𝜕𝐿(𝑤𝑇 𝑥𝑖 + 𝑏, 𝑦𝑖 )
+
)
𝜕𝑤
𝜕𝑤

where 𝜂 is the learning rate which controls the step-size in the parameter space. The intercept 𝑏 is updated similarly
but without regularization.
The learning rate 𝜂 can be either constant or gradually decaying. For classification, the default learning rate schedule
(learning_rate='optimal') is given by
𝜂 (𝑡) =

1
𝛼(𝑡0 + 𝑡)

where 𝑡 is the time step (there are a total of n_samples * n_iter time steps), 𝑡0 is determined based on a heuristic
proposed by Léon Bottou such that the expected initial updates are comparable with the expected size of the weights
(this assuming that the norm of the training samples is approx. 1). The exact definition can be found in _init_t in
BaseSGD.

3.1. Supervised learning

195

scikit-learn user guide, Release 0.19.1

For regression the default learning rate schedule is inverse scaling (learning_rate='invscaling'), given by
𝜂 (𝑡) =

𝑒𝑡𝑎0
𝑡𝑝𝑜𝑤𝑒𝑟_𝑡

where 𝑒𝑡𝑎0 and 𝑝𝑜𝑤𝑒𝑟_𝑡 are hyperparameters chosen by the user via eta0 and power_t, resp.
For a constant learning rate use learning_rate='constant' and use eta0 to specify the learning rate.
The model parameters can be accessed through the members coef_ and intercept_:
• Member coef_ holds the weights 𝑤
• Member intercept_ holds 𝑏
References:
• “Solving large scale linear prediction problems using stochastic gradient descent algorithms” T. Zhang - In
Proceedings of ICML ‘04.
• “Regularization and variable selection via the elastic net” H. Zou, T. Hastie - Journal of the Royal Statistical
Society Series B, 67 (2), 301-320.
• “Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent” Xu, Wei

Implementation details
The implementation of SGD is influenced by the Stochastic Gradient SVM of Léon Bottou. Similar to SvmSGD,
the weight vector is represented as the product of a scalar and a vector which allows an efficient weight update in
the case of L2 regularization. In the case of sparse feature vectors, the intercept is updated with a smaller learning
rate (multiplied by 0.01) to account for the fact that it is updated more frequently. Training examples are picked up
sequentially and the learning rate is lowered after each observed example. We adopted the learning rate schedule from

196

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Shalev-Shwartz et al. 2007. For multi-class classification, a “one versus all” approach is used. We use the truncated
gradient algorithm proposed by Tsuruoka et al. 2009 for L1 regularization (and the Elastic Net). The code is written
in Cython.
References:
• “Stochastic Gradient Descent” L. Bottou - Website, 2010.
• “The Tradeoffs of Large Scale Machine Learning” L. Bottou - Website, 2011.
• “Pegasos: Primal estimated sub-gradient solver for svm” S. Shalev-Shwartz, Y. Singer, N. Srebro - In Proceedings of ICML ‘07.
• “Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty” Y. Tsuruoka, J. Tsujii, S. Ananiadou - In Proceedings of the AFNLP/ACL ‘09.

3.1.6 Nearest Neighbors
sklearn.neighbors provides functionality for unsupervised and supervised neighbors-based learning methods.
Unsupervised nearest neighbors is the foundation of many other learning methods, notably manifold learning and
spectral clustering. Supervised neighbors-based learning comes in two flavors: classification for data with discrete
labels, and regression for data with continuous labels.
The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance
to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest
neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can,
in general, be any metric measure: standard Euclidean distance is the most common choice. Neighbors-based methods are known as non-generalizing machine learning methods, since they simply “remember” all of its training data
(possibly transformed into a fast indexing structure such as a Ball Tree or KD Tree.).
Despite its simplicity, nearest neighbors has been successful in a large number of classification and regression problems, including handwritten digits or satellite image scenes. Being a non-parametric method, it is often successful in
classification situations where the decision boundary is very irregular.
The classes in sklearn.neighbors can handle either Numpy arrays or scipy.sparse matrices as input. For dense
matrices, a large number of possible distance metrics are supported. For sparse matrices, arbitrary Minkowski metrics
are supported for searches.
There are many learning routines which rely on nearest neighbors at their core. One example is kernel density estimation, discussed in the density estimation section.
Unsupervised Nearest Neighbors
NearestNeighbors implements unsupervised nearest neighbors learning. It acts as a uniform interface to three
different nearest neighbors algorithms: BallTree, KDTree, and a brute-force algorithm based on routines in
sklearn.metrics.pairwise. The choice of neighbors search algorithm is controlled through the keyword
'algorithm', which must be one of ['auto', 'ball_tree', 'kd_tree', 'brute']. When the default value 'auto' is passed, the algorithm attempts to determine the best approach from the training data. For a
discussion of the strengths and weaknesses of each option, see Nearest Neighbor Algorithms.
Warning: Regarding the Nearest Neighbors algorithms, if two neighbors, neighbor 𝑘 + 1 and 𝑘, have
identical distances but different labels, the results will depend on the ordering of the training data.

3.1. Supervised learning

197

scikit-learn user guide, Release 0.19.1

Finding the Nearest Neighbors
For the simple task of finding the nearest neighbors between two sets of data, the unsupervised algorithms within
sklearn.neighbors can be used:
>>> from sklearn.neighbors import NearestNeighbors
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
>>> distances, indices = nbrs.kneighbors(X)
>>> indices
array([[0, 1],
[1, 0],
[2, 1],
[3, 4],
[4, 3],
[5, 4]]...)
>>> distances
array([[ 0.
, 1.
],
[ 0.
, 1.
],
[ 0.
, 1.41421356],
[ 0.
, 1.
],
[ 0.
, 1.
],
[ 0.
, 1.41421356]])

Because the query set matches the training set, the nearest neighbor of each point is the point itself, at a distance of
zero.
It is also possible to efficiently produce a sparse graph showing the connections between neighboring points:
>>> nbrs.kneighbors_graph(X).toarray()
array([[ 1., 1., 0., 0., 0., 0.],
[ 1., 1., 0., 0., 0., 0.],
[ 0., 1., 1., 0., 0., 0.],
[ 0., 0., 0., 1., 1., 0.],
[ 0., 0., 0., 1., 1., 0.],
[ 0., 0., 0., 0., 1., 1.]])

Our dataset is structured such that points nearby in index order are nearby in parameter space, leading to an approximately block-diagonal matrix of K-nearest neighbors. Such a sparse graph is useful in a variety of circumstances which make use of spatial relationships between points for unsupervised learning: in particular,
see sklearn.manifold.Isomap, sklearn.manifold.LocallyLinearEmbedding, and sklearn.
cluster.SpectralClustering.
KDTree and BallTree Classes
Alternatively, one can use the KDTree or BallTree classes directly to find nearest neighbors. This is the functionality wrapped by the NearestNeighbors class used above. The Ball Tree and KD Tree have the same interface;
we’ll show an example of using the KD Tree here:
>>> from sklearn.neighbors import KDTree
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> kdt = KDTree(X, leaf_size=30, metric='euclidean')
>>> kdt.query(X, k=2, return_distance=False)
array([[0, 1],

198

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

[1,
[2,
[3,
[4,
[5,

0],
1],
4],
3],
4]]...)

Refer to the KDTree and BallTree class documentation for more information on the options available for neighbors
searches, including specification of query strategies, of various distance metrics, etc. For a list of available metrics,
see the documentation of the DistanceMetric class.
Nearest Neighbors Classification
Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt
to construct a general internal model, but simply stores instances of the training data. Classification is computed from
a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the
most representatives within the nearest neighbors of the point.
scikit-learn implements two different nearest neighbors classifiers: KNeighborsClassifier implements learning based on the 𝑘 nearest neighbors of each query point, where 𝑘 is an integer value specified by the user.
RadiusNeighborsClassifier implements learning based on the number of neighbors within a fixed radius
𝑟 of each training point, where 𝑟 is a floating-point value specified by the user.
The 𝑘-neighbors classification in KNeighborsClassifier is the more commonly used of the two techniques.
The optimal choice of the value 𝑘 is highly data-dependent: in general a larger 𝑘 suppresses the effects of noise, but
makes the classification boundaries less distinct.
In cases where the data is not uniformly sampled, radius-based neighbors classification in
RadiusNeighborsClassifier can be a better choice. The user specifies a fixed radius 𝑟, such that
points in sparser neighborhoods use fewer nearest neighbors for the classification. For high-dimensional parameter
spaces, this method becomes less effective due to the so-called “curse of dimensionality”.
The basic nearest neighbors classification uses uniform weights: that is, the value assigned to a query point is computed
from a simple majority vote of the nearest neighbors. Under some circumstances, it is better to weight the neighbors
such that nearer neighbors contribute more to the fit. This can be accomplished through the weights keyword. The
default value, weights = 'uniform', assigns uniform weights to each neighbor. weights = 'distance'
assigns weights proportional to the inverse of the distance from the query point. Alternatively, a user-defined function
of the distance can be supplied which is used to compute the weights.

3.1. Supervised learning

199

scikit-learn user guide, Release 0.19.1

Examples:
• Nearest Neighbors Classification: an example of classification using nearest neighbors.

Nearest Neighbors Regression
Neighbors-based regression can be used in cases where the data labels are continuous rather than discrete variables.
The label assigned to a query point is computed based the mean of the labels of its nearest neighbors.
scikit-learn implements two different neighbors regressors: KNeighborsRegressor implements learning
based on the 𝑘 nearest neighbors of each query point, where 𝑘 is an integer value specified by the user.
RadiusNeighborsRegressor implements learning based on the neighbors within a fixed radius 𝑟 of the query
point, where 𝑟 is a floating-point value specified by the user.
The basic nearest neighbors regression uses uniform weights: that is, each point in the local neighborhood contributes
uniformly to the classification of a query point. Under some circumstances, it can be advantageous to weight points
such that nearby points contribute more to the regression than faraway points. This can be accomplished through the
weights keyword. The default value, weights = 'uniform', assigns equal weights to all points. weights
= 'distance' assigns weights proportional to the inverse of the distance from the query point. Alternatively, a
user-defined function of the distance can be supplied, which will be used to compute the weights.

The use of multi-output nearest neighbors for regression is demonstrated in Face completion with a multi-output
estimators. In this example, the inputs X are the pixels of the upper half of faces and the outputs Y are the pixels of
the lower half of those faces.
Examples:
• Nearest Neighbors regression: an example of regression using nearest neighbors.

200

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

3.1. Supervised learning

201

scikit-learn user guide, Release 0.19.1

• Face completion with a multi-output estimators: an example of multi-output regression using nearest neighbors.

Nearest Neighbor Algorithms
Brute Force
Fast computation of nearest neighbors is an active area of research in machine learning. The most naive neighbor
search implementation involves the brute-force computation of distances between all pairs of points in the dataset: for
𝑁 samples in 𝐷 dimensions, this approach scales as 𝑂[𝐷𝑁 2 ]. Efficient brute-force neighbors searches can be very
competitive for small data samples. However, as the number of samples 𝑁 grows, the brute-force approach quickly
becomes infeasible. In the classes within sklearn.neighbors, brute-force neighbors searches are specified using
the keyword algorithm = 'brute', and are computed using the routines available in sklearn.metrics.
pairwise.
K-D Tree
To address the computational inefficiencies of the brute-force approach, a variety of tree-based data structures have
been invented. In general, these structures attempt to reduce the required number of distance calculations by efficiently
encoding aggregate distance information for the sample. The basic idea is that if point 𝐴 is very distant from point
𝐵, and point 𝐵 is very close to point 𝐶, then we know that points 𝐴 and 𝐶 are very distant, without having to
explicitly calculate their distance. In this way, the computational cost of a nearest neighbors search can be reduced to
𝑂[𝐷𝑁 log(𝑁 )] or better. This is a significant improvement over brute-force for large 𝑁 .
An early approach to taking advantage of this aggregate information was the KD tree data structure (short for Kdimensional tree), which generalizes two-dimensional Quad-trees and 3-dimensional Oct-trees to an arbitrary number
of dimensions. The KD tree is a binary tree structure which recursively partitions the parameter space along the data
axes, dividing it into nested orthotropic regions into which data points are filed. The construction of a KD tree is very
fast: because partitioning is performed only along the data axes, no 𝐷-dimensional distances need to be computed.
Once constructed, the nearest neighbor of a query point can be determined with only 𝑂[log(𝑁 )] distance computations.
Though the KD tree approach is very fast for low-dimensional (𝐷 < 20) neighbors searches, it becomes inefficient
as 𝐷 grows very large: this is one manifestation of the so-called “curse of dimensionality”. In scikit-learn, KD tree
neighbors searches are specified using the keyword algorithm = 'kd_tree', and are computed using the class
KDTree.
References:
• “Multidimensional binary search trees used for associative searching”, Bentley, J.L., Communications of the
ACM (1975)

Ball Tree
To address the inefficiencies of KD Trees in higher dimensions, the ball tree data structure was developed. Where
KD trees partition data along Cartesian axes, ball trees partition data in a series of nesting hyper-spheres. This makes
tree construction more costly than that of the KD tree, but results in a data structure which can be very efficient on
highly-structured data, even in very high dimensions.
A ball tree recursively divides the data into nodes defined by a centroid 𝐶 and radius 𝑟, such that each point in the
node lies within the hyper-sphere defined by 𝑟 and 𝐶. The number of candidate points for a neighbor search is reduced

202

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

through use of the triangle inequality:
|𝑥 + 𝑦| ≤ |𝑥| + |𝑦|
With this setup, a single distance calculation between a test point and the centroid is sufficient to determine a lower
and upper bound on the distance to all points within the node. Because of the spherical geometry of the ball tree nodes,
it can out-perform a KD-tree in high dimensions, though the actual performance is highly dependent on the structure
of the training data. In scikit-learn, ball-tree-based neighbors searches are specified using the keyword algorithm
= 'ball_tree', and are computed using the class sklearn.neighbors.BallTree. Alternatively, the user
can work with the BallTree class directly.
References:
• “Five balltree construction algorithms”, Omohundro, S.M., International Computer Science Institute Technical Report (1989)

Choice of Nearest Neighbors Algorithm
The optimal algorithm for a given dataset is a complicated choice, and depends on a number of factors:
• number of samples 𝑁 (i.e. n_samples) and dimensionality 𝐷 (i.e. n_features).
– Brute force query time grows as 𝑂[𝐷𝑁 ]
– Ball tree query time grows as approximately 𝑂[𝐷 log(𝑁 )]
– KD tree query time changes with 𝐷 in a way that is difficult to precisely characterise. For small 𝐷 (less
than 20 or so) the cost is approximately 𝑂[𝐷 log(𝑁 )], and the KD tree query can be very efficient. For
larger 𝐷, the cost increases to nearly 𝑂[𝐷𝑁 ], and the overhead due to the tree structure can lead to queries
which are slower than brute force.
For small data sets (𝑁 less than 30 or so), log(𝑁 ) is comparable to 𝑁 , and brute force algorithms can be more
efficient than a tree-based approach. Both KDTree and BallTree address this through providing a leaf size
parameter: this controls the number of samples at which a query switches to brute-force. This allows both
algorithms to approach the efficiency of a brute-force computation for small 𝑁 .
• data structure: intrinsic dimensionality of the data and/or sparsity of the data. Intrinsic dimensionality refers
to the dimension 𝑑 ≤ 𝐷 of a manifold on which the data lies, which can be linearly or non-linearly embedded
in the parameter space. Sparsity refers to the degree to which the data fills the parameter space (this is to be
distinguished from the concept as used in “sparse” matrices. The data matrix may have no zero entries, but the
structure can still be “sparse” in this sense).
– Brute force query time is unchanged by data structure.
– Ball tree and KD tree query times can be greatly influenced by data structure. In general, sparser data with a
smaller intrinsic dimensionality leads to faster query times. Because the KD tree internal representation is
aligned with the parameter axes, it will not generally show as much improvement as ball tree for arbitrarily
structured data.
Datasets used in machine learning tend to be very structured, and are very well-suited for tree-based queries.
• number of neighbors 𝑘 requested for a query point.
– Brute force query time is largely unaffected by the value of 𝑘
– Ball tree and KD tree query time will become slower as 𝑘 increases. This is due to two effects: first, a
larger 𝑘 leads to the necessity to search a larger portion of the parameter space. Second, using 𝑘 > 1
requires internal queueing of results as the tree is traversed.
3.1. Supervised learning

203

scikit-learn user guide, Release 0.19.1

As 𝑘 becomes large compared to 𝑁 , the ability to prune branches in a tree-based query is reduced. In this
situation, Brute force queries can be more efficient.
• number of query points. Both the ball tree and the KD Tree require a construction phase. The cost of this
construction becomes negligible when amortized over many queries. If only a small number of queries will
be performed, however, the construction can make up a significant fraction of the total cost. If very few query
points will be required, brute force is better than a tree-based method.
Currently, algorithm = 'auto' selects 'kd_tree' if 𝑘 < 𝑁/2 and the 'effective_metric_'
is in the 'VALID_METRICS' list of 'kd_tree'.
It selects 'ball_tree' if 𝑘 < 𝑁/2 and the
'effective_metric_' is in the 'VALID_METRICS' list of 'ball_tree'. It selects 'brute' if 𝑘 < 𝑁/2
and the 'effective_metric_' is not in the 'VALID_METRICS' list of 'kd_tree' or 'ball_tree'. It
selects 'brute' if 𝑘 >= 𝑁/2. This choice is based on the assumption that the number of query points is at least the
same order as the number of training points, and that leaf_size is close to its default value of 30.
Effect of leaf_size
As noted above, for small sample sizes a brute force search can be more efficient than a tree-based query. This fact is
accounted for in the ball tree and KD tree by internally switching to brute force searches within leaf nodes. The level
of this switch can be specified with the parameter leaf_size. This parameter choice has many effects:
construction time A larger leaf_size leads to a faster tree construction time, because fewer nodes need to be
created
query time Both a large or small leaf_size can lead to suboptimal query cost. For leaf_size approaching
1, the overhead involved in traversing nodes can significantly slow query times. For leaf_size approaching the size of the training set, queries become essentially brute force. A good compromise between these is
leaf_size = 30, the default value of the parameter.
memory As leaf_size increases, the memory required to store a tree structure decreases. This is especially
important in the case of ball tree, which stores a 𝐷-dimensional centroid for each node. The required storage
space for BallTree is approximately 1 / leaf_size times the size of the training set.
leaf_size is not referenced for brute force queries.
Nearest Centroid Classifier
The NearestCentroid classifier is a simple algorithm that represents each class by the centroid of its members.
In effect, this makes it similar to the label updating phase of the sklearn.KMeans algorithm. It also has no parameters to choose, making it a good baseline classifier. It does, however, suffer on non-convex classes, as well as when
classes have drastically different variances, as equal variance in all dimensions is assumed. See Linear Discriminant Analysis (sklearn.discriminant_analysis.LinearDiscriminantAnalysis) and Quadratic
Discriminant Analysis (sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis) for
more complex methods that do not make this assumption. Usage of the default NearestCentroid is simple:
>>> from sklearn.neighbors.nearest_centroid import NearestCentroid
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> y = np.array([1, 1, 1, 2, 2, 2])
>>> clf = NearestCentroid()
>>> clf.fit(X, y)
NearestCentroid(metric='euclidean', shrink_threshold=None)
>>> print(clf.predict([[-0.8, -1]]))
[1]

204

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Nearest Shrunken Centroid
The NearestCentroid classifier has a shrink_threshold parameter, which implements the nearest shrunken
centroid classifier. In effect, the value of each feature for each centroid is divided by the within-class variance of that
feature. The feature values are then reduced by shrink_threshold. Most notably, if a particular feature value
crosses zero, it is set to zero. In effect, this removes the feature from affecting the classification. This is useful, for
example, for removing noisy features.
In the example below, using a small shrink threshold increases the accuracy of the model from 0.81 to 0.82.

Examples:
• Nearest Centroid Classification: an example of classification using nearest centroid with different shrink
thresholds.

3.1.7 Gaussian Processes
Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic
classification problems.
The advantages of Gaussian processes are:
• The prediction interpolates the observations (at least for regular kernels).
• The prediction is probabilistic (Gaussian) so that one can compute empirical confidence intervals and decide
based on those if one should refit (online fitting, adaptive fitting) the prediction in some region of interest.
• Versatile: different kernels can be specified. Common kernels are provided, but it is also possible to specify
custom kernels.
The disadvantages of Gaussian processes include:
• They are not sparse, i.e., they use the whole samples/features information to perform the prediction.
• They lose efficiency in high dimensional spaces – namely when the number of features exceeds a few dozens.
Gaussian Process Regression (GPR)
The GaussianProcessRegressor implements Gaussian processes (GP) for regression purposes. For this, the
prior of the GP needs to be specified. The prior mean is assumed to be constant and zero (for normalize_y=False)
3.1. Supervised learning

205

scikit-learn user guide, Release 0.19.1

or the training data’s mean (for normalize_y=True). The prior’s covariance is specified by a passing a kernel
object. The hyperparameters of the kernel are optimized during fitting of GaussianProcessRegressor by maximizing
the log-marginal-likelihood (LML) based on the passed optimizer. As the LML may have multiple local optima,
the optimizer can be started repeatedly by specifying n_restarts_optimizer. The first run is always conducted
starting from the initial hyperparameter values of the kernel; subsequent runs are conducted from hyperparameter
values that have been chosen randomly from the range of allowed values. If the initial hyperparameters should be kept
fixed, None can be passed as optimizer.
The noise level in the targets can be specified by passing it via the parameter alpha, either globally as a scalar or
per datapoint. Note that a moderate noise level can also be helpful for dealing with numeric issues during fitting as
it is effectively implemented as Tikhonov regularization, i.e., by adding it to the diagonal of the kernel matrix. An
alternative to specifying the noise level explicitly is to include a WhiteKernel component into the kernel, which can
estimate the global noise level from the data (see example below).
The implementation is based on Algorithm 2.1 of [RW2006]. In addition to the API of standard scikit-learn estimators,
GaussianProcessRegressor:
• allows prediction without prior fitting (based on the GP prior)
• provides an additional method sample_y(X), which evaluates samples drawn from the GPR (prior or posterior) at given inputs
• exposes a method log_marginal_likelihood(theta), which can be used externally for other ways of
selecting hyperparameters, e.g., via Markov chain Monte Carlo.
GPR examples
GPR with noise-level estimation
This example illustrates that GPR with a sum-kernel including a WhiteKernel can estimate the noise level of data. An
illustration of the log-marginal-likelihood (LML) landscape shows that there exist two local maxima of LML.
The first corresponds to a model with a high noise level and a large length scale, which explains all variations in the
data by noise.
The second one has a smaller noise level and shorter length scale, which explains most of the variation by the noisefree functional relationship. The second model has a higher likelihood; however, depending on the initial value for the
hyperparameters, the gradient-based optimization might also converge to the high-noise solution. It is thus important
to repeat the optimization several times for different initializations.
Comparison of GPR and Kernel Ridge Regression
Both kernel ridge regression (KRR) and GPR learn a target function by employing internally the “kernel trick”. KRR
learns a linear function in the space induced by the respective kernel which corresponds to a non-linear function in
the original space. The linear function in the kernel space is chosen based on the mean-squared error loss with ridge
regularization. GPR uses the kernel to define the covariance of a prior distribution over the target functions and uses
the observed training data to define a likelihood function. Based on Bayes theorem, a (Gaussian) posterior distribution
over target functions is defined, whose mean is used for prediction.
A major difference is that GPR can choose the kernel’s hyperparameters based on gradient-ascent on the marginal
likelihood function while KRR needs to perform a grid search on a cross-validated loss function (mean-squared error
loss). A further difference is that GPR learns a generative, probabilistic model of the target function and can thus
provide meaningful confidence intervals and posterior samples along with the predictions while KRR only provides
predictions.

206

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

3.1. Supervised learning

207

scikit-learn user guide, Release 0.19.1

208

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

3.1. Supervised learning

209

scikit-learn user guide, Release 0.19.1

The following figure illustrates both methods on an artificial dataset, which consists of a sinusoidal target function
and strong noise. The figure compares the learned model of KRR and GPR based on a ExpSineSquared kernel,
which is suited for learning periodic functions. The kernel’s hyperparameters control the smoothness (length_scale)
and periodicity of the kernel (periodicity). Moreover, the noise level of the data is learned explicitly by GPR by an
additional WhiteKernel component in the kernel and by the regularization parameter alpha of KRR.

The figure shows that both methods learn reasonable models of the target function. GPR correctly identifies the periodicity of the function to be roughly 2 * 𝜋 (6.28), while KRR chooses the doubled periodicity 4 * 𝜋 . Besides that, GPR
provides reasonable confidence bounds on the prediction which are not available for KRR. A major difference between
the two methods is the time required for fitting and predicting: while fitting KRR is fast in principle, the grid-search
for hyperparameter optimization scales exponentially with the number of hyperparameters (“curse of dimensionality”). The gradient-based optimization of the parameters in GPR does not suffer from this exponential scaling and is
thus considerable faster on this example with 3-dimensional hyperparameter space. The time for predicting is similar;
however, generating the variance of the predictive distribution of GPR takes considerable longer than just predicting
the mean.
GPR on Mauna Loa CO2 data
This example is based on Section 5.4.3 of [RW2006]. It illustrates an example of complex kernel engineering and
hyperparameter optimization using gradient ascent on the log-marginal-likelihood. The data consists of the monthly
average atmospheric CO2 concentrations (in parts per million by volume (ppmv)) collected at the Mauna Loa Observatory in Hawaii, between 1958 and 1997. The objective is to model the CO2 concentration as a function of the time
t.
The kernel is composed of several terms that are responsible for explaining different properties of the signal:
• a long term, smooth rising trend is to be explained by an RBF kernel. The RBF kernel with a large length-scale
enforces this component to be smooth; it is not enforced that the trend is rising which leaves this choice to the
GP. The specific length-scale and the amplitude are free hyperparameters.
• a seasonal component, which is to be explained by the periodic ExpSineSquared kernel with a fixed periodicity
of 1 year. The length-scale of this periodic component, controlling its smoothness, is a free parameter. In order
to allow decaying away from exact periodicity, the product with an RBF kernel is taken. The length-scale of this
RBF component controls the decay time and is a further free parameter.
210

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• smaller, medium term irregularities are to be explained by a RationalQuadratic kernel component, whose lengthscale and alpha parameter, which determines the diffuseness of the length-scales, are to be determined. According to [RW2006], these irregularities can better be explained by a RationalQuadratic than an RBF kernel
component, probably because it can accommodate several length-scales.
• a “noise” term, consisting of an RBF kernel contribution, which shall explain the correlated noise components
such as local weather phenomena, and a WhiteKernel contribution for the white noise. The relative amplitudes
and the RBF’s length scale are further free parameters.
Maximizing the log-marginal-likelihood after subtracting the target’s mean yields the following kernel with an LML
of -83.214:
34.4**2 * RBF(length_scale=41.8)
+ 3.27**2 * RBF(length_scale=180) * ExpSineSquared(length_scale=1.44,
periodicity=1)
+ 0.446**2 * RationalQuadratic(alpha=17.7, length_scale=0.957)
+ 0.197**2 * RBF(length_scale=0.138) + WhiteKernel(noise_level=0.0336)

Thus, most of the target signal (34.4ppm) is explained by a long-term rising trend (length-scale 41.8 years). The
periodic component has an amplitude of 3.27ppm, a decay time of 180 years and a length-scale of 1.44. The long
decay time indicates that we have a locally very close to periodic seasonal component. The correlated noise has an
amplitude of 0.197ppm with a length scale of 0.138 years and a white-noise contribution of 0.197ppm. Thus, the
overall noise level is very small, indicating that the data can be very well explained by the model. The figure shows
also that the model makes very confident predictions until around 2015

3.1. Supervised learning

211

scikit-learn user guide, Release 0.19.1

Gaussian Process Classification (GPC)
The GaussianProcessClassifier implements Gaussian processes (GP) for classification purposes, more
specifically for probabilistic classification, where test predictions take the form of class probabilities. GaussianProcessClassifier places a GP prior on a latent function 𝑓 , which is then squashed through a link function to obtain the
probabilistic classification. The latent function 𝑓 is a so-called nuisance function, whose values are not observed and
are not relevant by themselves. Its purpose is to allow a convenient formulation of the model, and 𝑓 is removed (integrated out) during prediction. GaussianProcessClassifier implements the logistic link function, for which the integral
cannot be computed analytically but is easily approximated in the binary case.
In contrast to the regression setting, the posterior of the latent function 𝑓 is not Gaussian even for a GP prior since
a Gaussian likelihood is inappropriate for discrete class labels. Rather, a non-Gaussian likelihood corresponding to
the logistic link function (logit) is used. GaussianProcessClassifier approximates the non-Gaussian posterior with a
Gaussian based on the Laplace approximation. More details can be found in Chapter 3 of [RW2006].
The GP prior mean is assumed to be zero. The prior’s covariance is specified by a passing a kernel object. The hyperparameters of the kernel are optimized during fitting of GaussianProcessRegressor by maximizing the log-marginallikelihood (LML) based on the passed optimizer. As the LML may have multiple local optima, the optimizer can
be started repeatedly by specifying n_restarts_optimizer. The first run is always conducted starting from the
initial hyperparameter values of the kernel; subsequent runs are conducted from hyperparameter values that have been
chosen randomly from the range of allowed values. If the initial hyperparameters should be kept fixed, None can be
passed as optimizer.
GaussianProcessClassifier supports multi-class classification by performing either one-versus-rest or oneversus-one based training and prediction. In one-versus-rest, one binary Gaussian process classifier is fitted for each
class, which is trained to separate this class from the rest. In “one_vs_one”, one binary Gaussian process classifier is
fitted for each pair of classes, which is trained to separate these two classes. The predictions of these binary predictors
are combined into multi-class predictions. See the section on multi-class classification for more details.
In the case of Gaussian process classification, “one_vs_one” might be computationally cheaper since it has to solve
many problems involving only a subset of the whole training set rather than fewer problems on the whole dataset. Since
Gaussian process classification scales cubically with the size of the dataset, this might be considerably faster. However, note that “one_vs_one” does not support predicting probability estimates but only plain predictions. Moreover,
note that GaussianProcessClassifier does not (yet) implement a true multi-class Laplace approximation internally, but as discussed above is based on solving several binary classification tasks internally, which are combined
using one-versus-rest or one-versus-one.
GPC examples
Probabilistic predictions with GPC
This example illustrates the predicted probability of GPC for an RBF kernel with different choices of the hyperparameters. The first figure shows the predicted probability of GPC with arbitrarily chosen hyperparameters and with the
hyperparameters corresponding to the maximum log-marginal-likelihood (LML).
While the hyperparameters chosen by optimizing LML have a considerable larger LML, they perform slightly worse
according to the log-loss on test data. The figure shows that this is because they exhibit a steep change of the class
probabilities at the class boundaries (which is good) but have predicted probabilities close to 0.5 far away from the
class boundaries (which is bad) This undesirable effect is caused by the Laplace approximation used internally by
GPC.
The second figure shows the log-marginal-likelihood for different choices of the kernel’s hyperparameters, highlighting
the two choices of the hyperparameters used in the first figure by black dots.

212

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

3.1. Supervised learning

213

scikit-learn user guide, Release 0.19.1

214

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Illustration of GPC on the XOR dataset
This example illustrates GPC on XOR data. Compared are a stationary, isotropic kernel (RBF) and a non-stationary
kernel (DotProduct). On this particular dataset, the DotProduct kernel obtains considerably better results because
the class-boundaries are linear and coincide with the coordinate axes. In practice, however, stationary kernels such as
RBF often obtain better results.

Gaussian process classification (GPC) on iris dataset
This example illustrates the predicted probability of GPC for an isotropic and anisotropic RBF kernel on a twodimensional version for the iris-dataset. This illustrates the applicability of GPC to non-binary classification. The
anisotropic RBF kernel obtains slightly higher log-marginal-likelihood by assigning different length-scales to the two
feature dimensions.
Kernels for Gaussian Processes
Kernels (also called “covariance functions” in the context of GPs) are a crucial ingredient of GPs which determine
the shape of prior and posterior of the GP. They encode the assumptions on the function being learned by defining the
“similarity” of two datapoints combined with the assumption that similar datapoints should have similar target values.
Two categories of kernels can be distinguished: stationary kernels depend only on the distance of two datapoints
and not on their absolute values 𝑘(𝑥𝑖 , 𝑥𝑗 ) = 𝑘(𝑑(𝑥𝑖 , 𝑥𝑗 )) and are thus invariant to translations in the input space,
while non-stationary kernels depend also on the specific values of the datapoints. Stationary kernels can further be
subdivided into isotropic and anisotropic kernels, where isotropic kernels are also invariant to rotations in the input
space. For more details, we refer to Chapter 4 of [RW2006].
Gaussian Process Kernel API
The main usage of a Kernel is to compute the GP’s covariance between datapoints. For this, the method __call__
of the kernel can be called. This method can either be used to compute the “auto-covariance” of all pairs of datapoints

3.1. Supervised learning

215

scikit-learn user guide, Release 0.19.1

in a 2d array X, or the “cross-covariance” of all combinations of datapoints of a 2d array X with datapoints in a 2d
array Y. The following identity holds true for all kernels k (except for the WhiteKernel): k(X) == K(X, Y=X)
If only the diagonal of the auto-covariance is being used, the method diag() of a kernel can be called, which is more
computationally efficient than the equivalent call to __call__: np.diag(k(X, X)) == k.diag(X)
Kernels are parameterized by a vector 𝜃 of hyperparameters. These hyperparameters can for instance control lengthscales or periodicity of a kernel (see below). All kernels support computing analytic gradients of of the kernel’s
auto-covariance with respect to 𝜃 via setting eval_gradient=True in the __call__ method. This gradient is
used by the Gaussian process (both regressor and classifier) in computing the gradient of the log-marginal-likelihood,
which in turn is used to determine the value of 𝜃, which maximizes the log-marginal-likelihood, via gradient ascent.
For each hyperparameter, the initial value and the bounds need to be specified when creating an instance of the kernel.
The current value of 𝜃 can be get and set via the property theta of the kernel object. Moreover, the bounds of the
hyperparameters can be accessed by the property bounds of the kernel. Note that both properties (theta and bounds)
return log-transformed values of the internally used values since those are typically more amenable to gradient-based
optimization. The specification of each hyperparameter is stored in the form of an instance of Hyperparameter
in the respective kernel. Note that a kernel using a hyperparameter with name “x” must have the attributes self.x and
self.x_bounds.
The abstract base class for all kernels is Kernel. Kernel implements a similar interface as Estimator, providing
the methods get_params(), set_params(), and clone(). This allows setting kernel values also via metaestimators such as Pipeline or GridSearch. Note that due to the nested structure of kernels (by applying kernel
operators, see below), the names of kernel parameters might become relatively complicated. In general, for a binary
kernel operator, parameters of the left operand are prefixed with k1__ and parameters of the right operand with k2__.
An additional convenience method is clone_with_theta(theta), which returns a cloned version of the kernel
but with the hyperparameters set to theta. An illustrative example:
>>> from sklearn.gaussian_process.kernels import ConstantKernel, RBF
>>> kernel = ConstantKernel(constant_value=1.0, constant_value_bounds=(0.0, 10.0)) *
˓→RBF(length_scale=0.5, length_scale_bounds=(0.0, 10.0)) + RBF(length_scale=2.0,
˓→length_scale_bounds=(0.0, 10.0))
>>> for hyperparameter in kernel.hyperparameters: print(hyperparameter)
Hyperparameter(name='k1__k1__constant_value', value_type='numeric', bounds=array([[
˓→0.,
10.]]), n_elements=1, fixed=False)
Hyperparameter(name='k1__k2__length_scale', value_type='numeric', bounds=array([[ 0.,
˓→
10.]]), n_elements=1, fixed=False)

216

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Hyperparameter(name='k2__length_scale', value_type='numeric', bounds=array([[
˓→10.]]), n_elements=1, fixed=False)
>>> params = kernel.get_params()
>>> for key in sorted(params): print("%s : %s" % (key, params[key]))
k1 : 1**2 * RBF(length_scale=0.5)
k1__k1 : 1**2
k1__k1__constant_value : 1.0
k1__k1__constant_value_bounds : (0.0, 10.0)
k1__k2 : RBF(length_scale=0.5)
k1__k2__length_scale : 0.5
k1__k2__length_scale_bounds : (0.0, 10.0)
k2 : RBF(length_scale=2)
k2__length_scale : 2.0
k2__length_scale_bounds : (0.0, 10.0)
>>> print(kernel.theta) # Note: log-transformed
[ 0.
-0.69314718 0.69314718]
>>> print(kernel.bounds) # Note: log-transformed
[[
-inf 2.30258509]
[
-inf 2.30258509]
[
-inf 2.30258509]]

0.,

All Gaussian process kernels are interoperable with sklearn.metrics.pairwise and vice versa: instances
of subclasses of Kernel can be passed as metric to pairwise_kernels‘‘ from sklearn.metrics.pairwise.
Moreover, kernel functions from pairwise can be used as GP kernels by using the wrapper class PairwiseKernel.
The only caveat is that the gradient of the hyperparameters is not analytic but numeric and all those kernels support
only isotropic distances. The parameter gamma is considered to be a hyperparameter and may be optimized. The other
kernel parameters are set directly at initialization and are kept fixed.
Basic kernels
The ConstantKernel kernel can be used as part of a Product kernel where it scales the magnitude of the other
factor (kernel) or as part of a Sum kernel, where it modifies the mean of the Gaussian process. It depends on a
parameter 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡_𝑣𝑎𝑙𝑢𝑒. It is defined as:
𝑘(𝑥𝑖 , 𝑥𝑗 ) = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡_𝑣𝑎𝑙𝑢𝑒 ∀ 𝑥1 , 𝑥2
The main use-case of the WhiteKernel kernel is as part of a sum-kernel where it explains the noise-component of
the signal. Tuning its parameter 𝑛𝑜𝑖𝑠𝑒_𝑙𝑒𝑣𝑒𝑙 corresponds to estimating the noise-level. It is defined as:e
𝑘(𝑥𝑖 , 𝑥𝑗 ) = 𝑛𝑜𝑖𝑠𝑒_𝑙𝑒𝑣𝑒𝑙 if 𝑥𝑖 == 𝑥𝑗 else 0
Kernel operators
Kernel operators take one or two base kernels and combine them into a new kernel. The Sum kernel takes two kernels
𝑘1 and 𝑘2 and combines them via 𝑘𝑠𝑢𝑚 (𝑋, 𝑌 ) = 𝑘1(𝑋, 𝑌 ) + 𝑘2(𝑋, 𝑌 ). The Product kernel takes two kernels 𝑘1
and 𝑘2 and combines them via 𝑘𝑝𝑟𝑜𝑑𝑢𝑐𝑡 (𝑋, 𝑌 ) = 𝑘1(𝑋, 𝑌 ) * 𝑘2(𝑋, 𝑌 ). The Exponentiation kernel takes one
base kernel and a scalar parameter 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡 and combines them via 𝑘𝑒𝑥𝑝 (𝑋, 𝑌 ) = 𝑘(𝑋, 𝑌 )exponent .
Radial-basis function (RBF) kernel
The RBF kernel is a stationary kernel. It is also known as the “squared exponential” kernel. It is parameterized by a
length-scale parameter 𝑙 > 0, which can either be a scalar (isotropic variant of the kernel) or a vector with the same

3.1. Supervised learning

217

scikit-learn user guide, Release 0.19.1

number of dimensions as the inputs 𝑥 (anisotropic variant of the kernel). The kernel is given by:
(︂
)︂
1
2
𝑘(𝑥𝑖 , 𝑥𝑗 ) = exp − 𝑑(𝑥𝑖 /𝑙, 𝑥𝑗 /𝑙)
2
This kernel is infinitely differentiable, which implies that GPs with this kernel as covariance function have mean square
derivatives of all orders, and are thus very smooth. The prior and posterior of a GP resulting from an RBF kernel are
shown in the following figure:

Matérn kernel
The Matern kernel is a stationary kernel and a generalization of the RBF kernel. It has an additional parameter 𝜈
which controls the smoothness of the resulting function. It is parameterized by a length-scale parameter 𝑙 > 0, which
218

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

can either be a scalar (isotropic variant of the kernel) or a vector with the same number of dimensions as the inputs 𝑥
(anisotropic variant of the kernel). The kernel is given by:
(︃
)︃𝜈 (︃
)︃
√
√
1
𝛾 2𝜈𝑑(𝑥𝑖 /𝑙, 𝑥𝑗 /𝑙) 𝐾𝜈 𝛾 2𝜈𝑑(𝑥𝑖 /𝑙, 𝑥𝑗 /𝑙) ,
𝑘(𝑥𝑖 , 𝑥𝑗 ) = 𝜎 2
Γ(𝜈)2𝜈−1
As 𝜈 → ∞, the Matérn kernel converges to the RBF kernel. When 𝜈 = 1/2, the Matérn kernel becomes identical to
the absolute exponential kernel, i.e.,
(︃
)︃
𝑘(𝑥𝑖 , 𝑥𝑗 ) = 𝜎 2 exp

− 𝛾𝑑(𝑥𝑖 /𝑙, 𝑥𝑗 /𝑙)

𝜈=

1
2

In particular, 𝜈 = 3/2:
(︃

)︃

√

𝑘(𝑥𝑖 , 𝑥𝑗 ) = 𝜎 2 1 + 𝛾 3𝑑(𝑥𝑖 /𝑙, 𝑥𝑗 /𝑙) exp

(︃

)︃

√

− 𝛾 3𝑑(𝑥𝑖 /𝑙, 𝑥𝑗 /𝑙)

𝜈=

3
2

and 𝜈 = 5/2:
(︃
𝑘(𝑥𝑖 , 𝑥𝑗 ) = 𝜎

2

√

5
1 + 𝛾 5𝑑(𝑥𝑖 /𝑙, 𝑥𝑗 /𝑙) + 𝛾 2 𝑑(𝑥𝑖 /𝑙, 𝑥𝑗 /𝑙)2
3

)︃

(︃
exp

√

)︃

− 𝛾 5𝑑(𝑥𝑖 /𝑙, 𝑥𝑗 /𝑙)

𝜈=

5
2

are popular choices for learning functions that are not infinitely differentiable (as assumed by the RBF kernel) but at
least once (𝜈 = 3/2) or twice differentiable (𝜈 = 5/2).
The flexibility of controlling the smoothness of the learned function via 𝜈 allows adapting to the properties of the
true underlying functional relation. The prior and posterior of a GP resulting from a Matérn kernel are shown in the
following figure:
See [RW2006], pp84 for further details regarding the different variants of the Matérn kernel.
Rational quadratic kernel
The RationalQuadratic kernel can be seen as a scale mixture (an infinite sum) of RBF kernels with different
characteristic length-scales. It is parameterized by a length-scale parameter 𝑙 > 0 and a scale mixture parameter 𝛼 > 0
Only the isotropic variant where 𝑙 is a scalar is supported at the moment. The kernel is given by:
(︂
𝑘(𝑥𝑖 , 𝑥𝑗 ) =

𝑑(𝑥𝑖 , 𝑥𝑗 )2
1+
2𝛼𝑙2

)︂−𝛼

The prior and posterior of a GP resulting from an RBF kernel are shown in the following figure:
Exp-Sine-Squared kernel
The ExpSineSquared kernel allows modeling periodic functions. It is parameterized by a length-scale parameter
𝑙 > 0 and a periodicity parameter 𝑝 > 0. Only the isotropic variant where 𝑙 is a scalar is supported at the moment.
The kernel is given by:
(︁
)︁
2
𝑘(𝑥𝑖 , 𝑥𝑗 ) = exp −2 (sin(𝜋/𝑝 * 𝑑(𝑥𝑖 , 𝑥𝑗 ))/𝑙)
The prior and posterior of a GP resulting from an ExpSineSquared kernel are shown in the following figure:

3.1. Supervised learning

219

scikit-learn user guide, Release 0.19.1

220

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

3.1. Supervised learning

221

scikit-learn user guide, Release 0.19.1

222

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Dot-Product kernel
The DotProduct kernel is non-stationary and can be obtained from linear regression by putting 𝑁 (0, 1) priors on
the coefficients of 𝑥𝑑 (𝑑 = 1, ..., 𝐷) and a prior of 𝑁 (0, 𝜎02 ) on the bias. The DotProduct kernel is invariant to a
rotation of the coordinates about the origin, but not translations. It is parameterized by a parameter 𝜎02 . For 𝜎02 = 0,
the kernel is called the homogeneous linear kernel, otherwise it is inhomogeneous. The kernel is given by
𝑘(𝑥𝑖 , 𝑥𝑗 ) = 𝜎02 + 𝑥𝑖 · 𝑥𝑗
The DotProduct kernel is commonly combined with exponentiation. An example with exponent 2 is shown in the
following figure:

3.1. Supervised learning

223

scikit-learn user guide, Release 0.19.1

References

3.1.8 Cross decomposition
The cross decomposition module contains two main families of algorithms: the partial least squares (PLS) and the
canonical correlation analysis (CCA).
These families of algorithms are useful to find linear relations between two multivariate datasets: the X and Y arguments of the fit method are 2D arrays.

Cross decomposition algorithms find the fundamental relations between two matrices (X and Y). They are latent
variable approaches to modeling the covariance structures in these two spaces. They will try to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space.
PLS-regression is particularly suited when the matrix of predictors has more variables than observations, and when
there is multicollinearity among X values. By contrast, standard regression will fail in these cases.
Classes included in this module are PLSRegression PLSCanonical, CCA and PLSSVD
Reference:
• JA Wegelin A survey of Partial Least Squares (PLS) methods, with emphasis on the two-block case

Examples:

224

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• Compare cross decomposition methods

3.1.9 Naive Bayes
Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive”
assumption of independence between every pair of features. Given a class variable 𝑦 and a dependent feature vector
𝑥1 through 𝑥𝑛 , Bayes’ theorem states the following relationship:
𝑃 (𝑦 | 𝑥1 , . . . , 𝑥𝑛 ) =

𝑃 (𝑦)𝑃 (𝑥1 , . . . 𝑥𝑛 | 𝑦)
𝑃 (𝑥1 , . . . , 𝑥𝑛 )

Using the naive independence assumption that
𝑃 (𝑥𝑖 |𝑦, 𝑥1 , . . . , 𝑥𝑖−1 , 𝑥𝑖+1 , . . . , 𝑥𝑛 ) = 𝑃 (𝑥𝑖 |𝑦),
for all 𝑖, this relationship is simplified to
𝑃 (𝑦 | 𝑥1 , . . . , 𝑥𝑛 ) =

∏︀𝑛
𝑃 (𝑦) 𝑖=1 𝑃 (𝑥𝑖 | 𝑦)
𝑃 (𝑥1 , . . . , 𝑥𝑛 )

Since 𝑃 (𝑥1 , . . . , 𝑥𝑛 ) is constant given the input, we can use the following classification rule:
𝑃 (𝑦 | 𝑥1 , . . . , 𝑥𝑛 ) ∝ 𝑃 (𝑦)

𝑛
∏︁

𝑃 (𝑥𝑖 | 𝑦)

𝑖=1

⇓
𝑦ˆ = arg max 𝑃 (𝑦)
𝑦

𝑛
∏︁

𝑃 (𝑥𝑖 | 𝑦),

𝑖=1

and we can use Maximum A Posteriori (MAP) estimation to estimate 𝑃 (𝑦) and 𝑃 (𝑥𝑖 | 𝑦); the former is then the
relative frequency of class 𝑦 in the training set.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of 𝑃 (𝑥𝑖 |
𝑦).
In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many realworld situations, famously document classification and spam filtering. They require a small amount of training data to
estimate the necessary parameters. (For theoretical reasons why naive Bayes works well, and on which types of data
it does, see the references below.)
Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling
of the class conditional feature distributions means that each distribution can be independently estimated as a one
dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.
On the flip side, although naive Bayes is known as a decent classifier, it is known to be a bad estimator, so the
probability outputs from predict_proba are not to be taken too seriously.
References:
• H. Zhang (2004). The optimality of Naive Bayes. Proc. FLAIRS.

3.1. Supervised learning

225

scikit-learn user guide, Release 0.19.1

Gaussian Naive Bayes
GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is
assumed to be Gaussian:
)︂
(︂
(𝑥𝑖 − 𝜇𝑦 )2
1
exp −
𝑃 (𝑥𝑖 | 𝑦) = √︁
2𝜎𝑦2
2𝜋𝜎𝑦2
The parameters 𝜎𝑦 and 𝜇𝑦 are estimated using maximum likelihood.
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> from sklearn.naive_bayes import GaussianNB
>>> gnb = GaussianNB()
>>> y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
>>> print("Number of mislabeled points out of a total %d points : %d"
...
% (iris.data.shape[0],(iris.target != y_pred).sum()))
Number of mislabeled points out of a total 150 points : 6

Multinomial Naive Bayes
MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is one of the two
classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts,
although tf-idf vectors are also known to work well in practice). The distribution is parametrized by vectors 𝜃𝑦 =
(𝜃𝑦1 , . . . , 𝜃𝑦𝑛 ) for each class 𝑦, where 𝑛 is the number of features (in text classification, the size of the vocabulary)
and 𝜃𝑦𝑖 is the probability 𝑃 (𝑥𝑖 | 𝑦) of feature 𝑖 appearing in a sample belonging to class 𝑦.
The parameters 𝜃𝑦 is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting:
𝑁𝑦𝑖 + 𝛼
𝜃ˆ𝑦𝑖 =
𝑁𝑦 + 𝛼𝑛
∑︀
where 𝑁𝑦𝑖 =
𝑥∈𝑇 𝑥𝑖 is the number of times feature 𝑖 appears in a sample of class 𝑦 in the training set 𝑇 , and
∑︀|𝑇 |
𝑁𝑦 = 𝑖=1 𝑁𝑦𝑖 is the total count of all features for class 𝑦.
The smoothing priors 𝛼 ≥ 0 accounts for features not present in the learning samples and prevents zero probabilities
in further computations. Setting 𝛼 = 1 is called Laplace smoothing, while 𝛼 < 1 is called Lidstone smoothing.
Bernoulli Naive Bayes
BernoulliNB implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features but each one is assumed to be a
binary-valued (Bernoulli, boolean) variable. Therefore, this class requires samples to be represented as binary-valued
feature vectors; if handed any other kind of data, a BernoulliNB instance may binarize its input (depending on the
binarize parameter).
The decision rule for Bernoulli naive Bayes is based on
𝑃 (𝑥𝑖 | 𝑦) = 𝑃 (𝑖 | 𝑦)𝑥𝑖 + (1 − 𝑃 (𝑖 | 𝑦))(1 − 𝑥𝑖 )
which differs from multinomial NB’s rule in that it explicitly penalizes the non-occurrence of a feature 𝑖 that is an
indicator for class 𝑦, where the multinomial variant would simply ignore a non-occurring feature.
In the case of text classification, word occurrence vectors (rather than word count vectors) may be used to train and
use this classifier. BernoulliNB might perform better on some datasets, especially those with shorter documents.
It is advisable to evaluate both models, if time permits.
226

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

References:
• C.D. Manning, P. Raghavan and H. Schütze (2008). Introduction to Information Retrieval. Cambridge University Press, pp. 234-265.
• A. McCallum and K. Nigam (1998). A comparison of event models for Naive Bayes text classification. Proc.
AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.
• V. Metsis, I. Androutsopoulos and G. Paliouras (2006). Spam filtering with Naive Bayes – Which Naive
Bayes? 3rd Conf. on Email and Anti-Spam (CEAS).

Out-of-core naive Bayes model fitting
Naive Bayes models can be used to tackle large scale classification problems for which the full training set might not fit
in memory. To handle this case, MultinomialNB, BernoulliNB, and GaussianNB expose a partial_fit
method that can be used incrementally as done with other classifiers as demonstrated in Out-of-core classification of
text documents. All naive Bayes classifiers support sample weighting.
Contrary to the fit method, the first call to partial_fit needs to be passed the list of all the expected class labels.
For an overview of available strategies in scikit-learn, see also the out-of-core learning documentation.
Note: The partial_fit method call of naive Bayes models introduces some computational overhead. It is
recommended to use data chunk sizes that are as large as possible, that is as the available RAM allows.

3.1.10 Decision Trees
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The
goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the
data features.
For instance, in the example below, decision trees learn from data to approximate a sine curve with a set of if-then-else
decision rules. The deeper the tree, the more complex the decision rules and the fitter the model.
Some advantages of decision trees are:
• Simple to understand and to interpret. Trees can be visualised.
• Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be
created and blank values to be removed. Note however that this module does not support missing values.
• The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
• Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing
datasets that have only one type of variable. See algorithms for more information.
• Able to handle multi-output problems.
• Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily
explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may
be more difficult to interpret.
• Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the
model.

3.1. Supervised learning

227

scikit-learn user guide, Release 0.19.1

• Performs well even if its assumptions are somewhat violated by the true model from which the data were
generated.
The disadvantages of decision trees include:
• Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required
at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
• Decision trees can be unstable because small variations in the data might result in a completely different tree
being generated. This problem is mitigated by using decision trees within an ensemble.
• The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality
and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic
algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms
cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in
an ensemble learner, where the features and samples are randomly sampled with replacement.
• There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity
or multiplexer problems.
• Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the
dataset prior to fitting with the decision tree.
Classification
DecisionTreeClassifier is a class capable of performing multi-class classification on a dataset.
As with other classifiers, DecisionTreeClassifier takes as input two arrays: an array X, sparse or dense,
of size [n_samples, n_features] holding the training samples, and an array Y of integer values, size
[n_samples], holding the class labels for the training samples:

228

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>>
>>>
>>>
>>>
>>>

from sklearn import tree
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

After being fitted, the model can then be used to predict the class of samples:
>>> clf.predict([[2., 2.]])
array([1])

Alternatively, the probability of each class can be predicted, which is the fraction of training samples of the same class
in a leaf:
>>> clf.predict_proba([[2., 2.]])
array([[ 0., 1.]])

DecisionTreeClassifier is capable of both binary (where the labels are [-1, 1]) classification and multiclass
(where the labels are [0, . . . , K-1]) classification.
Using the Iris dataset, we can construct a tree as follows:
>>>
>>>
>>>
>>>
>>>

from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)

Once trained, we can export the tree in Graphviz format using the export_graphviz exporter. If you use the
conda package manager, the graphviz binaries and the python package can be installed with
conda install python-graphviz
Alternatively binaries for graphviz can be downloaded from the graphviz project homepage, and the Python wrapper
installed from pypi with pip install graphviz.
Below is an example graphviz export of the above tree trained on the entire iris dataset; the results are saved in an
output file iris.pdf :
>>>
>>>
>>>
>>>

import graphviz
dot_data = tree.export_graphviz(clf, out_file=None)
graph = graphviz.Source(dot_data)
graph.render("iris")

The export_graphviz exporter also supports a variety of aesthetic options, including coloring nodes by their class
(or value for regression) and using explicit variable and class names if desired. Jupyter notebooks also render these
plots inline automatically:
>>> dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
>>> graph = graphviz.Source(dot_data)
>>> graph

After being fitted, the model can then be used to predict the class of samples:

3.1. Supervised learning

229

scikit-learn user guide, Release 0.19.1

petal length (cm) ≤ 2.45
gini = 0.6667
samples = 150
value = [50, 50, 50]
class = setosa
False

True
gini = 0.0
samples = 50
value = [50, 0, 0]
class = setosa

petal width (cm) ≤ 1.65
gini = 0.0408
samples = 48
value = [0, 47, 1]
class = versicolor

gini = 0.0
samples = 47
value = [0, 47, 0]
class = versicolor

gini = 0.0
samples = 1
value = [0, 0, 1]
class = virginica

petal width (cm) ≤ 1.75
gini = 0.5
samples = 100
value = [0, 50, 50]
class = versicolor

petal length (cm) ≤ 4.95
gini = 0.168
samples = 54
value = [0, 49, 5]
class = versicolor

petal length (cm) ≤ 4.85
gini = 0.0425
samples = 46
value = [0, 1, 45]
class = virginica

petal width (cm) ≤ 1.55
gini = 0.4444
samples = 6
value = [0, 2, 4]
class = virginica

sepal length (cm) ≤ 5.95
gini = 0.4444
samples = 3
value = [0, 1, 2]
class = virginica

gini = 0.0
samples = 3
value = [0, 0, 3]
class = virginica

sepal length (cm) ≤ 6.95
gini = 0.4444
samples = 3
value = [0, 2, 1]
class = versicolor

gini = 0.0
samples = 2
value = [0, 2, 0]
class = versicolor

230

gini = 0.0
samples = 1
value = [0, 1, 0]
class = versicolor

gini = 0.0
samples = 43
value = [0, 0, 43]
class = virginica

gini = 0.0
samples = 2
value = [0, 0, 2]
class = virginica

gini = 0.0
samples = 1
value = [0, 0, 1]
class = virginica

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>> clf.predict(iris.data[:1, :])
array([0])

Alternatively, the probability of each class can be predicted, which is the fraction of training samples of the same class
in a leaf:
>>> clf.predict_proba(iris.data[:1, :])
array([[ 1., 0., 0.]])

Examples:
• Plot the decision surface of a decision tree on the iris dataset

Regression
Decision trees can also be applied to regression problems, using the DecisionTreeRegressor class.
As in the classification setting, the fit method will take as argument arrays X and y, only that in this case y is expected
to have floating point values instead of integer values:
>>> from sklearn import tree
>>> X = [[0, 0], [2, 2]]
>>> y = [0.5, 2.5]
>>> clf = tree.DecisionTreeRegressor()
>>> clf = clf.fit(X, y)
>>> clf.predict([[1, 1]])
array([ 0.5])

3.1. Supervised learning

231

scikit-learn user guide, Release 0.19.1

Examples:
• Decision Tree Regression

Multi-output problems
A multi-output problem is a supervised learning problem with several outputs to predict, that is when Y is a 2d array
of size [n_samples, n_outputs].
When there is no correlation between the outputs, a very simple way to solve this kind of problem is to build n
independent models, i.e. one for each output, and then to use those models to independently predict each one of the
n outputs. However, because it is likely that the output values related to the same input are themselves correlated, an
often better way is to build a single model capable of predicting simultaneously all n outputs. First, it requires lower
training time since only a single estimator is built. Second, the generalization accuracy of the resulting estimator may
often be increased.
With regard to decision trees, this strategy can readily be used to support multi-output problems. This requires the
following changes:
• Store n output values in leaves, instead of 1;
• Use splitting criteria that compute the average reduction across all n outputs.
This module offers support for multi-output problems by implementing this strategy in both
DecisionTreeClassifier and DecisionTreeRegressor. If a decision tree is fit on an output
array Y of size [n_samples, n_outputs] then the resulting estimator will:
• Output n_output values upon predict;
• Output a list of n_output arrays of class probabilities upon predict_proba.
The use of multi-output trees for regression is demonstrated in Multi-output Decision Tree Regression. In this example,
the input X is a single real value and the outputs Y are the sine and cosine of X.
232

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The use of multi-output trees for classification is demonstrated in Face completion with a multi-output estimators. In
this example, the inputs X are the pixels of the upper half of faces and the outputs Y are the pixels of the lower half of
those faces.
Examples:
• Multi-output Decision Tree Regression
• Face completion with a multi-output estimators

References:
• M. Dumont et al, Fast multi-class image annotation with random subwindows and multiple output randomized
trees, International Conference on Computer Vision Theory and Applications 2009

Complexity
In general, the run time cost to construct a balanced binary tree is 𝑂(𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 log(𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 )) and query
time 𝑂(log(𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 )). Although the tree construction algorithm attempts to generate balanced trees, they will not
always be balanced. Assuming that the subtrees remain approximately balanced, the cost at each node consists of
searching through 𝑂(𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 ) to find the feature that offers the largest reduction in entropy. This has a cost of
𝑂(𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 log(𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 )) at each node, leading to a total cost over the entire trees (by summing the cost at
each node) of 𝑂(𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑛2𝑠𝑎𝑚𝑝𝑙𝑒𝑠 log(𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 )).
Scikit-learn offers a more efficient implementation for the construction of decision trees. A naive implementation
(as above) would recompute the class label histograms (for classification) or the means (for regression) at for each
new split point along a given feature. Presorting the feature over all relevant samples, and retaining a running label count, will reduce the complexity at each node to 𝑂(𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 log(𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 )), which results in a total cost of
3.1. Supervised learning

233

scikit-learn user guide, Release 0.19.1

234

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

𝑂(𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 log(𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 )). This is an option for all tree based algorithms. By default it is turned on for
gradient boosting, where in general it makes training faster, but turned off for all other algorithms as it tends to slow
down training when training deep trees.
Tips on practical use
• Decision trees tend to overfit on data with a large number of features. Getting the right ratio of samples to
number of features is important, since a tree with few samples in high dimensional space is very likely to
overfit.
• Consider performing dimensionality reduction (PCA, ICA, or Feature selection) beforehand to give your tree a
better chance of finding features that are discriminative.
• Visualise your tree as you are training by using the export function. Use max_depth=3 as an initial tree
depth to get a feel for how the tree is fitting to your data, and then increase the depth.
• Remember that the number of samples required to populate the tree doubles for each additional level the tree
grows to. Use max_depth to control the size of the tree to prevent overfitting.
• Use min_samples_split or min_samples_leaf to control the number of samples at a leaf node. A
very small number will usually mean the tree will overfit, whereas a large number will prevent the tree from
learning the data. Try min_samples_leaf=5 as an initial value. If the sample size varies greatly, a float
number can be used as percentage in these two parameters. The main difference between the two is that
min_samples_leaf guarantees a minimum number of samples in a leaf, while min_samples_split
can create arbitrary small leaves, though min_samples_split is more common in the literature.
• Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant.
Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value. Also note that
weight-based pre-pruning criteria, such as min_weight_fraction_leaf, will then be less biased toward
dominant classes than criteria that are not aware of the sample weights, like min_samples_leaf.
• If the samples are weighted, it will be easier to optimize the tree structure using weight-based pre-pruning
criterion such as min_weight_fraction_leaf, which ensure that leaf nodes contain at least a fraction of
the overall sum of the sample weights.
• All decision trees use np.float32 arrays internally. If training data is not in this format, a copy of the dataset
will be made.
• If the input matrix X is very sparse, it is recommended to convert to sparse csc_matrix before calling fit and
sparse csr_matrix before calling predict. Training time can be orders of magnitude faster for a sparse matrix
input compared to a dense matrix when features have zero values in most of the samples.
Tree algorithms: ID3, C4.5, C5.0 and CART
What are all the various decision tree algorithms and how do they differ from each other? Which one is implemented
in scikit-learn?
ID3 (Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. The algorithm creates a multiway tree, finding
for each node (i.e. in a greedy manner) the categorical feature that will yield the largest information gain for categorical
targets. Trees are grown to their maximum size and then a pruning step is usually applied to improve the ability of the
tree to generalise to unseen data.
C4.5 is the successor to ID3 and removed the restriction that features must be categorical by dynamically defining
a discrete attribute (based on numerical variables) that partitions the continuous attribute value into a discrete set of
intervals. C4.5 converts the trained trees (i.e. the output of the ID3 algorithm) into sets of if-then rules. These accuracy
of each rule is then evaluated to determine the order in which they should be applied. Pruning is done by removing a
rule’s precondition if the accuracy of the rule improves without it.
3.1. Supervised learning

235

scikit-learn user guide, Release 0.19.1

C5.0 is Quinlan’s latest version release under a proprietary license. It uses less memory and builds smaller rulesets
than C4.5 while being more accurate.
CART (Classification and Regression Trees) is very similar to C4.5, but it differs in that it supports numerical target
variables (regression) and does not compute rule sets. CART constructs binary trees using the feature and threshold
that yield the largest information gain at each node.
scikit-learn uses an optimised version of the CART algorithm.
Mathematical formulation
Given training vectors 𝑥𝑖 ∈ 𝑅𝑛 , i=1,. . . , l and a label vector 𝑦 ∈ 𝑅𝑙 , a decision tree recursively partitions the space
such that the samples with the same labels are grouped together.
Let the data at node 𝑚 be represented by 𝑄. For each candidate split 𝜃 = (𝑗, 𝑡𝑚 ) consisting of a feature 𝑗 and threshold
𝑡𝑚 , partition the data into 𝑄𝑙𝑒𝑓 𝑡 (𝜃) and 𝑄𝑟𝑖𝑔ℎ𝑡 (𝜃) subsets
𝑄𝑙𝑒𝑓 𝑡 (𝜃) = (𝑥, 𝑦)|𝑥𝑗 <= 𝑡𝑚
𝑄𝑟𝑖𝑔ℎ𝑡 (𝜃) = 𝑄 ∖ 𝑄𝑙𝑒𝑓 𝑡 (𝜃)
The impurity at 𝑚 is computed using an impurity function 𝐻(), the choice of which depends on the task being solved
(classification or regression)
𝐺(𝑄, 𝜃) =

𝑛𝑟𝑖𝑔ℎ𝑡
𝑛𝑙𝑒𝑓 𝑡
𝐻(𝑄𝑙𝑒𝑓 𝑡 (𝜃)) +
𝐻(𝑄𝑟𝑖𝑔ℎ𝑡 (𝜃))
𝑁𝑚
𝑁𝑚

Select the parameters that minimises the impurity
𝜃* = argmin𝜃 𝐺(𝑄, 𝜃)
Recurse for subsets 𝑄𝑙𝑒𝑓 𝑡 (𝜃* ) and 𝑄𝑟𝑖𝑔ℎ𝑡 (𝜃* ) until the maximum allowable depth is reached, 𝑁𝑚 < min𝑠𝑎𝑚𝑝𝑙𝑒𝑠 or
𝑁𝑚 = 1.
Classification criteria
If a target is a classification outcome taking on values 0,1,. . . ,K-1, for node 𝑚, representing a region 𝑅𝑚 with 𝑁𝑚
observations, let
∑︁
𝑝𝑚𝑘 = 1/𝑁𝑚
𝐼(𝑦𝑖 = 𝑘)
𝑥𝑖 ∈𝑅𝑚

be the proportion of class k observations in node 𝑚
Common measures of impurity are Gini
𝐻(𝑋𝑚 ) =

∑︁

𝑝𝑚𝑘 (1 − 𝑝𝑚𝑘 )

𝑘

Cross-Entropy
𝐻(𝑋𝑚 ) = −

∑︁

𝑝𝑚𝑘 log(𝑝𝑚𝑘 )

𝑘

and Misclassification
𝐻(𝑋𝑚 ) = 1 − max(𝑝𝑚𝑘 )
where 𝑋𝑚 is the training data in node 𝑚
236

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Regression criteria
If the target is a continuous value, then for node 𝑚, representing a region 𝑅𝑚 with 𝑁𝑚 observations, common criteria
to minimise as for determining locations for future splits are Mean Squared Error, which minimizes the L2 error
using mean values at terminal nodes, and Mean Absolute Error, which minimizes the L1 error using median values at
terminal nodes.
Mean Squared Error:
1 ∑︁
𝑦𝑖
𝑁𝑚
𝑖∈𝑁𝑚
∑︁
(𝑦𝑖 − 𝑐𝑚 )2

𝑐𝑚 =
𝐻(𝑋𝑚 ) =

1
𝑁𝑚

𝑖∈𝑁𝑚

Mean Absolute Error:
1 ∑︁
𝑦𝑖
𝑁𝑚
𝑖∈𝑁𝑚
∑︁
|𝑦𝑖 − 𝑦¯𝑚 |

𝑦¯𝑚 =
𝐻(𝑋𝑚 ) =

1
𝑁𝑚

𝑖∈𝑁𝑚

where 𝑋𝑚 is the training data in node 𝑚
References:
• https://en.wikipedia.org/wiki/Decision_tree_learning
• https://en.wikipedia.org/wiki/Predictive_analytics
• L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Belmont,
CA, 1984.
• J.R. Quinlan. C4. 5: programs for machine learning. Morgan Kaufmann, 1993.
• T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical Learning, Springer, 2009.

3.1.11 Ensemble methods
The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning
algorithm in order to improve generalizability / robustness over a single estimator.
Two families of ensemble methods are usually distinguished:
• In averaging methods, the driving principle is to build several estimators independently and then to average
their predictions. On average, the combined estimator is usually better than any of the single base estimator
because its variance is reduced.
Examples: Bagging methods, Forests of randomized trees, . . .
• By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce the bias of the
combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.
Examples: AdaBoost, Gradient Tree Boosting, . . .

3.1. Supervised learning

237

scikit-learn user guide, Release 0.19.1

Bagging meta-estimator
In ensemble algorithms, bagging methods form a class of algorithms which build several instances of a black-box
estimator on random subsets of the original training set and then aggregate their individual predictions to form a final
prediction. These methods are used as a way to reduce the variance of a base estimator (e.g., a decision tree), by
introducing randomization into its construction procedure and then making an ensemble out of it. In many cases,
bagging methods constitute a very simple way to improve with respect to a single model, without making it necessary
to adapt the underlying base algorithm. As they provide a way to reduce overfitting, bagging methods work best with
strong and complex models (e.g., fully developed decision trees), in contrast with boosting methods which usually
work best with weak models (e.g., shallow decision trees).
Bagging methods come in many flavours but mostly differ from each other by the way they draw random subsets of
the training set:
• When random subsets of the dataset are drawn as random subsets of the samples, then this algorithm is known
as Pasting [B1999].
• When samples are drawn with replacement, then the method is known as Bagging [B1996].
• When random subsets of the dataset are drawn as random subsets of the features, then the method is known as
Random Subspaces [H1998].
• Finally, when base estimators are built on subsets of both samples and features, then the method is known as
Random Patches [LG2012].
In scikit-learn, bagging methods are offered as a unified BaggingClassifier meta-estimator (resp.
BaggingRegressor), taking as input a user-specified base estimator along with parameters specifying the strategy
to draw random subsets. In particular, max_samples and max_features control the size of the subsets (in terms
of samples and features), while bootstrap and bootstrap_features control whether samples and features
are drawn with or without replacement. When using a subset of the available samples the generalization accuracy can
be estimated with the out-of-bag samples by setting oob_score=True. As an example, the snippet below illustrates
how to instantiate a bagging ensemble of KNeighborsClassifier base estimators, each built on random subsets
of 50% of the samples and 50% of the features.
>>> from sklearn.ensemble import BaggingClassifier
>>> from sklearn.neighbors import KNeighborsClassifier
>>> bagging = BaggingClassifier(KNeighborsClassifier(),
...
max_samples=0.5, max_features=0.5)

Examples:
• Single estimator versus bagging: bias-variance decomposition

References

Forests of randomized trees
The sklearn.ensemble module includes two averaging algorithms based on randomized decision trees: the RandomForest algorithm and the Extra-Trees method. Both algorithms are perturb-and-combine techniques [B1998]
specifically designed for trees. This means a diverse set of classifiers is created by introducing randomness in the
classifier construction. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.

238

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

As other classifiers, forest classifiers have to be fitted with two arrays: a sparse or dense array X of size [n_samples,
n_features] holding the training samples, and an array Y of size [n_samples] holding the target values (class
labels) for the training samples:
>>>
>>>
>>>
>>>
>>>

from sklearn.ensemble import RandomForestClassifier
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X, Y)

Like decision trees, forests of trees also extend to multi-output problems (if Y is an array of size [n_samples,
n_outputs]).
Random Forests
In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the
ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition,
when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all
features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this
randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but,
due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding
an overall better model.
In contrast to the original publication [B2001], the scikit-learn implementation combines classifiers by averaging their
probabilistic prediction, instead of letting each classifier vote for a single class.
Extremely Randomized Trees
In extremely randomized trees (see ExtraTreesClassifier and ExtraTreesRegressor classes), randomness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features
is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to
reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias:
>>>
>>>
>>>
>>>
>>>

from
from
from
from
from

sklearn.model_selection import cross_val_score
sklearn.datasets import make_blobs
sklearn.ensemble import RandomForestClassifier
sklearn.ensemble import ExtraTreesClassifier
sklearn.tree import DecisionTreeClassifier

>>> X, y = make_blobs(n_samples=10000, n_features=10, centers=100,
...
random_state=0)
>>> clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2,
...
random_state=0)
>>> scores = cross_val_score(clf, X, y)
>>> scores.mean()
0.97...
>>> clf = RandomForestClassifier(n_estimators=10, max_depth=None,
...
min_samples_split=2, random_state=0)
>>> scores = cross_val_score(clf, X, y)
>>> scores.mean()
0.999...

3.1. Supervised learning

239

scikit-learn user guide, Release 0.19.1

>>> clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,
...
min_samples_split=2, random_state=0)
>>> scores = cross_val_score(clf, X, y)
>>> scores.mean() > 0.999
True

Parameters
The main parameters to adjust when using these methods is n_estimators and max_features. The former
is the number of trees in the forest. The larger the better, but also the longer it will take to compute. In addition,
note that results will stop getting significantly better beyond a critical number of trees. The latter is the size of
the random subsets of features to consider when splitting a node. The lower the greater the reduction of variance,
but also the greater the increase in bias. Empirical good default values are max_features=n_features for
regression problems, and max_features=sqrt(n_features) for classification tasks (where n_features is
the number of features in the data). Good results are often achieved when setting max_depth=None in combination
with min_samples_split=2 (i.e., when fully developing the trees). Bear in mind though that these values are
usually not optimal, and might result in models that consume a lot of RAM. The best parameter values should always be
cross-validated. In addition, note that in random forests, bootstrap samples are used by default (bootstrap=True)
while the default strategy for extra-trees is to use the whole dataset (bootstrap=False). When using bootstrap
sampling the generalization accuracy can be estimated on the left out or out-of-bag samples. This can be enabled by
setting oob_score=True.
Note: The size of the model with the default parameters is 𝑂(𝑀 * 𝑁 * 𝑙𝑜𝑔(𝑁 )), where 𝑀 is the number of
trees and 𝑁 is the number of samples. In order to reduce the size of the model, you can change these parameters:
min_samples_split, min_samples_leaf, max_leaf_nodes and max_depth.

240

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Parallelization
Finally, this module also features the parallel construction of the trees and the parallel computation of the predictions
through the n_jobs parameter. If n_jobs=k then computations are partitioned into k jobs, and run on k cores of
the machine. If n_jobs=-1 then all cores available on the machine are used. Note that because of inter-process
communication overhead, the speedup might not be linear (i.e., using k jobs will unfortunately not be k times as fast).
Significant speedup can still be achieved though when building a large number of trees, or when building a single tree
requires a fair amount of time (e.g., on large datasets).
Examples:
• Plot the decision surfaces of ensembles of trees on the iris dataset
• Pixel importances with a parallel forest of trees
• Face completion with a multi-output estimators

References
• P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006.

Feature importance evaluation
The relative rank (i.e. depth) of a feature used as a decision node in a tree can be used to assess the relative importance
of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute
to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they
contribute to can thus be used as an estimate of the relative importance of the features.
By averaging those expected activity rates over several randomized trees one can reduce the variance of such an
estimate and use it for feature selection.
The following example shows a color-coded representation of the relative importances of each individual pixel for a
face recognition task using a ExtraTreesClassifier model.
In practice those estimates are stored as an attribute named feature_importances_ on the fitted model. This
is an array with shape (n_features,) whose values are positive and sum to 1.0. The higher the value, the more
important is the contribution of the matching feature to the prediction function.
Examples:
• Pixel importances with a parallel forest of trees
• Feature importances with forests of trees

Totally Random Trees Embedding
RandomTreesEmbedding implements an unsupervised transformation of the data. Using a forest of completely
random trees, RandomTreesEmbedding encodes the data by the indices of the leaves a data point ends up in. This
index is then encoded in a one-of-K manner, leading to a high dimensional, sparse binary coding. This coding can be
computed very efficiently and can then be used as a basis for other learning tasks. The size and sparsity of the code

3.1. Supervised learning

241

scikit-learn user guide, Release 0.19.1

can be influenced by choosing the number of trees and the maximum depth per tree. For each tree in the ensemble, the
coding contains one entry of one. The size of the coding is at most n_estimators * 2 ** max_depth, the
maximum number of leaves in the forest.
As neighboring data points are more likely to lie within the same leaf of a tree, the transformation performs an implicit,
non-parametric density estimation.
Examples:
• Hashing feature transformation using Totally Random Trees
• Manifold learning on handwritten digits: Locally Linear Embedding, Isomap. . . compares non-linear dimensionality reduction techniques on handwritten digits.
• Feature transformations with ensembles of trees compares supervised and unsupervised tree based feature
transformations.
See also:
Manifold learning techniques can also be useful to derive non-linear representations of feature space, also these approaches focus also on dimensionality reduction.
AdaBoost
The module sklearn.ensemble includes the popular boosting algorithm AdaBoost, introduced in 1995 by Freund
and Schapire [FS1995].
The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than
random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from
all of them are then combined through a weighted majority vote (or sum) to produce the final prediction. The data
modifications at each so-called boosting iteration consist of applying weights 𝑤1 , 𝑤2 , . . . , 𝑤𝑁 to each of the training
samples. Initially, those weights are all set to 𝑤𝑖 = 1/𝑁 , so that the first step simply trains a weak learner on the
242

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

original data. For each successive iteration, the sample weights are individually modified and the learning algorithm is
reapplied to the reweighted data. At a given step, those training examples that were incorrectly predicted by the boosted
model induced at the previous step have their weights increased, whereas the weights are decreased for those that were
predicted correctly. As iterations proceed, examples that are difficult to predict receive ever-increasing influence. Each
subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the
sequence [HTF].

AdaBoost can be used both for classification and regression problems:
• For multi-class classification, AdaBoostClassifier implements AdaBoost-SAMME and AdaBoostSAMME.R [ZZRH2009].
• For regression, AdaBoostRegressor implements AdaBoost.R2 [D1997].
Usage
The following example shows how to fit an AdaBoost classifier with 100 weak learners:
>>> from sklearn.model_selection import cross_val_score
>>> from sklearn.datasets import load_iris
>>> from sklearn.ensemble import AdaBoostClassifier
>>> iris = load_iris()
>>> clf = AdaBoostClassifier(n_estimators=100)
>>> scores = cross_val_score(clf, iris.data, iris.target)
>>> scores.mean()
0.9...

The number of weak learners is controlled by the parameter n_estimators. The learning_rate parameter
controls the contribution of the weak learners in the final combination. By default, weak learners are decision stumps.
Different weak learners can be specified through the base_estimator parameter. The main parameters to tune to

3.1. Supervised learning

243

scikit-learn user guide, Release 0.19.1

obtain good results are n_estimators and the complexity of the base estimators (e.g., its depth max_depth or
minimum required number of samples at a leaf min_samples_leaf in case of decision trees).
Examples:
• Discrete versus Real AdaBoost compares the classification error of a decision stump, decision tree, and a
boosted decision stump using AdaBoost-SAMME and AdaBoost-SAMME.R.
• Multi-class AdaBoosted Decision Trees shows the performance of AdaBoost-SAMME and AdaBoostSAMME.R on a multi-class problem.
• Two-class AdaBoost shows the decision boundary and decision function values for a non-linearly separable
two-class problem using AdaBoost-SAMME.
• Decision Tree Regression with AdaBoost demonstrates regression with the AdaBoost.R2 algorithm.

References

Gradient Tree Boosting
Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT) is a generalization of boosting to arbitrary
differentiable loss functions. GBRT is an accurate and effective off-the-shelf procedure that can be used for both
regression and classification problems. Gradient Tree Boosting models are used in a variety of areas including Web
search ranking and ecology.
The advantages of GBRT are:
• Natural handling of data of mixed type (= heterogeneous features)
• Predictive power
• Robustness to outliers in output space (via robust loss functions)
The disadvantages of GBRT are:
• Scalability, due to the sequential nature of boosting it can hardly be parallelized.
The module sklearn.ensemble provides methods for both classification and regression via gradient boosted
regression trees.
Classification
GradientBoostingClassifier supports both binary and multi-class classification. The following example
shows how to fit a gradient boosting classifier with 100 decision stumps as weak learners:
>>> from sklearn.datasets import make_hastie_10_2
>>> from sklearn.ensemble import GradientBoostingClassifier
>>> X, y = make_hastie_10_2(random_state=0)
>>> X_train, X_test = X[:2000], X[2000:]
>>> y_train, y_test = y[:2000], y[2000:]
>>> clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
...
max_depth=1, random_state=0).fit(X_train, y_train)

244

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>> clf.score(X_test, y_test)
0.913...

The number of weak learners (i.e. regression trees) is controlled by the parameter n_estimators; The size of each
tree can be controlled either by setting the tree depth via max_depth or by setting the number of leaf nodes via
max_leaf_nodes. The learning_rate is a hyper-parameter in the range (0.0, 1.0] that controls overfitting via
shrinkage .
Note: Classification with more than 2 classes requires the induction of n_classes regression trees at each
iteration, thus, the total number of induced trees equals n_classes * n_estimators. For datasets with
a large number of classes we strongly recommend to use RandomForestClassifier as an alternative to
GradientBoostingClassifier .

Regression
GradientBoostingRegressor supports a number of different loss functions for regression which can be specified via the argument loss; the default loss function for regression is least squares ('ls').
>>>
>>>
>>>
>>>

import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_friedman1
from sklearn.ensemble import GradientBoostingRegressor

>>> X, y = make_friedman1(n_samples=1200, random_state=0, noise=1.0)
>>> X_train, X_test = X[:200], X[200:]
>>> y_train, y_test = y[:200], y[200:]
>>> est = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,
...
max_depth=1, random_state=0, loss='ls').fit(X_train, y_train)
>>> mean_squared_error(y_test, est.predict(X_test))
5.00...

The figure below shows the results of applying GradientBoostingRegressor with least squares loss and 500
base learners to the Boston house price dataset (sklearn.datasets.load_boston). The plot on the left shows
the train and test error at each iteration. The train error at each iteration is stored in the train_score_ attribute
of the gradient boosting model. The test error at each iterations can be obtained via the staged_predict method
which returns a generator that yields the predictions at each stage. Plots like these can be used to determine the optimal
number of trees (i.e. n_estimators) by early stopping. The plot on the right shows the feature importances which
can be obtained via the feature_importances_ property.
Examples:
• Gradient Boosting regression
• Gradient Boosting Out-of-Bag estimates

Fitting additional weak-learners
Both
GradientBoostingRegressor
and
GradientBoostingClassifier
warm_start=True which allows you to add more estimators to an already fitted model.

3.1. Supervised learning

support

245

scikit-learn user guide, Release 0.19.1

>>> _ = est.set_params(n_estimators=200, warm_start=True) # set warm_start and new
˓→nr of trees
>>> _ = est.fit(X_train, y_train) # fit additional 100 trees to est
>>> mean_squared_error(y_test, est.predict(X_test))
3.84...

Controlling the tree size
The size of the regression tree base learners defines the level of variable interactions that can be captured by the
gradient boosting model. In general, a tree of depth h can capture interactions of order h . There are two ways in
which the size of the individual regression trees can be controlled.
If you specify max_depth=h then complete binary trees of depth h will be grown. Such trees will have (at most)
2**h leaf nodes and 2**h - 1 split nodes.
Alternatively, you can control the tree size by specifying the number of leaf nodes via the parameter
max_leaf_nodes. In this case, trees will be grown using best-first search where nodes with the highest improvement in impurity will be expanded first. A tree with max_leaf_nodes=k has k - 1 split nodes and thus can
model interactions of up to order max_leaf_nodes - 1 .
We found that max_leaf_nodes=k gives comparable results to max_depth=k-1 but is significantly faster to
train at the expense of a slightly higher training error. The parameter max_leaf_nodes corresponds to the variable
J in the chapter on gradient boosting in [F2001] and is related to the parameter interaction.depth in R’s gbm
package where max_leaf_nodes == interaction.depth + 1 .
Mathematical formulation
GBRT considers additive models of the following form:

246

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

𝐹 (𝑥) =

𝑀
∑︁

𝛾𝑚 ℎ𝑚 (𝑥)

𝑚=1

where ℎ𝑚 (𝑥) are the basis functions which are usually called weak learners in the context of boosting. Gradient Tree
Boosting uses decision trees of fixed size as weak learners. Decision trees have a number of abilities that make them
valuable for boosting, namely the ability to handle data of mixed type and the ability to model complex functions.
Similar to other boosting algorithms GBRT builds the additive model in a forward stagewise fashion:

𝐹𝑚 (𝑥) = 𝐹𝑚−1 (𝑥) + 𝛾𝑚 ℎ𝑚 (𝑥)
At each stage the decision tree ℎ𝑚 (𝑥) is chosen to minimize the loss function 𝐿 given the current model 𝐹𝑚−1 and its
fit 𝐹𝑚−1 (𝑥𝑖 )

𝐹𝑚 (𝑥) = 𝐹𝑚−1 (𝑥) + arg min
ℎ

𝑛
∑︁

𝐿(𝑦𝑖 , 𝐹𝑚−1 (𝑥𝑖 ) + ℎ(𝑥))

𝑖=1

The initial model 𝐹0 is problem specific, for least-squares regression one usually chooses the mean of the target values.
Note: The initial model can also be specified via the init argument. The passed object has to implement fit and
predict.
Gradient Boosting attempts to solve this minimization problem numerically via steepest descent: The steepest descent
direction is the negative gradient of the loss function evaluated at the current model 𝐹𝑚−1 which can be calculated for
any differentiable loss function:

𝐹𝑚 (𝑥) = 𝐹𝑚−1 (𝑥) − 𝛾𝑚

𝑛
∑︁

∇𝐹 𝐿(𝑦𝑖 , 𝐹𝑚−1 (𝑥𝑖 ))

𝑖=1

Where the step length 𝛾𝑚 is chosen using line search:

𝛾𝑚 = arg min
𝛾

𝑛
∑︁
𝑖=1

𝐿(𝑦𝑖 , 𝐹𝑚−1 (𝑥𝑖 ) − 𝛾

𝜕𝐿(𝑦𝑖 , 𝐹𝑚−1 (𝑥𝑖 ))
)
𝜕𝐹𝑚−1 (𝑥𝑖 )

The algorithms for regression and classification only differ in the concrete loss function used.
Loss Functions
The following loss functions are supported and can be specified using the parameter loss:
• Regression
– Least squares ('ls'): The natural choice for regression due to its superior computational properties. The
initial model is given by the mean of the target values.
3.1. Supervised learning

247

scikit-learn user guide, Release 0.19.1

– Least absolute deviation ('lad'): A robust loss function for regression. The initial model is given by the
median of the target values.
– Huber ('huber'): Another robust loss function that combines least squares and least absolute deviation;
use alpha to control the sensitivity with regards to outliers (see [F2001] for more details).
– Quantile ('quantile'): A loss function for quantile regression. Use 0 < alpha < 1 to specify the
quantile. This loss function can be used to create prediction intervals (see Prediction Intervals for Gradient
Boosting Regression).
• Classification
– Binomial deviance ('deviance'): The negative binomial log-likelihood loss function for binary classification (provides probability estimates). The initial model is given by the log odds-ratio.
– Multinomial deviance ('deviance'): The negative multinomial log-likelihood loss function for multiclass classification with n_classes mutually exclusive classes. It provides probability estimates. The
initial model is given by the prior probability of each class. At each iteration n_classes regression trees
have to be constructed which makes GBRT rather inefficient for data sets with a large number of classes.
– Exponential loss ('exponential'): The same loss function as AdaBoostClassifier. Less robust
to mislabeled examples than 'deviance'; can only be used for binary classification.
Regularization
Shrinkage
[F2001] proposed a simple regularization strategy that scales the contribution of each weak learner by a factor 𝜈:
𝐹𝑚 (𝑥) = 𝐹𝑚−1 (𝑥) + 𝜈𝛾𝑚 ℎ𝑚 (𝑥)
The parameter 𝜈 is also called the learning rate because it scales the step length the gradient descent procedure; it can
be set via the learning_rate parameter.
The parameter learning_rate strongly interacts with the parameter n_estimators, the number of weak learners to fit. Smaller values of learning_rate require larger numbers of weak learners to maintain a constant training
error. Empirical evidence suggests that small values of learning_rate favor better test error. [HTF2009] recommend to set the learning rate to a small constant (e.g. learning_rate <= 0.1) and choose n_estimators by
early stopping. For a more detailed discussion of the interaction between learning_rate and n_estimators
see [R2007].
Subsampling
[F1999] proposed stochastic gradient boosting, which combines gradient boosting with bootstrap averaging (bagging).
At each iteration the base classifier is trained on a fraction subsample of the available training data. The subsample
is drawn without replacement. A typical value of subsample is 0.5.
The figure below illustrates the effect of shrinkage and subsampling on the goodness-of-fit of the model. We can
clearly see that shrinkage outperforms no-shrinkage. Subsampling with shrinkage can further increase the accuracy of
the model. Subsampling without shrinkage, on the other hand, does poorly.
Another strategy to reduce the variance is by subsampling the features analogous to the random splits in
RandomForestClassifier . The number of subsampled features can be controlled via the max_features
parameter.

248

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Note: Using a small max_features value can significantly decrease the runtime.
Stochastic gradient boosting allows to compute out-of-bag estimates of the test deviance by computing the improvement in deviance on the examples that are not included in the bootstrap sample (i.e. the out-of-bag examples). The
improvements are stored in the attribute oob_improvement_. oob_improvement_[i] holds the improvement
in terms of the loss on the OOB samples if you add the i-th stage to the current predictions. Out-of-bag estimates can
be used for model selection, for example to determine the optimal number of iterations. OOB estimates are usually
very pessimistic thus we recommend to use cross-validation instead and only use OOB if cross-validation is too time
consuming.
Examples:
• Gradient Boosting regularization
• Gradient Boosting Out-of-Bag estimates
• OOB Errors for Random Forests

Interpretation
Individual decision trees can be interpreted easily by simply visualizing the tree structure. Gradient boosting models,
however, comprise hundreds of regression trees thus they cannot be easily interpreted by visual inspection of the
individual trees. Fortunately, a number of techniques have been proposed to summarize and interpret gradient boosting
models.

3.1. Supervised learning

249

scikit-learn user guide, Release 0.19.1

Feature importance
Often features do not contribute equally to predict the target response; in many situations the majority of the features
are in fact irrelevant. When interpreting a model, the first question usually is: what are those important features and
how do they contributing in predicting the target response?
Individual decision trees intrinsically perform feature selection by selecting appropriate split points. This information
can be used to measure the importance of each feature; the basic idea is: the more often a feature is used in the split
points of a tree the more important that feature is. This notion of importance can be extended to decision tree ensembles
by simply averaging the feature importance of each tree (see Feature importance evaluation for more details).
The feature importance scores of a fit gradient boosting model can be accessed via the feature_importances_
property:
>>> from sklearn.datasets import make_hastie_10_2
>>> from sklearn.ensemble import GradientBoostingClassifier
>>> X, y = make_hastie_10_2(random_state=0)
>>> clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
...
max_depth=1, random_state=0).fit(X, y)
>>> clf.feature_importances_
array([ 0.11, 0.1 , 0.11, ...

Examples:
• Gradient Boosting regression

Partial dependence
Partial dependence plots (PDP) show the dependence between the target response and a set of ‘target’ features,
marginalizing over the values of all other features (the ‘complement’ features). Intuitively, we can interpret the partial
dependence as the expected target response1 as a function of the ‘target’ features2 .
Due to the limits of human perception the size of the target feature set must be small (usually, one or two) thus the
target features are usually chosen among the most important features.
The Figure below shows four one-way and one two-way partial dependence plots for the California housing dataset:
One-way PDPs tell us about the interaction between the target response and the target feature (e.g. linear, non-linear).
The upper left plot in the above Figure shows the effect of the median income in a district on the median house price;
we can clearly see a linear relationship among them.
PDPs with two target features show the interactions among the two features. For example, the two-variable PDP in
the above Figure shows the dependence of median house price on joint values of house age and avg. occupants per
household. We can clearly see an interaction between the two features: For an avg. occupancy greater than two, the
house price is nearly independent of the house age, whereas for values less than two there is a strong dependence on
age.
The module partial_dependence provides a convenience function plot_partial_dependence to create one-way and two-way partial dependence plots. In the below example we show how to create a grid of partial
dependence plots: two one-way PDPs for the features 0 and 1 and a two-way PDP between the two features:
1

For classification with loss='deviance' the target response is logit(p).
More precisely its the expectation of the target response after accounting for the initial model; partial dependence plots do not include the
init model.
2

250

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>> from sklearn.datasets import make_hastie_10_2
>>> from sklearn.ensemble import GradientBoostingClassifier
>>> from sklearn.ensemble.partial_dependence import plot_partial_dependence
>>>
>>>
...
>>>
>>>

X, y = make_hastie_10_2(random_state=0)
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
max_depth=1, random_state=0).fit(X, y)
features = [0, 1, (0, 1)]
fig, axs = plot_partial_dependence(clf, X, features)

For multi-class models, you need to set the class label for which the PDPs should be created via the label argument:
>>>
>>>
>>>
...
>>>
>>>

from sklearn.datasets import load_iris
iris = load_iris()
mc_clf = GradientBoostingClassifier(n_estimators=10,
max_depth=1).fit(iris.data, iris.target)
features = [3, 2, (3, 2)]
fig, axs = plot_partial_dependence(mc_clf, X, features, label=0)

If you need the raw values of the partial dependence function rather than the plots you can use the
partial_dependence function:
>>> from sklearn.ensemble.partial_dependence import partial_dependence
>>> pdp, axes = partial_dependence(clf, [0], X=X)
>>> pdp
array([[ 2.46643157, 2.46643157, ...
>>> axes
[array([-1.62497054, -1.59201391, ...

The function requires either the argument grid which specifies the values of the target features on which the partial
dependence function should be evaluated or the argument X which is a convenience mode for automatically creating
grid from the training data. If X is given, the axes value returned by the function gives the axis for each target
feature.
3.1. Supervised learning

251

scikit-learn user guide, Release 0.19.1

For each value of the ‘target’ features in the grid the partial dependence function need to marginalize the predictions
of a tree over all possible values of the ‘complement’ features. In decision trees this function can be evaluated efficiently without reference to the training data. For each grid point a weighted tree traversal is performed: if a split node
involves a ‘target’ feature, the corresponding left or right branch is followed, otherwise both branches are followed,
each branch is weighted by the fraction of training samples that entered that branch. Finally, the partial dependence
is given by a weighted average of all visited leaves. For tree ensembles the results of each individual tree are again
averaged.
Examples:
• Partial Dependence Plots

References

Voting Classifier
The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use
a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be
useful for a set of equally well performing model in order to balance out their individual weaknesses.
Majority Class Labels (Majority/Hard Voting)
In majority voting, the predicted class label for a particular sample is the class label that represents the majority (mode)
of the class labels predicted by each individual classifier.
E.g., if the prediction for a given sample is
• classifier 1 -> class 1
• classifier 2 -> class 1
• classifier 3 -> class 2
the VotingClassifier (with voting='hard') would classify the sample as “class 1” based on the majority class label.
In the cases of a tie, the VotingClassifier will select the class based on the ascending sort order. E.g., in the following
scenario
• classifier 1 -> class 2
• classifier 2 -> class 1
the class label 1 will be assigned to the sample.
Usage
The following example shows how to fit the majority rule classifier:
>>>
>>>
>>>
>>>
>>>

252

from
from
from
from
from

sklearn import datasets
sklearn.model_selection import cross_val_score
sklearn.linear_model import LogisticRegression
sklearn.naive_bayes import GaussianNB
sklearn.ensemble import RandomForestClassifier

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>> from sklearn.ensemble import VotingClassifier
>>> iris = datasets.load_iris()
>>> X, y = iris.data[:, 1:3], iris.target
>>> clf1 = LogisticRegression(random_state=1)
>>> clf2 = RandomForestClassifier(random_state=1)
>>> clf3 = GaussianNB()
>>> eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
˓→voting='hard')
>>> for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random
˓→Forest', 'naive Bayes', 'Ensemble']):
...
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
...
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(),
˓→label))
Accuracy: 0.90 (+/- 0.05) [Logistic Regression]
Accuracy: 0.93 (+/- 0.05) [Random Forest]
Accuracy: 0.91 (+/- 0.04) [naive Bayes]
Accuracy: 0.95 (+/- 0.05) [Ensemble]

Weighted Average Probabilities (Soft Voting)
In contrast to majority voting (hard voting), soft voting returns the class label as argmax of the sum of predicted
probabilities.
Specific weights can be assigned to each classifier via the weights parameter. When weights are provided, the
predicted class probabilities for each classifier are collected, multiplied by the classifier weight, and averaged. The
final class label is then derived from the class label with the highest average probability.
To illustrate this with a simple example, let’s assume we have 3 classifiers and a 3-class classification problems where
we assign equal weights to all classifiers: w1=1, w2=1, w3=1.
The weighted average probabilities for a sample would then be calculated as follows:
classifier
classifier 1
classifier 2
classifier 3
weighted average

class 1
w1 * 0.2
w2 * 0.6
w3 * 0.3
0.37

class 2
w1 * 0.5
w2 * 0.3
w3 * 0.4
0.4

class 3
w1 * 0.3
w2 * 0.1
w3 * 0.3
0.23

Here, the predicted class label is 2, since it has the highest average probability.
The following example illustrates how the decision regions may change when a soft VotingClassifier is used based on
an linear Support Vector Machine, a Decision Tree, and a K-nearest neighbor classifier:
>>>
>>>
>>>
>>>
>>>
>>>

from
from
from
from
from
from

sklearn import datasets
sklearn.tree import DecisionTreeClassifier
sklearn.neighbors import KNeighborsClassifier
sklearn.svm import SVC
itertools import product
sklearn.ensemble import VotingClassifier

>>> # Loading some example data
>>> iris = datasets.load_iris()

3.1. Supervised learning

253

scikit-learn user guide, Release 0.19.1

>>> X = iris.data[:, [0,2]]
>>> y = iris.target
>>> # Training classifiers
>>> clf1 = DecisionTreeClassifier(max_depth=4)
>>> clf2 = KNeighborsClassifier(n_neighbors=7)
>>> clf3 = SVC(kernel='rbf', probability=True)
>>> eclf = VotingClassifier(estimators=[('dt', clf1), ('knn', clf2), ('svc', clf3)],
˓→voting='soft', weights=[2,1,2])
>>>
>>>
>>>
>>>

clf1
clf2
clf3
eclf

=
=
=
=

clf1.fit(X,y)
clf2.fit(X,y)
clf3.fit(X,y)
eclf.fit(X,y)

Using the VotingClassifier with GridSearch
The VotingClassifier can also be used together with GridSearch in order to tune the hyperparameters of the individual
estimators:

254

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>> from sklearn.model_selection import GridSearchCV
>>> clf1 = LogisticRegression(random_state=1)
>>> clf2 = RandomForestClassifier(random_state=1)
>>> clf3 = GaussianNB()
>>> eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
˓→voting='soft')
>>> params = {'lr__C': [1.0, 100.0], 'rf__n_estimators': [20, 200],}
>>> grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5)
>>> grid = grid.fit(iris.data, iris.target)

Usage
In order to predict the class labels based on the predicted class-probabilities (scikit-learn estimators in the VotingClassifier must support predict_proba method):
>>> eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
˓→voting='soft')

Optionally, weights can be provided for the individual classifiers:
>>> eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
˓→voting='soft', weights=[2,5,1])

3.1.12 Multiclass and multilabel algorithms
Warning: All classifiers in scikit-learn do multiclass classification out-of-the-box. You don’t need to use the
sklearn.multiclass module unless you want to experiment with different multiclass strategies.
The sklearn.multiclass module implements meta-estimators to solve multiclass and multilabel classification problems by decomposing such problems into binary classification problems. Multitarget regression is also
supported.
• Multiclass classification means a classification task with more than two classes; e.g., classify a set of images of
fruits which may be oranges, apples, or pears. Multiclass classification makes the assumption that each sample
is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.
• Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might
be about any of religion, politics, finance or education at the same time or none of these.
• Multioutput regression assigns each sample a set of target values. This can be thought of as predicting several
properties for each data-point, such as wind direction and magnitude at a certain location.
• Multioutput-multiclass classification and multi-task classification means that a single estimator has to handle
several joint classification tasks. This is both a generalization of the multi-label classification task, which only
considers binary classification, as well as a generalization of the multi-class classification task. The output
format is a 2d numpy array or sparse matrix.
The set of labels can be different for each output variable. For instance, a sample could be assigned “pear” for
an output variable that takes possible values in a finite set of species such as “pear”, “apple”; and “blue” or

3.1. Supervised learning

255

scikit-learn user guide, Release 0.19.1

“green” for a second output variable that takes possible values in a finite set of colors such as “green”, “red”,
“blue”, “yellow”. . .
This means that any classifiers handling multi-output multiclass or multi-task classification tasks, support the
multi-label classification task as a special case. Multi-task classification is similar to the multi-output classification task with different model formulations. For more information, see the relevant estimator documentation.
All scikit-learn classifiers are capable of multiclass classification, but the meta-estimators offered by sklearn.
multiclass permit changing the way they handle more than two classes because this may have an effect on classifier
performance (either in terms of generalization error or required computational resources).
Below is a summary of the classifiers supported by scikit-learn grouped by strategy; you don’t need the meta-estimators
in this class if you’re using one of these, unless you want custom multiclass behavior:
• Inherently multiclass:
– sklearn.naive_bayes.BernoulliNB
– sklearn.tree.DecisionTreeClassifier
– sklearn.tree.ExtraTreeClassifier
– sklearn.ensemble.ExtraTreesClassifier
– sklearn.naive_bayes.GaussianNB
– sklearn.neighbors.KNeighborsClassifier
– sklearn.semi_supervised.LabelPropagation
– sklearn.semi_supervised.LabelSpreading
– sklearn.discriminant_analysis.LinearDiscriminantAnalysis
– sklearn.svm.LinearSVC (setting multi_class=”crammer_singer”)
– sklearn.linear_model.LogisticRegression (setting multi_class=”multinomial”)
– sklearn.linear_model.LogisticRegressionCV (setting multi_class=”multinomial”)
– sklearn.neural_network.MLPClassifier
– sklearn.neighbors.NearestCentroid
– sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis
– sklearn.neighbors.RadiusNeighborsClassifier
– sklearn.ensemble.RandomForestClassifier
– sklearn.linear_model.RidgeClassifier
– sklearn.linear_model.RidgeClassifierCV
• Multiclass as One-Vs-One:
– sklearn.svm.NuSVC
– sklearn.svm.SVC.
– sklearn.gaussian_process.GaussianProcessClassifier
“one_vs_one”)

(setting

multi_class

=

(setting

multi_class

=

• Multiclass as One-Vs-All:
– sklearn.ensemble.GradientBoostingClassifier
– sklearn.gaussian_process.GaussianProcessClassifier
“one_vs_rest”)

256

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

– sklearn.svm.LinearSVC (setting multi_class=”ovr”)
– sklearn.linear_model.LogisticRegression (setting multi_class=”ovr”)
– sklearn.linear_model.LogisticRegressionCV (setting multi_class=”ovr”)
– sklearn.linear_model.SGDClassifier
– sklearn.linear_model.Perceptron
– sklearn.linear_model.PassiveAggressiveClassifier
• Support multilabel:
– sklearn.tree.DecisionTreeClassifier
– sklearn.tree.ExtraTreeClassifier
– sklearn.ensemble.ExtraTreesClassifier
– sklearn.neighbors.KNeighborsClassifier
– sklearn.neural_network.MLPClassifier
– sklearn.neighbors.RadiusNeighborsClassifier
– sklearn.ensemble.RandomForestClassifier
– sklearn.linear_model.RidgeClassifierCV
• Support multiclass-multioutput:
– sklearn.tree.DecisionTreeClassifier
– sklearn.tree.ExtraTreeClassifier
– sklearn.ensemble.ExtraTreesClassifier
– sklearn.neighbors.KNeighborsClassifier
– sklearn.neighbors.RadiusNeighborsClassifier
– sklearn.ensemble.RandomForestClassifier
Warning: At present, no metric in sklearn.metrics supports the multioutput-multiclass classification task.

Multilabel classification format
In multilabel learning, the joint set of binary classification tasks is expressed with label binary indicator array: each
sample is one row of a 2d array of shape (n_samples, n_classes) with binary values: the one, i.e. the non zero elements,
corresponds to the subset of labels. An array such as np.array([[1, 0, 0], [0, 1, 1], [0, 0, 0]])
represents label 0 in the first sample, labels 1 and 2 in the second sample, and no labels in the third sample.
Producing multilabel data as a list of sets of labels may be more intuitive. The MultiLabelBinarizer transformer
can be used to convert between a collection of collections of labels and the indicator format.
>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> y = [[2, 3, 4], [2], [0, 1, 3], [0, 1, 2, 3, 4], [0, 1, 2]]
>>> MultiLabelBinarizer().fit_transform(y)
array([[0, 0, 1, 1, 1],
[0, 0, 1, 0, 0],
[1, 1, 0, 1, 0],

3.1. Supervised learning

257

scikit-learn user guide, Release 0.19.1

[1, 1, 1, 1, 1],
[1, 1, 1, 0, 0]])

One-Vs-The-Rest
This strategy, also known as one-vs-all, is implemented in OneVsRestClassifier. The strategy consists in
fitting one classifier per class. For each classifier, the class is fitted against all the other classes. In addition to its
computational efficiency (only n_classes classifiers are needed), one advantage of this approach is its interpretability.
Since each class is represented by one and only one classifier, it is possible to gain knowledge about the class by
inspecting its corresponding classifier. This is the most commonly used strategy and is a fair default choice.
Multiclass learning
Below is an example of multiclass learning using OvR:
>>> from sklearn import datasets
>>> from sklearn.multiclass import OneVsRestClassifier
>>> from sklearn.svm import LinearSVC
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> OneVsRestClassifier(LinearSVC(random_state=0)).fit(X,
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

y).predict(X)
0, 0, 0, 0, 0,
0, 0, 0, 0, 0,
1, 1, 1, 1, 1,
1, 1, 1, 1, 1,
2, 2, 2, 2, 2,
2, 1, 2, 2, 2,

0,
0,
1,
1,
2,
2,

Multilabel learning
OneVsRestClassifier also supports multilabel classification. To use this feature, feed the classifier an indicator
matrix, in which cell [i, j] indicates the presence of label j in sample i.
Examples:
• Multilabel classification

One-Vs-One
OneVsOneClassifier constructs one classifier per pair of classes. At prediction time, the class which received
the most votes is selected. In the event of a tie (among two classes with an equal number of votes), it selects the class
with the highest aggregate classification confidence by summing over the pair-wise classification confidence levels
computed by the underlying binary classifiers.
Since it requires to fit n_classes * (n_classes - 1) / 2 classifiers, this method is usually slower than
one-vs-the-rest, due to its O(n_classes^2) complexity. However, this method may be advantageous for algorithms
such as kernel algorithms which don’t scale well with n_samples. This is because each individual learning problem
only involves a small subset of the data whereas, with one-vs-the-rest, the complete dataset is used n_classes times.

258

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Multiclass learning
Below is an example of multiclass learning using OvO:
>>> from sklearn import datasets
>>> from sklearn.multiclass import OneVsOneClassifier
>>> from sklearn.svm import LinearSVC
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> OneVsOneClassifier(LinearSVC(random_state=0)).fit(X, y).predict(X)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

0,
0,
1,
1,
2,
2,

References:
• “Pattern Recognition and Machine Learning. Springer”, Christopher M. Bishop, page 183, (First Edition)

3.1. Supervised learning

259

scikit-learn user guide, Release 0.19.1

Error-Correcting Output-Codes
Output-code based strategies are fairly different from one-vs-the-rest and one-vs-one. With these strategies, each class
is represented in a Euclidean space, where each dimension can only be 0 or 1. Another way to put it is that each class
is represented by a binary code (an array of 0 and 1). The matrix which keeps track of the location/code of each class
is called the code book. The code size is the dimensionality of the aforementioned space. Intuitively, each class should
be represented by a code as unique as possible and a good code book should be designed to optimize classification
accuracy. In this implementation, we simply use a randomly-generated code book as advocated in3 although more
elaborate methods may be added in the future.
At fitting time, one binary classifier per bit in the code book is fitted. At prediction time, the classifiers are used to
project new points in the class space and the class closest to the points is chosen.
In OutputCodeClassifier, the code_size attribute allows the user to control the number of classifiers which
will be used. It is a percentage of the total number of classes.
A number between 0 and 1 will require fewer classifiers than one-vs-the-rest. In theory, log2(n_classes) /
n_classes is sufficient to represent each class unambiguously. However, in practice, it may not lead to good
accuracy since log2(n_classes) is much smaller than n_classes.
A number greater than 1 will require more classifiers than one-vs-the-rest. In this case, some classifiers will in theory
correct for the mistakes made by other classifiers, hence the name “error-correcting”. In practice, however, this may
not happen as classifier mistakes will typically be correlated. The error-correcting output codes have a similar effect
to bagging.
Multiclass learning
Below is an example of multiclass learning using Output-Codes:
>>> from sklearn import datasets
>>> from sklearn.multiclass import OutputCodeClassifier
>>> from sklearn.svm import LinearSVC
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf = OutputCodeClassifier(LinearSVC(random_state=0),
...
code_size=2, random_state=0)
>>> clf.fit(X, y).predict(X)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1,
1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

0,
0,
1,
1,
2,
1,

0,
0,
1,
1,
2,
1,

0,
0,
2,
1,
2,
2,

0,
0,
1,
1,
2,
2,

0,
0,
1,
1,
2,
2,

References:
• “Solving multiclass learning problems via error-correcting output codes”, Dietterich T., Bakiri G., Journal of
Artificial Intelligence Research 2, 1995.
• “The Elements of Statistical Learning”, Hastie T., Tibshirani R., Friedman J., page 606 (second-edition) 2008.
3

“The error coding method and PICTs”, James G., Hastie T., Journal of Computational and Graphical statistics 7, 1998.

260

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Multioutput regression
Multioutput regression support can be added to any regressor with MultiOutputRegressor. This strategy consists of fitting one regressor per target. Since each target is represented by exactly one regressor it is possible to
gain knowledge about the target by inspecting its corresponding regressor. As MultiOutputRegressor fits one
regressor per target it can not take advantage of correlations between targets.
Below is an example of multioutput regression:
>>> from sklearn.datasets import make_regression
>>> from sklearn.multioutput import MultiOutputRegressor
>>> from sklearn.ensemble import GradientBoostingRegressor
>>> X, y = make_regression(n_samples=10, n_targets=3, random_state=1)
>>> MultiOutputRegressor(GradientBoostingRegressor(random_state=0)).fit(X, y).
˓→predict(X)
array([[-154.75474165, -147.03498585, -50.03812219],
[
7.12165031,
5.12914884, -81.46081961],
[-187.8948621 , -100.44373091,
13.88978285],
[-141.62745778,
95.02891072, -191.48204257],
[ 97.03260883, 165.34867495, 139.52003279],
[ 123.92529176,
21.25719016,
-7.84253
],
[-122.25193977, -85.16443186, -107.12274212],
[ -30.170388 , -94.80956739,
12.16979946],
[ 140.72667194, 176.50941682, -17.50447799],
[ 149.37967282, -81.15699552,
-5.72850319]])

Multioutput classification
Multioutput classification support can be added to any classifier with MultiOutputClassifier. This strategy
consists of fitting one classifier per target. This allows multiple target variable classifications. The purpose of this class
is to extend estimators to be able to estimate a series of target functions (f1,f2,f3. . . ,fn) that are trained on a single X
predictor matrix to predict a series of responses (y1,y2,y3. . . ,yn).
Below is an example of multioutput classification:
>>> from sklearn.datasets import make_classification
>>> from sklearn.multioutput import MultiOutputClassifier
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.utils import shuffle
>>> import numpy as np
>>> X, y1 = make_classification(n_samples=10, n_features=100, n_informative=30, n_
˓→classes=3, random_state=1)
>>> y2 = shuffle(y1, random_state=1)
>>> y3 = shuffle(y1, random_state=2)
>>> Y = np.vstack((y1, y2, y3)).T
>>> n_samples, n_features = X.shape # 10,100
>>> n_outputs = Y.shape[1] # 3
>>> n_classes = 3
>>> forest = RandomForestClassifier(n_estimators=100, random_state=1)
>>> multi_target_forest = MultiOutputClassifier(forest, n_jobs=-1)
>>> multi_target_forest.fit(X, Y).predict(X)
array([[2, 2, 0],
[1, 2, 1],
[2, 1, 0],
[0, 0, 2],
[0, 2, 1],
[0, 0, 2],

3.1. Supervised learning

261

scikit-learn user guide, Release 0.19.1

[1,
[1,
[0,
[2,

1,
1,
0,
0,

0],
1],
2],
0]])

Classifier Chain
Classifier chains (see ClassifierChain) are a way of combining a number of binary classifiers into a single
multi-label model that is capable of exploiting correlations among targets.
For a multi-label classification problem with N classes, N binary classifiers are assigned an integer between 0 and N-1.
These integers define the order of models in the chain. Each classifier is then fit on the available training data plus the
true labels of the classes whose models were assigned a lower number.
When predicting, the true labels will not be available. Instead the predictions of each model are passed on to the
subsequent models in the chain to be used as features.
Clearly the order of the chain is important. The first model in the chain has no information about the other labels while
the last model in the chain has features indicating the presence of all of the other labels. In general one does not know
the optimal ordering of the models in the chain so typically many randomly ordered chains are fit and their predictions
are averaged together.
References:
Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank, “Classifier Chains for Multi-label Classification”, 2009.

3.1.13 Feature selection
The classes in the sklearn.feature_selection module can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very highdimensional datasets.
Removing features with low variance
VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance
doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value
in all samples.
As an example, suppose that we have a dataset with boolean features, and we want to remove all features that are
either one or zero (on or off) in more than 80% of the samples. Boolean features are Bernoulli random variables, and
the variance of such variables is given by
Var[𝑋] = 𝑝(1 − 𝑝)
so we can select using the threshold .8 * (1 - .8):
>>> from sklearn.feature_selection import VarianceThreshold
>>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
>>> sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
>>> sel.fit_transform(X)
array([[0, 1],

262

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

[1,
[0,
[1,
[1,
[1,

0],
0],
1],
0],
1]])

As expected, VarianceThreshold has removed the first column, which has a probability 𝑝 = 5/6 > .8 of
containing a zero.
Univariate feature selection
Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen
as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects that implement the
transform method:
• SelectKBest removes all but the 𝑘 highest scoring features
• SelectPercentile removes all but a user-specified highest scoring percentage of features
• using common univariate statistical tests for each feature: false positive rate SelectFpr, false discovery rate
SelectFdr, or family wise error SelectFwe.
• GenericUnivariateSelect allows to perform univariate feature selection with a configurable strategy.
This allows to select the best univariate selection strategy with hyper-parameter search estimator.
For instance, we can perform a 𝜒2 test to the samples to retrieve only the two best features as follows:
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> from sklearn.feature_selection import chi2
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
>>> X_new.shape
(150, 2)

These objects take as input a scoring function that returns univariate scores and p-values (or only scores for
SelectKBest and SelectPercentile):
• For regression: f_regression, mutual_info_regression
• For classification: chi2, f_classif, mutual_info_classif
The methods based on F-test estimate the degree of linear dependency between two random variables. On the other
hand, mutual information methods can capture any kind of statistical dependency, but being nonparametric, they
require more samples for accurate estimation.
Feature selection with sparse data
If you use sparse data (i.e. data represented as sparse matrices), chi2, mutual_info_regression,
mutual_info_classif will deal with the data without making it dense.

3.1. Supervised learning

263

scikit-learn user guide, Release 0.19.1

Warning: Beware not to use a regression scoring function with a classification problem, you will get useless
results.

Examples:
• Univariate Feature Selection
• Comparison of F-test and mutual information

Recursive feature elimination
Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature
elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the
estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_
attribute or through a feature_importances_ attribute. Then, the least important features are pruned from
current set of features.That procedure is recursively repeated on the pruned set until the desired number of features to
select is eventually reached.
RFECV performs RFE in a cross-validation loop to find the optimal number of features.
Examples:
• Recursive feature elimination: A recursive feature elimination example showing the relevance of pixels in a
digit classification task.
• Recursive feature elimination with cross-validation: A recursive feature elimination example with automatic
tuning of the number of features selected with cross-validation.

Feature selection using SelectFromModel
SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or
feature_importances_ attribute after fitting. The features are considered unimportant and removed, if the
corresponding coef_ or feature_importances_ values are below the provided threshold parameter. Apart
from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument.
Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.
For examples on how it is to be used refer to the sections below.
Examples
• Feature selection using SelectFromModel and LassoCV: Selecting the two most important features from the
Boston dataset without knowing the threshold beforehand.

L1-based feature selection
Linear models penalized with the L1 norm have sparse solutions: many of their estimated coefficients are zero.
When the goal is to reduce the dimensionality of the data to use with another classifier, they can be used along
with feature_selection.SelectFromModel to select the non-zero coefficients. In particular, sparse
264

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

estimators useful for this purpose are the linear_model.Lasso for regression, and of linear_model.
LogisticRegression and svm.LinearSVC for classification:
>>> from sklearn.svm import LinearSVC
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
>>> model = SelectFromModel(lsvc, prefit=True)
>>> X_new = model.transform(X)
>>> X_new.shape
(150, 3)

With SVMs and logistic-regression, the parameter C controls the sparsity: the smaller C the fewer features selected.
With Lasso, the higher the alpha parameter, the fewer features selected.
Examples:
• Classification of text documents using sparse features: Comparison of different algorithms for document
classification including L1-based feature selection.

L1-recovery and compressive sensing
For a good choice of alpha, the Lasso can fully recover the exact set of non-zero variables using only few observations, provided certain specific conditions are met. In particular, the number of samples should be “sufficiently
large”, or L1 models will perform at random, where “sufficiently large” depends on the number of non-zero coefficients, the logarithm of the number of features, the amount of noise, the smallest absolute value of non-zero
coefficients, and the structure of the design matrix X. In addition, the design matrix must display certain specific
properties, such as not being too correlated.
There is no general rule to select an alpha parameter for recovery of non-zero coefficients. It can by set by crossvalidation (LassoCV or LassoLarsCV), though this may lead to under-penalized models: including a small
number of non-relevant variables is not detrimental to prediction score. BIC (LassoLarsIC) tends, on the opposite, to set high values of alpha.
Reference Richard G. Baraniuk “Compressive Sensing”, IEEE Signal Processing Magazine [120] July 2007 http:
//dsp.rice.edu/sites/dsp.rice.edu/files/cs/baraniukCSlecture07.pdf

Tree-based feature selection
Tree-based estimators (see the sklearn.tree module and forest of trees in the sklearn.ensemble module)
can be used to compute feature importances, which in turn can be used to discard irrelevant features (when coupled
with the sklearn.feature_selection.SelectFromModel meta-transformer):
>>>
>>>
>>>
>>>
>>>
>>>

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
X, y = iris.data, iris.target
X.shape

3.1. Supervised learning

265

scikit-learn user guide, Release 0.19.1

(150, 4)
>>> clf = ExtraTreesClassifier()
>>> clf = clf.fit(X, y)
>>> clf.feature_importances_
array([ 0.04..., 0.05..., 0.4..., 0.4...])
>>> model = SelectFromModel(clf, prefit=True)
>>> X_new = model.transform(X)
>>> X_new.shape
(150, 2)

Examples:
• Feature importances with forests of trees: example on synthetic data showing the recovery of the actually
meaningful features.
• Pixel importances with a parallel forest of trees: example on face recognition data.

Feature selection as part of a pipeline
Feature selection is usually used as a pre-processing step before doing the actual learning. The recommended way to
do this in scikit-learn is to use a sklearn.pipeline.Pipeline:
clf = Pipeline([
('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
('classification', RandomForestClassifier())
])
clf.fit(X, y)

In this snippet we make use of a sklearn.svm.LinearSVC coupled with sklearn.feature_selection.
SelectFromModel to evaluate feature importances and select the most relevant features. Then, a sklearn.
ensemble.RandomForestClassifier is trained on the transformed output, i.e. using only relevant features.
You can perform similar operations with the other feature selection methods and also classifiers that provide a way to
evaluate feature importances of course. See the sklearn.pipeline.Pipeline examples for more details.

3.1.14 Semi-Supervised
Semi-supervised learning is a situation in which in your training data some of the samples are not labeled. The semisupervised estimators in sklearn.semi_supervised are able to make use of this additional unlabeled data to
better capture the shape of the underlying data distribution and generalize better to new samples. These algorithms
can perform well when we have a very small amount of labeled points and a large amount of unlabeled points.
Unlabeled entries in y
It is important to assign an identifier to unlabeled points along with the labeled data when training the model with
the fit method. The identifier that this implementation uses is the integer value −1.

Label Propagation
Label propagation denotes a few variations of semi-supervised graph inference algorithms.

266

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

A few features available in this model:
• Can be used for classification and regression tasks
• Kernel methods to project data into alternate dimensional spaces
scikit-learn provides two label propagation models: LabelPropagation and LabelSpreading. Both work by
constructing a similarity graph over all items in the input dataset.

Fig. 3.1: An illustration of label-propagation: the structure of unlabeled observations is consistent with the class
structure, and thus the class label can be propagated to the unlabeled observations of the training set.
LabelPropagation and LabelSpreading differ in modifications to the similarity matrix that graph and the
clamping effect on the label distributions. Clamping allows the algorithm to change the weight of the true ground
labeled data to some degree. The LabelPropagation algorithm performs hard clamping of input labels, which
means 𝛼 = 0. This clamping factor can be relaxed, to say 𝛼 = 0.2, which means that we will always retain 80 percent
of our original label distribution, but the algorithm gets to change its confidence of the distribution within 20 percent.
LabelPropagation uses the raw similarity matrix constructed from the data with no modifications. In contrast,
LabelSpreading minimizes a loss function that has regularization properties, as such it is often more robust to
noise. The algorithm iterates on a modified version of the original graph and normalizes the edge weights by computing
the normalized graph Laplacian matrix. This procedure is also used in Spectral clustering.
Label propagation models have two built-in kernel methods. Choice of kernel effects both scalability and performance
of the algorithms. The following are available:
• rbf (exp(−𝛾|𝑥 − 𝑦|2 ), 𝛾 > 0). 𝛾 is specified by keyword gamma.
• knn (1[𝑥′ ∈ 𝑘𝑁 𝑁 (𝑥)]). 𝑘 is specified by keyword n_neighbors.
The RBF kernel will produce a fully connected graph which is represented in memory by a dense matrix. This matrix
may be very large and combined with the cost of performing a full matrix multiplication calculation for each iteration
of the algorithm can lead to prohibitively long running times. On the other hand, the KNN kernel will produce a much
more memory-friendly sparse matrix which can drastically reduce running times.
Examples
• Decision boundary of label propagation versus SVM on the Iris dataset
• Label Propagation learning a complex structure
• Label Propagation digits active learning

3.1. Supervised learning

267

scikit-learn user guide, Release 0.19.1

References
[1] Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux. In Semi-Supervised Learning (2006), pp. 193-216
[2] Olivier Delalleau, Yoshua Bengio, Nicolas Le Roux. Efficient Non-Parametric Function Induction in SemiSupervised Learning. AISTAT 2005 http://research.microsoft.com/en-us/people/nicolasl/efficient_ssl.pdf

3.1.15 Isotonic regression
The class IsotonicRegression fits a non-decreasing function to data. It solves the following problem:
∑︀
minimize 𝑖 𝑤𝑖 (𝑦𝑖 − 𝑦ˆ𝑖 )2
subject to 𝑦ˆ𝑚𝑖𝑛 = 𝑦ˆ1 ≤ 𝑦ˆ2 ... ≤ 𝑦ˆ𝑛 = 𝑦ˆ𝑚𝑎𝑥
where each 𝑤𝑖 is strictly positive and each 𝑦𝑖 is an arbitrary real number. It yields the vector which is composed of
non-decreasing elements the closest in terms of mean squared error. In practice this list of elements forms a function
that is piecewise linear.

268

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

3.1.16 Probability calibration
When performing classification you often want not only to predict the class label, but also obtain a probability of the
respective label. This probability gives you some kind of confidence on the prediction. Some models can give you
poor estimates of the class probabilities and some even do not support probability prediction. The calibration module
allows you to better calibrate the probabilities of a given model, or to add support for probability prediction.
Well calibrated classifiers are probabilistic classifiers for which the output of the predict_proba method can be directly
interpreted as a confidence level. For instance, a well calibrated (binary) classifier should classify the samples such
that among the samples to which it gave a predict_proba value close to 0.8, approximately 80% actually belong to the
positive class. The following plot compares how well the probabilistic predictions of different classifiers are calibrated:

LogisticRegression returns well calibrated predictions by default as it directly optimizes log-loss. In contrast,
3.1. Supervised learning

269

scikit-learn user guide, Release 0.19.1

the other methods return biased probabilities; with different biases per method:
• GaussianNB tends to push probabilties to 0 or 1 (note the counts in the histograms). This is mainly because
it makes the assumption that features are conditionally independent given the class, which is not the case in this
dataset which contains 2 redundant features.
• RandomForestClassifier shows the opposite behavior: the histograms show peaks at approximately
0.2 and 0.9 probability, while probabilities close to 0 or 1 are very rare. An explanation for this is given by
Niculescu-Mizil and Caruana4 : “Methods such as bagging and random forests that average predictions from a
base set of models can have difficulty making predictions near 0 and 1 because variance in the underlying base
models will bias predictions that should be near zero or one away from these values. Because predictions are
restricted to the interval [0,1], errors caused by variance tend to be one-sided near zero and one. For example,
if a model should predict p = 0 for a case, the only way bagging can achieve this is if all bagged trees predict
zero. If we add noise to the trees that bagging is averaging over, this noise will cause some trees to predict
values larger than 0 for this case, thus moving the average prediction of the bagged ensemble away from 0. We
observe this effect most strongly with random forests because the base-level trees trained with random forests
have relatively high variance due to feature subsetting.” As a result, the calibration curve also referred to as the
reliability diagram (Wilks 19955 ) shows a characteristic sigmoid shape, indicating that the classifier could trust
its “intuition” more and return probabilties closer to 0 or 1 typically.
• Linear Support Vector Classification (LinearSVC) shows an even more sigmoid curve as the RandomForestClassifier, which is typical for maximum-margin methods (compare Niculescu-Mizil and Caruana4 ), which
focus on hard samples that are close to the decision boundary (the support vectors).
Two approaches for performing calibration of probabilistic predictions are provided: a parametric approach based on
Platt’s sigmoid model and a non-parametric approach based on isotonic regression (sklearn.isotonic). Probability calibration should be done on new data not used for model fitting. The class CalibratedClassifierCV
uses a cross-validation generator and estimates for each split the model parameter on the train samples and the calibration of the test samples. The probabilities predicted for the folds are then averaged. Already fitted classifiers can
be calibrated by CalibratedClassifierCV via the parameter cv=”prefit”. In this case, the user has to take care
manually that data for model fitting and calibration are disjoint.
The following images demonstrate the benefit of probability calibration. The first image present a dataset with 2
classes and 3 blobs of data. The blob in the middle contains random samples of each class. The probability for the
samples in this blob should be 0.5.
The following image shows on the data above the estimated probability using a Gaussian naive Bayes classifier without
calibration, with a sigmoid calibration and with a non-parametric isotonic calibration. One can observe that the nonparametric model provides the most accurate probability estimates for samples in the middle, i.e., 0.5.
The following experiment is performed on an artificial dataset for binary classification with 100.000 samples (1.000
of them are used for model fitting) with 20 features. Of the 20 features, only 2 are informative and 10 are redundant.
The figure shows the estimated probabilities obtained with logistic regression, a linear support-vector classifier (SVC),
and linear SVC with both isotonic calibration and sigmoid calibration. The calibration performance is evaluated with
Brier score brier_score_loss, reported in the legend (the smaller the better).
One can observe here that logistic regression is well calibrated as its curve is nearly diagonal. Linear SVC’s calibration
curve or reliability diagram has a sigmoid curve, which is typical for an under-confident classifier. In the case of
LinearSVC, this is caused by the margin property of the hinge loss, which lets the model focus on hard samples that
are close to the decision boundary (the support vectors). Both kinds of calibration can fix this issue and yield nearly
identical results. The next figure shows the calibration curve of Gaussian naive Bayes on the same data, with both
kinds of calibration and also without calibration.
One can see that Gaussian naive Bayes performs very badly but does so in an other way than linear SVC: While linear
SVC exhibited a sigmoid calibration curve, Gaussian naive Bayes’ calibration curve has a transposed-sigmoid shape.
4
5

Predicting Good Probabilities with Supervised Learning, A. Niculescu-Mizil & R. Caruana, ICML 2005
On the combination of forecast probabilities for consecutive precipitation periods. Wea. Forecasting, 5, 640–650., Wilks, D. S., 1990a

270

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

3.1. Supervised learning

271

scikit-learn user guide, Release 0.19.1

272

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

3.1. Supervised learning

273

scikit-learn user guide, Release 0.19.1

274

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

This is typical for an over-confident classifier. In this case, the classifier’s overconfidence is caused by the redundant
features which violate the naive Bayes assumption of feature-independence.
Calibration of the probabilities of Gaussian naive Bayes with isotonic regression can fix this issue as can be seen from
the nearly diagonal calibration curve. Sigmoid calibration also improves the brier score slightly, albeit not as strongly
as the non-parametric isotonic calibration. This is an intrinsic limitation of sigmoid calibration, whose parametric form
assumes a sigmoid rather than a transposed-sigmoid curve. The non-parametric isotonic calibration model, however,
makes no such strong assumptions and can deal with either shape, provided that there is sufficient calibration data. In
general, sigmoid calibration is preferable in cases where the calibration curve is sigmoid and where there is limited
calibration data, while isotonic calibration is preferable for non-sigmoid calibration curves and in situations where
large amounts of data are available for calibration.
CalibratedClassifierCV can also deal with classification tasks that involve more than two classes if the base
estimator can do so. In this case, the classifier is calibrated first for each class separately in an one-vs-rest fashion.
When predicting probabilities for unseen data, the calibrated probabilities for each class are predicted separately. As
those probabilities do not necessarily sum to one, a postprocessing is performed to normalize them.
The next image illustrates how sigmoid calibration changes predicted probabilities for a 3-class classification problem.
Illustrated is the standard 2-simplex, where the three corners correspond to the three classes. Arrows point from the
probability vectors predicted by an uncalibrated classifier to the probability vectors predicted by the same classifier
after sigmoid calibration on a hold-out validation set. Colors indicate the true class of an instance (red: class 1, green:
class 2, blue: class 3).

The base classifier is a random forest classifier with 25 base estimators (trees). If this classifier is trained on all 800
training datapoints, it is overly confident in its predictions and thus incurs a large log-loss. Calibrating an identical

3.1. Supervised learning

275

scikit-learn user guide, Release 0.19.1

classifier, which was trained on 600 datapoints, with method=’sigmoid’ on the remaining 200 datapoints reduces the
confidence of the predictions, i.e., moves the probability vectors from the edges of the simplex towards the center:

This calibration results in a lower log-loss. Note that an alternative would have been to increase the number of base
estimators which would have resulted in a similar decrease in log-loss.
References:
• Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers, B. Zadrozny &
C. Elkan, ICML 2001
• Transforming Classifier Scores into Accurate Multiclass Probability Estimates, B. Zadrozny & C. Elkan,
(KDD 2002)
• Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods, J.
Platt, (1999)

276

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

3.1.17 Neural network models (supervised)
Warning: This implementation is not intended for large-scale applications. In particular, scikit-learn offers no
GPU support. For much faster, GPU-based implementations, as well as frameworks offering much more flexibility
to build deep learning architectures, see Related Projects.
Multi-layer Perceptron
Multi-layer Perceptron (MLP) is a supervised learning algorithm that learns a function 𝑓 (·) : 𝑅𝑚 → 𝑅𝑜 by training
on a dataset, where 𝑚 is the number of dimensions for input and 𝑜 is the number of dimensions for output. Given a set
of features 𝑋 = 𝑥1 , 𝑥2 , ..., 𝑥𝑚 and a target 𝑦, it can learn a non-linear function approximator for either classification
or regression. It is different from logistic regression, in that between the input and the output layer, there can be one
or more non-linear layers, called hidden layers. Figure 1 shows a one hidden layer MLP with scalar output.

Fig. 3.2: Figure 1 : One hidden layer MLP.
The leftmost layer, known as the input layer, consists of a set of neurons {𝑥𝑖 |𝑥1 , 𝑥2 , ..., 𝑥𝑚 } representing the input
features. Each neuron in the hidden layer transforms the values from the previous layer with a weighted linear summation 𝑤1 𝑥1 + 𝑤2 𝑥2 + ... + 𝑤𝑚 𝑥𝑚 , followed by a non-linear activation function 𝑔(·) : 𝑅 → 𝑅 - like the hyperbolic
tan function. The output layer receives the values from the last hidden layer and transforms them into output values.
The module contains the public attributes coefs_ and intercepts_. coefs_ is a list of weight matrices, where
weight matrix at index 𝑖 represents the weights between layer 𝑖 and layer 𝑖+1. intercepts_ is a list of bias vectors,
where the vector at index 𝑖 represents the bias values added to layer 𝑖 + 1.
The advantages of Multi-layer Perceptron are:
• Capability to learn non-linear models.
• Capability to learn models in real-time (on-line learning) using partial_fit.
The disadvantages of Multi-layer Perceptron (MLP) include:
• MLP with hidden layers have a non-convex loss function where there exists more than one local minimum.
Therefore different random weight initializations can lead to different validation accuracy.
• MLP requires tuning a number of hyperparameters such as the number of hidden neurons, layers, and iterations.
• MLP is sensitive to feature scaling.

3.1. Supervised learning

277

scikit-learn user guide, Release 0.19.1

Please see Tips on Practical Use section that addresses some of these disadvantages.
Classification
Class MLPClassifier implements a multi-layer perceptron (MLP) algorithm that trains using Backpropagation.
MLP trains on two arrays: array X of size (n_samples, n_features), which holds the training samples represented as
floating point feature vectors; and array y of size (n_samples,), which holds the target values (class labels) for the
training samples:
>>> from sklearn.neural_network import MLPClassifier
>>> X = [[0., 0.], [1., 1.]]
>>> y = [0, 1]
>>> clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
...
hidden_layer_sizes=(5, 2), random_state=1)
...
>>> clf.fit(X, y)
MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto',
beta_1=0.9, beta_2=0.999, early_stopping=False,
epsilon=1e-08, hidden_layer_sizes=(5, 2), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
warm_start=False)

After fitting (training), the model can predict labels for new samples:
>>> clf.predict([[2., 2.], [-1., -2.]])
array([1, 0])

MLP can fit a non-linear model to the training data. clf.coefs_ contains the weight matrices that constitute the
model parameters:
>>> [coef.shape for coef in clf.coefs_]
[(2, 5), (5, 2), (2, 1)]

Currently, MLPClassifier supports only the Cross-Entropy loss function, which allows probability estimates by
running the predict_proba method.
MLP trains using Backpropagation. More precisely, it trains using some form of gradient descent and the gradients
are calculated using Backpropagation. For classification, it minimizes the Cross-Entropy loss function, giving a vector
of probability estimates 𝑃 (𝑦|𝑥) per sample 𝑥:
>>> clf.predict_proba([[2., 2.], [1., 2.]])
array([[ 1.967...e-04,
9.998...-01],
[ 1.967...e-04,
9.998...-01]])

MLPClassifier supports multi-class classification by applying Softmax as the output function.
Further, the model supports multi-label classification in which a sample can belong to more than one class. For each
class, the raw output passes through the logistic function. Values larger or equal to 0.5 are rounded to 1, otherwise to
0. For a predicted output of a sample, the indices where the value is 1 represents the assigned classes of that sample:
>>> X = [[0., 0.], [1., 1.]]
>>> y = [[0, 1], [1, 1]]
>>> clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
...
hidden_layer_sizes=(15,), random_state=1)
...

278

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>> clf.fit(X, y)
MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto',
beta_1=0.9, beta_2=0.999, early_stopping=False,
epsilon=1e-08, hidden_layer_sizes=(15,), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
warm_start=False)
>>> clf.predict([[1., 2.]])
array([[1, 1]])
>>> clf.predict([[0., 0.]])
array([[0, 1]])

See the examples below and the doc string of MLPClassifier.fit for further information.
Examples:
• Compare Stochastic learning strategies for MLPClassifier
• Visualization of MLP weights on MNIST

Regression
Class MLPRegressor implements a multi-layer perceptron (MLP) that trains using backpropagation with no activation function in the output layer, which can also be seen as using the identity function as activation function. Therefore,
it uses the square error as the loss function, and the output is a set of continuous values.
MLPRegressor also supports multi-output regression, in which a sample can have more than one target.
Regularization
Both MLPRegressor and MLPClassifier use parameter alpha for regularization (L2 regularization) term
which helps in avoiding overfitting by penalizing weights with large magnitudes. Following plot displays varying
decision function with value of alpha.
See the examples below for further information.
Examples:
• Varying regularization in Multi-layer Perceptron

Algorithms
MLP trains using Stochastic Gradient Descent, Adam, or L-BFGS. Stochastic Gradient Descent (SGD) updates parameters using the gradient of the loss function with respect to a parameter that needs adaptation, i.e.
𝑤 ← 𝑤 − 𝜂(𝛼

𝜕𝑅(𝑤) 𝜕𝐿𝑜𝑠𝑠
+
)
𝜕𝑤
𝜕𝑤

where 𝜂 is the learning rate which controls the step-size in the parameter space search. 𝐿𝑜𝑠𝑠 is the loss function used
for the network.

3.1. Supervised learning

279

scikit-learn user guide, Release 0.19.1

More details can be found in the documentation of SGD
Adam is similar to SGD in a sense that it is a stochastic optimizer, but it can automatically adjust the amount to update
parameters based on adaptive estimates of lower-order moments.
With SGD or Adam, training supports online and mini-batch learning.
L-BFGS is a solver that approximates the Hessian matrix which represents the second-order partial derivative of a
function. Further it approximates the inverse of the Hessian matrix to perform parameter updates. The implementation
uses the Scipy version of L-BFGS.
If the selected solver is ‘L-BFGS’, training does not support online nor mini-batch learning.
Complexity
Suppose there are 𝑛 training samples, 𝑚 features, 𝑘 hidden layers, each containing ℎ neurons - for simplicity, and 𝑜
output neurons. The time complexity of backpropagation is 𝑂(𝑛 · 𝑚 · ℎ𝑘 · 𝑜 · 𝑖), where 𝑖 is the number of iterations.
Since backpropagation has a high time complexity, it is advisable to start with smaller number of hidden neurons and
few hidden layers for training.
Mathematical formulation
Given a set of training examples (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), . . . , (𝑥𝑛 , 𝑦𝑛 ) where 𝑥𝑖 ∈ R𝑛 and 𝑦𝑖 ∈ {0, 1}, a one hidden layer
one hidden neuron MLP learns the function 𝑓 (𝑥) = 𝑊2 𝑔(𝑊1𝑇 𝑥 + 𝑏1 ) + 𝑏2 where 𝑊1 ∈ R𝑚 and 𝑊2 , 𝑏1 , 𝑏2 ∈ R are
model parameters. 𝑊1 , 𝑊2 represent the weights of the input layer and hidden layer, resepctively; and 𝑏1 , 𝑏2 represent
the bias added to the hidden layer and the output layer, respectively. 𝑔(·) : 𝑅 → 𝑅 is the activation function, set by
default as the hyperbolic tan. It is given as,
𝑔(𝑧) =

𝑒𝑧 − 𝑒−𝑧
𝑒𝑧 + 𝑒−𝑧

For binary classification, 𝑓 (𝑥) passes through the logistic function 𝑔(𝑧) = 1/(1+𝑒−𝑧 ) to obtain output values between
zero and one. A threshold, set to 0.5, would assign samples of outputs larger or equal 0.5 to the positive class, and the
rest to the negative class.
280

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

If there are more than two classes, 𝑓 (𝑥) itself would be a vector of size (n_classes,). Instead of passing through logistic
function, it passes through the softmax function, which is written as,
exp(𝑧𝑖 )
softmax(𝑧)𝑖 = ∑︀𝑘
𝑙=1 exp(𝑧𝑙 )
where 𝑧𝑖 represents the 𝑖 th element of the input to softmax, which corresponds to class 𝑖, and 𝐾 is the number of
classes. The result is a vector containing the probabilities that sample 𝑥 belong to each class. The output is the class
with the highest probability.
In regression, the output remains as 𝑓 (𝑥); therefore, output activation function is just the identity function.
MLP uses different loss functions depending on the problem type. The loss function for classification is Cross-Entropy,
which in binary case is given as,
𝐿𝑜𝑠𝑠(ˆ
𝑦 , 𝑦, 𝑊 ) = −𝑦 ln 𝑦ˆ − (1 − 𝑦) ln (1 − 𝑦ˆ) + 𝛼||𝑊 ||22
where 𝛼||𝑊 ||22 is an L2-regularization term (aka penalty) that penalizes complex models; and 𝛼 > 0 is a non-negative
hyperparameter that controls the magnitude of the penalty.
For regression, MLP uses the Square Error loss function; written as,
𝐿𝑜𝑠𝑠(ˆ
𝑦 , 𝑦, 𝑊 ) =

1
𝛼
||ˆ
𝑦 − 𝑦||22 + ||𝑊 ||22
2
2

Starting from initial random weights, multi-layer perceptron (MLP) minimizes the loss function by repeatedly updating
these weights. After computing the loss, a backward pass propagates it from the output layer to the previous layers,
providing each weight parameter with an update value meant to decrease the loss.
In gradient descent, the gradient ∇𝐿𝑜𝑠𝑠𝑊 of the loss with respect to the weights is computed and deducted from 𝑊 .
More formally, this is expressed as,
𝑊 𝑖+1 = 𝑊 𝑖 − 𝜖∇𝐿𝑜𝑠𝑠𝑖𝑊
where 𝑖 is the iteration step, and 𝜖 is the learning rate with a value larger than 0.
The algorithm stops when it reaches a preset maximum number of iterations; or when the improvement in loss is below
a certain, small number.
Tips on Practical Use
• Multi-layer Perceptron is sensitive to feature scaling, so it is highly recommended to scale your data. For
example, scale each attribute on the input vector X to [0, 1] or [-1, +1], or standardize it to have mean 0 and
variance 1. Note that you must apply the same scaling to the test set for meaningful results. You can use
StandardScaler for standardization.
>>>
>>>
>>>
>>>
>>>
>>>
>>>

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Don't cheat - fit only on training data
scaler.fit(X_train)
X_train = scaler.transform(X_train)
# apply same transformation to test data
X_test = scaler.transform(X_test)

An alternative and recommended approach is to use StandardScaler in a Pipeline
• Finding a reasonable regularization parameter 𝛼 is best done using GridSearchCV, usually in the range 10.0
** -np.arange(1, 7).

3.1. Supervised learning

281

scikit-learn user guide, Release 0.19.1

• Empirically, we observed that L-BFGS converges faster and with better solutions on small datasets. For relatively
large datasets, however, Adam is very robust. It usually converges quickly and gives pretty good performance.
SGD with momentum or nesterov’s momentum, on the other hand, can perform better than those two algorithms
if learning rate is correctly tuned.
More control with warm_start
If you want more control over stopping criteria or learning rate in SGD, or want to do additional monitoring, using
warm_start=True and max_iter=1 and iterating yourself can be helpful:
>>> X = [[0., 0.], [1., 1.]]
>>> y = [0, 1]
>>> clf = MLPClassifier(hidden_layer_sizes=(15,), random_state=1, max_iter=1, warm_
˓→start=True)
>>> for i in range(10):
...
clf.fit(X, y)
...
# additional monitoring / inspection
MLPClassifier(...

References:
• “Learning representations by back-propagating errors.” Rumelhart, David E., Geoffrey E. Hinton, and Ronald
J. Williams.
• “Stochastic Gradient Descent” L. Bottou - Website, 2010.
• “Backpropagation” Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen - Website, 2011.
• “Efficient BackProp” Y. LeCun, L. Bottou, G. Orr, K. Müller - In Neural Networks: Tricks of the Trade 1998.
• “Adam: A method for stochastic optimization.” Kingma, Diederik, and Jimmy Ba.
arXiv:1412.6980 (2014).

arXiv preprint

3.2 Unsupervised learning
3.2.1 Gaussian mixture models
sklearn.mixture is a package which enables one to learn Gaussian Mixture Models (diagonal, spherical, tied
and full covariance matrices supported), sample them, and estimate them from data. Facilities to help determine the
appropriate number of components are also provided.

A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a
finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing
k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the
latent Gaussians.
Scikit-learn implements different classes to estimate Gaussian mixture models, that correspond to different estimation
strategies, detailed below.

282

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Fig. 3.3: Two-component Gaussian mixture model: data points, and equi-probability surfaces of the model.
Gaussian Mixture
The GaussianMixture object implements the expectation-maximization (EM) algorithm for fitting mixture-ofGaussian models. It can also draw confidence ellipsoids for multivariate models, and compute the Bayesian Information Criterion to assess the number of clusters in the data. A GaussianMixture.fit method is provided that
learns a Gaussian Mixture Model from train data. Given test data, it can assign to each sample the Gaussian it mostly
probably belong to using the GaussianMixture.predict method.
The GaussianMixture comes with different options to constrain the covariance of the difference classes estimated:
spherical, diagonal, tied or full covariance.
Examples:
• See GMM covariances for an example of using the Gaussian mixture as clustering on the iris dataset.
• See Density Estimation for a Gaussian mixture for an example on plotting the density estimation.

Pros and cons of class GaussianMixture
Pros
Speed It is the fastest algorithm for learning mixture models
Agnostic As this algorithm maximizes only the likelihood, it will not bias the means towards zero, or
bias the cluster sizes to have specific structures that might or might not apply.
Cons
Singularities When one has insufficiently many points per mixture, estimating the covariance matrices
becomes difficult, and the algorithm is known to diverge and find solutions with infinite likelihood
unless one regularizes the covariances artificially.

3.2. Unsupervised learning

283

scikit-learn user guide, Release 0.19.1

Number of components This algorithm will always use all the components it has access to, needing
held-out data or information theoretical criteria to decide how many components to use in the absence of external cues.
Selecting the number of components in a classical Gaussian Mixture Model
The BIC criterion can be used to select the number of components in a Gaussian Mixture in an efficient way. In theory,
it recovers the true number of components only in the asymptotic regime (i.e. if much data is available and assuming
that the data was actually generated i.i.d. from a mixture of Gaussian distribution). Note that using a Variational
Bayesian Gaussian mixture avoids the specification of the number of components for a Gaussian mixture model.
Examples:
• See Gaussian Mixture Model Selection for an example of model selection performed with classical Gaussian
mixture.

Estimation algorithm Expectation-maximization
The main difficulty in learning Gaussian mixture models from unlabeled data is that it is one usually doesn’t know
which points came from which latent component (if one has access to this information it gets very easy to fit a separate
Gaussian distribution to each set of points). Expectation-maximization is a well-founded statistical algorithm to get
around this problem by an iterative process. First one assumes random components (randomly centered on data points,
284

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

learned from k-means, or even just normally distributed around the origin) and computes for each point a probability
of being generated by each component of the model. Then, one tweaks the parameters to maximize the likelihood of
the data given those assignments. Repeating this process is guaranteed to always converge to a local optimum.
Variational Bayesian Gaussian Mixture
The BayesianGaussianMixture object implements a variant of the Gaussian mixture model with variational
inference algorithms. The API is similar as the one defined by GaussianMixture.
Estimation algorithm: variational inference
Variational inference is an extension of expectation-maximization that maximizes a lower bound on model evidence
(including priors) instead of data likelihood. The principle behind variational methods is the same as expectationmaximization (that is both are iterative algorithms that alternate between finding the probabilities for each point to
be generated by each mixture and fitting the mixture to these assigned points), but variational methods add regularization by integrating information from prior distributions. This avoids the singularities often found in expectationmaximization solutions but introduces some subtle biases to the model. Inference is often notably slower, but not
usually as much so as to render usage unpractical.
Due to its Bayesian nature, the variational algorithm needs more hyper- parameters than expectation-maximization,
the most important of these being the concentration parameter weight_concentration_prior. Specifying a
low value for the concentration prior will make the model put most of the weight on few components set the remaining components weights very close to zero. High values of the concentration prior will allow a larger number of
components to be active in the mixture.
The parameters implementation of the BayesianGaussianMixture class proposes two types of prior for the
weights distribution: a finite mixture model with Dirichlet distribution and an infinite mixture model with the Dirichlet
Process. In practice Dirichlet Process inference algorithm is approximated and uses a truncated distribution with a fixed
maximum number of components (called the Stick-breaking representation). The number of components actually used
almost always depends on the data.
The next figure compares the results obtained for the different type of the weight concentration prior (parameter
weight_concentration_prior_type) for different values of weight_concentration_prior. Here,
we can see the value of the weight_concentration_prior parameter has a strong impact on the effective
number of active components obtained. We can also notice that large values for the concentration weight prior lead
to more uniform weights when the type of prior is ‘dirichlet_distribution’ while this is not necessarily the case for the
‘dirichlet_process’ type (used by default).

3.2. Unsupervised learning

285

scikit-learn user guide, Release 0.19.1

The examples below compare Gaussian mixture models with a fixed number of components, to the variational Gaussian mixture models with a Dirichlet process prior. Here, a classical Gaussian mixture is fitted with 5 components on
a dataset composed of 2 clusters. We can see that the variational Gaussian mixture with a Dirichlet process prior is
able to limit itself to only 2 components whereas the Gaussian mixture fits the data with a fixed number of components
that has to be set a priori by the user. In this case the user has selected n_components=5 which does not match the
true generative distribution of this toy dataset. Note that with very little observations, the variational Gaussian mixture
models with a Dirichlet process prior can take a conservative stand, and fit only one component.

286

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

On the following figure we are fitting a dataset not well-depicted by a Gaussian mixture. Adjusting the
weight_concentration_prior, parameter of the class:BayesianGaussianMixture controls the number of components used to fit this data. We also present on the last two plots a random sampling generated from the two resulting
mixtures.
Examples:
• See Gaussian Mixture Model Ellipsoids for an example on plotting the confidence ellipsoids for both
GaussianMixture and BayesianGaussianMixture.
• Gaussian
Mixture
Model
Sine
Curve
BayesianGaussianMixture to fit a sine wave.

shows

using

GaussianMixture

and

• See Concentration Prior Type Analysis of Variation Bayesian Gaussian Mixture for an example plotting the confidence ellipsoids for the BayesianGaussianMixture with different weight_concentration_prior_type for different values of the parameter
weight_concentration_prior.

Pros and cons of variational inference with BayesianGaussianMixture
Pros
Automatic selection when
weight_concentration_prior
is
small
enough
and
n_components is larger than what is found necessary by the model, the Variational Bayesian
mixture model has a natural tendency to set some mixture weights values close to zero. This makes
it possible to let the model choose a suitable number of effective components automatically. Only an
upper bound of this number needs to be provided. Note however that the “ideal” number of active
components is very application specific and is typically ill-defined in a data exploration setting.
Less sensitivity to the number of parameters unlike finite models, which will almost always use
all components as much as they can, and hence will produce wildly different solutions for
3.2. Unsupervised learning

287

scikit-learn user guide, Release 0.19.1

288

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

different numbers of components, the variantional inference with a Dirichlet process prior
(weight_concentration_prior_type='dirichlet_process') won’t change much
with changes to the parameters, leading to more stability and less tuning.
Regularization due to the incorporation of prior information, variational solutions have less pathological
special cases than expectation-maximization solutions.
Cons
Speed the extra parametrization necessary for variational inference make inference slower, although not
by much.
Hyperparameters this algorithm needs an extra hyperparameter that might need experimental tuning via
cross-validation.
Bias there are many implicit biases in the inference algorithms (and also in the Dirichlet process if used),
and whenever there is a mismatch between these biases and the data it might be possible to fit better
models using a finite mixture.
The Dirichlet Process
Here we describe variational inference algorithms on Dirichlet process mixture. The Dirichlet process is a prior
probability distribution on clusterings with an infinite, unbounded, number of partitions. Variational techniques let us
incorporate this prior structure on Gaussian mixture models at almost no penalty in inference time, comparing with a
finite Gaussian mixture model.
An important question is how can the Dirichlet process use an infinite, unbounded number of clusters and still be
consistent. While a full explanation doesn’t fit this manual, one can think of its stick breaking process analogy to help
understanding it. The stick breaking process is a generative story for the Dirichlet process. We start with a unit-length
stick and in each step we break off a portion of the remaining stick. Each time, we associate the length of the piece of
the stick to the proportion of points that falls into a group of the mixture. At the end, to represent the infinite mixture,
we associate the last remaining piece of the stick to the proportion of points that don’t fall into all the other groups. The
length of each piece is random variable with probability proportional to the concentration parameter. Smaller value of
the concentration will divide the unit-length into larger pieces of the stick (defining more concentrated distribution).
Larger concentration values will create smaller pieces of the stick (increasing the number of components with non
zero weights).
Variational inference techniques for the Dirichlet process still work with a finite approximation to this infinite mixture
model, but instead of having to specify a priori how many components one wants to use, one just specifies the concentration parameter and an upper bound on the number of mixture components (this upper bound, assuming it is higher
than the “true” number of components, affects only algorithmic complexity, not the actual number of components
used).

3.2.2 Manifold learning
Look for the bare necessities
The simple bare necessities
Forget about your worries and your strife
I mean the bare necessities
Old Mother Nature’s recipes
That bring the bare necessities of life
– Baloo’s song [The Jungle Book]
3.2. Unsupervised learning

289

scikit-learn user guide, Release 0.19.1

Manifold learning is an approach to non-linear dimensionality reduction. Algorithms for this task are based on the
idea that the dimensionality of many data sets is only artificially high.
Introduction
High-dimensional datasets can be very difficult to visualize. While data in two or three dimensions can be plotted to
show the inherent structure of the data, equivalent high-dimensional plots are much less intuitive. To aid visualization
of the structure of a dataset, the dimension must be reduced in some way.
The simplest way to accomplish this dimensionality reduction is by taking a random projection of the data. Though
this allows some degree of visualization of the data structure, the randomness of the choice leaves much to be desired.
In a random projection, it is likely that the more interesting structure within the data will be lost.

To address this concern, a number of supervised and unsupervised linear dimensionality reduction frameworks have
been designed, such as Principal Component Analysis (PCA), Independent Component Analysis, Linear Discriminant
Analysis, and others. These algorithms define specific rubrics to choose an “interesting” linear projection of the data.
These methods can be powerful, but often miss important non-linear structure in the data.

290

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Manifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to nonlinear structure in data. Though supervised variants exist, the typical manifold learning problem is unsupervised: it
learns the high-dimensional structure of the data from the data itself, without the use of predetermined classifications.
Examples:
• See Manifold learning on handwritten digits: Locally Linear Embedding, Isomap. . . for an example of
dimensionality reduction on handwritten digits.
• See Comparison of Manifold Learning methods for an example of dimensionality reduction on a toy “Scurve” dataset.
The manifold learning implementations available in scikit-learn are summarized below
Isomap
One of the earliest approaches to manifold learning is the Isomap algorithm, short for Isometric Mapping. Isomap can
be viewed as an extension of Multi-dimensional Scaling (MDS) or Kernel PCA. Isomap seeks a lower-dimensional
embedding which maintains geodesic distances between all points. Isomap can be performed with the object Isomap.

3.2. Unsupervised learning

291

scikit-learn user guide, Release 0.19.1

Complexity
The Isomap algorithm comprises three stages:
1. Nearest neighbor search. Isomap uses sklearn.neighbors.BallTree for efficient neighbor search.
The cost is approximately 𝑂[𝐷 log(𝑘)𝑁 log(𝑁 )], for 𝑘 nearest neighbors of 𝑁 points in 𝐷 dimensions.
2. Shortest-path graph search. The most efficient known algorithms for this are Dijkstra’s Algorithm, which is
approximately 𝑂[𝑁 2 (𝑘 + log(𝑁 ))], or the Floyd-Warshall algorithm, which is 𝑂[𝑁 3 ]. The algorithm can be
selected by the user with the path_method keyword of Isomap. If unspecified, the code attempts to choose
the best algorithm for the input data.
3. Partial eigenvalue decomposition. The embedding is encoded in the eigenvectors corresponding to the 𝑑
largest eigenvalues of the 𝑁 × 𝑁 isomap kernel. For a dense solver, the cost is approximately 𝑂[𝑑𝑁 2 ]. This
cost can often be improved using the ARPACK solver. The eigensolver can be specified by the user with the
path_method keyword of Isomap. If unspecified, the code attempts to choose the best algorithm for the
input data.
The overall complexity of Isomap is 𝑂[𝐷 log(𝑘)𝑁 log(𝑁 )] + 𝑂[𝑁 2 (𝑘 + log(𝑁 ))] + 𝑂[𝑑𝑁 2 ].
• 𝑁 : number of training data points
• 𝐷 : input dimension
• 𝑘 : number of nearest neighbors
• 𝑑 : output dimension
References:
• “A global geometric framework for nonlinear dimensionality reduction” Tenenbaum, J.B.; De Silva, V.; &
Langford, J.C. Science 290 (5500)

Locally Linear Embedding
Locally linear embedding (LLE) seeks a lower-dimensional projection of the data which preserves distances within
local neighborhoods. It can be thought of as a series of local Principal Component Analyses which are globally
compared to find the best non-linear embedding.
Locally linear embedding can be performed with function locally_linear_embedding or its object-oriented
counterpart LocallyLinearEmbedding.
Complexity
The standard LLE algorithm comprises three stages:
1. Nearest Neighbors Search. See discussion under Isomap above.
2. Weight Matrix Construction. 𝑂[𝐷𝑁 𝑘 3 ]. The construction of the LLE weight matrix involves the solution of
a 𝑘 × 𝑘 linear equation for each of the 𝑁 local neighborhoods
3. Partial Eigenvalue Decomposition. See discussion under Isomap above.
The overall complexity of standard LLE is 𝑂[𝐷 log(𝑘)𝑁 log(𝑁 )] + 𝑂[𝐷𝑁 𝑘 3 ] + 𝑂[𝑑𝑁 2 ].
• 𝑁 : number of training data points
• 𝐷 : input dimension
292

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• 𝑘 : number of nearest neighbors
• 𝑑 : output dimension
References:
• “Nonlinear dimensionality reduction by locally linear embedding” Roweis, S. & Saul, L. Science 290:2323
(2000)

Modified Locally Linear Embedding
One well-known issue with LLE is the regularization problem. When the number of neighbors is greater than the
number of input dimensions, the matrix defining each local neighborhood is rank-deficient. To address this, standard
LLE applies an arbitrary regularization parameter 𝑟, which is chosen relative to the trace of the local weight matrix.
Though it can be shown formally that as 𝑟 → 0, the solution converges to the desired embedding, there is no guarantee
that the optimal solution will be found for 𝑟 > 0. This problem manifests itself in embeddings which distort the
underlying geometry of the manifold.
One method to address the regularization problem is to use multiple weight vectors in each neighborhood.
This is the essence of modified locally linear embedding (MLLE). MLLE can be performed with function
locally_linear_embedding or its object-oriented counterpart LocallyLinearEmbedding, with the keyword method = 'modified'. It requires n_neighbors > n_components.
Complexity
The MLLE algorithm comprises three stages:
1. Nearest Neighbors Search. Same as standard LLE
2. Weight Matrix Construction. Approximately 𝑂[𝐷𝑁 𝑘 3 ]+𝑂[𝑁 (𝑘 −𝐷)𝑘 2 ]. The first term is exactly equivalent
to that of standard LLE. The second term has to do with constructing the weight matrix from multiple weights.
In practice, the added cost of constructing the MLLE weight matrix is relatively small compared to the cost of
steps 1 and 3.
3. Partial Eigenvalue Decomposition. Same as standard LLE
The overall complexity of MLLE is 𝑂[𝐷 log(𝑘)𝑁 log(𝑁 )] + 𝑂[𝐷𝑁 𝑘 3 ] + 𝑂[𝑁 (𝑘 − 𝐷)𝑘 2 ] + 𝑂[𝑑𝑁 2 ].

3.2. Unsupervised learning

293

scikit-learn user guide, Release 0.19.1

• 𝑁 : number of training data points
• 𝐷 : input dimension
• 𝑘 : number of nearest neighbors
• 𝑑 : output dimension
References:
• “MLLE: Modified Locally Linear Embedding Using Multiple Weights” Zhang, Z. & Wang, J.

Hessian Eigenmapping
Hessian Eigenmapping (also known as Hessian-based LLE: HLLE) is another method of solving the regularization
problem of LLE. It revolves around a hessian-based quadratic form at each neighborhood which is used to recover
the locally linear structure. Though other implementations note its poor scaling with data size, sklearn implements some algorithmic improvements which make its cost comparable to that of other LLE variants for small output
dimension. HLLE can be performed with function locally_linear_embedding or its object-oriented counterpart LocallyLinearEmbedding, with the keyword method = 'hessian'. It requires n_neighbors >
n_components * (n_components + 3) / 2.

294

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Complexity
The HLLE algorithm comprises three stages:
1. Nearest Neighbors Search. Same as standard LLE
2. Weight Matrix Construction. Approximately 𝑂[𝐷𝑁 𝑘 3 ] + 𝑂[𝑁 𝑑6 ]. The first term reflects a similar cost to
that of standard LLE. The second term comes from a QR decomposition of the local hessian estimator.
3. Partial Eigenvalue Decomposition. Same as standard LLE
The overall complexity of standard HLLE is 𝑂[𝐷 log(𝑘)𝑁 log(𝑁 )] + 𝑂[𝐷𝑁 𝑘 3 ] + 𝑂[𝑁 𝑑6 ] + 𝑂[𝑑𝑁 2 ].
• 𝑁 : number of training data points
• 𝐷 : input dimension
• 𝑘 : number of nearest neighbors
• 𝑑 : output dimension
References:
• “Hessian Eigenmaps: Locally linear embedding techniques for high-dimensional data” Donoho, D. &
Grimes, C. Proc Natl Acad Sci USA. 100:5591 (2003)

Spectral Embedding
Spectral Embedding is an approach to calculating a non-linear embedding. Scikit-learn implements Laplacian Eigenmaps, which finds a low dimensional representation of the data using a spectral decomposition of the graph Laplacian.
The graph generated can be considered as a discrete approximation of the low dimensional manifold in the high dimensional space. Minimization of a cost function based on the graph ensures that points close to each other on the manifold
are mapped close to each other in the low dimensional space, preserving local distances. Spectral embedding can be
performed with the function spectral_embedding or its object-oriented counterpart SpectralEmbedding.
Complexity
The Spectral Embedding (Laplacian Eigenmaps) algorithm comprises three stages:
1. Weighted Graph Construction. Transform the raw input data into graph representation using affinity (adjacency) matrix representation.
2. Graph Laplacian Construction. unnormalized Graph Laplacian is constructed as 𝐿 = 𝐷 − 𝐴 for and normal1
1
ized one as 𝐿 = 𝐷− 2 (𝐷 − 𝐴)𝐷− 2 .
3. Partial Eigenvalue Decomposition. Eigenvalue decomposition is done on graph Laplacian
The overall complexity of spectral embedding is 𝑂[𝐷 log(𝑘)𝑁 log(𝑁 )] + 𝑂[𝐷𝑁 𝑘 3 ] + 𝑂[𝑑𝑁 2 ].
• 𝑁 : number of training data points
• 𝐷 : input dimension
• 𝑘 : number of nearest neighbors
• 𝑑 : output dimension

3.2. Unsupervised learning

295

scikit-learn user guide, Release 0.19.1

References:
• “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation” M. Belkin, P. Niyogi, Neural
Computation, June 2003; 15 (6):1373-1396

Local Tangent Space Alignment
Though not technically a variant of LLE, Local tangent space alignment (LTSA) is algorithmically similar enough
to LLE that it can be put in this category. Rather than focusing on preserving neighborhood distances as in LLE,
LTSA seeks to characterize the local geometry at each neighborhood via its tangent space, and performs a global
optimization to align these local tangent spaces to learn the embedding. LTSA can be performed with function
locally_linear_embedding or its object-oriented counterpart LocallyLinearEmbedding, with the keyword method = 'ltsa'.

Complexity
The LTSA algorithm comprises three stages:
1. Nearest Neighbors Search. Same as standard LLE
2. Weight Matrix Construction. Approximately 𝑂[𝐷𝑁 𝑘 3 ] + 𝑂[𝑘 2 𝑑]. The first term reflects a similar cost to that
of standard LLE.
3. Partial Eigenvalue Decomposition. Same as standard LLE
The overall complexity of standard LTSA is 𝑂[𝐷 log(𝑘)𝑁 log(𝑁 )] + 𝑂[𝐷𝑁 𝑘 3 ] + 𝑂[𝑘 2 𝑑] + 𝑂[𝑑𝑁 2 ].
• 𝑁 : number of training data points
• 𝐷 : input dimension
• 𝑘 : number of nearest neighbors
• 𝑑 : output dimension

296

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

References:
• “Principal manifolds and nonlinear dimensionality reduction via tangent space alignment” Zhang, Z. & Zha,
H. Journal of Shanghai Univ. 8:406 (2004)

Multi-dimensional Scaling (MDS)
Multidimensional scaling (MDS) seeks a low-dimensional representation of the data in which the distances respect well
the distances in the original high-dimensional space.
In general, is a technique used for analyzing similarity or dissimilarity data. MDS attempts to model similarity or
dissimilarity data as distances in a geometric spaces. The data can be ratings of similarity between objects, interaction
frequencies of molecules, or trade indices between countries.
There exists two types of MDS algorithm: metric and non metric. In the scikit-learn, the class MDS implements
both. In Metric MDS, the input similarity matrix arises from a metric (and thus respects the triangular inequality), the
distances between output two points are then set to be as close as possible to the similarity or dissimilarity data. In
the non-metric version, the algorithms will try to preserve the order of the distances, and hence seek for a monotonic
relationship between the distances in the embedded space and the similarities/dissimilarities.

Let 𝑆 be the similarity matrix, and 𝑋 the coordinates of the 𝑛 input points. Disparities 𝑑ˆ𝑖𝑗 are transformation of the
similarities chosen in some optimal ways. The objective, called the stress, is then defined by 𝑠𝑢𝑚𝑖<𝑗 𝑑𝑖𝑗 (𝑋) − 𝑑ˆ𝑖𝑗 (𝑋)
Metric MDS
The simplest metric MDS model, called absolute MDS, disparities are defined by 𝑑ˆ𝑖𝑗 = 𝑆𝑖𝑗 . With absolute MDS, the
value 𝑆𝑖𝑗 should then correspond exactly to the distance between point 𝑖 and 𝑗 in the embedding point.
Most commonly, disparities are set to 𝑑ˆ𝑖𝑗 = 𝑏𝑆𝑖𝑗 .
Nonmetric MDS
Non metric MDS focuses on the ordination of the data. If 𝑆𝑖𝑗 < 𝑆𝑘𝑙 , then the embedding should enforce 𝑑𝑖𝑗 < 𝑑𝑗𝑘 .
A simple algorithm to enforce that is to use a monotonic regression of 𝑑𝑖𝑗 on 𝑆𝑖𝑗 , yielding disparities 𝑑ˆ𝑖𝑗 in the same
order as 𝑆𝑖𝑗 .

3.2. Unsupervised learning

297

scikit-learn user guide, Release 0.19.1

A trivial solution to this problem is to set all the points on the origin. In order to avoid that, the disparities 𝑑ˆ𝑖𝑗 are
normalized.

References:
• “Modern Multidimensional Scaling - Theory and Applications” Borg, I.; Groenen P. Springer Series in Statistics (1997)
• “Nonmetric multidimensional scaling: a numerical method” Kruskal, J. Psychometrika, 29 (1964)
• “Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis” Kruskal, J. Psychometrika, 29, (1964)

t-distributed Stochastic Neighbor Embedding (t-SNE)
t-SNE (TSNE) converts affinities of data points to probabilities. The affinities in the original space are represented by
Gaussian joint probabilities and the affinities in the embedded space are represented by Student’s t-distributions. This
allows t-SNE to be particularly sensitive to local structure and has a few other advantages over existing techniques:
• Revealing the structure at many scales on a single map
• Revealing data that lie in multiple, different, manifolds or clusters
• Reducing the tendency to crowd points together at the center
While Isomap, LLE and variants are best suited to unfold a single continuous low dimensional manifold, t-SNE will
focus on the local structure of the data and will tend to extract clustered local groups of samples as highlighted on the
S-curve example. This ability to group samples based on the local structure might be beneficial to visually disentangle
a dataset that comprises several manifolds at once as is the case in the digits dataset.
The Kullback-Leibler (KL) divergence of the joint probabilities in the original space and the embedded space will
be minimized by gradient descent. Note that the KL divergence is not convex, i.e. multiple restarts with different
initializations will end up in local minima of the KL divergence. Hence, it is sometimes useful to try different seeds
and select the embedding with the lowest KL divergence.
The disadvantages to using t-SNE are roughly:

298

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• t-SNE is computationally expensive, and can take several hours on million-sample datasets where PCA will
finish in seconds or minutes
• The Barnes-Hut t-SNE method is limited to two or three dimensional embeddings.
• The algorithm is stochastic and multiple restarts with different seeds can yield different embeddings. However,
it is perfectly legitimate to pick the embedding with the least error.
• Global structure is not explicitly preserved. This is problem is mitigated by initializing points with PCA (using
init=’pca’).

Optimizing t-SNE
The main purpose of t-SNE is visualization of high-dimensional data. Hence, it works best when the data will be
embedded on two or three dimensions.
Optimizing the KL divergence can be a little bit tricky sometimes. There are five parameters that control the optimization of t-SNE and therefore possibly the quality of the resulting embedding:
• perplexity
• early exaggeration factor
• learning rate
• maximum number of iterations
• angle (not used in the exact method)
The perplexity is defined as 𝑘 = 2( 𝑆) where 𝑆 is the Shannon entropy of the conditional probability distribution.
The perplexity of a 𝑘-sided die is 𝑘, so that 𝑘 is effectively the number of nearest neighbors t-SNE considers when
generating the conditional probabilities. Larger perplexities lead to more nearest neighbors and less sensitive to small
structure. Conversely a lower perplexity considers a smaller number of neighbors, and thus ignores more global
information in favour of the local neighborhood. As dataset sizes get larger more points will be required to get a
reasonable sample of the local neighborhood, and hence larger perplexities may be required. Similarly noisier datasets
will require larger perplexity values to encompass enough local neighbors to see beyond the background noise.
The maximum number of iterations is usually high enough and does not need any tuning. The optimization consists of
two phases: the early exaggeration phase and the final optimization. During early exaggeration the joint probabilities
in the original space will be artificially increased by multiplication with a given factor. Larger factors result in larger
gaps between natural clusters in the data. If the factor is too high, the KL divergence could increase during this phase.
Usually it does not have to be tuned. A critical parameter is the learning rate. If it is too low gradient descent will get
stuck in a bad local minimum. If it is too high the KL divergence will increase during optimization. More tips can be
3.2. Unsupervised learning

299

scikit-learn user guide, Release 0.19.1

found in Laurens van der Maaten’s FAQ (see references). The last parameter, angle, is a tradeoff between performance
and accuracy. Larger angles imply that we can approximate larger regions by a single point, leading to better speed
but less accurate results.
“How to Use t-SNE Effectively” provides a good discussion of the effects of the various parameters, as well as
interactive plots to explore the effects of different parameters.
Barnes-Hut t-SNE
The Barnes-Hut t-SNE that has been implemented here is usually much slower than other manifold learning algorithms. The optimization is quite difficult and the computation of the gradient is 𝑂[𝑑𝑁 𝑙𝑜𝑔(𝑁 )], where 𝑑 is the number
of output dimensions and 𝑁 is the number of samples. The Barnes-Hut method improves on the exact method where
t-SNE complexity is 𝑂[𝑑𝑁 2 ], but has several other notable differences:
• The Barnes-Hut implementation only works when the target dimensionality is 3 or less. The 2D case is typical
when building visualizations.
• Barnes-Hut only works with dense input data. Sparse data matrices can only be embedded with the exact method
or can be approximated by a dense low rank projection for instance using sklearn.decomposition.
TruncatedSVD
• Barnes-Hut is an approximation of the exact method. The approximation is parameterized with the angle parameter, therefore the angle parameter is unused when method=”exact”
• Barnes-Hut is significantly more scalable. Barnes-Hut can be used to embed hundred of thousands of data points
while the exact method can handle thousands of samples before becoming computationally intractable
For visualization purpose (which is the main use case of t-SNE), using the Barnes-Hut method is strongly recommended. The exact t-SNE method is useful for checking the theoretically properties of the embedding possibly in
higher dimensional space but limit to small datasets due to computational constraints.
Also note that the digits labels roughly match the natural grouping found by t-SNE while the linear 2D projection of
the PCA model yields a representation where label regions largely overlap. This is a strong clue that this data can be
well separated by non linear methods that focus on the local structure (e.g. an SVM with a Gaussian RBF kernel).
However, failing to visualize well separated homogeneously labeled groups with t-SNE in 2D does not necessarily
implie that the data cannot be correctly classified by a supervised model. It might be the case that 2 dimensions are
not enough low to accurately represents the internal structure of the data.
References:
• “Visualizing High-Dimensional Data Using t-SNE” van der Maaten, L.J.P.; Hinton, G. Journal of Machine
Learning Research (2008)
• “t-Distributed Stochastic Neighbor Embedding” van der Maaten, L.J.P.
• “Accelerating t-SNE using Tree-Based Algorithms.” L.J.P. van der Maaten. Journal of Machine Learning
Research 15(Oct):3221-3245, 2014.

Tips on practical use
• Make sure the same scale is used over all features. Because manifold learning methods are based on a nearestneighbor search, the algorithm may perform poorly otherwise. See StandardScaler for convenient ways of
scaling heterogeneous data.

300

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• The reconstruction error computed by each routine can be used to choose the optimal output dimension. For a
𝑑-dimensional manifold embedded in a 𝐷-dimensional parameter space, the reconstruction error will decrease
as n_components is increased until n_components == d.
• Note that noisy data can “short-circuit” the manifold, in essence acting as a bridge between parts of the manifold
that would otherwise be well-separated. Manifold learning on noisy and/or incomplete data is an active area of
research.
• Certain input configurations can lead to singular weight matrices, for example when more than two points in the
dataset are identical, or when the data is split into disjointed groups. In this case, solver='arpack' will
fail to find the null space. The easiest way to address this is to use solver='dense' which will work on a
singular matrix, though it may be very slow depending on the number of input points. Alternatively, one can
attempt to understand the source of the singularity: if it is due to disjoint sets, increasing n_neighbors may
help. If it is due to identical points in the dataset, removing these points may help.
See also:
Totally Random Trees Embedding can also be useful to derive non-linear representations of feature space, also it does
not perform dimensionality reduction.

3.2.3 Clustering
Clustering of unlabeled data can be performed with the module sklearn.cluster.
Each clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on train
data, and a function, that, given train data, returns an array of integer labels corresponding to the different clusters. For
the class, the labels over the training data can be found in the labels_ attribute.
Input data
One important thing to note is that the algorithms implemented in this module can take different kinds of matrix as
input. All the methods accept standard data matrices of shape [n_samples, n_features]. These can be obtained from the classes in the sklearn.feature_extraction module. For AffinityPropagation,
SpectralClustering and DBSCAN one can also input similarity matrices of shape [n_samples,
n_samples]. These can be obtained from the functions in the sklearn.metrics.pairwise module.

3.2. Unsupervised learning

301

scikit-learn user guide, Release 0.19.1

Fig. 3.4: A comparison of the clustering algorithms in scikit-learn

302

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Overview of clustering methods
Method
name
K-Means

Parameters

Scalability

Usecase

number of clusters

General-purpose, even cluster size, flat geometry, not
too many clusters

Affinity propagation

damping, sample preference

Very
large
n_samples,
medium
n_clusters
with MiniBatch code
Not scalable with
n_samples

Mean-shift

bandwidth

Spectral clustering

number of clusters

Many clusters, uneven cluster size, non-flat geometry
Few clusters, even cluster
size, non-flat geometry

Ward
hierarchical
clustering
Agglomerative
clustering

number of clusters

Not scalable with
n_samples
Medium
n_samples, small
n_clusters
Large n_samples
and n_clusters
Large n_samples
and n_clusters

Many clusters, possibly connectivity constraints, non
Euclidean distances
Non-flat geometry, uneven
cluster sizes

Any pairwise distance

Flat geometry, good for density estimation
Large dataset, outlier removal, data reduction.

Mahalanobis
distances to centers
Euclidean
distance
between points

DBSCAN

Gaussian mixtures
Birch

number of clusters,
linkage
type, distance
neighborhood
size

many
branching factor, threshold,
optional global
clusterer.

Very
large
n_samples,
medium
n_clusters
Not scalable
Large
n_clusters
and n_samples

Many clusters, uneven cluster size, non-flat geometry

Many clusters, possibly connectivity constraints

Geometry
used)
Distances
points

(metric
between

Graph distance (e.g.
nearest-neighbor
graph)
Distances
between
points
Graph distance (e.g.
nearest-neighbor
graph)
Distances
between
points

Distances
between
nearest points

Non-flat geometry clustering is useful when the clusters have a specific shape, i.e. a non-flat manifold, and the standard
euclidean distance is not the right metric. This case arises in the two top rows of the figure above.
Gaussian mixture models, useful for clustering, are described in another chapter of the documentation dedicated
to mixture models. KMeans can be seen as a special case of Gaussian mixture model with equal covariance per
component.
K-means
The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion
known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified.
It scales well to large number of samples and has been used across a large range of application areas in many different
fields.
The k-means algorithm divides a set of 𝑁 samples 𝑋 into 𝐾 disjoint clusters 𝐶, each described by the mean 𝜇𝑗 of
the samples in the cluster. The means are commonly called the cluster “centroids”; note that they are not, in general,
points from 𝑋, although they live in the same space. The K-means algorithm aims to choose centroids that minimise

3.2. Unsupervised learning

303

scikit-learn user guide, Release 0.19.1

the inertia, or within-cluster sum of squared criterion:
𝑛
∑︁
𝑖=0

min (||𝑥𝑗 − 𝜇𝑖 ||2 )

𝜇𝑗 ∈𝐶

Inertia, or the within-cluster sum of squares criterion, can be recognized as a measure of how internally coherent
clusters are. It suffers from various drawbacks:
• Inertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds
poorly to elongated clusters, or manifolds with irregular shapes.
• Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very
high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse
of dimensionality”). Running a dimensionality reduction algorithm such as PCA prior to k-means clustering
can alleviate this problem and speed up the computations.

Kmeans is often referred to as Lloyd’s algorithm. In basic terms, the algorithm has three steps. The first step chooses
the initial centroids, with the most basic method being to choose 𝑘 samples from the dataset 𝑋. After initialization,
K-means consists of looping between the two other steps. The first step assigns each sample to its nearest centroid.

304

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The second step creates new centroids by taking the mean value of all of the samples assigned to each previous
centroid. The difference between the old and the new centroids are computed and the algorithm repeats these last two
steps until this value is less than a threshold. In other words, it repeats until the centroids do not move significantly.

K-means is equivalent to the expectation-maximization algorithm with a
small, all-equal, diagonal covariance matrix.
The algorithm can also be understood through the concept of Voronoi diagrams. First the Voronoi diagram of the points
is calculated using the current centroids. Each segment in the Voronoi diagram becomes a separate cluster. Secondly,
the centroids are updated to the mean of each segment. The algorithm then repeats this until a stopping criterion is
fulfilled. Usually, the algorithm stops when the relative decrease in the objective function between iterations is less
than the given tolerance value. This is not the case in this implementation: iteration stops when centroids move less
than the tolerance.
Given enough time, K-means will always converge, however this may be to a local minimum. This is highly dependent on the initialization of the centroids. As a result, the computation is often done several times, with different
initializations of the centroids. One method to help address this issue is the k-means++ initialization scheme, which
has been implemented in scikit-learn (use the init='k-means++' parameter). This initializes the centroids to
be (generally) distant from each other, leading to provably better results than random initialization, as shown in the
reference.
A parameter can be given to allow K-means to be run in parallel, called n_jobs. Giving this parameter a positive
value uses that many processors (default: 1). A value of -1 uses all available processors, with -2 using one less, and so
on. Parallelization generally speeds up computation at the cost of memory (in this case, multiple copies of centroids
need to be stored, one for each job).
Warning: The parallel version of K-Means is broken on OS X when numpy uses the Accelerate Framework. This
is expected behavior: Accelerate can be called after a fork but you need to execv the subprocess with the Python
binary (which multiprocessing does not do under posix).
K-means can be used for vector quantization. This is achieved using the transform method of a trained model of
KMeans.
Examples:
• Demonstration of k-means assumptions: Demonstrating when k-means performs intuitively and when it does
not
• A demo of K-Means clustering on the handwritten digits data: Clustering handwritten digits

References:
• “k-means++: The advantages of careful seeding” Arthur, David, and Sergei Vassilvitskii, Proceedings of

3.2. Unsupervised learning

305

scikit-learn user guide, Release 0.19.1

the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied
Mathematics (2007)

Mini Batch K-Means
The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation
time, while still attempting to optimise the same objective function. Mini-batches are subsets of the input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of computation required
to converge to a local solution. In contrast to other algorithms that reduce the convergence time of k-means, mini-batch
k-means produces results that are generally only slightly worse than the standard algorithm.
The algorithm iterates between two major steps, similar to vanilla k-means. In the first step, 𝑏 samples are drawn
randomly from the dataset, to form a mini-batch. These are then assigned to the nearest centroid. In the second step,
the centroids are updated. In contrast to k-means, this is done on a per-sample basis. For each sample in the mini-batch,
the assigned centroid is updated by taking the streaming average of the sample and all previous samples assigned to
that centroid. This has the effect of decreasing the rate of change for a centroid over time. These steps are performed
until convergence or a predetermined number of iterations is reached.
MiniBatchKMeans converges faster than KMeans, but the quality of the results is reduced. In practice this difference in quality can be quite small, as shown in the example and cited reference.

Examples:
• Comparison of the K-Means and MiniBatchKMeans clustering algorithms: Comparison of KMeans and
MiniBatchKMeans
• Clustering text documents using k-means: Document clustering using sparse MiniBatchKMeans
• Online learning of a dictionary of parts of faces

References:
• “Web Scale K-Means clustering” D. Sculley, Proceedings of the 19th international conference on World wide
web (2010)

306

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Affinity Propagation
AffinityPropagation creates clusters by sending messages between pairs of samples until convergence. A
dataset is then described using a small number of exemplars, which are identified as those most representative of other
samples. The messages sent between pairs represent the suitability for one sample to be the exemplar of the other,
which is updated in response to the values from other pairs. This updating happens iteratively until convergence, at
which point the final exemplars are chosen, and hence the final clustering is given.

Affinity Propagation can be interesting as it chooses the number of clusters based on the data provided. For this purpose, the two important parameters are the preference, which controls how many exemplars are used, and the damping
factor which damps the responsibility and availability messages to avoid numerical oscillations when updating these
messages.
The main drawback of Affinity Propagation is its complexity. The algorithm has a time complexity of the order
𝑂(𝑁 2 𝑇 ), where 𝑁 is the number of samples and 𝑇 is the number of iterations until convergence. Further, the memory
complexity is of the order 𝑂(𝑁 2 ) if a dense similarity matrix is used, but reducible if a sparse similarity matrix is
used. This makes Affinity Propagation most appropriate for small to medium sized datasets.
Examples:
• Demo of affinity propagation clustering algorithm: Affinity Propagation on a synthetic 2D datasets with 3
classes.
• Visualizing the stock market structure Affinity Propagation on Financial time series to find groups of companies
Algorithm description: The messages sent between points belong to one of two categories. The first is the responsibility 𝑟(𝑖, 𝑘), which is the accumulated evidence that sample 𝑘 should be the exemplar for sample 𝑖. The second is the
availability 𝑎(𝑖, 𝑘) which is the accumulated evidence that sample 𝑖 should choose sample 𝑘 to be its exemplar, and
considers the values for all other samples that 𝑘 should be an exemplar. In this way, exemplars are chosen by samples
if they are (1) similar enough to many samples and (2) chosen by many samples to be representative of themselves.
More formally, the responsibility of a sample 𝑘 to be the exemplar of sample 𝑖 is given by:
𝑟(𝑖, 𝑘) ← 𝑠(𝑖, 𝑘) − 𝑚𝑎𝑥[𝑎(𝑖, 𝑘 ′ ) + 𝑠(𝑖, 𝑘 ′ )∀𝑘 ′ ̸= 𝑘]
Where 𝑠(𝑖, 𝑘) is the similarity between samples 𝑖 and 𝑘. The availability of sample 𝑘 to be the exemplar of sample 𝑖 is

3.2. Unsupervised learning

307

scikit-learn user guide, Release 0.19.1

given by:
∑︁

𝑎(𝑖, 𝑘) ← 𝑚𝑖𝑛[0, 𝑟(𝑘, 𝑘) +
𝑖′

𝑠.𝑡.

𝑟(𝑖′ , 𝑘)]

𝑖′ ∈{𝑖,𝑘}
/

To begin with, all values for 𝑟 and 𝑎 are set to zero, and the calculation of each iterates until convergence. As discussed
above, in order to avoid numerical oscillations when updating the messages, the damping factor 𝜆 is introduced to
iteration process:
𝑟𝑡+1 (𝑖, 𝑘) = 𝜆 · 𝑟𝑡 (𝑖, 𝑘) + (1 − 𝜆) · 𝑟𝑡+1 (𝑖, 𝑘)
𝑎𝑡+1 (𝑖, 𝑘) = 𝜆 · 𝑎𝑡 (𝑖, 𝑘) + (1 − 𝜆) · 𝑎𝑡+1 (𝑖, 𝑘)
where 𝑡 indicates the iteration times.
Mean Shift
MeanShift clustering aims to discover blobs in a smooth density of samples. It is a centroid based algorithm, which
works by updating candidates for centroids to be the mean of the points within a given region. These candidates are
then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids.
Given a candidate centroid 𝑥𝑖 for iteration 𝑡, the candidate is updated according to the following equation:
𝑥𝑡+1
= 𝑥𝑡𝑖 + 𝑚(𝑥𝑡𝑖 )
𝑖
Where 𝑁 (𝑥𝑖 ) is the neighborhood of samples within a given distance around 𝑥𝑖 and 𝑚 is the mean shift vector that
is computed for each centroid that points towards a region of the maximum increase in the density of points. This
is computed using the following equation, effectively updating a centroid to be the mean of the samples within its
neighborhood:
∑︀
𝑥 ∈𝑁 (𝑥𝑖 ) 𝐾(𝑥𝑗 − 𝑥𝑖 )𝑥𝑗
𝑚(𝑥𝑖 ) = ∑︀𝑗
𝑥𝑗 ∈𝑁 (𝑥𝑖 ) 𝐾(𝑥𝑗 − 𝑥𝑖 )
The algorithm automatically sets the number of clusters, instead of relying on a parameter bandwidth, which dictates
the size of the region to search through. This parameter can be set manually, but can be estimated using the provided
estimate_bandwidth function, which is called if the bandwidth is not set.
The algorithm is not highly scalable, as it requires multiple nearest neighbor searches during the execution of the
algorithm. The algorithm is guaranteed to converge, however the algorithm will stop iterating when the change in
centroids is small.
Labelling a new sample is performed by finding the nearest centroid for a given sample.
Examples:
• A demo of the mean-shift clustering algorithm: Mean Shift clustering on a synthetic 2D datasets with 3
classes.

References:
• “Mean shift: A robust approach toward feature space analysis.” D. Comaniciu and P. Meer, IEEE Transactions
on Pattern Analysis and Machine Intelligence (2002)

308

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Spectral clustering
SpectralClustering does a low-dimension embedding of the affinity matrix between samples, followed by a
KMeans in the low dimensional space. It is especially efficient if the affinity matrix is sparse and the pyamg module
is installed. SpectralClustering requires the number of clusters to be specified. It works well for a small number of
clusters but is not advised when using many clusters.
For two clusters, it solves a convex relaxation of the normalised cuts problem on the similarity graph: cutting the
graph in two so that the weight of the edges cut is small compared to the weights of the edges inside each cluster. This
criteria is especially interesting when working on images: graph vertices are pixels, and edges of the similarity graph
are a function of the gradient of the image.

Warning: Transforming distance to well-behaved similarities
Note that if the values of your similarity matrix are not well distributed, e.g. with negative values or with a distance
matrix rather than a similarity, the spectral problem will be singular and the problem not solvable. In which case
it is advised to apply a transformation to the entries of the matrix. For instance, in the case of a signed distance
matrix, is common to apply a heat kernel:
similarity = np.exp(-beta * distance / distance.std())

See the examples for such an application.

3.2. Unsupervised learning

309

scikit-learn user guide, Release 0.19.1

Examples:
• Spectral clustering for image segmentation: Segmenting objects from a noisy background using spectral
clustering.
• Segmenting the picture of a raccoon face in regions: Spectral clustering to split the image of the raccoon face
in regions.

Different label assignment strategies
Different label assignment strategies can be used, corresponding to the assign_labels parameter of
SpectralClustering. The "kmeans" strategy can match finer details of the data, but it can be more unstable. In particular, unless you control the random_state, it may not be reproducible from run-to-run, as it depends
on a random initialization. On the other hand, the "discretize" strategy is 100% reproducible, but it tends to
create parcels of fairly even and geometrical shape.
assign_labels="kmeans"

assign_labels="discretize"

References:
• “A Tutorial on Spectral Clustering” Ulrike von Luxburg, 2007
• “Normalized cuts and image segmentation” Jianbo Shi, Jitendra Malik, 2000
• “A Random Walks View of Spectral Segmentation” Marina Meila, Jianbo Shi, 2001
• “On Spectral Clustering: Analysis and an algorithm” Andrew Y. Ng, Michael I. Jordan, Yair Weiss, 2001

Hierarchical clustering
Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting
them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique
310

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

cluster that gathers all the samples, the leaves being the clusters with only one sample. See the Wikipedia page for
more details.
The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each
observation starts in its own cluster, and clusters are successively merged together. The linkage criteria determines the
metric used for the merge strategy:
• Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in
this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.
• Maximum or complete linkage minimizes the maximum distance between observations of pairs of clusters.
• Average linkage minimizes the average of the distances between all observations of pairs of clusters.
AgglomerativeClustering can also scale to large number of samples when it is used jointly with a connectivity
matrix, but is computationally expensive when no connectivity constraints are added between samples: it considers at
each step all the possible merges.
FeatureAgglomeration
The FeatureAgglomeration uses agglomerative clustering to group together features that look very similar,
thus decreasing the number of features. It is a dimensionality reduction tool, see Unsupervised dimensionality
reduction.

Different linkage type: Ward, complete and average linkage
AgglomerativeClustering

supports

Ward,

average,

and

complete

linkage

strategies.

Agglomerative cluster has a “rich get richer” behavior that leads to uneven cluster sizes. In this regard, complete
linkage is the worst strategy, and Ward gives the most regular sizes. However, the affinity (or distance used in
clustering) cannot be varied with Ward, thus for non Euclidean metrics, average linkage is a good alternative.
Examples:
• Various Agglomerative Clustering on a 2D embedding of digits: exploration of the different linkage strategies
in a real dataset.

Adding connectivity constraints
An interesting aspect of AgglomerativeClustering is that connectivity constraints can be added to this algorithm (only adjacent clusters can be merged together), through a connectivity matrix that defines for each sample
the neighboring samples following a given structure of the data. For instance, in the swiss-roll example below, the

3.2. Unsupervised learning

311

scikit-learn user guide, Release 0.19.1

connectivity constraints forbid the merging of points that are not adjacent on the swiss roll, and thus avoid forming
clusters that extend across overlapping folds of the roll.

These constraint are useful to impose a certain local structure, but they also make the algorithm faster, especially when
the number of the samples is high.
The connectivity constraints are imposed via an connectivity matrix: a scipy sparse matrix that has elements only
at the intersection of a row and a column with indices of the dataset that should be connected. This matrix can
be constructed from a-priori information: for instance, you may wish to cluster web pages by only merging pages
with a link pointing from one to another. It can also be learned from the data, for instance using sklearn.
neighbors.kneighbors_graph to restrict merging to nearest neighbors as in this example, or using sklearn.
feature_extraction.image.grid_to_graph to enable only merging of neighboring pixels on an image,
as in the raccoon face example.
Examples:
• A demo of structured Ward hierarchical clustering on a raccoon face image: Ward clustering to split the
image of a raccoon face in regions.
• Hierarchical clustering: structured vs unstructured ward: Example of Ward algorithm on a swiss-roll, comparison of structured approaches versus unstructured approaches.
• Feature agglomeration vs. univariate selection: Example of dimensionality reduction with feature agglomeration based on Ward hierarchical clustering.
• Agglomerative clustering with and without structure

Warning: Connectivity constraints with average and complete linkage
Connectivity constraints and complete or average linkage can enhance the ‘rich getting richer’ aspect of agglomerative clustering, particularly so if they are built with sklearn.neighbors.kneighbors_graph. In the
limit of a small number of clusters, they tend to give a few macroscopically occupied clusters and almost empty
ones. (see the discussion in Agglomerative clustering with and without structure).

312

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Varying the metric
Average and complete linkage can be used with a variety of distances (or affinities), in particular Euclidean distance
(l2), Manhattan distance (or Cityblock, or l1), cosine distance, or any precomputed affinity matrix.
• l1 distance is often good for sparse features, or sparse noise: ie many of the features are zero, as in text mining
using occurrences of rare words.
• cosine distance is interesting because it is invariant to global scalings of the signal.
The
tance

guidelines
between

for
choosing
a
metric
is
samples
in
different
classes,

to
use
one
that
maximizes
the
disand
minimizes
that
within
each
class.

Examples:
• Agglomerative clustering with different metrics

DBSCAN
The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather
generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are
convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in
areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance
measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples). There
are two parameters to the algorithm, min_samples and eps, which define formally what we mean when we say
dense. Higher min_samples or lower eps indicate higher density necessary to form a cluster.
More formally, we define a core sample as being a sample in the dataset such that there exist min_samples other
samples within a distance of eps, which are defined as neighbors of the core sample. This tells us that the core sample
is in a dense area of the vector space. A cluster is a set of core samples that can be built by recursively taking a core
sample, finding all of its neighbors that are core samples, finding all of their neighbors that are core samples, and so
on. A cluster also has a set of non-core samples, which are samples that are neighbors of a core sample in the cluster
but are not themselves core samples. Intuitively, these samples are on the fringes of a cluster.
3.2. Unsupervised learning

313

scikit-learn user guide, Release 0.19.1

Any core sample is part of a cluster, by definition. Any sample that is not a core sample, and is at least eps in distance
from any core sample, is considered an outlier by the algorithm.
In the figure below, the color indicates cluster membership, with large circles indicating core samples found by the
algorithm. Smaller circles are non-core samples that are still part of a cluster. Moreover, the outliers are indicated by
black points below.

Examples:
• Demo of DBSCAN clustering algorithm

Implementation
The DBSCAN algorithm is deterministic, always generating the same clusters when given the same data in the
same order. However, the results can differ when data is provided in a different order. First, even though the core
samples will always be assigned to the same clusters, the labels of those clusters will depend on the order in which
those samples are encountered in the data. Second and more importantly, the clusters to which non-core samples
are assigned can differ depending on the data order. This would happen when a non-core sample has a distance
lower than eps to two core samples in different clusters. By the triangular inequality, those two core samples must
be more distant than eps from each other, or they would be in the same cluster. The non-core sample is assigned
to whichever cluster is generated first in a pass through the data, and so the results will depend on the data ordering.
The current implementation uses ball trees and kd-trees to determine the neighborhood of points, which avoids
calculating the full distance matrix (as was done in scikit-learn versions before 0.14). The possibility to use custom
metrics is retained; for details, see NearestNeighbors.

Memory consumption for large sample sizes
This implementation is by default not memory efficient because it constructs a full pairwise similarity matrix in the
case where kd-trees or ball-trees cannot be used (e.g. with sparse matrices). This matrix will consume n^2 floats.
A couple of mechanisms for getting around this are:
• A sparse radius neighborhood graph (where missing entries are presumed to be out of eps) can be precomputed in a memory-efficient way and dbscan can be run over this with metric='precomputed'.
• The dataset can be compressed, either by removing exact duplicates if these occur in your data, or by using
BIRCH. Then you only have a relatively small number of representatives for a large number of points. You

314

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

can then provide a sample_weight when fitting DBSCAN.

References:
• “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise” Ester, M., H. P.
Kriegel, J. Sander, and X. Xu, In Proceedings of the 2nd International Conference on Knowledge Discovery
and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996

Birch
The Birch builds a tree called the Characteristic Feature Tree (CFT) for the given data. The data is essentially lossy
compressed to a set of Characteristic Feature nodes (CF Nodes). The CF Nodes have a number of subclusters called
Characteristic Feature subclusters (CF Subclusters) and these CF Subclusters located in the non-terminal CF Nodes
can have CF Nodes as children.
The CF Subclusters hold the necessary information for clustering which prevents the need to hold the entire input data
in memory. This information includes:
• Number of samples in a subcluster.
• Linear Sum - A n-dimensional vector holding the sum of all samples
• Squared Sum - Sum of the squared L2 norm of all samples.
• Centroids - To avoid recalculation linear sum / n_samples.
• Squared norm of the centroids.
The Birch algorithm has two parameters, the threshold and the branching factor. The branching factor limits the
number of subclusters in a node and the threshold limits the distance between the entering sample and the existing
subclusters.
This algorithm can be viewed as an instance or data reduction method, since it reduces the input data to a set of
subclusters which are obtained directly from the leaves of the CFT. This reduced data can be further processed by
feeding it into a global clusterer. This global clusterer can be set by n_clusters. If n_clusters is set to None,
the subclusters from the leaves are directly read off, otherwise a global clustering step labels these subclusters into
global clusters (labels) and the samples are mapped to the global label of the nearest subcluster.
Algorithm description:
• A new sample is inserted into the root of the CF Tree which is a CF Node. It is then merged with the subcluster of
the root, that has the smallest radius after merging, constrained by the threshold and branching factor conditions.
If the subcluster has any child node, then this is done repeatedly till it reaches a leaf. After finding the nearest
subcluster in the leaf, the properties of this subcluster and the parent subclusters are recursively updated.
• If the radius of the subcluster obtained by merging the new sample and the nearest subcluster is greater than
the square of the threshold and if the number of subclusters is greater than the branching factor, then a space is
temporarily allocated to this new sample. The two farthest subclusters are taken and the subclusters are divided
into two groups on the basis of the distance between these subclusters.
• If this split node has a parent subcluster and there is room for a new subcluster, then the parent is split into two.
If there is no room, then this node is again split into two and the process is continued recursively, till it reaches
the root.
Birch or MiniBatchKMeans?
• Birch does not scale very well to high dimensional data. As a rule of thumb if n_features is greater than
twenty, it is generally better to use MiniBatchKMeans.
3.2. Unsupervised learning

315

scikit-learn user guide, Release 0.19.1

• If the number of instances of data needs to be reduced, or if one wants a large number of subclusters either as a
preprocessing step or otherwise, Birch is more useful than MiniBatchKMeans.
How to use partial_fit?
To avoid the computation of global clustering, for every call of partial_fit the user is advised
1. To set n_clusters=None initially
2. Train all data by multiple calls to partial_fit.
3. Set n_clusters to a required value using brc.set_params(n_clusters=n_clusters).
4. Call partial_fit finally with no arguments, i.e brc.partial_fit() which performs the global clustering.

References:
• Tian Zhang, Raghu Ramakrishnan, Maron Livny BIRCH: An efficient data clustering method for large
databases. http://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf
• Roberto Perdisci JBirch - Java implementation of BIRCH clustering algorithm https://code.google.com/
archive/p/jbirch

Clustering performance evaluation
Evaluating the performance of a clustering algorithm is not as trivial as counting the number of errors or the precision
and recall of a supervised classification algorithm. In particular any evaluation metric should not take the absolute
values of the cluster labels into account but rather if this clustering define separations of the data similar to some
ground truth set of classes or satisfying some assumption such that members belong to the same class are more similar
that members of different classes according to some similarity metric.
Adjusted Rand index
Given the knowledge of the ground truth class assignments labels_true and our clustering algorithm assignments
of the same samples labels_pred, the adjusted Rand index is a function that measures the similarity of the two
assignments, ignoring permutations and with chance normalization:
>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]

316

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>> metrics.adjusted_rand_score(labels_true, labels_pred)
0.24...

One can permute 0 and 1 in the predicted labels, rename 2 to 3, and get the same score:
>>> labels_pred = [1, 1, 0, 0, 3, 3]
>>> metrics.adjusted_rand_score(labels_true, labels_pred)
0.24...

Furthermore, adjusted_rand_score is symmetric: swapping the argument does not change the score. It can
thus be used as a consensus measure:
>>> metrics.adjusted_rand_score(labels_pred, labels_true)
0.24...

Perfect labeling is scored 1.0:
>>> labels_pred = labels_true[:]
>>> metrics.adjusted_rand_score(labels_true, labels_pred)
1.0

Bad (e.g. independent labelings) have negative or close to 0.0 scores:
>>> labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
>>> labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
>>> metrics.adjusted_rand_score(labels_true, labels_pred)
-0.12...

Advantages
• Random (uniform) label assignments have a ARI score close to 0.0 for any value of n_clusters and
n_samples (which is not the case for raw Rand index or the V-measure for instance).
• Bounded range [-1, 1]: negative values are bad (independent labelings), similar clusterings have a positive ARI,
1.0 is the perfect match score.
• No assumption is made on the cluster structure: can be used to compare clustering algorithms such as kmeans which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster
with “folded” shapes.
Drawbacks
• Contrary to inertia, ARI requires knowledge of the ground truth classes while is almost never available in
practice or requires manual assignment by human annotators (as in the supervised learning setting).
However ARI can also be useful in a purely unsupervised setting as a building block for a Consensus Index that
can be used for clustering model selection (TODO).
Examples:
• Adjustment for chance in clustering performance evaluation: Analysis of the impact of the dataset size on the
value of clustering measures for random assignments.

3.2. Unsupervised learning

317

scikit-learn user guide, Release 0.19.1

Mathematical formulation
If C is a ground truth class assignment and K the clustering, let us define 𝑎 and 𝑏 as:
• 𝑎, the number of pairs of elements that are in the same set in C and in the same set in K
• 𝑏, the number of pairs of elements that are in different sets in C and in different sets in K
The raw (unadjusted) Rand index is then given by:
RI =

𝑎+𝑏
𝑛
𝐶2 𝑠𝑎𝑚𝑝𝑙𝑒𝑠

𝑛

Where 𝐶2 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 is the total number of possible pairs in the dataset (without ordering).
However the RI score does not guarantee that random label assignments will get a value close to zero (esp. if the
number of clusters is in the same order of magnitude as the number of samples).
To counter this effect we can discount the expected RI 𝐸[RI] of random labelings by defining the adjusted Rand index
as follows:
ARI =

RI − 𝐸[RI]
max(RI) − 𝐸[RI]

References
• Comparing Partitions L. Hubert and P. Arabie, Journal of Classification 1985
• Wikipedia entry for the adjusted Rand index

Mutual Information based scores
Given the knowledge of the ground truth class assignments labels_true and our clustering algorithm assignments
of the same samples labels_pred, the Mutual Information is a function that measures the agreement of the two
assignments, ignoring permutations. Two different normalized versions of this measure are available, Normalized
Mutual Information(NMI) and Adjusted Mutual Information(AMI). NMI is often used in the literature while
AMI was proposed more recently and is normalized against chance:
>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]
>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)
0.22504...

One can permute 0 and 1 in the predicted labels, rename 2 to 3 and get the same score:
>>> labels_pred = [1, 1, 0, 0, 3, 3]
>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)
0.22504...

All, mutual_info_score, adjusted_mutual_info_score and normalized_mutual_info_score
are symmetric: swapping the argument does not change the score. Thus they can be used as a consensus measure:
>>> metrics.adjusted_mutual_info_score(labels_pred, labels_true)
0.22504...

318

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Perfect labeling is scored 1.0:
>>> labels_pred = labels_true[:]
>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)
1.0
>>> metrics.normalized_mutual_info_score(labels_true, labels_pred)
1.0

This is not true for mutual_info_score, which is therefore harder to judge:
>>> metrics.mutual_info_score(labels_true, labels_pred)
0.69...

Bad (e.g. independent labelings) have non-positive scores:
>>> labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
>>> labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)
-0.10526...

Advantages
• Random (uniform) label assignments have a AMI score close to 0.0 for any value of n_clusters and
n_samples (which is not the case for raw Mutual Information or the V-measure for instance).
• Bounded range [0, 1]: Values close to zero indicate two label assignments that are largely independent, while
values close to one indicate significant agreement. Further, values of exactly 0 indicate purely independent
label assignments and a AMI of exactly 1 indicates that the two label assignments are equal (with or without
permutation).
• No assumption is made on the cluster structure: can be used to compare clustering algorithms such as kmeans which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster
with “folded” shapes.
Drawbacks
• Contrary to inertia, MI-based measures require the knowledge of the ground truth classes while almost
never available in practice or requires manual assignment by human annotators (as in the supervised learning
setting).
However MI-based measures can also be useful in purely unsupervised setting as a building block for a Consensus Index that can be used for clustering model selection.
• NMI and MI are not adjusted against chance.
Examples:
• Adjustment for chance in clustering performance evaluation: Analysis of the impact of the dataset size on the
value of clustering measures for random assignments. This example also includes the Adjusted Rand Index.

3.2. Unsupervised learning

319

scikit-learn user guide, Release 0.19.1

Mathematical formulation
Assume two label assignments (of the same N objects), 𝑈 and 𝑉 . Their entropy is the amount of uncertainty for a
partition set, defined by:
𝐻(𝑈 ) = −

|𝑈 |
∑︁

𝑃 (𝑖) log(𝑃 (𝑖))

𝑖=1

where 𝑃 (𝑖) = |𝑈𝑖 |/𝑁 is the probability that an object picked at random from 𝑈 falls into class 𝑈𝑖 . Likewise for 𝑉 :
𝐻(𝑉 ) = −

|𝑉 |
∑︁

𝑃 ′ (𝑗) log(𝑃 ′ (𝑗))

𝑗=1

With 𝑃 ′ (𝑗) = |𝑉𝑗 |/𝑁 . The mutual information (MI) between 𝑈 and 𝑉 is calculated by:
MI(𝑈, 𝑉 ) =

|𝑈 | |𝑉 |
∑︁
∑︁

(︂
𝑃 (𝑖, 𝑗) log

𝑖=1 𝑗=1

𝑃 (𝑖, 𝑗)
𝑃 (𝑖)𝑃 ′ (𝑗)

)︂

where 𝑃 (𝑖, 𝑗) = |𝑈𝑖 ∩ 𝑉𝑗 |/𝑁 is the probability that an object picked at random falls into both classes 𝑈𝑖 and 𝑉𝑗 .
It also can be expressed in set cardinality formulation:
MI(𝑈, 𝑉 ) =

|𝑈 | |𝑉 |
∑︁
∑︁ |𝑈𝑖 ∩ 𝑉𝑗 |

𝑁

𝑖=1 𝑗=1

(︂
log

𝑁 |𝑈𝑖 ∩ 𝑉𝑗 |
|𝑈𝑖 ||𝑉𝑗 |

)︂

The normalized mutual information is defined as
NMI(𝑈, 𝑉 ) = √︀

MI(𝑈, 𝑉 )
𝐻(𝑈 )𝐻(𝑉 )

This value of the mutual information and also the normalized variant is not adjusted for chance and will tend to increase
as the number of different labels (clusters) increases, regardless of the actual amount of “mutual information” between
the label assignments.
The expected value for the mutual information can be calculated using the following equation, from Vinh, Epps, and
Bailey, (2009). In this equation, 𝑎𝑖 = |𝑈𝑖 | (the number of elements in 𝑈𝑖 ) and 𝑏𝑗 = |𝑉𝑗 | (the number of elements in
𝑉𝑗 ).
𝐸[MI(𝑈, 𝑉 )] =

|
∑︁
𝑖=1

𝑈|

|
∑︁
𝑗=1

min(𝑎𝑖 ,𝑏𝑗 )

𝑉|

∑︁
𝑛𝑖𝑗 =(𝑎𝑖 +𝑏𝑗 −𝑁 )+

𝑛𝑖𝑗
log
𝑁

(︂

𝑁.𝑛𝑖𝑗
𝑎𝑖 𝑏𝑗

)︂

𝑎𝑖 !𝑏𝑗 !(𝑁 − 𝑎𝑖 )!(𝑁 − 𝑏𝑗 )!
𝑁 !𝑛𝑖𝑗 !(𝑎𝑖 − 𝑛𝑖𝑗 )!(𝑏𝑗 − 𝑛𝑖𝑗 )!(𝑁 − 𝑎𝑖 − 𝑏𝑗 + 𝑛𝑖𝑗 )!

Using the expected value, the adjusted mutual information can then be calculated using a similar form to that of the
adjusted Rand index:
AMI =

MI − 𝐸[MI]
max(𝐻(𝑈 ), 𝐻(𝑉 )) − 𝐸[MI]

References
• Strehl, Alexander, and Joydeep Ghosh (2002). “Cluster ensembles – a knowledge reuse framework for combining multiple partitions”.
Journal of Machine Learning Research 3: 583–617.
doi:10.1162/153244303321897735.

320

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• Vinh, Epps, and Bailey, (2009). “Information theoretic measures for clusterings comparison”. Proceedings of
the 26th Annual International Conference on Machine Learning - ICML ‘09. doi:10.1145/1553374.1553511.
ISBN 9781605585161.
• Vinh, Epps, and Bailey, (2010). Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance, JMLR http://jmlr.csail.mit.edu/papers/volume11/vinh10a/
vinh10a.pdf
• Wikipedia entry for the (normalized) Mutual Information
• Wikipedia entry for the Adjusted Mutual Information

Homogeneity, completeness and V-measure
Given the knowledge of the ground truth class assignments of the samples, it is possible to define some intuitive metric
using conditional entropy analysis.
In particular Rosenberg and Hirschberg (2007) define the following two desirable objectives for any cluster assignment:
• homogeneity: each cluster contains only members of a single class.
• completeness: all members of a given class are assigned to the same cluster.
We can turn those concept as scores homogeneity_score and completeness_score. Both are bounded
below by 0.0 and above by 1.0 (higher is better):
>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]
>>> metrics.homogeneity_score(labels_true, labels_pred)
0.66...
>>> metrics.completeness_score(labels_true, labels_pred)
0.42...

Their harmonic mean called V-measure is computed by v_measure_score:
>>> metrics.v_measure_score(labels_true, labels_pred)
0.51...

The V-measure is actually equivalent to the mutual information (NMI) discussed above normalized by the sum of the
label entropies [B2011].
Homogeneity,
completeness
and
V-measure
can
homogeneity_completeness_v_measure as follows:

be

computed

at

once

using

>>> metrics.homogeneity_completeness_v_measure(labels_true, labels_pred)
...
(0.66..., 0.42..., 0.51...)

The following clustering assignment is slightly better, since it is homogeneous but not complete:
>>> labels_pred = [0, 0, 0, 1, 2, 2]
>>> metrics.homogeneity_completeness_v_measure(labels_true, labels_pred)
...
(1.0, 0.68..., 0.81...)

3.2. Unsupervised learning

321

scikit-learn user guide, Release 0.19.1

Note: v_measure_score is symmetric: it can be used to evaluate the agreement of two independent assignments
on the same dataset.
This is not the case for completeness_score and homogeneity_score: both are bound by the relationship:
homogeneity_score(a, b) == completeness_score(b, a)

Advantages
• Bounded scores: 0.0 is as bad as it can be, 1.0 is a perfect score.
• Intuitive interpretation: clustering with bad V-measure can be qualitatively analyzed in terms of homogeneity
and completeness to better feel what ‘kind’ of mistakes is done by the assignment.
• No assumption is made on the cluster structure: can be used to compare clustering algorithms such as kmeans which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster
with “folded” shapes.
Drawbacks
• The previously introduced metrics are not normalized with regards to random labeling: this means that
depending on the number of samples, clusters and ground truth classes, a completely random labeling will
not always yield the same values for homogeneity, completeness and hence v-measure. In particular random
labeling won’t yield zero scores especially when the number of clusters is large.
This problem can safely be ignored when the number of samples is more than a thousand and the number of
clusters is less than 10. For smaller sample sizes or larger number of clusters it is safer to use an adjusted
index such as the Adjusted Rand Index (ARI).
• These metrics require the knowledge of the ground truth classes while almost never available in practice or
requires manual assignment by human annotators (as in the supervised learning setting).
Examples:
• Adjustment for chance in clustering performance evaluation: Analysis of the impact of the dataset size on the
value of clustering measures for random assignments.

Mathematical formulation
Homogeneity and completeness scores are formally given by:
ℎ=1−

𝐻(𝐶|𝐾)
𝐻(𝐶)

𝑐=1−

𝐻(𝐾|𝐶)
𝐻(𝐾)

where 𝐻(𝐶|𝐾) is the conditional entropy of the classes given the cluster assignments and is given by:
𝐻(𝐶|𝐾) = −

|𝐶| |𝐾|
∑︁
∑︁ 𝑛𝑐,𝑘
𝑐=1 𝑘=1

322

𝑛

(︂
· log

𝑛𝑐,𝑘
𝑛𝑘

)︂

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

3.2. Unsupervised learning

323

scikit-learn user guide, Release 0.19.1

and 𝐻(𝐶) is the entropy of the classes and is given by:
𝐻(𝐶) = −

|𝐶|
∑︁
𝑛𝑐
𝑐=1

𝑛

· log

(︁ 𝑛 )︁
𝑐

𝑛

with 𝑛 the total number of samples, 𝑛𝑐 and 𝑛𝑘 the number of samples respectively belonging to class 𝑐 and cluster 𝑘,
and finally 𝑛𝑐,𝑘 the number of samples from class 𝑐 assigned to cluster 𝑘.
The conditional entropy of clusters given class 𝐻(𝐾|𝐶) and the entropy of clusters 𝐻(𝐾) are defined in a symmetric manner.
Rosenberg and Hirschberg further define V-measure as the harmonic mean of homogeneity and completeness:
𝑣 =2·

ℎ·𝑐
ℎ+𝑐

References
• V-Measure: A conditional entropy-based external cluster evaluation measure Andrew Rosenberg and Julia
Hirschberg, 2007

Fowlkes-Mallows scores
The Fowlkes-Mallows index (sklearn.metrics.fowlkes_mallows_score) can be used when the ground
truth class assignments of the samples is known. The Fowlkes-Mallows score FMI is defined as the geometric mean
of the pairwise precision and recall:
FMI = √︀

TP
(TP + FP)(TP + FN)

Where TP is the number of True Positive (i.e. the number of pair of points that belong to the same clusters in both the
true labels and the predicted labels), FP is the number of False Positive (i.e. the number of pair of points that belong
to the same clusters in the true labels and not in the predicted labels) and FN is the number of False Negative (i.e the
number of pair of points that belongs in the same clusters in the predicted labels and not in the true labels).
The score ranges from 0 to 1. A high value indicates a good similarity between two clusters.
>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)
0.47140...

One can permute 0 and 1 in the predicted labels, rename 2 to 3 and get the same score:
>>> labels_pred = [1, 1, 0, 0, 3, 3]
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)
0.47140...

Perfect labeling is scored 1.0:

324

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>> labels_pred = labels_true[:]
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)
1.0

Bad (e.g. independent labelings) have zero scores:
>>> labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
>>> labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)
0.0

Advantages
• Random (uniform) label assignments have a FMI score close to 0.0 for any value of n_clusters and
n_samples (which is not the case for raw Mutual Information or the V-measure for instance).
• Bounded range [0, 1]: Values close to zero indicate two label assignments that are largely independent, while
values close to one indicate significant agreement. Further, values of exactly 0 indicate purely independent
label assignments and a AMI of exactly 1 indicates that the two label assignments are equal (with or without
permutation).
• No assumption is made on the cluster structure: can be used to compare clustering algorithms such as kmeans which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster
with “folded” shapes.
Drawbacks
• Contrary to inertia, FMI-based measures require the knowledge of the ground truth classes while almost
never available in practice or requires manual assignment by human annotators (as in the supervised learning
setting).
References
• E. B. Fowkles and C. L. Mallows, 1983. “A method for comparing two hierarchical clusterings”. Journal of
the American Statistical Association. http://wildfire.stat.ucla.edu/pdflibrary/fowlkes.pdf
• Wikipedia entry for the Fowlkes-Mallows Index

Silhouette Coefficient
If the ground truth labels are not known, evaluation must be performed using the model itself. The Silhouette Coefficient (sklearn.metrics.silhouette_score) is an example of such an evaluation, where a higher Silhouette
Coefficient score relates to a model with better defined clusters. The Silhouette Coefficient is defined for each sample
and is composed of two scores:
• a: The mean distance between a sample and all other points in the same class.
• b: The mean distance between a sample and all other points in the next nearest cluster.
The Silhouette Coefficient s for a single sample is then given as:
𝑠=

3.2. Unsupervised learning

𝑏−𝑎
𝑚𝑎𝑥(𝑎, 𝑏)
325

scikit-learn user guide, Release 0.19.1

The Silhouette Coefficient for a set of samples is given as the mean of the Silhouette Coefficient for each sample.
from sklearn import metrics
from sklearn.metrics import pairwise_distances
from sklearn import datasets
dataset = datasets.load_iris()
X = dataset.data
y = dataset.target

>>>
>>>
>>>
>>>
>>>
>>>

In normal usage, the Silhouette Coefficient is applied to the results of a cluster analysis.
>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
>>> labels = kmeans_model.labels_
>>> metrics.silhouette_score(X, labels, metric='euclidean')
...
0.55...

References
• Peter J. Rousseeuw (1987). “Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster
Analysis”. Computational and Applied Mathematics 20: 53–65. doi:10.1016/0377-0427(87)90125-7.

Advantages
• The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero
indicate overlapping clusters.
• The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.
Drawbacks
• The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as density
based clusters like those obtained through DBSCAN.
Examples:
• Selecting the number of clusters with silhouette analysis on KMeans clustering : In this example the silhouette
analysis is used to choose an optimal value for n_clusters.

Calinski-Harabaz Index
If the ground truth labels are not known, the Calinski-Harabaz index (sklearn.metrics.
calinski_harabaz_score) can be used to evaluate the model, where a higher Calinski-Harabaz score
relates to a model with better defined clusters.

326

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

For 𝑘 clusters, the Calinski-Harabaz score 𝑠 is given as the ratio of the between-clusters dispersion mean and the
within-cluster dispersion:
𝑠(𝑘) =

Tr(𝐵𝑘 )
𝑁 −𝑘
×
Tr(𝑊𝑘 )
𝑘−1

where 𝐵𝐾 is the between group dispersion matrix and 𝑊𝐾 is the within-cluster dispersion matrix defined by:
𝑊𝑘 =

𝑘 ∑︁
∑︁

(𝑥 − 𝑐𝑞 )(𝑥 − 𝑐𝑞 )𝑇

𝑞=1 𝑥∈𝐶𝑞

𝐵𝑘 =

∑︁

𝑛𝑞 (𝑐𝑞 − 𝑐)(𝑐𝑞 − 𝑐)𝑇

𝑞

with 𝑁 be the number of points in our data, 𝐶𝑞 be the set of points in cluster 𝑞, 𝑐𝑞 be the center of cluster 𝑞, 𝑐 be the
center of 𝐸, 𝑛𝑞 be the number of points in cluster 𝑞.
>>>
>>>
>>>
>>>
>>>
>>>

from sklearn import metrics
from sklearn.metrics import pairwise_distances
from sklearn import datasets
dataset = datasets.load_iris()
X = dataset.data
y = dataset.target

In normal usage, the Calinski-Harabaz index is applied to the results of a cluster analysis.
>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
>>> labels = kmeans_model.labels_
>>> metrics.calinski_harabaz_score(X, labels)
560.39...

Advantages
• The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.
• The score is fast to compute
Drawbacks
• The Calinski-Harabaz index is generally higher for convex clusters than other concepts of clusters, such as
density based clusters like those obtained through DBSCAN.
References
• Caliński, T., & Harabasz, J. (1974). “A dendrite method for cluster analysis”. Communications in Statisticstheory and Methods 3: 1-27. doi:10.1080/03610926.2011.560741.

3.2. Unsupervised learning

327

scikit-learn user guide, Release 0.19.1

3.2.4 Biclustering
Biclustering can be performed with the module sklearn.cluster.bicluster. Biclustering algorithms simultaneously cluster rows and columns of a data matrix. These clusters of rows and columns are known as biclusters.
Each determines a submatrix of the original data matrix with some desired properties.
For instance, given a matrix of shape (10, 10), one possible bicluster with three rows and two columns induces a
submatrix of shape (3, 2):
>>> import numpy as np
>>> data = np.arange(100).reshape(10, 10)
>>> rows = np.array([0, 2, 3])[:, np.newaxis]
>>> columns = np.array([1, 2])
>>> data[rows, columns]
array([[ 1, 2],
[21, 22],
[31, 32]])

For visualization purposes, given a bicluster, the rows and columns of the data matrix may be rearranged to make the
bicluster contiguous.
Algorithms differ in how they define biclusters. Some of the common types include:
• constant values, constant rows, or constant columns
• unusually high or low values
• submatrices with low variance
• correlated rows or columns
Algorithms also differ in how rows and columns may be assigned to biclusters, which leads to different bicluster
structures. Block diagonal or checkerboard structures occur when rows and columns are divided into partitions.
If each row and each column belongs to exactly one bicluster, then rearranging the rows and columns of the data matrix
reveals the biclusters on the diagonal. Here is an example of this structure where biclusters have higher average values
than the other rows and columns:

Fig. 3.5: An example of biclusters formed by partitioning rows and columns.
In the checkerboard case, each row belongs to all column clusters, and each column belongs to all row clusters. Here
is an example of this structure where the variance of the values within each bicluster is small:

328

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Fig. 3.6: An example of checkerboard biclusters.
After fitting a model, row and column cluster membership can be found in the rows_ and columns_ attributes.
rows_[i] is a binary vector with nonzero entries corresponding to rows that belong to bicluster i. Similarly,
columns_[i] indicates which columns belong to bicluster i.
Some models also have row_labels_ and column_labels_ attributes. These models partition the rows and
columns, such as in the block diagonal and checkerboard bicluster structures.
Note: Biclustering has many other names in different fields including co-clustering, two-mode clustering, two-way
clustering, block clustering, coupled two-way clustering, etc. The names of some algorithms, such as the Spectral
Co-Clustering algorithm, reflect these alternate names.

Spectral Co-Clustering
The SpectralCoclustering algorithm finds biclusters with values higher than those in the corresponding other
rows and columns. Each row and each column belongs to exactly one bicluster, so rearranging the rows and columns
to make partitions contiguous reveals these high values along the diagonal:
Note: The algorithm treats the input data matrix as a bipartite graph: the rows and columns of the matrix correspond
to the two sets of vertices, and each entry corresponds to an edge between a row and a column. The algorithm
approximates the normalized cut of this graph to find heavy subgraphs.

Mathematical formulation
An approximate solution to the optimal normalized cut may be found via the generalized eigenvalue decomposition of
the Laplacian of the graph. Usually this would mean working directly with the Laplacian matrix. If the original data
matrix 𝐴 has shape 𝑚 × 𝑛, the Laplacian matrix for the corresponding bipartite graph has shape (𝑚 + 𝑛) × (𝑚 + 𝑛).
However, in this case it is possible to work directly with 𝐴, which is smaller and more efficient.
The input matrix 𝐴 is preprocessed as follows:
𝐴𝑛 = 𝑅−1/2 𝐴𝐶 −1/2

3.2. Unsupervised learning

329

scikit-learn user guide, Release 0.19.1

Where 𝑅 is the diagonal matrix with entry 𝑖 equal to
∑︀
𝑖 𝐴𝑖𝑗 .

∑︀

𝑗

𝐴𝑖𝑗 and 𝐶 is the diagonal matrix with entry 𝑗 equal to

The singular value decomposition, 𝐴𝑛 = 𝑈 Σ𝑉 ⊤ , provides the partitions of the rows and columns of 𝐴. A subset of
the left singular vectors gives the row partitions, and a subset of the right singular vectors gives the column partitions.
The ℓ = ⌈log2 𝑘⌉ singular vectors, starting from the second, provide the desired partitioning information. They are
used to form the matrix 𝑍:
⎡ −1/2 ⎤
𝑅
𝑈
⎦
𝑍=⎣
−1/2
𝐶
𝑉
where the columns of 𝑈 are 𝑢2 , . . . , 𝑢ℓ+1 , and similarly for 𝑉 .
Then the rows of 𝑍 are clustered using k-means. The first n_rows labels provide the row partitioning, and the
remaining n_columns labels provide the column partitioning.
Examples:
• A demo of the Spectral Co-Clustering algorithm: A simple example showing how to generate a data matrix
with biclusters and apply this method to it.
• Biclustering documents with the Spectral Co-clustering algorithm: An example of finding biclusters in the
twenty newsgroup dataset.

References:
• Dhillon, Inderjit S, 2001. Co-clustering documents and words using bipartite spectral graph partitioning.

Spectral Biclustering
The SpectralBiclustering algorithm assumes that the input data matrix has a hidden checkerboard structure.
The rows and columns of a matrix with this structure may be partitioned so that the entries of any bicluster in the
Cartesian product of row clusters and column clusters are approximately constant. For instance, if there are two row
partitions and three column partitions, each row will belong to three biclusters, and each column will belong to two
biclusters.
The algorithm partitions the rows and columns of a matrix so that a corresponding blockwise-constant checkerboard
matrix provides a good approximation to the original matrix.
Mathematical formulation
The input matrix 𝐴 is first normalized to make the checkerboard pattern more obvious. There are three possible
methods:
1. Independent row and column normalization, as in Spectral Co-Clustering. This method makes the rows sum to
a constant and the columns sum to a different constant.
2. Bistochastization: repeated row and column normalization until convergence. This method makes both rows
and columns sum to the same constant.
3. Log normalization: the log of the data matrix is computed: 𝐿 = log 𝐴. Then the column mean 𝐿𝑖· , row mean
𝐿·𝑗 , and overall mean 𝐿·· of 𝐿 are computed. The final matrix is computed according to the formula
330

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

𝐾𝑖𝑗 = 𝐿𝑖𝑗 − 𝐿𝑖· − 𝐿·𝑗 + 𝐿··
After normalizing, the first few singular vectors are computed, just as in the Spectral Co-Clustering algorithm.
If log normalization was used, all the singular vectors are meaningful. However, if independent normalization or
bistochastization were used, the first singular vectors, 𝑢1 and 𝑣1 . are discarded. From now on, the “first” singular
vectors refers to 𝑢2 . . . 𝑢𝑝+1 and 𝑣2 . . . 𝑣𝑝+1 except in the case of log normalization.
Given these singular vectors, they are ranked according to which can be best approximated by a piecewise-constant
vector. The approximations for each vector are found using one-dimensional k-means and scored using the Euclidean
distance. Some subset of the best left and right singular vector are selected. Next, the data is projected to this best
subset of singular vectors and clustered.
For instance, if 𝑝 singular vectors were calculated, the 𝑞 best are found as described, where 𝑞 < 𝑝. Let 𝑈 be the matrix
with columns the 𝑞 best left singular vectors, and similarly 𝑉 for the right. To partition the rows, the rows of 𝐴 are
projected to a 𝑞 dimensional space: 𝐴 * 𝑉 . Treating the 𝑚 rows of this 𝑚 × 𝑞 matrix as samples and clustering using
k-means yields the row labels. Similarly, projecting the columns to 𝐴⊤ * 𝑈 and clustering this 𝑛 × 𝑞 matrix yields the
column labels.
Examples:
• A demo of the Spectral Biclustering algorithm: a simple example showing how to generate a checkerboard
matrix and bicluster it.

References:
• Kluger, Yuval, et. al., 2003. Spectral biclustering of microarray data: coclustering genes and conditions.

Biclustering evaluation
There are two ways of evaluating a biclustering result: internal and external. Internal measures, such as cluster
stability, rely only on the data and the result themselves. Currently there are no internal bicluster measures in scikitlearn. External measures refer to an external source of information, such as the true solution. When working with
real data the true solution is usually unknown, but biclustering artificial data may be useful for evaluating algorithms
precisely because the true solution is known.
To compare a set of found biclusters to the set of true biclusters, two similarity measures are needed: a similarity
measure for individual biclusters, and a way to combine these individual similarities into an overall score.
To compare individual biclusters, several measures have been used. For now, only the Jaccard index is implemented:
𝐽(𝐴, 𝐵) =

|𝐴 ∩ 𝐵|
|𝐴| + |𝐵| − |𝐴 ∩ 𝐵|

where 𝐴 and 𝐵 are biclusters, |𝐴 ∩ 𝐵| is the number of elements in their intersection. The Jaccard index achieves its
minimum of 0 when the biclusters to not overlap at all and its maximum of 1 when they are identical.
Several methods have been developed to compare two sets of biclusters. For now, only consensus_score (Hochreiter et. al., 2010) is available:
1. Compute bicluster similarities for pairs of biclusters, one in each set, using the Jaccard index or a similar
measure.
2. Assign biclusters from one set to another in a one-to-one fashion to maximize the sum of their similarities. This
step is performed using the Hungarian algorithm.

3.2. Unsupervised learning

331

scikit-learn user guide, Release 0.19.1

3. The final sum of similarities is divided by the size of the larger set.
The minimum consensus score, 0, occurs when all pairs of biclusters are totally dissimilar. The maximum score, 1,
occurs when both sets are identical.
References:
• Hochreiter, Bodenhofer, et. al., 2010. FABIA: factor analysis for bicluster acquisition.

3.2.5 Decomposing signals in components (matrix factorization problems)
Principal component analysis (PCA)
Exact PCA and probabilistic interpretation
PCA is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum
amount of the variance. In scikit-learn, PCA is implemented as a transformer object that learns 𝑛 components in its
fit method, and can be used on new data to project it on these components.
The optional parameter whiten=True makes it possible to project the data onto the singular space while scaling
each component to unit variance. This is often useful if the models down-stream make strong assumptions on the
isotropy of the signal: this is for example the case for Support Vector Machines with the RBF kernel and the K-Means
clustering algorithm.
Below is an example of the iris dataset, which is comprised of 4 features, projected on the 2 dimensions that explain
most variance:

The PCA object also provides a probabilistic interpretation of the PCA that can give a likelihood of data based on the
amount of variance it explains. As such it implements a score method that can be used in cross-validation:

332

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Examples:
• Comparison of LDA and PCA 2D projection of Iris dataset
• Model selection with Probabilistic PCA and Factor Analysis (FA)

Incremental PCA
The PCA object is very useful, but has certain limitations for large datasets. The biggest limitation is that PCA only supports batch processing, which means all of the data to be processed must fit in main memory. The IncrementalPCA
object uses a different form of processing and allows for partial computations which almost exactly match the results
of PCA while processing the data in a minibatch fashion. IncrementalPCA makes it possible to implement out-ofcore Principal Component Analysis either by:
• Using its partial_fit method on chunks of data fetched sequentially from the local hard drive or a network
database.
• Calling its fit method on a memory mapped file using numpy.memmap.
IncrementalPCA only stores estimates of component and noise variances, in order update
explained_variance_ratio_ incrementally. This is why memory usage depends on the number of
samples per batch, rather than the number of samples to be processed in the dataset.
Examples:
• Incremental PCA

3.2. Unsupervised learning

333

scikit-learn user guide, Release 0.19.1

334

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

3.2. Unsupervised learning

335

scikit-learn user guide, Release 0.19.1

PCA using randomized SVD
It is often interesting to project data to a lower-dimensional space that preserves most of the variance, by dropping the
singular vector of components associated with lower singular values.
For instance, if we work with 64x64 pixel gray-level pictures for face recognition, the dimensionality of the data is
4096 and it is slow to train an RBF support vector machine on such wide data. Furthermore we know that the intrinsic
dimensionality of the data is much lower than 4096 since all pictures of human faces look somewhat alike. The
samples lie on a manifold of much lower dimension (say around 200 for instance). The PCA algorithm can be used to
linearly transform the data while both reducing the dimensionality and preserve most of the explained variance at the
same time.
The class PCA used with the optional parameter svd_solver='randomized' is very useful in that case: since
we are going to drop most of the singular vectors it is much more efficient to limit the computation to an approximated
estimate of the singular vectors we will keep to actually perform the transform.
For instance, the following shows 16 sample portraits (centered around 0.0) from the Olivetti dataset. On the right
hand side are the first 16 singular vectors reshaped as portraits. Since we only require the top 16 singular vectors of a
dataset with size 𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 = 400 and 𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 = 64 × 64 = 4096, the computation time is less than 1s:

Note: with the optional parameter svd_solver='randomized', we also need to give PCA the size of the lowerdimensional space n_components as a mandatory input parameter.

336

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

If we note 𝑛max = max(𝑛samples , 𝑛features ) and 𝑛min = min(𝑛samples , 𝑛features ), the time complexity of the randomized PCA is 𝑂(𝑛2max · 𝑛components ) instead of 𝑂(𝑛2max · 𝑛min ) for the exact method implemented in PCA.
The memory footprint of randomized PCA is also proportional to 2 · 𝑛max · 𝑛components instead of 𝑛max · 𝑛min for the
exact method.
Note: the implementation of inverse_transform in PCA with svd_solver='randomized' is not the exact
inverse transform of transform even when whiten=False (default).
Examples:
• Faces recognition example using eigenfaces and SVMs
• Faces dataset decompositions

References:
• “Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions” Halko, et al., 2009

Kernel PCA
KernelPCA is an extension of PCA which achieves non-linear dimensionality reduction through the use of
kernels (see Pairwise metrics, Affinities and Kernels). It has many applications including denoising, compression and structured prediction (kernel dependency estimation). KernelPCA supports both transform and
inverse_transform.

3.2. Unsupervised learning

337

scikit-learn user guide, Release 0.19.1

Examples:
• Kernel PCA

Sparse principal components analysis (SparsePCA and MiniBatchSparsePCA)
SparsePCA is a variant of PCA, with the goal of extracting the set of sparse components that best reconstruct the
data.
Mini-batch sparse PCA (MiniBatchSparsePCA) is a variant of SparsePCA that is faster but less accurate. The
increased speed is reached by iterating over small chunks of the set of features, for a given number of iterations.
Principal component analysis (PCA) has the disadvantage that the components extracted by this method have exclusively dense expressions, i.e. they have non-zero coefficients when expressed as linear combinations of the original
variables. This can make interpretation difficult. In many cases, the real underlying components can be more naturally
imagined as sparse vectors; for example in face recognition, components might naturally map to parts of faces.
Sparse principal components yields a more parsimonious, interpretable representation, clearly emphasizing which of
the original features contribute to the differences between samples.
The following example illustrates 16 components extracted using sparse PCA from the Olivetti faces dataset. It can
be seen how the regularization term induces many zeros. Furthermore, the natural structure of the data causes the
non-zero coefficients to be vertically adjacent. The model does not enforce this mathematically: each component is
a vector ℎ ∈ R4096 , and there is no notion of vertical adjacency except during the human-friendly visualization as
64x64 pixel images. The fact that the components shown below appear local is the effect of the inherent structure of
the data, which makes such local patterns minimize reconstruction error. There exist sparsity-inducing norms that take
into account adjacency and different kinds of structure; see [Jen09] for a review of such methods. For more details on
how to use Sparse PCA, see the Examples section, below.

338

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Note that there are many different formulations for the Sparse PCA problem. The one implemented here is based
on [Mrl09] . The optimization problem solved is a PCA problem (dictionary learning) with an ℓ1 penalty on the
components:
1
(𝑈 * , 𝑉 * ) = arg min ||𝑋 − 𝑈 𝑉 ||22 + 𝛼||𝑉 ||1
2
𝑈,𝑉
subject to ||𝑈𝑘 ||2 = 1 for all 0 ≤ 𝑘 < 𝑛𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠
The sparsity-inducing ℓ1 norm also prevents learning components from noise when few training samples are available.
The degree of penalization (and thus sparsity) can be adjusted through the hyperparameter alpha. Small values lead
to a gently regularized factorization, while larger values shrink many coefficients to zero.
Note:
While in the spirit of an online algorithm, the class MiniBatchSparsePCA does not implement
partial_fit because the algorithm is online along the features direction, not the samples direction.

Examples:
• Faces dataset decompositions

References:

Truncated singular value decomposition and latent semantic analysis
TruncatedSVD implements a variant of singular value decomposition (SVD) that only computes the 𝑘 largest
singular values, where 𝑘 is a user-specified parameter.
When truncated SVD is applied to term-document matrices (as returned by CountVectorizer or
TfidfVectorizer), this transformation is known as latent semantic analysis (LSA), because it transforms such
matrices to a “semantic” space of low dimensionality. In particular, LSA is known to combat the effects of synonymy
and polysemy (both of which roughly mean there are multiple meanings per word), which cause term-document matrices to be overly sparse and exhibit poor similarity under measures such as cosine similarity.

3.2. Unsupervised learning

339

scikit-learn user guide, Release 0.19.1

Note: LSA is also known as latent semantic indexing, LSI, though strictly that refers to its use in persistent indexes
for information retrieval purposes.
Mathematically, truncated SVD applied to training samples 𝑋 produces a low-rank approximation 𝑋:
𝑋 ≈ 𝑋𝑘 = 𝑈𝑘 Σ𝑘 𝑉𝑘⊤
After this operation, 𝑈𝑘 Σ⊤
𝑘 is the transformed training set with 𝑘 features (called n_components in the API).
To also transform a test set 𝑋, we multiply it with 𝑉𝑘 :
𝑋 ′ = 𝑋𝑉𝑘

Note: Most treatments of LSA in the natural language processing (NLP) and information retrieval (IR) literature
swap the axes of the matrix 𝑋 so that it has shape n_features × n_samples. We present LSA in a different way
that matches the scikit-learn API better, but the singular values found are the same.
TruncatedSVD is very similar to PCA, but differs in that it works on sample matrices 𝑋 directly instead of their
covariance matrices. When the columnwise (per-feature) means of 𝑋 are subtracted from the feature values, truncated
SVD on the resulting matrix is equivalent to PCA. In practical terms, this means that the TruncatedSVD transformer
accepts scipy.sparse matrices without the need to densify them, as densifying may fill up memory even for
medium-sized document collections.
While the TruncatedSVD transformer works with any (sparse) feature matrix, using it on tf–idf matrices is recommended over raw frequency counts in an LSA/document processing setting. In particular, sublinear scaling and inverse
document frequency should be turned on (sublinear_tf=True, use_idf=True) to bring the feature values
closer to a Gaussian distribution, compensating for LSA’s erroneous assumptions about textual data.
Examples:
• Clustering text documents using k-means

References:
• Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze (2008), Introduction to Information Retrieval, Cambridge University Press, chapter 18: Matrix decompositions & latent semantic indexing

Dictionary Learning
Sparse coding with a precomputed dictionary
The SparseCoder object is an estimator that can be used to transform signals into sparse linear combination of
atoms from a fixed, precomputed dictionary such as a discrete wavelet basis. This object therefore does not implement
a fit method. The transformation amounts to a sparse coding problem: finding a representation of the data as a linear
combination of as few dictionary atoms as possible. All variations of dictionary learning implement the following
transform methods, controllable via the transform_method initialization parameter:
• Orthogonal matching pursuit (Orthogonal Matching Pursuit (OMP))

340

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• Least-angle regression (Least Angle Regression)
• Lasso computed by least-angle regression
• Lasso using coordinate descent (Lasso)
• Thresholding
Thresholding is very fast but it does not yield accurate reconstructions. They have been shown useful in literature for
classification tasks. For image reconstruction tasks, orthogonal matching pursuit yields the most accurate, unbiased
reconstruction.
The dictionary learning objects offer, via the split_code parameter, the possibility to separate the positive and
negative values in the results of sparse coding. This is useful when dictionary learning is used for extracting features
that will be used for supervised learning, because it allows the learning algorithm to assign different weights to negative
loadings of a particular atom, from to the corresponding positive loading.
The split code for a single sample has length 2 * n_components and is constructed using the following rule:
First, the regular code of length n_components is computed. Then, the first n_components entries of the
split_code are filled with the positive part of the regular code vector. The second half of the split code is filled
with the negative part of the code vector, only with a positive sign. Therefore, the split_code is non-negative.
Examples:
• Sparse coding with a precomputed dictionary

Generic dictionary learning
Dictionary learning (DictionaryLearning) is a matrix factorization problem that amounts to finding a (usually
overcomplete) dictionary that will perform good at sparsely encoding the fitted data.
Representing data as sparse combinations of atoms from an overcomplete dictionary is suggested to be the way the
mammal primary visual cortex works. Consequently, dictionary learning applied on image patches has been shown
to give good results in image processing tasks such as image completion, inpainting and denoising, as well as for
supervised recognition tasks.
Dictionary learning is an optimization problem solved by alternatively updating the sparse code, as a solution to
multiple Lasso problems, considering the dictionary fixed, and then updating the dictionary to best fit the sparse code.
1
(𝑈 * , 𝑉 * ) = arg min ||𝑋 − 𝑈 𝑉 ||22 + 𝛼||𝑈 ||1
2
𝑈,𝑉
subject to ||𝑉𝑘 ||2 = 1 for all 0 ≤ 𝑘 < 𝑛atoms

3.2. Unsupervised learning

341

scikit-learn user guide, Release 0.19.1

After using such a procedure to fit the dictionary, the transform is simply a sparse coding step that shares the same
implementation with all dictionary learning objects (see Sparse coding with a precomputed dictionary).
The following image shows how a dictionary learned from 4x4 pixel image patches extracted from part of the image
of a raccoon face looks like.

342

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Examples:
• Image denoising using dictionary learning

References:
• “Online dictionary learning for sparse coding” J. Mairal, F. Bach, J. Ponce, G. Sapiro, 2009

Mini-batch dictionary learning
MiniBatchDictionaryLearning implements a faster, but less accurate version of the dictionary learning algorithm that is better suited for large datasets.
By default, MiniBatchDictionaryLearning divides the data into mini-batches and optimizes in an online
manner by cycling over the mini-batches for the specified number of iterations. However, at the moment it does not
implement a stopping condition.
The estimator also implements partial_fit, which updates the dictionary by iterating only once over a mini-batch.
This can be used for online learning when the data is not readily available from the start, or for when the data does not

fit into the memory.
Clustering for dictionary learning
Note that when using dictionary learning to extract a representation (e.g. for sparse coding) clustering can be a
good proxy to learn the dictionary. For instance the MiniBatchKMeans estimator is computationally efficient
and implements on-line learning with a partial_fit method.
Example: Online learning of a dictionary of parts of faces

Factor Analysis
In unsupervised learning we only have a dataset 𝑋 = {𝑥1 , 𝑥2 , . . . , 𝑥𝑛 }. How can this dataset be described mathematically? A very simple continuous latent variable model for 𝑋 is
𝑥𝑖 = 𝑊 ℎ𝑖 + 𝜇 + 𝜖
The vector ℎ𝑖 is called “latent” because it is unobserved. 𝜖 is considered a noise term distributed according to a
Gaussian with mean 0 and covariance Ψ (i.e. 𝜖 ∼ 𝒩 (0, Ψ)), 𝜇 is some arbitrary offset vector. Such a model is called

3.2. Unsupervised learning

343

scikit-learn user guide, Release 0.19.1

“generative” as it describes how 𝑥𝑖 is generated from ℎ𝑖 . If we use all the 𝑥𝑖 ‘s as columns to form a matrix X and all
the ℎ𝑖 ‘s as columns of a matrix H then we can write (with suitably defined M and E):
X = 𝑊H + M + E
In other words, we decomposed matrix X.
If ℎ𝑖 is given, the above equation automatically implies the following probabilistic interpretation:
𝑝(𝑥𝑖 |ℎ𝑖 ) = 𝒩 (𝑊 ℎ𝑖 + 𝜇, Ψ)
For a complete probabilistic model we also need a prior distribution for the latent variable ℎ. The most straightforward
assumption (based on the nice properties of the Gaussian distribution) is ℎ ∼ 𝒩 (0, I). This yields a Gaussian as the
marginal distribution of 𝑥:
𝑝(𝑥) = 𝒩 (𝜇, 𝑊 𝑊 𝑇 + Ψ)
Now, without any further assumptions the idea of having a latent variable ℎ would be superfluous – 𝑥 can be completely modelled with a mean and a covariance. We need to impose some more specific structure on one of these two
parameters. A simple additional assumption regards the structure of the error covariance Ψ:
• Ψ = 𝜎 2 I: This assumption leads to the probabilistic model of PCA.
• Ψ = diag(𝜓1 , 𝜓2 , . . . , 𝜓𝑛 ): This model is called FactorAnalysis, a classical statistical model. The matrix
W is sometimes called the “factor loading matrix”.
Both models essentially estimate a Gaussian with a low-rank covariance matrix. Because both models are probabilistic
they can be integrated in more complex models, e.g. Mixture of Factor Analysers. One gets very different models (e.g.
FastICA) if non-Gaussian priors on the latent variables are assumed.
Factor analysis can produce similar components (the columns of its loading matrix) to PCA. However, one can not
make any general statements about these components (e.g. whether they are orthogonal):

344

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The main advantage for Factor Analysis (over PCA is that it can model the variance in every direction of the input
space independently (heteroscedastic noise):

This allows better model selection than probabilistic PCA in the presence of heteroscedastic noise:
Examples:
• Model selection with Probabilistic PCA and Factor Analysis (FA)

Independent component analysis (ICA)
Independent component analysis separates a multivariate signal into additive subcomponents that are maximally independent. It is implemented in scikit-learn using the Fast ICA algorithm. Typically, ICA is not used for reducing
dimensionality but for separating superimposed signals. Since the ICA model does not include a noise term, for the
model to be correct, whitening must be applied. This can be done internally using the whiten argument or manually
using one of the PCA variants.
It is classically used to separate mixed signals (a problem known as blind source separation), as in the example below:
ICA can also be used as yet another non linear decomposition that finds components with some sparsity:

3.2. Unsupervised learning

345

scikit-learn user guide, Release 0.19.1

346

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Examples:
• Blind source separation using FastICA
• FastICA on 2D point clouds
• Faces dataset decompositions

Non-negative matrix factorization (NMF or NNMF)
NMF with the Frobenius norm
NMF 1 is an alternative approach to decomposition that assumes that the data and the components are non-negative.
NMF can be plugged in instead of PCA or its variants, in the cases where the data matrix does not contain negative
values. It finds a decomposition of samples 𝑋 into two matrices 𝑊 and 𝐻 of non-negative elements, by optimizing the
distance 𝑑 between 𝑋 and the matrix product 𝑊 𝐻. The most widely used distance function is the squared Frobenius
1

“Learning the parts of objects by non-negative matrix factorization” D. Lee, S. Seung, 1999

3.2. Unsupervised learning

347

scikit-learn user guide, Release 0.19.1

norm, which is an obvious extension of the Euclidean norm to matrices:
𝑑Fro (𝑋, 𝑌 ) =

1
1 ∑︁
||𝑋 − 𝑌 ||2Fro =
(𝑋𝑖𝑗 − 𝑌𝑖𝑗 )2
2
2 𝑖,𝑗

Unlike PCA, the representation of a vector is obtained in an additive fashion, by superimposing the components,
without subtracting. Such additive models are efficient for representing images and text.
It has been observed in [Hoyer, 2004]2 that, when carefully constrained, NMF can produce a parts-based representation
of the dataset, resulting in interpretable models. The following example displays 16 sparse components found by NMF
from the images in the Olivetti faces dataset, in comparison with the PCA eigenfaces.

The init attribute determines the initialization method applied, which has a great impact on the performance of the
method. NMF implements the method Nonnegative Double Singular Value Decomposition. NNDSVD4 is based on
two SVD processes, one approximating the data matrix, the other approximating positive sections of the resulting
partial SVD factors utilizing an algebraic property of unit rank matrices. The basic NNDSVD algorithm is better fit
for sparse factorization. Its variants NNDSVDa (in which all zeros are set equal to the mean of all elements of the
data), and NNDSVDar (in which the zeros are set to random perturbations less than the mean of the data divided by
100) are recommended in the dense case.
2
4

“Non-negative Matrix Factorization with Sparseness Constraints” P. Hoyer, 2004
“SVD based initialization: A head start for nonnegative matrix factorization” C. Boutsidis, E. Gallopoulos, 2008

348

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Note that the Multiplicative Update (‘mu’) solver cannot update zeros present in the initialization, so it leads to poorer
results when used jointly with the basic NNDSVD algorithm which introduces a lot of zeros; in this case, NNDSVDa
or NNDSVDar should be preferred.
NMF can also be initialized with correctly scaled random non-negative matrices by setting init="random". An
integer seed or a RandomState can also be passed to random_state to control reproducibility.
In NMF, L1 and L2 priors can be added to the loss function in order to regularize the model. The L2 prior uses the
Frobenius norm, while the L1 prior uses an elementwise L1 norm. As in ElasticNet, we control the combination
of L1 and L2 with the l1_ratio (𝜌) parameter, and the intensity of the regularization with the alpha (𝛼) parameter.
Then the priors terms are:
𝛼𝜌||𝑊 ||1 + 𝛼𝜌||𝐻||1 +

𝛼(1 − 𝜌)
𝛼(1 − 𝜌)
||𝑊 ||2Fro +
||𝐻||2Fro
2
2

and the regularized objective function is:
𝑑Fro (𝑋, 𝑊 𝐻) + 𝛼𝜌||𝑊 ||1 + 𝛼𝜌||𝐻||1 +

𝛼(1 − 𝜌)
𝛼(1 − 𝜌)
||𝑊 ||2Fro +
||𝐻||2Fro
2
2

NMF regularizes both W and H. The public function non_negative_factorization allows a finer control
through the regularization attribute, and may regularize only W, only H, or both.
NMF with a beta-divergence
As described previously, the most widely used distance function is the squared Frobenius norm, which is an obvious
extension of the Euclidean norm to matrices:
1
1 ∑︁
𝑑Fro (𝑋, 𝑌 ) = ||𝑋 − 𝑌 ||2𝐹 𝑟𝑜 =
(𝑋𝑖𝑗 − 𝑌𝑖𝑗 )2
2
2 𝑖,𝑗
Other distance functions can be used in NMF as, for example, the (generalized) Kullback-Leibler (KL) divergence,
also referred as I-divergence:
𝑑𝐾𝐿 (𝑋, 𝑌 ) =

∑︁
𝑋𝑖𝑗
(𝑋𝑖𝑗 log(
) − 𝑋𝑖𝑗 + 𝑌𝑖𝑗 )
𝑌𝑖𝑗
𝑖,𝑗

Or, the Itakura-Saito (IS) divergence:
𝑑𝐼𝑆 (𝑋, 𝑌 ) =

∑︁ 𝑋𝑖𝑗
𝑋𝑖𝑗
(
− log(
) − 1)
𝑌
𝑌𝑖𝑗
𝑖𝑗
𝑖,𝑗

These three distances are special cases of the beta-divergence family, with 𝛽 = 2, 1, 0 respectively6 . The betadivergence are defined by :
𝑑𝛽 (𝑋, 𝑌 ) =

∑︁
𝑖,𝑗

1
(𝑋 𝛽 + (𝛽 − 1)𝑌𝑖𝑗𝛽 − 𝛽𝑋𝑖𝑗 𝑌𝑖𝑗𝛽−1 )
𝛽(𝛽 − 1) 𝑖𝑗

Note that this definition is not valid if 𝛽 ∈ (0; 1), yet it can be continously extended to the definitions of 𝑑𝐾𝐿 and 𝑑𝐼𝑆
respectively.
NMF implements two solvers, using Coordinate Descent (‘cd’)5 , and Multiplicative Update (‘mu’)6 . The ‘mu’ solver
can optimize every beta-divergence, including of course the Frobenius norm (𝛽 = 2), the (generalized) KullbackLeibler divergence (𝛽 = 1) and the Itakura-Saito divergence (𝛽 = 0). Note that for 𝛽 ∈ (1; 2), the ‘mu’ solver is
6
5

“Algorithms for nonnegative matrix factorization with the beta-divergence” C. Fevotte, J. Idier, 2011
“Fast local algorithms for large scale nonnegative matrix and tensor factorizations.” A. Cichocki, P. Anh-Huy, 2009

3.2. Unsupervised learning

349

scikit-learn user guide, Release 0.19.1

significantly faster than for other values of 𝛽. Note also that with a negative (or 0, i.e. ‘itakura-saito’) 𝛽, the input
matrix cannot contain zero values.
The ‘cd’ solver can only optimize the Frobenius norm. Due to the underlying non-convexity of NMF, the different
solvers may converge to different minima, even when optimizing the same distance function.
NMF is best used with the fit_transform method, which returns the matrix W. The matrix H is stored into the
fitted model in the components_ attribute; the method transform will decompose a new matrix X_new based on
these stored components:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>

import numpy as np
X = np.array([[1, 1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]])
from sklearn.decomposition import NMF
model = NMF(n_components=2, init='random', random_state=0)
W = model.fit_transform(X)
H = model.components_
X_new = np.array([[1, 0], [1, 6.1], [1, 0], [1, 4], [3.2, 1], [0, 4]])
W_new = model.transform(X_new)

Examples:
• Faces dataset decompositions
• Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation
• Beta-divergence loss functions

References:

350

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation is a generative probabilistic model for collections of discrete dataset such as text corpora.
It is also a topic model that is used for discovering abstract topics from a collection of documents.
The graphical model of LDA is a three-level Bayesian model:

When modeling text corpora, the model assumes the following generative process for a corpus with 𝐷 documents and
𝐾 topics:
1. For each topic 𝑘, draw 𝛽𝑘 ∼ Dirichlet(𝜂), 𝑘 = 1...𝐾
2. For each document 𝑑, draw 𝜃𝑑 ∼ Dirichlet(𝛼), 𝑑 = 1...𝐷
3. For each word 𝑖 in document 𝑑:
1. Draw a topic index 𝑧𝑑𝑖 ∼ Multinomial(𝜃𝑑 )
2. Draw the observed word 𝑤𝑖𝑗 ∼ Multinomial(𝑏𝑒𝑡𝑎𝑧𝑑𝑖 .)
For parameter estimation, the posterior distribution is:
𝑝(𝑧, 𝜃, 𝛽|𝑤, 𝛼, 𝜂) =

𝑝(𝑧, 𝜃, 𝛽|𝛼, 𝜂)
𝑝(𝑤|𝛼, 𝜂)

Since the posterior is intractable, variational Bayesian method uses a simpler distribution 𝑞(𝑧, 𝜃, 𝛽|𝜆, 𝜑, 𝛾) to approximate it, and those variational parameters 𝜆, 𝜑, 𝛾 are optimized to maximize the Evidence Lower Bound (ELBO):
△

log 𝑃 (𝑤|𝛼, 𝜂) ≥ 𝐿(𝑤, 𝜑, 𝛾, 𝜆) = 𝐸𝑞 [log 𝑝(𝑤, 𝑧, 𝜃, 𝛽|𝛼, 𝜂)] − 𝐸𝑞 [log 𝑞(𝑧, 𝜃, 𝛽)]
Maximizing ELBO is equivalent to minimizing the Kullback-Leibler(KL) divergence between 𝑞(𝑧, 𝜃, 𝛽) and the true
posterior 𝑝(𝑧, 𝜃, 𝛽|𝑤, 𝛼, 𝜂).
LatentDirichletAllocation implements online variational Bayes algorithm and supports both online and
batch update method. While batch method updates variational variables after each full pass through the data, online
method updates variational variables from mini-batch data points.
Note: Although online method is guaranteed to converge to a local optimum point, the quality of the optimum point
and the speed of convergence may depend on mini-batch size and attributes related to learning rate setting.

3.2. Unsupervised learning

351

scikit-learn user guide, Release 0.19.1

When LatentDirichletAllocation is applied on a “document-term” matrix, the matrix will be decomposed
into a “topic-term” matrix and a “document-topic” matrix. While “topic-term” matrix is stored as components_ in
the model, “document-topic” matrix can be calculated from transform method.
LatentDirichletAllocation also implements partial_fit method. This is used when data can be fetched
sequentially.
Examples:
• Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation

References:
• “Latent Dirichlet Allocation” D. Blei, A. Ng, M. Jordan, 2003
• “Online Learning for Latent Dirichlet Allocation” M. Hoffman, D. Blei, F. Bach, 2010
• “Stochastic Variational Inference” M. Hoffman, D. Blei, C. Wang, J. Paisley, 2013

3.2.6 Covariance estimation
Many statistical problems require at some point the estimation of a population’s covariance matrix, which can be seen
as an estimation of data set scatter plot shape. Most of the time, such an estimation has to be done on a sample whose
properties (size, structure, homogeneity) has a large influence on the estimation’s quality. The sklearn.covariance
package aims at providing tools affording an accurate estimation of a population’s covariance matrix under various
settings.
We assume that the observations are independent and identically distributed (i.i.d.).
Empirical covariance
The covariance matrix of a data set is known to be well approximated with the classical maximum likelihood estimator
(or “empirical covariance”), provided the number of observations is large enough compared to the number of features
(the variables describing the observations). More precisely, the Maximum Likelihood Estimator of a sample is an
unbiased estimator of the corresponding population covariance matrix.
The empirical covariance matrix of a sample can be computed using the empirical_covariance function of the
package, or by fitting an EmpiricalCovariance object to the data sample with the EmpiricalCovariance.
fit method. Be careful that depending whether the data are centered or not, the result will be different, so one may want to use the assume_centered parameter accurately. More precisely if one uses
assume_centered=False, then the test set is supposed to have the same mean vector as the training set. If
not so, both should be centered by the user, and assume_centered=True should be used.
Examples:
• See Shrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood for an example on how to fit an
EmpiricalCovariance object to data.

352

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Shrunk Covariance
Basic shrinkage
Despite being an unbiased estimator of the covariance matrix, the Maximum Likelihood Estimator is not a good estimator of the eigenvalues of the covariance matrix, so the precision matrix obtained from its inversion is not accurate.
Sometimes, it even occurs that the empirical covariance matrix cannot be inverted for numerical reasons. To avoid
such an inversion problem, a transformation of the empirical covariance matrix has been introduced: the shrinkage.
In the scikit-learn, this transformation (with a user-defined shrinkage coefficient) can be directly applied to a precomputed covariance with the shrunk_covariance method. Also, a shrunk estimator of the covariance can be
fitted to data with a ShrunkCovariance object and its ShrunkCovariance.fit method. Again, depending
whether the data are centered or not, the result will be different, so one may want to use the assume_centered
parameter accurately.
Mathematically, this shrinkage consists in reducing the ratio between the smallest and the largest eigenvalue of the
empirical covariance matrix. It can be done by simply shifting every eigenvalue according to a given offset, which is
equivalent of finding the l2-penalized Maximum Likelihood Estimator of the covariance matrix. In practice, shrinkage
^
boils down to a simple a convex transformation : Σshrunk = (1 − 𝛼)Σ̂ + 𝛼 Tr𝑝Σ Id.
Choosing the amount of shrinkage, 𝛼 amounts to setting a bias/variance trade-off, and is discussed below.
Examples:
• See Shrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood for an example on how to fit a
ShrunkCovariance object to data.

Ledoit-Wolf shrinkage
In their 2004 paper1 , O. Ledoit and M. Wolf propose a formula so as to compute the optimal shrinkage coefficient 𝛼
that minimizes the Mean Squared Error between the estimated and the real covariance matrix.
The Ledoit-Wolf estimator of the covariance matrix can be computed on a sample with the ledoit_wolf function of
the sklearn.covariance package, or it can be otherwise obtained by fitting a LedoitWolf object to the same sample.
Note: Case when population covariance matrix is isotropic
It is important to note that when the number of samples is much larger than the number of features, one would expect
that no shrinkage would be necessary. The intuition behind this is that if the population covariance is full rank, when
the number of sample grows, the sample covariance will also become positive definite. As a result, no shrinkage would
necessary and the method should automatically do this.
This, however, is not the case in the Ledoit-Wolf procedure when the population covariance happens to be a multiple of
the identity matrix. In this case, the Ledoit-Wolf shrinkage estimate approaches 1 as the number of samples increases.
This indicates that the optimal estimate of the covariance matrix in the Ledoit-Wolf sense is multiple of the identity.
Since the population covariance is already a multiple of the identity matrix, the Ledoit-Wolf solution is indeed a
reasonable estimate.

1 O. Ledoit and M. Wolf, “A Well-Conditioned Estimator for Large-Dimensional Covariance Matrices”, Journal of Multivariate Analysis, Volume 88, Issue 2, February 2004, pages 365-411.

3.2. Unsupervised learning

353

scikit-learn user guide, Release 0.19.1

Examples:
• See Shrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood for an example on how to fit a
LedoitWolf object to data and for visualizing the performances of the Ledoit-Wolf estimator in terms of
likelihood.

References:

Oracle Approximating Shrinkage
Under the assumption that the data are Gaussian distributed, Chen et al.2 derived a formula aimed at choosing a
shrinkage coefficient that yields a smaller Mean Squared Error than the one given by Ledoit and Wolf’s formula. The
resulting estimator is known as the Oracle Shrinkage Approximating estimator of the covariance.
The OAS estimator of the covariance matrix can be computed on a sample with the oas function of the
sklearn.covariance package, or it can be otherwise obtained by fitting an OAS object to the same sample.

Fig. 3.7: Bias-variance trade-off when setting the shrinkage: comparing the choices of Ledoit-Wolf and OAS estimators

References:

Examples:
• See Shrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood for an example on how to fit an
OAS object to data.
2

Chen et al., “Shrinkage Algorithms for MMSE Covariance Estimation”, IEEE Trans. on Sign. Proc., Volume 58, Issue 10, October 2010.

354

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• See Ledoit-Wolf vs OAS estimation to visualize the Mean Squared Error difference between a LedoitWolf
and an OAS estimator of the covariance.

Sparse inverse covariance
The matrix inverse of the covariance matrix, often called the precision matrix, is proportional to the partial correlation
matrix. It gives the partial independence relationship. In other words, if two features are independent conditionally on
the others, the corresponding coefficient in the precision matrix will be zero. This is why it makes sense to estimate a
sparse precision matrix: by learning independence relations from the data, the estimation of the covariance matrix is
better conditioned. This is known as covariance selection.
In the small-samples situation, in which n_samples is on the order of n_features or smaller, sparse inverse
covariance estimators tend to work better than shrunk covariance estimators. However, in the opposite situation, or for
very correlated data, they can be numerically unstable. In addition, unlike shrinkage estimators, sparse estimators are
able to recover off-diagonal structure.
The GraphLasso estimator uses an l1 penalty to enforce sparsity on the precision matrix: the higher its alpha
parameter, the more sparse the precision matrix. The corresponding GraphLassoCV object uses cross-validation to
automatically set the alpha parameter.
Note: Structure recovery
Recovering a graphical structure from correlations in the data is a challenging thing. If you are interested in such
recovery keep in mind that:
• Recovery is easier from a correlation matrix than a covariance matrix: standardize your observations before
running GraphLasso
• If the underlying graph has nodes with much more connections than the average node, the algorithm will miss
some of these connections.

3.2. Unsupervised learning

355

scikit-learn user guide, Release 0.19.1

Fig. 3.8: A comparison of maximum likelihood, shrinkage and sparse estimates of the covariance and precision matrix
in the very small samples settings.
• If your number of observations is not large compared to the number of edges in your underlying graph, you will
not recover it.
• Even if you are in favorable recovery conditions, the alpha parameter chosen by cross-validation (e.g. using the
GraphLassoCV object) will lead to selecting too many edges. However, the relevant edges will have heavier
weights than the irrelevant ones.
The mathematical formulation is the following:
(︀
)︀
ˆ = argmin𝐾 tr𝑆𝐾 − logdet𝐾 + 𝛼‖𝐾‖1
𝐾
Where 𝐾 is the precision matrix to be estimated, and 𝑆 is the sample covariance matrix. ‖𝐾‖1 is the sum of the absolute values of off-diagonal coefficients of 𝐾. The algorithm employed to solve this problem is the GLasso algorithm,
from the Friedman 2008 Biostatistics paper. It is the same algorithm as in the R glasso package.
Examples:
• Sparse inverse covariance estimation: example on synthetic data showing some recovery of a structure, and
comparing to other covariance estimators.
• Visualizing the stock market structure: example on real stock market data, finding which symbols are most
linked.

References:
• Friedman et al, “Sparse inverse covariance estimation with the graphical lasso”, Biostatistics 9, pp 432, 2008

356

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Robust Covariance Estimation
Real data set are often subjects to measurement or recording errors. Regular but uncommon observations may also
appear for a variety of reason. Every observation which is very uncommon is called an outlier. The empirical covariance estimator and the shrunk covariance estimators presented above are very sensitive to the presence of outlying
observations in the data. Therefore, one should use robust covariance estimators to estimate the covariance of its real
data sets. Alternatively, robust covariance estimators can be used to perform outlier detection and discard/downweight
some observations according to further processing of the data.
The sklearn.covariance package implements a robust estimator of covariance, the Minimum Covariance Determinant3 .
Minimum Covariance Determinant
The Minimum Covariance Determinant estimator is a robust estimator of a data set’s covariance introduced by P.J.
Rousseeuw in3 . The idea is to find a given proportion (h) of “good” observations which are not outliers and compute
their empirical covariance matrix. This empirical covariance matrix is then rescaled to compensate the performed
selection of observations (“consistency step”). Having computed the Minimum Covariance Determinant estimator,
one can give weights to observations according to their Mahalanobis distance, leading to a reweighted estimate of the
covariance matrix of the data set (“reweighting step”).
Rousseeuw and Van Driessen4 developed the FastMCD algorithm in order to compute the Minimum Covariance
Determinant. This algorithm is used in scikit-learn when fitting an MCD object to data. The FastMCD algorithm also
computes a robust estimate of the data set location at the same time.
Raw estimates can be accessed as raw_location_ and raw_covariance_ attributes of a MinCovDet robust
covariance estimator object.
References:

Examples:
• See Robust vs Empirical covariance estimate for an example on how to fit a MinCovDet object to data and
see how the estimate remains accurate despite the presence of outliers.
• See Robust covariance estimation and Mahalanobis distances relevance to visualize the difference between
EmpiricalCovariance and MinCovDet covariance estimators in terms of Mahalanobis distance (so
we get a better estimate of the precision matrix too).

3

P. J. Rousseeuw. Least median of squares regression. J. Am Stat Ass, 79:871, 1984.
A Fast Algorithm for the Minimum Covariance Determinant Estimator, 1999, American Statistical Association and the American Society for
Quality, TECHNOMETRICS.
4

3.2. Unsupervised learning

357

scikit-learn user guide, Release 0.19.1

Influence of outliers on location and covariance
estimates

Separating inliers from outliers using a Mahalanobis
distance

3.2.7 Novelty and Outlier Detection
Many applications require being able to decide whether a new observation belongs to the same distribution as existing
observations (it is an inlier), or should be considered as different (it is an outlier). Often, this ability is used to clean
real data sets. Two important distinction must be made:
novelty detection The training data is not polluted by outliers, and we are interested in detecting anomalies in new observations.
outlier detection The training data contains outliers, and we need to fit the central mode of the training
data, ignoring the deviant observations.
The scikit-learn project provides a set of machine learning tools that can be used both for novelty or outliers detection.
This strategy is implemented with objects learning in an unsupervised way from the data:
estimator.fit(X_train)

new observations can then be sorted as inliers or outliers with a predict method:
estimator.predict(X_test)

Inliers are labeled 1, while outliers are labeled -1.
Novelty Detection
Consider a data set of 𝑛 observations from the same distribution described by 𝑝 features. Consider now that we add one
more observation to that data set. Is the new observation so different from the others that we can doubt it is regular?
(i.e. does it come from the same distribution?) Or on the contrary, is it so similar to the other that we cannot distinguish
it from the original observations? This is the question addressed by the novelty detection tools and methods.
In general, it is about to learn a rough, close frontier delimiting the contour of the initial observations distribution,
plotted in embedding 𝑝-dimensional space. Then, if further observations lay within the frontier-delimited subspace,
they are considered as coming from the same population than the initial observations. Otherwise, if they lay outside
the frontier, we can say that they are abnormal with a given confidence in our assessment.
The One-Class SVM has been introduced by Schölkopf et al. for that purpose and implemented in the Support Vector
Machines module in the svm.OneClassSVM object. It requires the choice of a kernel and a scalar parameter to
358

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

define a frontier. The RBF kernel is usually chosen although there exists no exact formula or algorithm to set its
bandwidth parameter. This is the default in the scikit-learn implementation. The 𝜈 parameter, also known as the
margin of the One-Class SVM, corresponds to the probability of finding a new, but regular, observation outside the
frontier.
References:
• Estimating the support of a high-dimensional distribution Schölkopf, Bernhard, et al. Neural computation
13.7 (2001): 1443-1471.

Examples:
• See One-class SVM with non-linear kernel (RBF) for visualizing the frontier learned around some data by a
svm.OneClassSVM object.

Outlier Detection
Outlier detection is similar to novelty detection in the sense that the goal is to separate a core of regular observations
from some polluting ones, called “outliers”. Yet, in the case of outlier detection, we don’t have a clean data set
representing the population of regular observations that can be used to train any tool.
Fitting an elliptic envelope
One common way of performing outlier detection is to assume that the regular data come from a known distribution
(e.g. data are Gaussian distributed). From this assumption, we generally try to define the “shape” of the data, and can
define outlying observations as observations which stand far enough from the fit shape.

3.2. Unsupervised learning

359

scikit-learn user guide, Release 0.19.1

The scikit-learn provides an object covariance.EllipticEnvelope that fits a robust covariance estimate to
the data, and thus fits an ellipse to the central data points, ignoring points outside the central mode.
For instance, assuming that the inlier data are Gaussian distributed, it will estimate the inlier location and covariance
in a robust way (i.e. without being influenced by outliers). The Mahalanobis distances obtained from this estimate is
used to derive a measure of outlyingness. This strategy is illustrated below.

Examples:
• See Robust covariance estimation and Mahalanobis distances relevance for an illustration of the difference between using a standard (covariance.EmpiricalCovariance) or a robust estimate
(covariance.MinCovDet) of location and covariance to assess the degree of outlyingness of an observation.

References:
• Rousseeuw, P.J., Van Driessen, K. “A fast algorithm for the minimum covariance determinant estimator”
Technometrics 41(3), 212 (1999)

Isolation Forest
One efficient way of performing outlier detection in high-dimensional datasets is to use random forests. The
ensemble.IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.
Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample
is equivalent to the path length from the root node to the terminating node.

360

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.
Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.
This strategy is illustrated below.

Examples:
• See IsolationForest example for an illustration of the use of IsolationForest.
• See Outlier detection with several methods. for a comparison of ensemble.IsolationForest with
neighbors.LocalOutlierFactor, svm.OneClassSVM (tuned to perform like an outlier detection
method) and a covariance-based outlier detection with covariance.EllipticEnvelope.

References:
• Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation forest.” Data Mining, 2008. ICDM‘08. Eighth
IEEE International Conference on.

Local Outlier Factor
Another efficient way to perform outlier detection on moderately high dimensional datasets is to use the Local Outlier
Factor (LOF) algorithm.
The neighbors.LocalOutlierFactor (LOF) algorithm computes a score (called local outlier factor) reflecting the degree of abnormality of the observations. It measures the local density deviation of a given data point with
respect to its neighbors. The idea is to detect the samples that have a substantially lower density than their neighbors.

3.2. Unsupervised learning

361

scikit-learn user guide, Release 0.19.1

In practice the local density is obtained from the k-nearest neighbors. The LOF score of an observation is equal to the
ratio of the average local density of his k-nearest neighbors, and its own local density: a normal instance is expected
to have a local density similar to that of its neighbors, while abnormal data are expected to have much smaller local
density.
The number k of neighbors considered, (alias parameter n_neighbors) is typically chosen 1) greater than the minimum
number of objects a cluster has to contain, so that other objects can be local outliers relative to this cluster, and 2)
smaller than the maximum number of close by objects that can potentially be local outliers. In practice, such informations are generally not available, and taking n_neighbors=20 appears to work well in general. When the proportion of
outliers is high (i.e. greater than 10 %, as in the example below), n_neighbors should be greater (n_neighbors=35 in
the example below).
The strength of the LOF algorithm is that it takes both local and global properties of datasets into consideration: it can
perform well even in datasets where abnormal samples have different underlying densities. The question is not, how
isolated the sample is, but how isolated it is with respect to the surrounding neighborhood.
This strategy is illustrated below.

Examples:
• See Anomaly detection with Local Outlier Factor (LOF) for an illustration of the use of neighbors.
LocalOutlierFactor.
• See Outlier detection with several methods. for a comparison with other anomaly detection methods.

References:
• Breunig, Kriegel, Ng, and Sander (2000) LOF: identifying density-based local outliers. Proc. ACM SIGMOD

362

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

One-class SVM versus Elliptic Envelope versus Isolation Forest versus LOF
Strictly-speaking, the One-class SVM is not an outlier-detection method, but a novelty-detection method: its training
set should not be contaminated by outliers as it may fit them. That said, outlier detection in high-dimension, or without
any assumptions on the distribution of the inlying data is very challenging, and a One-class SVM gives useful results
in these situations.
The examples below illustrate how the performance of the covariance.EllipticEnvelope degrades as the
data is less and less unimodal. The svm.OneClassSVM works better on data with multiple modes and ensemble.
IsolationForest and neighbors.LocalOutlierFactor perform well in every cases.

3.2. Unsupervised learning

363

scikit-learn user guide, Release 0.19.1

Table 3.1: Comparing One-class SVM, Isolation Forest, LOF, and
Elliptic Envelope

For a inlier mode well-centered and elliptic, the svm.OneClassSVM is not able to
benefit from the rotational symmetry of the
inlier population. In addition, it fits a bit the
outliers present in the training set. On the
opposite, the decision rule based on fitting
an covariance.EllipticEnvelope
learns an ellipse, which fits well the
inlier distribution.
The ensemble.
IsolationForest and neighbors.
LocalOutlierFactor perform as well.

As the inlier distribution becomes
bimodal,
the
covariance.
EllipticEnvelope does not fit
well the inliers. However, we can see
that ensemble.IsolationForest,
svm.OneClassSVM and neighbors.
LocalOutlierFactor have difficulties
to detect the two modes, and that the svm.
OneClassSVM tends to overfit: because
364
it has no model of inliers, it interprets a
region where, by chance some outliers are
clustered, as inliers.

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Examples:
• See Outlier detection with several methods.
for a comparison of the svm.OneClassSVM
(tuned to perform like an outlier detection method), the ensemble.IsolationForest, the
neighbors.LocalOutlierFactor and a covariance-based outlier detection covariance.
EllipticEnvelope.

3.2.8 Density Estimation
Density estimation walks the line between unsupervised learning, feature engineering, and data modeling. Some of
the most popular and useful density estimation techniques are mixture models such as Gaussian Mixtures (sklearn.
mixture.GaussianMixture), and neighbor-based approaches such as the kernel density estimate (sklearn.
neighbors.KernelDensity). Gaussian Mixtures are discussed more fully in the context of clustering, because
the technique is also useful as an unsupervised clustering scheme.
Density estimation is a very simple concept, and most people are already familiar with one common density estimation
technique: the histogram.
Density Estimation: Histograms
A histogram is a simple visualization of data where bins are defined, and the number of data points within each bin is
tallied. An example of a histogram can be seen in the upper-left panel of the following figure:

A major problem with histograms, however, is that the choice of binning can have a disproportionate effect on the
resulting visualization. Consider the upper-right panel of the above figure. It shows a histogram over the same data,
with the bins shifted right. The results of the two visualizations look entirely different, and might lead to different
interpretations of the data.

3.2. Unsupervised learning

365

scikit-learn user guide, Release 0.19.1

Intuitively, one can also think of a histogram as a stack of blocks, one block per point. By stacking the blocks in the
appropriate grid space, we recover the histogram. But what if, instead of stacking the blocks on a regular grid, we
center each block on the point it represents, and sum the total height at each location? This idea leads to the lower-left
visualization. It is perhaps not as clean as a histogram, but the fact that the data drive the block locations mean that it
is a much better representation of the underlying data.
This visualization is an example of a kernel density estimation, in this case with a top-hat kernel (i.e. a square block
at each point). We can recover a smoother distribution by using a smoother kernel. The bottom-right plot shows a
Gaussian kernel density estimate, in which each point contributes a Gaussian curve to the total. The result is a smooth
density estimate which is derived from the data, and functions as a powerful non-parametric model of the distribution
of points.
Kernel Density Estimation
Kernel density estimation in scikit-learn is implemented in the sklearn.neighbors.KernelDensity estimator, which uses the Ball Tree or KD Tree for efficient queries (see Nearest Neighbors for a discussion of these).
Though the above example uses a 1D data set for simplicity, kernel density estimation can be performed in any number
of dimensions, though in practice the curse of dimensionality causes its performance to degrade in high dimensions.
In the following figure, 100 points are drawn from a bimodal distribution, and the kernel density estimates are shown
for three choices of kernels:

It’s clear how the kernel shape affects the smoothness of the resulting distribution. The scikit-learn kernel density
estimator can be used as follows:
>>> from sklearn.neighbors.kde import KernelDensity
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(X)
>>> kde.score_samples(X)
array([-0.41075698, -0.41075698, -0.41076071, -0.41075698, -0.41075698,
-0.41076071])

366

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Here we have used kernel='gaussian', as seen above. Mathematically, a kernel is a positive function 𝐾(𝑥; ℎ)
which is controlled by the bandwidth parameter ℎ. Given this kernel form, the density estimate at a point 𝑦 within a
group of points 𝑥𝑖 ; 𝑖 = 1 · · · 𝑁 is given by:
𝜌𝐾 (𝑦) =

𝑁
∑︁

𝐾((𝑦 − 𝑥𝑖 )/ℎ)

𝑖=1

The bandwidth here acts as a smoothing parameter, controlling the tradeoff between bias and variance in the result. A
large bandwidth leads to a very smooth (i.e. high-bias) density distribution. A small bandwidth leads to an unsmooth
(i.e. high-variance) density distribution.
sklearn.neighbors.KernelDensity implements several common kernel forms, which are shown in the
following figure:

The form of these kernels is as follows:
• Gaussian kernel (kernel = 'gaussian')
2

𝑥
𝐾(𝑥; ℎ) ∝ exp(− 2ℎ
2)

• Tophat kernel (kernel = 'tophat')
𝐾(𝑥; ℎ) ∝ 1 if 𝑥 < ℎ
• Epanechnikov kernel (kernel = 'epanechnikov')
𝐾(𝑥; ℎ) ∝ 1 −

𝑥2
ℎ2

• Exponential kernel (kernel = 'exponential')
𝐾(𝑥; ℎ) ∝ exp(−𝑥/ℎ)
• Linear kernel (kernel = 'linear')
𝐾(𝑥; ℎ) ∝ 1 − 𝑥/ℎ if 𝑥 < ℎ
3.2. Unsupervised learning

367

scikit-learn user guide, Release 0.19.1

• Cosine kernel (kernel = 'cosine')
𝐾(𝑥; ℎ) ∝ cos( 𝜋𝑥
2ℎ ) if 𝑥 < ℎ
The kernel density estimator can be used with any of the valid distance metrics (see sklearn.neighbors.
DistanceMetric for a list of available metrics), though the results are properly normalized only for the Euclidean
metric. One particularly useful metric is the Haversine distance which measures the angular distance between points
on a sphere. Here is an example of using a kernel density estimate for a visualization of geospatial data, in this case
the distribution of observations of two different species on the South American continent:

One other useful application of kernel density estimation is to learn a non-parametric generative model of a dataset in
order to efficiently draw new samples from this generative model. Here is an example of using this process to create a
new set of hand-written digits, using a Gaussian kernel learned on a PCA projection of the data:

368

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The “new” data consists of linear combinations of the input data, with weights probabilistically drawn given the KDE
model.
Examples:
• Simple 1D Kernel Density Estimation: computation of simple kernel density estimates in one dimension.
• Kernel Density Estimation: an example of using Kernel Density estimation to learn a generative model of the
hand-written digits data, and drawing new samples from this model.
• Kernel Density Estimate of Species Distributions: an example of Kernel Density estimation using the Haversine distance metric to visualize geospatial data

3.2.9 Neural network models (unsupervised)
Restricted Boltzmann machines
Restricted Boltzmann machines (RBM) are unsupervised nonlinear feature learners based on a probabilistic model.
The features extracted by an RBM or a hierarchy of RBMs often give good results when fed into a linear classifier
such as a linear SVM or a perceptron.
The model makes assumptions regarding the distribution of inputs. At the moment, scikit-learn only provides
BernoulliRBM , which assumes the inputs are either binary values or values between 0 and 1, each encoding the
probability that the specific feature would be turned on.
The RBM tries to maximize the likelihood of the data using a particular graphical model. The parameter learning
algorithm used (Stochastic Maximum Likelihood) prevents the representations from straying far from the input data,
which makes them capture interesting regularities, but makes the model less useful for small datasets, and usually not
useful for density estimation.

3.2. Unsupervised learning

369

scikit-learn user guide, Release 0.19.1

The method gained popularity for initializing deep neural networks with the weights of independent RBMs. This
method is known as unsupervised pre-training.

Examples:
• Restricted Boltzmann Machine features for digit classification

Graphical model and parametrization
The graphical model of an RBM is a fully-connected bipartite graph.

370

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The nodes are random variables whose states depend on the state of the other nodes they are connected to. The model
is therefore parameterized by the weights of the connections, as well as one intercept (bias) term for each visible and
hidden unit, omitted from the image for simplicity.
The energy function measures the quality of a joint assignment:
∑︁ ∑︁
∑︁
∑︁
𝐸(v, h) =
𝑤𝑖𝑗 𝑣𝑖 ℎ𝑗 +
𝑏𝑖 𝑣 𝑖 +
𝑐𝑗 ℎ𝑗
𝑖

𝑗

𝑖

𝑗

In the formula above, b and c are the intercept vectors for the visible and hidden layers, respectively. The joint
probability of the model is defined in terms of the energy:
𝑃 (v, h) =

𝑒−𝐸(v,h)
𝑍

The word restricted refers to the bipartite structure of the model, which prohibits direct interaction between hidden
units, or between visible units. This means that the following conditional independencies are assumed:
ℎ𝑖 ⊥ℎ𝑗 |v
𝑣𝑖 ⊥𝑣𝑗 |h
The bipartite structure allows for the use of efficient block Gibbs sampling for inference.
Bernoulli Restricted Boltzmann machines
In the BernoulliRBM , all units are binary stochastic units. This means that the input data should either be binary, or
real-valued between 0 and 1 signifying the probability that the visible unit would turn on or off. This is a good model
for character recognition, where the interest is on which pixels are active and which aren’t. For images of natural
scenes it no longer fits because of background, depth and the tendency of neighbouring pixels to take the same values.
The conditional probability distribution of each unit is given by the logistic sigmoid activation function of the input it
receives:
∑︁
𝑃 (𝑣𝑖 = 1|h) = 𝜎(
𝑤𝑖𝑗 ℎ𝑗 + 𝑏𝑖 )
𝑗

∑︁
𝑃 (ℎ𝑖 = 1|v) = 𝜎(
𝑤𝑖𝑗 𝑣𝑖 + 𝑐𝑗 )
𝑖

3.2. Unsupervised learning

371

scikit-learn user guide, Release 0.19.1

where 𝜎 is the logistic sigmoid function:
𝜎(𝑥) =

1
1 + 𝑒−𝑥

Stochastic Maximum Likelihood learning
The training algorithm implemented in BernoulliRBM is known as Stochastic Maximum Likelihood (SML) or
Persistent Contrastive Divergence (PCD). Optimizing maximum likelihood directly is infeasible because of the form
of the data likelihood:
∑︁
∑︁
log 𝑃 (𝑣) = log
𝑒−𝐸(𝑣,ℎ) − log
𝑒−𝐸(𝑥,𝑦)
ℎ

𝑥,𝑦

For simplicity the equation above is written for a single training example. The gradient with respect to the weights is
formed of two terms corresponding to the ones above. They are usually known as the positive gradient and the negative
gradient, because of their respective signs. In this implementation, the gradients are estimated over mini-batches of
samples.
In maximizing the log-likelihood, the positive gradient makes the model prefer hidden states that are compatible with
the observed training data. Because of the bipartite structure of RBMs, it can be computed efficiently. The negative
gradient, however, is intractable. Its goal is to lower the energy of joint states that the model prefers, therefore making
it stay true to the data. It can be approximated by Markov chain Monte Carlo using block Gibbs sampling by iteratively
sampling each of 𝑣 and ℎ given the other, until the chain mixes. Samples generated in this way are sometimes referred
as fantasy particles. This is inefficient and it is difficult to determine whether the Markov chain mixes.
The Contrastive Divergence method suggests to stop the chain after a small number of iterations, 𝑘, usually even 1.
This method is fast and has low variance, but the samples are far from the model distribution.
Persistent Contrastive Divergence addresses this. Instead of starting a new chain each time the gradient is needed, and
performing only one Gibbs sampling step, in PCD we keep a number of chains (fantasy particles) that are updated 𝑘
Gibbs steps after each weight update. This allows the particles to explore the space more thoroughly.
References:
• “A fast learning algorithm for deep belief nets” G. Hinton, S. Osindero, Y.-W. Teh, 2006
• “Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient” T. Tieleman,
2008

3.3 Model selection and evaluation
3.3.1 Cross-validation: evaluating estimator performance
Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model
that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict
anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when
performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test,
y_test. Note that the word “experiment” is not intended to denote academic use only, because even in commercial
settings machine learning usually starts out experimentally.
In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split
helper function. Let’s load the iris data set to fit a linear support vector machine on it:

372

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>>
>>>
>>>
>>>

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

>>> iris = datasets.load_iris()
>>> iris.data.shape, iris.target.shape
((150, 4), (150,))

We can now quickly sample a training set while holding out 40% of the data for testing (evaluating) our classifier:
>>> X_train, X_test, y_train, y_test = train_test_split(
...
iris.data, iris.target, test_size=0.4, random_state=0)
>>> X_train.shape, y_train.shape
((90, 4), (90,))
>>> X_test.shape, y_test.shape
((60, 4), (60,))
>>> clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
0.96...

When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set
for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator
performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer
report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a socalled “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and
when the experiment seems to be successful, final evaluation can be done on the test set.
However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be
used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation)
sets.
A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for
final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the
training set is split into k smaller sets (other approaches are described below, but generally follow the same principles).
The following procedure is followed for each of the k “folds”:
• A model is trained using 𝑘 − 1 of the folds as training data;
• the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a
performance measure such as accuracy).
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop.
This approach can be computationally expensive, but does not waste too much data (as it is the case when fixing an
arbitrary test set), which is a major advantage in problem such as inverse inference where the number of samples is
very small.
Computing cross-validated metrics
The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the
dataset.
The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the
iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each
time):

3.3. Model selection and evaluation

373

scikit-learn user guide, Release 0.19.1

>>> from sklearn.model_selection import cross_val_score
>>> clf = svm.SVC(kernel='linear', C=1)
>>> scores = cross_val_score(clf, iris.data, iris.target, cv=5)
>>> scores
array([ 0.96..., 1. ..., 0.96..., 0.96..., 1.
])

The mean score and the 95% confidence interval of the score estimate are hence given by:
>>> print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.98 (+/- 0.03)

By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change
this by using the scoring parameter:
>>> from sklearn import metrics
>>> scores = cross_val_score(
...
clf, iris.data, iris.target, cv=5, scoring='f1_macro')
>>> scores
array([ 0.96..., 1. ..., 0.96..., 0.96..., 1.
])

See The scoring parameter: defining model evaluation rules for details. In the case of the Iris dataset, the samples are
balanced across target classes hence the accuracy and the F1-score are almost equal.
When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by
default, the latter being used if the estimator derives from ClassifierMixin.
It is also possible to use other cross validation strategies by passing a cross validation iterator instead, for instance:
>>> from sklearn.model_selection import ShuffleSplit
>>> n_samples = iris.data.shape[0]
>>> cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0)
>>> cross_val_score(clf, iris.data, iris.target, cv=cv)
...
array([ 0.97..., 0.97..., 1.
])

Data transformation with held out data
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization,
feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied
to held-out data for prediction:
>>> from sklearn import preprocessing
>>> X_train, X_test, y_train, y_test = train_test_split(
...
iris.data, iris.target, test_size=0.4, random_state=0)
>>> scaler = preprocessing.StandardScaler().fit(X_train)
>>> X_train_transformed = scaler.transform(X_train)
>>> clf = svm.SVC(C=1).fit(X_train_transformed, y_train)
>>> X_test_transformed = scaler.transform(X_test)
>>> clf.score(X_test_transformed, y_test)
0.9333...

A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation:
>>> from sklearn.pipeline import make_pipeline
>>> clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
>>> cross_val_score(clf, iris.data, iris.target, cv=cv)
...
array([ 0.97..., 0.93..., 0.95...])

374

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

See Pipeline and FeatureUnion: combining estimators.

The cross_validate function and multiple metric evaluation
The cross_validate function differs from cross_val_score in two ways • It allows specifying multiple metrics for evaluation.
• It returns a dict containing training scores, fit-times and score-times in addition to the test score.
For single metric evaluation, where the scoring parameter is a string, callable or None, the keys will be ['test_score', 'fit_time', 'score_time']
And for multiple metric evaluation, the return value is a dict with the following keys ['test_', 'test_', 'test_', 'fit_time',
'score_time']
return_train_score is set to True by default. It adds train score keys for all the scorers. If train scores are not
needed, this should be set to False explicitly.
The multiple metrics can be specified either as a list, tuple or set of predefined scorer names:
>>> from sklearn.model_selection import cross_validate
>>> from sklearn.metrics import recall_score
>>> scoring = ['precision_macro', 'recall_macro']
>>> clf = svm.SVC(kernel='linear', C=1, random_state=0)
>>> scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,
...
cv=5, return_train_score=False)
>>> sorted(scores.keys())
['fit_time', 'score_time', 'test_precision_macro', 'test_recall_macro']
>>> scores['test_recall_macro']
array([ 0.96..., 1. ..., 0.96..., 0.96..., 1.
])

Or as a dict mapping scorer name to a predefined or custom scoring function:
>>> from sklearn.metrics.scorer import make_scorer
>>> scoring = {'prec_macro': 'precision_macro',
...
'rec_micro': make_scorer(recall_score, average='macro')}
>>> scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,
...
cv=5, return_train_score=True)
>>> sorted(scores.keys())
['fit_time', 'score_time', 'test_prec_macro', 'test_rec_micro',
'train_prec_macro', 'train_rec_micro']
>>> scores['train_rec_micro']
array([ 0.97..., 0.97..., 0.99..., 0.98..., 0.98...])

Here is an example of cross_validate using a single metric:
>>> scores = cross_validate(clf, iris.data, iris.target,
...
scoring='precision_macro')
>>> sorted(scores.keys())
['fit_time', 'score_time', 'test_score', 'train_score']

3.3. Model selection and evaluation

375

scikit-learn user guide, Release 0.19.1

Obtaining predictions by cross-validation
The function cross_val_predict has a similar interface to cross_val_score, but returns, for each element
in the input, the prediction that was obtained for that element when it was in the test set. Only cross-validation
strategies that assign all elements to a test set exactly once can be used (otherwise, an exception is raised).
These prediction can then be used to evaluate the classifier:
>>> from sklearn.model_selection import cross_val_predict
>>> predicted = cross_val_predict(clf, iris.data, iris.target, cv=10)
>>> metrics.accuracy_score(iris.target, predicted)
0.973...

Note that the result of this computation may be slightly different from those obtained using cross_val_score as
the elements are grouped in different ways.
The available cross validation iterators are introduced in the following section.
Examples
• Receiver Operating Characteristic (ROC) with cross validation,
• Recursive feature elimination with cross-validation,
• Parameter estimation using grid search with cross-validation,
• Sample pipeline for text feature extraction and evaluation,
• Plotting Cross-Validated Predictions,
• Nested versus non-nested cross-validation.

Cross validation iterators
The following sections list utilities to generate indices that can be used to generate dataset splits according to different
cross validation strategies.
Cross-validation iterators for i.i.d. data
Assuming that some data is Independent and Identically Distributed (i.i.d.) is making the assumption that all samples
stem from the same generative process and that the generative process is assumed to have no memory of past generated
samples.
The following cross-validators can be used in such cases.
NOTE
While i.i.d. data is a common assumption in machine learning theory, it rarely holds in practice. If one knows that
the samples have been generated using a time-dependent process, it’s safer to use a time-series aware cross-validation
scheme Similarly if we know that the generative process has a group structure (samples from collected from different
subjects, experiments, measurement devices) it safer to use group-wise cross-validation.
K-fold
KFold divides all the samples in 𝑘 groups of samples, called folds (if 𝑘 = 𝑛, this is equivalent to the Leave One Out

376

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

strategy), of equal sizes (if possible). The prediction function is learned using 𝑘 − 1 folds, and the fold left out is used
for test.
Example of 2-fold cross-validation on a dataset with 4 samples:
>>> import numpy as np
>>> from sklearn.model_selection import KFold
>>> X = ["a", "b", "c", "d"]
>>> kf = KFold(n_splits=2)
>>> for train, test in kf.split(X):
...
print("%s %s" % (train, test))
[2 3] [0 1]
[0 1] [2 3]

Each fold is constituted by two arrays: the first one is related to the training set, and the second one to the test set.
Thus, one can create the training/test sets using numpy indexing:
>>> X = np.array([[0., 0.], [1., 1.], [-1., -1.], [2., 2.]])
>>> y = np.array([0, 1, 0, 1])
>>> X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]

Repeated K-Fold
RepeatedKFold repeats K-Fold n times. It can be used when one requires to run KFold n times, producing
different splits in each repetition.
Example of 2-fold K-Fold repeated 2 times:
>>> import numpy as np
>>> from sklearn.model_selection import RepeatedKFold
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> random_state = 12883823
>>> rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state)
>>> for train, test in rkf.split(X):
...
print("%s %s" % (train, test))
...
[2 3] [0 1]
[0 1] [2 3]
[0 2] [1 3]
[1 3] [0 2]

Similarly, RepeatedStratifiedKFold repeats Stratified K-Fold n times with different randomization in each
repetition.
Leave One Out (LOO)
LeaveOneOut (or LOO) is a simple cross-validation. Each learning set is created by taking all the samples except
one, the test set being the sample left out. Thus, for 𝑛 samples, we have 𝑛 different training sets and 𝑛 different tests
set. This cross-validation procedure does not waste much data as only one sample is removed from the training set:
>>> from sklearn.model_selection import LeaveOneOut
>>> X = [1, 2, 3, 4]
>>> loo = LeaveOneOut()

3.3. Model selection and evaluation

377

scikit-learn user guide, Release 0.19.1

>>> for
...
[1 2 3]
[0 2 3]
[0 1 3]
[0 1 2]

train, test in loo.split(X):
print("%s %s" % (train, test))
[0]
[1]
[2]
[3]

Potential users of LOO for model selection should weigh a few known caveats. When compared with 𝑘-fold cross
validation, one builds 𝑛 models from 𝑛 samples instead of 𝑘 models, where 𝑛 > 𝑘. Moreover, each is trained on 𝑛 − 1
samples rather than (𝑘 − 1)𝑛/𝑘. In both ways, assuming 𝑘 is not too large and 𝑘 < 𝑛, LOO is more computationally
expensive than 𝑘-fold cross validation.
In terms of accuracy, LOO often results in high variance as an estimator for the test error. Intuitively, since 𝑛 − 1 of
the 𝑛 samples are used to build each model, models constructed from folds are virtually identical to each other and to
the model built from the entire training set.
However, if the learning curve is steep for the training size in question, then 5- or 10- fold cross validation can
overestimate the generalization error.
As a general rule, most authors, and empirical evidence, suggest that 5- or 10- fold cross validation should be preferred
to LOO.
References:
• http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html;
• T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer 2009
• L. Breiman, P. Spector Submodel selection and evaluation in regression: The X-random case, International
Statistical Review 1992;
• R. Kohavi, A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection, Intl.
Jnt. Conf. AI
• R. Bharat Rao, G. Fung, R. Rosales, On the Dangers of Cross-Validation. An Experimental Evaluation, SIAM
2008;
• G. James, D. Witten, T. Hastie, R Tibshirani, An Introduction to Statistical Learning, Springer 2013.

Leave P Out (LPO)
LeavePOut is very similar to LeaveOneOut as it(︀creates
all the possible training/test sets by removing 𝑝 samples
)︀
from the complete set. For 𝑛 samples, this produces 𝑛𝑝 train-test pairs. Unlike LeaveOneOut and KFold, the test
sets will overlap for 𝑝 > 1.
Example of Leave-2-Out on a dataset with 4 samples:
>>> from sklearn.model_selection import LeavePOut
>>> X = np.ones(4)
>>> lpo = LeavePOut(p=2)
>>> for train, test in lpo.split(X):
...
print("%s %s" % (train, test))
[2 3] [0 1]
[1 3] [0 2]
[1 2] [0 3]
[0 3] [1 2]

378

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

[0 2] [1 3]
[0 1] [2 3]

Random permutations cross-validation a.k.a. Shuffle & Split
ShuffleSplit
The ShuffleSplit iterator will generate a user defined number of independent train / test dataset splits. Samples
are first shuffled and then split into a pair of train and test sets.
It is possible to control the randomness for reproducibility of the results by explicitly seeding the random_state
pseudo random number generator.
Here is a usage example:
>>> from sklearn.model_selection import ShuffleSplit
>>> X = np.arange(5)
>>> ss = ShuffleSplit(n_splits=3, test_size=0.25,
...
random_state=0)
>>> for train_index, test_index in ss.split(X):
...
print("%s %s" % (train_index, test_index))
...
[1 3 4] [2 0]
[1 4 3] [0 2]
[4 0 2] [1 3]

ShuffleSplit is thus a good alternative to KFold cross validation that allows a finer control on the number of
iterations and the proportion of samples on each side of the train / test split.
Cross-validation iterators with stratification based on class labels.
Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance there
could be several times more negative samples than positive samples. In such cases it is recommended to use stratified
sampling as implemented in StratifiedKFold and StratifiedShuffleSplit to ensure that relative class
frequencies is approximately preserved in each train and validation fold.
Stratified k-fold
StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same
percentage of samples of each target class as the complete set.
Example of stratified 3-fold cross-validation on a dataset with 10 samples from two slightly unbalanced classes:
>>> from sklearn.model_selection import StratifiedKFold
>>> X = np.ones(10)
>>> y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
>>> skf = StratifiedKFold(n_splits=3)
>>> for train, test in skf.split(X, y):
...
print("%s %s" % (train, test))
[2 3 6 7 8 9] [0 1 4 5]
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]

3.3. Model selection and evaluation

379

scikit-learn user guide, Release 0.19.1

RepeatedStratifiedKFold can be used to repeat Stratified K-Fold n times with different randomization in each
repetition.
Stratified Shuffle Split
StratifiedShuffleSplit is a variation of ShuffleSplit, which returns stratified splits, i.e which creates splits
by preserving the same percentage for each target class as in the complete set.
Cross-validation iterators for grouped data.
The i.i.d. assumption is broken if the underlying generative process yield groups of dependent samples.
Such a grouping of data is domain specific. An example would be when there is medical data collected from multiple
patients, with multiple samples taken from each patient. And such data is likely to be dependent on the individual
group. In our example, the patient id for each sample will be its group identifier.
In this case we would like to know if a model trained on a particular set of groups generalizes well to the unseen
groups. To measure this, we need to ensure that all the samples in the validation fold come from groups that are not
represented at all in the paired training fold.
The following cross-validation splitters can be used to do that. The grouping identifier for the samples is specified via
the groups parameter.
Group k-fold
GroupKFold is a variation of k-fold which ensures that the same group is not represented in both testing and training
sets. For example if the data is obtained from different subjects with several samples per-subject and if the model is
flexible enough to learn from highly person specific features it could fail to generalize to new subjects. GroupKFold
makes it possible to detect this kind of overfitting situations.
Imagine you have three subjects, each with an associated number from 1 to 3:
>>> from sklearn.model_selection import GroupKFold
>>> X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
>>> y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
>>> groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]
>>> gkf = GroupKFold(n_splits=3)
>>> for train, test in gkf.split(X, y, groups=groups):
...
print("%s %s" % (train, test))
[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]

Each subject is in a different testing fold, and the same subject is never in both testing and training. Notice that the
folds do not have exactly the same size due to the imbalance in the data.
Leave One Group Out
LeaveOneGroupOut is a cross-validation scheme which holds out the samples according to a third-party provided
array of integer groups. This group information can be used to encode arbitrary domain specific pre-defined crossvalidation folds.

380

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Each training set is thus constituted by all the samples except the ones related to a specific group.
For example, in the cases of multiple experiments, LeaveOneGroupOut can be used to create a cross-validation
based on the different experiments: we create a training set using the samples of all the experiments except one:
>>> from sklearn.model_selection import LeaveOneGroupOut
>>> X = [1, 5, 10, 50, 60, 70, 80]
>>> y = [0, 1, 1, 2, 2, 2, 2]
>>> groups = [1, 1, 2, 2, 3, 3, 3]
>>> logo = LeaveOneGroupOut()
>>> for train, test in logo.split(X, y, groups=groups):
...
print("%s %s" % (train, test))
[2 3 4 5 6] [0 1]
[0 1 4 5 6] [2 3]
[0 1 2 3] [4 5 6]

Another common application is to use time information: for instance the groups could be the year of collection of the
samples and thus allow for cross-validation against time-based splits.
Leave P Groups Out
LeavePGroupsOut is similar as LeaveOneGroupOut, but removes samples related to 𝑃 groups for each training/test set.
Example of Leave-2-Group Out:
>>> from sklearn.model_selection import LeavePGroupsOut
>>> X = np.arange(6)
>>> y = [1, 1, 1, 2, 2, 2]
>>> groups = [1, 1, 2, 2, 3, 3]
>>> lpgo = LeavePGroupsOut(n_groups=2)
>>> for train, test in lpgo.split(X, y, groups=groups):
...
print("%s %s" % (train, test))
[4 5] [0 1 2 3]
[2 3] [0 1 4 5]
[0 1] [2 3 4 5]

Group Shuffle Split
The GroupShuffleSplit iterator behaves as a combination of ShuffleSplit and LeavePGroupsOut, and
generates a sequence of randomized partitions in which a subset of groups are held out for each split.
Here is a usage example:
>>> from sklearn.model_selection import GroupShuffleSplit
>>> X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 0.001]
>>> y = ["a", "b", "b", "b", "c", "c", "c", "a"]
>>> groups = [1, 1, 2, 2, 3, 3, 4, 4]
>>> gss = GroupShuffleSplit(n_splits=4, test_size=0.5, random_state=0)
>>> for train, test in gss.split(X, y, groups=groups):
...
print("%s %s" % (train, test))
...
[0 1 2 3] [4 5 6 7]

3.3. Model selection and evaluation

381

scikit-learn user guide, Release 0.19.1

[2 3 6 7] [0 1 4 5]
[2 3 4 5] [0 1 6 7]
[4 5 6 7] [0 1 2 3]

This class is useful when the behavior of LeavePGroupsOut is desired, but the number of groups is large enough
that generating all possible partitions with 𝑃 groups withheld would be prohibitively expensive. In such a scenario, GroupShuffleSplit provides a random sample (with replacement) of the train / test splits generated by
LeavePGroupsOut.
Predefined Fold-Splits / Validation-Sets
For some datasets, a pre-defined split of the data into training- and validation fold or into several cross-validation folds
already exists. Using PredefinedSplit it is possible to use these folds e.g. when searching for hyperparameters.
For example, when using a validation set, set the test_fold to 0 for all samples that are part of the validation set,
and to -1 for all other samples.
Cross validation of time series data
Time series data is characterised by the correlation between observations that are near in time (autocorrelation). However, classical cross-validation techniques such as KFold and ShuffleSplit assume the samples are independent
and identically distributed, and would result in unreasonable correlation between training and testing instances (yielding poor estimates of generalisation error) on time series data. Therefore, it is very important to evaluate our model
for time series data on the “future” observations least like those that are used to train the model. To achieve this, one
solution is provided by TimeSeriesSplit.
Time Series Split
TimeSeriesSplit is a variation of k-fold which returns first 𝑘 folds as train set and the (𝑘 + 1) th fold as test set.
Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before
them. Also, it adds all surplus data to the first training partition, which is always used to train the model.
This class can be used to cross-validate time series data samples that are observed at fixed time intervals.
Example of 3-split time series cross-validation on a dataset with 6 samples:
>>> from sklearn.model_selection import TimeSeriesSplit
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4, 5, 6])
>>> tscv = TimeSeriesSplit(n_splits=3)
>>> print(tscv)
TimeSeriesSplit(max_train_size=None, n_splits=3)
>>> for train, test in tscv.split(X):
...
print("%s %s" % (train, test))
[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]

382

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

A note on shuffling
If the data ordering is not arbitrary (e.g. samples with the same class label are contiguous), shuffling it first may
be essential to get a meaningful cross- validation result. However, the opposite may be true if the samples are not
independently and identically distributed. For example, if samples correspond to news articles, and are ordered by
their time of publication, then shuffling the data will likely lead to a model that is overfit and an inflated validation
score: it will be tested on samples that are artificially similar (close in time) to training samples.
Some cross validation iterators, such as KFold, have an inbuilt option to shuffle the data indices before splitting them.
Note that:
• This consumes less memory than shuffling the data directly.
• By default no shuffling occurs, including for the (stratified) K fold cross- validation performed by specifying
cv=some_integer to cross_val_score, grid search, etc. Keep in mind that train_test_split
still returns a random split.
• The random_state parameter defaults to None, meaning that the shuffling will be different every time
KFold(..., shuffle=True) is iterated. However, GridSearchCV will use the same shuffling for
each set of parameters validated by a single call to its fit method.
• To get identical results for each split, set random_state to an integer.
Cross validation and model selection
Cross validation iterators can also be used to directly perform model selection using Grid Search for the optimal
hyperparameters of the model. This is the topic of the next section: Tuning the hyper-parameters of an estimator.

3.3.2 Tuning the hyper-parameters of an estimator
Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as
arguments to the constructor of the estimator classes. Typical examples include C, kernel and gamma for Support
Vector Classifier, alpha for Lasso, etc.
It is possible and recommended to search the hyper-parameter space for the best cross validation score.
Any parameter provided when constructing an estimator may be optimized in this manner. Specifically, to find the
names and current values for all parameters for a given estimator, use:
estimator.get_params()

A search consists of:
• an estimator (regressor or classifier such as sklearn.svm.SVC());
• a parameter space;
• a method for searching or sampling candidates;
• a cross-validation scheme; and
• a score function.
Some models allow for specialized, efficient parameter search strategies, outlined below. Two generic approaches to
sampling search candidates are provided in scikit-learn: for given values, GridSearchCV exhaustively considers all
parameter combinations, while RandomizedSearchCV can sample a given number of candidates from a parameter
space with a specified distribution. After describing these tools we detail best practice applicable to both approaches.
Note that it is common that a small subset of those parameters can have a large impact on the predictive or computation
performance of the model while others can be left to their default values. It is recommended to read the docstring of
3.3. Model selection and evaluation

383

scikit-learn user guide, Release 0.19.1

the estimator class to get a finer understanding of their expected behavior, possibly by reading the enclosed reference
to the literature.
Exhaustive Grid Search
The grid search provided by GridSearchCV exhaustively generates candidates from a grid of parameter values
specified with the param_grid parameter. For instance, the following param_grid:
param_grid = [
{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]

specifies that two grids should be explored: one with a linear kernel and C values in [1, 10, 100, 1000], and the second
one with an RBF kernel, and the cross-product of C values ranging in [1, 10, 100, 1000] and gamma values in [0.001,
0.0001].
The GridSearchCV instance implements the usual estimator API: when “fitting” it on a dataset all the possible
combinations of parameter values are evaluated and the best combination is retained.
Examples:
• See Parameter estimation using grid search with cross-validation for an example of Grid Search computation
on the digits dataset.
• See Sample pipeline for text feature extraction and evaluation for an example of Grid Search coupling parameters from a text documents feature extractor (n-gram count vectorizer and TF-IDF transformer) with a
classifier (here a linear SVM trained with SGD with either elastic net or L2 penalty) using a pipeline.
Pipeline instance.
• See Nested versus non-nested cross-validation for an example of Grid Search within a cross validation loop
on the iris dataset. This is the best practice for evaluating the performance of a model with grid search.
• See Demonstration of multi-metric evaluation on cross_val_score and GridSearchCV for an example of
GridSearchCV being used to evaluate multiple metrics simultaneously.

Randomized Parameter Optimization
While using a grid of parameter settings is currently the most widely used method for parameter optimization, other
search methods have more favourable properties. RandomizedSearchCV implements a randomized search over
parameters, where each setting is sampled from a distribution over possible parameter values. This has two main
benefits over an exhaustive search:
• A budget can be chosen independent of the number of parameters and possible values.
• Adding parameters that do not influence the performance does not decrease efficiency.
Specifying how parameters should be sampled is done using a dictionary, very similar to specifying parameters for
GridSearchCV . Additionally, a computation budget, being the number of sampled candidates or sampling iterations, is specified using the n_iter parameter. For each parameter, either a distribution over possible values or a list
of discrete choices (which will be sampled uniformly) can be specified:
{'C': scipy.stats.expon(scale=100), 'gamma': scipy.stats.expon(scale=.1),
'kernel': ['rbf'], 'class_weight':['balanced', None]}

384

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

This example uses the scipy.stats module, which contains many useful distributions for sampling parameters,
such as expon, gamma, uniform or randint. In principle, any function can be passed that provides a rvs
(random variate sample) method to sample a value. A call to the rvs function should provide independent random
samples from possible parameter values on consecutive calls.
Warning: The distributions in scipy.stats prior to version scipy 0.16 do not allow specifying a
random state. Instead, they use the global numpy random state, that can be seeded via np.random.
seed or set using np.random.set_state. However, beginning scikit-learn 0.18, the sklearn.
model_selection module sets the random state provided by the user if scipy >= 0.16 is also
available.
For continuous parameters, such as C above, it is important to specify a continuous distribution to take full advantage
of the randomization. This way, increasing n_iter will always lead to a finer search.
Examples:
• Comparing randomized search and grid search for hyperparameter estimation compares the usage and efficiency of randomized search and grid search.

References:
• Bergstra, J. and Bengio, Y., Random search for hyper-parameter optimization, The Journal of Machine Learning Research (2012)

Tips for parameter search
Specifying an objective metric
By default, parameter search uses the score function of the estimator to evaluate a parameter setting. These are
the sklearn.metrics.accuracy_score for classification and sklearn.metrics.r2_score for regression. For some applications, other scoring functions are better suited (for example in unbalanced classification, the
accuracy score is often uninformative). An alternative scoring function can be specified via the scoring parameter
to GridSearchCV , RandomizedSearchCV and many of the specialized cross-validation tools described below.
See The scoring parameter: defining model evaluation rules for more details.
Specifying multiple metrics for evaluation
GridSearchCV and RandomizedSearchCV allow specifying multiple metrics for the scoring parameter.
Multimetric scoring can either be specified as a list of strings of predefined scores names or a dict mapping the scorer
name to the scorer function and/or the predefined scorer name(s). See Using multiple metric evaluation for more
details.
When specifying multiple metrics, the refit parameter must be set to the metric (string) for which the
best_params_ will be found and used to build the best_estimator_ on the whole dataset. If the search
should not be refit, set refit=False. Leaving refit to the default value None will result in an error when using
multiple metrics.
See Demonstration of multi-metric evaluation on cross_val_score and GridSearchCV for an example usage.

3.3. Model selection and evaluation

385

scikit-learn user guide, Release 0.19.1

Composite estimators and parameter spaces
Pipeline: chaining estimators describes building composite estimators whose parameter space can be searched with
these tools.
Model selection: development and evaluation
Model selection by evaluating various parameter settings can be seen as a way to use the labeled data to “train” the
parameters of the grid.
When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid
search process: it is recommended to split the data into a development set (to be fed to the GridSearchCV instance)
and an evaluation set to compute performance metrics.
This can be done by using the train_test_split utility function.
Parallelism
GridSearchCV and RandomizedSearchCV evaluate each parameter setting independently. Computations can
be run in parallel if your OS supports it, by using the keyword n_jobs=-1. See function signature for more details.
Robustness to failure
Some parameter settings may result in a failure to fit one or more folds of the data. By default, this will cause
the entire search to fail, even if some parameter settings could be fully evaluated. Setting error_score=0 (or
=np.NaN) will make the procedure robust to such failure, issuing a warning and setting the score for that fold to 0 (or
NaN), but completing the search.
Alternatives to brute force parameter search
Model specific cross-validation
Some models can fit data for a range of values of some parameter almost as efficiently as fitting the estimator for
a single value of the parameter. This feature can be leveraged to perform a more efficient cross-validation used for
model selection of this parameter.
The most common parameter amenable to this strategy is the parameter encoding the strength of the regularizer. In
this case we say that we compute the regularization path of the estimator.
Here is the list of such models:
linear_model.ElasticNetCV ([l1_ratio, eps, . . . ])
linear_model.LarsCV ([fit_intercept, . . . ])
linear_model.LassoCV ([eps, n_alphas, . . . ])
linear_model.LassoLarsCV ([fit_intercept, . . . ])
linear_model.LogisticRegressionCV ([Cs,
. . . ])
linear_model.MultiTaskElasticNetCV ([. . . ])

386

Elastic Net model with iterative fitting along a regularization path
Cross-validated Least Angle Regression model
Lasso linear model with iterative fitting along a regularization path
Cross-validated Lasso, using the LARS algorithm
Logistic Regression CV (aka logit, MaxEnt) classifier.
Multi-task L1/L2 ElasticNet with built-in cross-validation.
Continued on next page

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Table 3.2 – continued from previous page
linear_model.MultiTaskLassoCV ([eps, . . . ])
Multi-task L1/L2 Lasso with built-in cross-validation.
linear_model.OrthogonalMatchingPursuitCV ([.Cross-validated
. . ])
Orthogonal Matching Pursuit model
(OMP)
linear_model.RidgeCV ([alphas, . . . ])
Ridge regression with built-in cross-validation.
linear_model.RidgeClassifierCV ([alphas,
Ridge classifier with built-in cross-validation.
. . . ])

sklearn.linear_model.ElasticNetCV
class sklearn.linear_model.ElasticNetCV(l1_ratio=0.5,
eps=0.001,
n_alphas=100,
alphas=None, fit_intercept=True, normalize=False,
precompute=’auto’, max_iter=1000, tol=0.0001,
cv=None, copy_X=True, verbose=0, n_jobs=1, positive=False, random_state=None, selection=’cyclic’)
Elastic Net model with iterative fitting along a regularization path
The best model is selected by cross-validation.
Read more in the User Guide.
Parameters l1_ratio : float or array of floats, optional
float between 0 and 1 passed to ElasticNet (scaling between l1 and l2 penalties). For
l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty.
For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2 This parameter
can be a list, in which case the different values are tested by cross-validation and the
one giving the best prediction score is used. Note that a good choice of list of values
for l1_ratio is often to put more values close to 1 (i.e. Lasso) and less close to 0 (i.e.
Ridge), as in [.1, .5, .7, .9, .95, .99, 1]
eps : float, optional
Length of the path. eps=1e-3 means that alpha_min / alpha_max = 1e-3.
n_alphas : int, optional
Number of alphas along the regularization path, used for each l1_ratio.
alphas : numpy array, optional
List of alphas where to compute the models. If None alphas are set automatically
fit_intercept : boolean
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
normalize : boolean, optional, default False
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by
the l2-norm. If you wish to standardize, please use sklearn.preprocessing.
StandardScaler before calling fit on an estimator with normalize=False.
precompute : True | False | ‘auto’ | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to 'auto'
let us decide. The Gram matrix can also be passed as argument.
max_iter : int, optional

3.3. Model selection and evaluation

387

scikit-learn user guide, Release 0.19.1

The maximum number of iterations
tol : float, optional
The tolerance for the optimization: if the updates are smaller than tol, the optimization
code checks the dual gap for optimality and continues until it is smaller than tol.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy. Possible inputs for cv are:
• None, to use the default 3-fold cross-validation,
• integer, to specify the number of folds.
• An object to be used as a cross-validation generator.
• An iterable yielding train/test splits.
For integer/None inputs, KFold is used.
Refer User Guide for the various cross-validation strategies that can be used here.
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
verbose : bool or integer
Amount of verbosity.
n_jobs : integer, optional
Number of CPUs to use during the cross validation. If -1, use all the CPUs.
positive : bool, optional
When set to True, forces the coefficients to be positive.
random_state : int, RandomState instance or None, optional, default None
The seed of the pseudo random number generator that selects a random feature to update. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when
selection == ‘random’.
selection : str, default ‘cyclic’
If set to ‘random’, a random coefficient is updated every iteration rather than looping
over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.
Attributes alpha_ : float
The amount of penalization chosen by cross validation
l1_ratio_ : float
The compromise between l1 and l2 penalization chosen by cross validation
coef_ : array, shape (n_features,) | (n_targets, n_features)
Parameter vector (w in the cost function formula),
intercept_ : float | array, shape (n_targets, n_features)
Independent term in the decision function.

388

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

mse_path_ : array, shape (n_l1_ratio, n_alpha, n_folds)
Mean square error for the test set on each fold, varying l1_ratio and alpha.
alphas_ : numpy array, shape (n_alphas,) or (n_l1_ratio, n_alphas)
The grid of alphas used for fitting, for each l1_ratio.
n_iter_ : int
number of iterations run by the coordinate descent solver to reach the specified tolerance
for the optimal alpha.
See also:
enet_path, ElasticNet
Notes
For an example, see examples/linear_model/plot_lasso_model_selection.py.
To avoid unnecessary memory duplication the X argument of the fit method should be directly passed as a
Fortran-contiguous numpy array.
The parameter l1_ratio corresponds to alpha in the glmnet R package while alpha corresponds to the lambda
parameter in glmnet. More specifically, the optimization objective is:
1 / (2 * n_samples) * ||y - Xw||^2_2
+ alpha * l1_ratio * ||w||_1
+ 0.5 * alpha * (1 - l1_ratio) * ||w||^2_2

If you are interested in controlling the L1 and L2 penalty separately, keep in mind that this is equivalent to:
a * L1 + b * L2

for:
alpha = a + b and l1_ratio = a / (a + b).

Examples
>>> from sklearn.linear_model import ElasticNetCV
>>> from sklearn.datasets import make_regression
>>>
>>> X, y = make_regression(n_features=2, random_state=0)
>>> regr = ElasticNetCV(cv=5, random_state=0)
>>> regr.fit(X, y)
ElasticNetCV(alphas=None, copy_X=True, cv=5, eps=0.001, fit_intercept=True,
l1_ratio=0.5, max_iter=1000, n_alphas=100, n_jobs=1,
normalize=False, positive=False, precompute='auto', random_state=0,
selection='cyclic', tol=0.0001, verbose=0)
>>> print(regr.alpha_)
0.19947279427
>>> print(regr.intercept_)
0.398882965428
>>> print(regr.predict([[0, 0]]))
[ 0.39888297]

3.3. Model selection and evaluation

389

scikit-learn user guide, Release 0.19.1

Methods

fit(X, y)
get_params([deep])
path(X, y[, l1_ratio, eps, n_alphas, . . . ])
predict(X)
score(X, y[, sample_weight])
set_params(**params)

Fit linear model with coordinate descent
Get parameters for this estimator.
Compute elastic net path with coordinate descent
Predict using the linear model
Returns the coefficient of determination R^2 of the prediction.
Set the parameters of this estimator.

__init__(l1_ratio=0.5, eps=0.001, n_alphas=100, alphas=None, fit_intercept=True, normalize=False, precompute=’auto’, max_iter=1000, tol=0.0001, cv=None, copy_X=True, verbose=0, n_jobs=1, positive=False, random_state=None, selection=’cyclic’)
fit(X, y)
Fit linear model with coordinate descent
Fit is on grid of alphas and best alpha estimated by cross-validation.
Parameters X : {array-like}, shape (n_samples, n_features)
Training data. Pass directly as Fortran-contiguous data to avoid unnecessary memory
duplication. If y is mono-output, X can be sparse.
y : array-like, shape (n_samples,) or (n_samples, n_targets)
Target values
get_params(deep=True)
Get parameters for this estimator.
Parameters deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
Returns params : mapping of string to any
Parameter names mapped to their values.
static path(X, y, l1_ratio=0.5, eps=0.001, n_alphas=100, alphas=None, precompute=’auto’,
Xy=None, copy_X=True, coef_init=None, verbose=False, return_n_iter=False, positive=False, check_input=True, **params)
Compute elastic net path with coordinate descent
The elastic net optimization function varies for mono and multi-outputs.
For mono-output tasks it is:
1 / (2 * n_samples) * ||y - Xw||^2_2
+ alpha * l1_ratio * ||w||_1
+ 0.5 * alpha * (1 - l1_ratio) * ||w||^2_2

For multi-output tasks it is:
(1 / (2 * n_samples)) * ||Y - XW||^Fro_2
+ alpha * l1_ratio * ||W||_21
+ 0.5 * alpha * (1 - l1_ratio) * ||W||_Fro^2

Where:

390

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

||W||_21 = \sum_i \sqrt{\sum_j w_{ij}^2}

i.e. the sum of norm of each row.
Read more in the User Guide.
Parameters X : {array-like}, shape (n_samples, n_features)
Training data. Pass directly as Fortran-contiguous data to avoid unnecessary memory
duplication. If y is mono-output then X can be sparse.
y : ndarray, shape (n_samples,) or (n_samples, n_outputs)
Target values
l1_ratio : float, optional
float between 0 and 1 passed to elastic net (scaling between l1 and l2 penalties).
l1_ratio=1 corresponds to the Lasso
eps : float
Length of the path. eps=1e-3 means that alpha_min / alpha_max = 1e-3
n_alphas : int, optional
Number of alphas along the regularization path
alphas : ndarray, optional
List of alphas where to compute the models. If None alphas are set automatically
precompute : True | False | ‘auto’ | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to 'auto'
let us decide. The Gram matrix can also be passed as argument.
Xy : array-like, optional
Xy = np.dot(X.T, y) that can be precomputed. It is useful only when the Gram matrix is
precomputed.
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
coef_init : array, shape (n_features, ) | None
The initial values of the coefficients.
verbose : bool or integer
Amount of verbosity.
return_n_iter : bool
whether to return the number of iterations or not.
positive : bool, default False
If set to True, forces coefficients to be positive. (Only allowed when y.ndim == 1).
check_input : bool, default True
Skip input validation checks, including the Gram matrix when provided assuming there
are handled by the caller when check_input=False.
**params : kwargs

3.3. Model selection and evaluation

391

scikit-learn user guide, Release 0.19.1

keyword arguments passed to the coordinate descent solver.
Returns alphas : array, shape (n_alphas,)
The alphas along the path where models are computed.
coefs : array, shape (n_features, n_alphas) or (n_outputs, n_features, n_alphas)
Coefficients along the path.
dual_gaps : array, shape (n_alphas,)
The dual gaps at the end of the optimization for each alpha.
n_iters : array-like, shape (n_alphas,)
The number of iterations taken by the coordinate descent optimizer to reach the specified
tolerance for each alpha. (Is returned when return_n_iter is set to True).
See also:
MultiTaskElasticNet, MultiTaskElasticNetCV , ElasticNet, ElasticNetCV
Notes
For an example, see examples/linear_model/plot_lasso_coordinate_descent_path.py.
predict(X)
Predict using the linear model
Parameters X : {array-like, sparse matrix}, shape = (n_samples, n_features)
Samples.
Returns C : array, shape = (n_samples,)
Returns predicted values.
score(X, y, sample_weight=None)
Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True values for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns score : float
R^2 of self.predict(X) wrt. y.
set_params(**params)
Set the parameters of this estimator.

392

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form __ so that it’s possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.LarsCV
class sklearn.linear_model.LarsCV(fit_intercept=True, verbose=False, max_iter=500, normalize=True, precompute=’auto’, cv=None, max_n_alphas=1000,
n_jobs=1, eps=2.2204460492503131e-16, copy_X=True, positive=False)
Cross-validated Least Angle Regression model
Read more in the User Guide.
Parameters fit_intercept : boolean
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
verbose : boolean or integer, optional
Sets the verbosity amount
max_iter : integer, optional
Maximum number of iterations to perform.
normalize : boolean, optional, default True
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by
the l2-norm. If you wish to standardize, please use sklearn.preprocessing.
StandardScaler before calling fit on an estimator with normalize=False.
precompute : True | False | ‘auto’ | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to 'auto'
let us decide. The Gram matrix cannot be passed as argument since we will use only
subsets of X.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy. Possible inputs for cv are:
• None, to use the default 3-fold cross-validation,
• integer, to specify the number of folds.
• An object to be used as a cross-validation generator.
• An iterable yielding train/test splits.
For integer/None inputs, KFold is used.
Refer User Guide for the various cross-validation strategies that can be used here.
max_n_alphas : integer, optional
The maximum number of points on the path used to compute the residuals in the crossvalidation
n_jobs : integer, optional

3.3. Model selection and evaluation

393

scikit-learn user guide, Release 0.19.1

Number of CPUs to use during the cross validation. If -1, use all the CPUs
eps : float, optional
The machine-precision regularization in the computation of the Cholesky diagonal factors. Increase this for very ill-conditioned systems.
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
positive : boolean (default=False)
Restrict coefficients to be >= 0. Be aware that you might want to remove fit_intercept
which is set True by default.
Attributes coef_ : array, shape (n_features,)
parameter vector (w in the formulation formula)
intercept_ : float
independent term in decision function
coef_path_ : array, shape (n_features, n_alphas)
the varying values of the coefficients along the path
alpha_ : float
the estimated regularization parameter alpha
alphas_ : array, shape (n_alphas,)
the different values of alpha along the path
cv_alphas_ : array, shape (n_cv_alphas,)
all the values of alpha along the path for the different folds
mse_path_ : array, shape (n_folds, n_cv_alphas)
the mean square error on left-out for each fold along the path (alpha values given by
cv_alphas)
n_iter_ : array-like or int
the number of iterations run by Lars with the optimal alpha.
See also:
lars_path, LassoLars, LassoLarsCV
Methods

fit(X, y)
get_params([deep])
predict(X)
score(X, y[, sample_weight])
set_params(**params)

394

Fit the model using X, y as training data.
Get parameters for this estimator.
Predict using the linear model
Returns the coefficient of determination R^2 of the prediction.
Set the parameters of this estimator.

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

__init__(fit_intercept=True, verbose=False, max_iter=500, normalize=True, precompute=’auto’,
cv=None, max_n_alphas=1000, n_jobs=1, eps=2.2204460492503131e-16, copy_X=True,
positive=False)
alpha
DEPRECATED: Attribute alpha is deprecated in 0.19 and will be removed in 0.21. See alpha_ instead
cv_mse_path_
DEPRECATED: Attribute cv_mse_path_ is deprecated in 0.18 and will be removed in 0.20. Use
mse_path_ instead
fit(X, y)
Fit the model using X, y as training data.
Parameters X : array-like, shape (n_samples, n_features)
Training data.
y : array-like, shape (n_samples,)
Target values.
Returns self : object
returns an instance of self.
get_params(deep=True)
Get parameters for this estimator.
Parameters deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
Returns params : mapping of string to any
Parameter names mapped to their values.
predict(X)
Predict using the linear model
Parameters X : {array-like, sparse matrix}, shape = (n_samples, n_features)
Samples.
Returns C : array, shape = (n_samples,)
Returns predicted values.
score(X, y, sample_weight=None)
Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True values for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
3.3. Model selection and evaluation

395

scikit-learn user guide, Release 0.19.1

Returns score : float
R^2 of self.predict(X) wrt. y.
set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form __ so that it’s possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.LassoCV
class sklearn.linear_model.LassoCV(eps=0.001, n_alphas=100, alphas=None, fit_intercept=True,
normalize=False,
precompute=’auto’,
max_iter=1000,
tol=0.0001, copy_X=True, cv=None, verbose=False,
n_jobs=1, positive=False, random_state=None, selection=’cyclic’)
Lasso linear model with iterative fitting along a regularization path
The best model is selected by cross-validation.
The optimization objective for Lasso is:
(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1

Read more in the User Guide.
Parameters eps : float, optional
Length of the path. eps=1e-3 means that alpha_min / alpha_max = 1e-3.
n_alphas : int, optional
Number of alphas along the regularization path
alphas : numpy array, optional
List of alphas where to compute the models. If None alphas are set automatically
fit_intercept : boolean, default True
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
normalize : boolean, optional, default False
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by
the l2-norm. If you wish to standardize, please use sklearn.preprocessing.
StandardScaler before calling fit on an estimator with normalize=False.
precompute : True | False | ‘auto’ | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to 'auto'
let us decide. The Gram matrix can also be passed as argument.
max_iter : int, optional
The maximum number of iterations
tol : float, optional
396

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The tolerance for the optimization: if the updates are smaller than tol, the optimization
code checks the dual gap for optimality and continues until it is smaller than tol.
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy. Possible inputs for cv are:
• None, to use the default 3-fold cross-validation,
• integer, to specify the number of folds.
• An object to be used as a cross-validation generator.
• An iterable yielding train/test splits.
For integer/None inputs, KFold is used.
Refer User Guide for the various cross-validation strategies that can be used here.
verbose : bool or integer
Amount of verbosity.
n_jobs : integer, optional
Number of CPUs to use during the cross validation. If -1, use all the CPUs.
positive : bool, optional
If positive, restrict regression coefficients to be positive
random_state : int, RandomState instance or None, optional, default None
The seed of the pseudo random number generator that selects a random feature to update. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when
selection == ‘random’.
selection : str, default ‘cyclic’
If set to ‘random’, a random coefficient is updated every iteration rather than looping
over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.
Attributes alpha_ : float
The amount of penalization chosen by cross validation
coef_ : array, shape (n_features,) | (n_targets, n_features)
parameter vector (w in the cost function formula)
intercept_ : float | array, shape (n_targets,)
independent term in decision function.
mse_path_ : array, shape (n_alphas, n_folds)
mean square error for the test set on each fold, varying alpha
alphas_ : numpy array, shape (n_alphas,)
The grid of alphas used for fitting

3.3. Model selection and evaluation

397

scikit-learn user guide, Release 0.19.1

dual_gap_ : ndarray, shape ()
The dual gap at the end of the optimization for the optimal alpha (alpha_).
n_iter_ : int
number of iterations run by the coordinate descent solver to reach the specified tolerance
for the optimal alpha.
See also:
lars_path, lasso_path, LassoLars, Lasso, LassoLarsCV
Notes
For an example, see examples/linear_model/plot_lasso_model_selection.py.
To avoid unnecessary memory duplication the X argument of the fit method should be directly passed as a
Fortran-contiguous numpy array.
Methods

fit(X, y)
get_params([deep])
path(X, y[, eps, n_alphas, alphas, . . . ])
predict(X)
score(X, y[, sample_weight])
set_params(**params)

Fit linear model with coordinate descent
Get parameters for this estimator.
Compute Lasso path with coordinate descent
Predict using the linear model
Returns the coefficient of determination R^2 of the prediction.
Set the parameters of this estimator.

__init__(eps=0.001, n_alphas=100, alphas=None, fit_intercept=True, normalize=False, precompute=’auto’, max_iter=1000, tol=0.0001, copy_X=True, cv=None, verbose=False,
n_jobs=1, positive=False, random_state=None, selection=’cyclic’)
fit(X, y)
Fit linear model with coordinate descent
Fit is on grid of alphas and best alpha estimated by cross-validation.
Parameters X : {array-like}, shape (n_samples, n_features)
Training data. Pass directly as Fortran-contiguous data to avoid unnecessary memory
duplication. If y is mono-output, X can be sparse.
y : array-like, shape (n_samples,) or (n_samples, n_targets)
Target values
get_params(deep=True)
Get parameters for this estimator.
Parameters deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
Returns params : mapping of string to any
Parameter names mapped to their values.

398

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

static path(X, y, eps=0.001, n_alphas=100, alphas=None, precompute=’auto’, Xy=None,
copy_X=True, coef_init=None, verbose=False, return_n_iter=False, positive=False,
**params)
Compute Lasso path with coordinate descent
The Lasso optimization function varies for mono and multi-outputs.
For mono-output tasks it is:
(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1

For multi-output tasks it is:
(1 / (2 * n_samples)) * ||Y - XW||^2_Fro + alpha * ||W||_21

Where:
||W||_21 = \sum_i \sqrt{\sum_j w_{ij}^2}

i.e. the sum of norm of each row.
Read more in the User Guide.
Parameters X : {array-like, sparse matrix}, shape (n_samples, n_features)
Training data. Pass directly as Fortran-contiguous data to avoid unnecessary memory
duplication. If y is mono-output then X can be sparse.
y : ndarray, shape (n_samples,), or (n_samples, n_outputs)
Target values
eps : float, optional
Length of the path. eps=1e-3 means that alpha_min / alpha_max = 1e-3
n_alphas : int, optional
Number of alphas along the regularization path
alphas : ndarray, optional
List of alphas where to compute the models. If None alphas are set automatically
precompute : True | False | ‘auto’ | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to 'auto'
let us decide. The Gram matrix can also be passed as argument.
Xy : array-like, optional
Xy = np.dot(X.T, y) that can be precomputed. It is useful only when the Gram matrix is
precomputed.
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
coef_init : array, shape (n_features, ) | None
The initial values of the coefficients.
verbose : bool or integer
Amount of verbosity.
return_n_iter : bool
3.3. Model selection and evaluation

399

scikit-learn user guide, Release 0.19.1

whether to return the number of iterations or not.
positive : bool, default False
If set to True, forces coefficients to be positive. (Only allowed when y.ndim == 1).
**params : kwargs
keyword arguments passed to the coordinate descent solver.
Returns alphas : array, shape (n_alphas,)
The alphas along the path where models are computed.
coefs : array, shape (n_features, n_alphas) or (n_outputs, n_features, n_alphas)
Coefficients along the path.
dual_gaps : array, shape (n_alphas,)
The dual gaps at the end of the optimization for each alpha.
n_iters : array-like, shape (n_alphas,)
The number of iterations taken by the coordinate descent optimizer to reach the specified
tolerance for each alpha.
See also:
lars_path, Lasso, LassoLars, LassoCV , LassoLarsCV , sklearn.decomposition.
sparse_encode
Notes
For an example, see examples/linear_model/plot_lasso_coordinate_descent_path.py.
To avoid unnecessary memory duplication the X argument of the fit method should be directly passed as a
Fortran-contiguous numpy array.
Note that in certain cases, the Lars solver may be significantly faster to implement this functionality. In
particular, linear interpolation can be used to retrieve model coefficients between the values output by
lars_path
Examples
Comparing lasso_path and lars_path with interpolation:
>>> X = np.array([[1, 2, 3.1], [2.3, 5.4, 4.3]]).T
>>> y = np.array([1, 2, 3.1])
>>> # Use lasso_path to compute a coefficient path
>>> _, coef_path, _ = lasso_path(X, y, alphas=[5., 1., .5])
>>> print(coef_path)
[[ 0.
0.
0.46874778]
[ 0.2159048
0.4425765
0.23689075]]
>>>
>>>
>>>
>>>
>>>
>>>

400

# Now use lars_path and 1D linear interpolation to compute the
# same path
from sklearn.linear_model import lars_path
alphas, active, coef_path_lars = lars_path(X, y, method='lasso')
from scipy import interpolate
coef_path_continuous = interpolate.interp1d(alphas[::-1],

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

...
>>> print(coef_path_continuous([5., 1., .5]))
[[ 0.
0.
0.46915237]
[ 0.2159048
0.4425765
0.23668876]]

coef_path_lars[:, ::-1])

predict(X)
Predict using the linear model
Parameters X : {array-like, sparse matrix}, shape = (n_samples, n_features)
Samples.
Returns C : array, shape = (n_samples,)
Returns predicted values.
score(X, y, sample_weight=None)
Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True values for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns score : float
R^2 of self.predict(X) wrt. y.
set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form __ so that it’s possible to update each component
of a nested object.
Returns self :
Examples using sklearn.linear_model.LassoCV
• Cross-validation on diabetes Dataset Exercise
• Feature selection using SelectFromModel and LassoCV
• Lasso model selection: Cross-Validation / AIC / BIC

3.3. Model selection and evaluation

401

scikit-learn user guide, Release 0.19.1

sklearn.linear_model.LassoLarsCV
class sklearn.linear_model.LassoLarsCV(fit_intercept=True, verbose=False, max_iter=500,
normalize=True,
precompute=’auto’,
cv=None,
max_n_alphas=1000,
n_jobs=1,
eps=2.2204460492503131e-16, copy_X=True, positive=False)
Cross-validated Lasso, using the LARS algorithm
The optimization objective for Lasso is:
(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1

Read more in the User Guide.
Parameters fit_intercept : boolean
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
verbose : boolean or integer, optional
Sets the verbosity amount
max_iter : integer, optional
Maximum number of iterations to perform.
normalize : boolean, optional, default True
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by
the l2-norm. If you wish to standardize, please use sklearn.preprocessing.
StandardScaler before calling fit on an estimator with normalize=False.
precompute : True | False | ‘auto’
Whether to use a precomputed Gram matrix to speed up calculations. If set to 'auto'
let us decide. The Gram matrix cannot be passed as argument since we will use only
subsets of X.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy. Possible inputs for cv are:
• None, to use the default 3-fold cross-validation,
• integer, to specify the number of folds.
• An object to be used as a cross-validation generator.
• An iterable yielding train/test splits.
For integer/None inputs, KFold is used.
Refer User Guide for the various cross-validation strategies that can be used here.
max_n_alphas : integer, optional
The maximum number of points on the path used to compute the residuals in the crossvalidation
n_jobs : integer, optional
Number of CPUs to use during the cross validation. If -1, use all the CPUs

402

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

eps : float, optional
The machine-precision regularization in the computation of the Cholesky diagonal factors. Increase this for very ill-conditioned systems.
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
positive : boolean (default=False)
Restrict coefficients to be >= 0. Be aware that you might want to remove fit_intercept
which is set True by default. Under the positive restriction the model coefficients do
not converge to the ordinary-least-squares solution for small values of alpha. Only coefficients up to the smallest alpha value (alphas_[alphas_ > 0.].min() when
fit_path=True) reached by the stepwise Lars-Lasso algorithm are typically in congruence with the solution of the coordinate descent Lasso estimator. As a consequence
using LassoLarsCV only makes sense for problems where a sparse solution is expected
and/or reached.
Attributes coef_ : array, shape (n_features,)
parameter vector (w in the formulation formula)
intercept_ : float
independent term in decision function.
coef_path_ : array, shape (n_features, n_alphas)
the varying values of the coefficients along the path
alpha_ : float
the estimated regularization parameter alpha
alphas_ : array, shape (n_alphas,)
the different values of alpha along the path
cv_alphas_ : array, shape (n_cv_alphas,)
all the values of alpha along the path for the different folds
mse_path_ : array, shape (n_folds, n_cv_alphas)
the mean square error on left-out for each fold along the path (alpha values given by
cv_alphas)
n_iter_ : array-like or int
the number of iterations run by Lars with the optimal alpha.
See also:
lars_path, LassoLars, LarsCV , LassoCV
Notes
The object solves the same problem as the LassoCV object. However, unlike the LassoCV, it find the relevant
alphas values by itself. In general, because of this property, it will be more stable. However, it is more fragile to
heavily multicollinear datasets.
It is more efficient than the LassoCV if only a small number of features are selected compared to the total
number, for instance if there are very few samples compared to the number of features.

3.3. Model selection and evaluation

403

scikit-learn user guide, Release 0.19.1

Methods

fit(X, y)
get_params([deep])
predict(X)
score(X, y[, sample_weight])
set_params(**params)

Fit the model using X, y as training data.
Get parameters for this estimator.
Predict using the linear model
Returns the coefficient of determination R^2 of the prediction.
Set the parameters of this estimator.

__init__(fit_intercept=True, verbose=False, max_iter=500, normalize=True, precompute=’auto’,
cv=None, max_n_alphas=1000, n_jobs=1, eps=2.2204460492503131e-16, copy_X=True,
positive=False)
alpha
DEPRECATED: Attribute alpha is deprecated in 0.19 and will be removed in 0.21. See alpha_ instead
cv_mse_path_
DEPRECATED: Attribute cv_mse_path_ is deprecated in 0.18 and will be removed in 0.20. Use
mse_path_ instead
fit(X, y)
Fit the model using X, y as training data.
Parameters X : array-like, shape (n_samples, n_features)
Training data.
y : array-like, shape (n_samples,)
Target values.
Returns self : object
returns an instance of self.
get_params(deep=True)
Get parameters for this estimator.
Parameters deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
Returns params : mapping of string to any
Parameter names mapped to their values.
predict(X)
Predict using the linear model
Parameters X : {array-like, sparse matrix}, shape = (n_samples, n_features)
Samples.
Returns C : array, shape = (n_samples,)
Returns predicted values.
score(X, y, sample_weight=None)
Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score

404

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True values for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns score : float
R^2 of self.predict(X) wrt. y.
set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form __ so that it’s possible to update each component
of a nested object.
Returns self :
Examples using sklearn.linear_model.LassoLarsCV
• Lasso model selection: Cross-Validation / AIC / BIC
sklearn.linear_model.LogisticRegressionCV
class sklearn.linear_model.LogisticRegressionCV(Cs=10, fit_intercept=True, cv=None,
dual=False,
penalty=’l2’,
scoring=None, solver=’lbfgs’, tol=0.0001,
max_iter=100,
class_weight=None,
n_jobs=1, verbose=0, refit=True, intercept_scaling=1.0, multi_class=’ovr’,
random_state=None)
Logistic Regression CV (aka logit, MaxEnt) classifier.
This class implements logistic regression using liblinear, newton-cg, sag of lbfgs optimizer. The newton-cg, sag
and lbfgs solvers support only L2 regularization with primal formulation. The liblinear solver supports both L1
and L2 regularization, with a dual formulation only for the L2 penalty.
For the grid of Cs values (that are set by default to be ten values in a logarithmic scale between 1e-4 and
1e4), the best hyperparameter is selected by the cross-validator StratifiedKFold, but it can be changed using
the cv parameter. In the case of newton-cg and lbfgs solvers, we warm start along the path i.e guess the initial
coefficients of the present fit to be the coefficients got after convergence in the previous fit, so it is supposed to
be faster for high-dimensional dense data.
For a multiclass problem, the hyperparameters for each class are computed using the best scores got by doing a
one-vs-rest in parallel across all folds and classes. Hence this is not the true multinomial loss.
Read more in the User Guide.
Parameters Cs : list of floats | int

3.3. Model selection and evaluation

405

scikit-learn user guide, Release 0.19.1

Each of the values in Cs describes the inverse of regularization strength. If Cs is as an
int, then a grid of Cs values are chosen in a logarithmic scale between 1e-4 and 1e4.
Like in support vector machines, smaller values specify stronger regularization.
fit_intercept : bool, default: True
Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.
cv : integer or cross-validation generator
The default cross-validation generator used is Stratified K-Folds. If an integer
is provided, then it is the number of folds used. See the module sklearn.
model_selection module for the list of possible cross-validation objects.
dual : bool
Dual or primal formulation. Dual formulation is only implemented for l2 penalty with
liblinear solver. Prefer dual=False when n_samples > n_features.
penalty : str, ‘l1’ or ‘l2’
Used to specify the norm used in the penalization. The ‘newton-cg’, ‘sag’ and ‘lbfgs’
solvers support only l2 penalties.
scoring : string, callable, or None
A string (see model evaluation documentation) or a scorer callable object / function with
signature scorer(estimator, X, y). For a list of scoring functions that can be
used, look at sklearn.metrics. The default scoring option used is ‘accuracy’.
solver : {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’},
default: ‘lbfgs’ Algorithm to use in the optimization problem.
• For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’
faster for large ones.

are

• For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle
multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.
• ‘newton-cg’, ‘lbfgs’ and ‘sag’ only handle L2 penalty, whereas ‘liblinear’
‘saga’ handle L1 penalty.

and

• ‘liblinear’ might be slower in LogisticRegressionCV because it does not handle
warm-starting.
Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with
approximately the same scale. You can preprocess the data with a scaler from
sklearn.preprocessing.
New in version 0.17: Stochastic Average Gradient descent solver.
New in version 0.19: SAGA solver.
tol : float, optional
Tolerance for stopping criteria.
max_iter : int, optional
Maximum number of iterations of the optimization algorithm.
class_weight : dict or ‘balanced’, optional
Weights associated with classes in the form {class_label:
given, all classes are supposed to have weight one.

406

weight}. If not

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The “balanced” mode uses the values of y to automatically adjust weights inversely
proportional to class frequencies in the input data as n_samples / (n_classes
* np.bincount(y)).
Note that these weights will be multiplied with sample_weight (passed through the fit
method) if sample_weight is specified.
New in version 0.17: class_weight == ‘balanced’
n_jobs : int, optional
Number of CPU cores used during the cross-validation loop. If given a value of -1, all
cores are used.
verbose : int
For the ‘liblinear’, ‘sag’ and ‘lbfgs’ solvers set verbose to any positive number for verbosity.
refit : bool
If set to True, the scores are averaged across all folds, and the coefs and the C that
corresponds to the best score is taken, and a final refit is done using these parameters.
Otherwise the coefs, intercepts and C that correspond to the best scores across folds are
averaged.
intercept_scaling : float, default 1.
Useful only when the solver ‘liblinear’ is used and self.fit_intercept is set to True. In this
case, x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value
equal to intercept_scaling is appended to the instance vector. The intercept becomes
intercept_scaling * synthetic_feature_weight.
Note! the synthetic feature weight is subject to l1/l2 regularization as all other features.
To lessen the effect of regularization on synthetic feature weight (and therefore on the
intercept) intercept_scaling has to be increased.
multi_class : str, {‘ovr’, ‘multinomial’}
Multiclass option can be either ‘ovr’ or ‘multinomial’. If the option chosen is ‘ovr’,
then a binary problem is fit for each label. Else the loss minimised is the multinomial
loss fit across the entire probability distribution. Works only for the ‘newton-cg’, ‘sag’,
‘saga’ and ‘lbfgs’ solver.
New in version 0.18: Stochastic Average Gradient descent solver for ‘multinomial’
case.
random_state : int, RandomState instance or None, optional, default None
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
Attributes coef_ : array, shape (1, n_features) or (n_classes, n_features)
Coefficient of the features in the decision function.
coef_ is of shape (1, n_features) when the given problem is binary.
intercept_ : array, shape (1,) or (n_classes,)
Intercept (a.k.a. bias) added to the decision function.
If fit_intercept is set to False, the intercept is set to zero. intercept_ is of shape(1,) when
the problem is binary.

3.3. Model selection and evaluation

407

scikit-learn user guide, Release 0.19.1

Cs_ : array
Array of C i.e. inverse of regularization parameter values used for cross-validation.
coefs_paths_ : array, shape (n_folds, len(Cs_), n_features) or (n_folds,
len(Cs_), n_features + 1)
dict with classes as the keys, and the path of coefficients obtained during crossvalidating across each fold and then across each Cs after doing an OvR for the corresponding class as values. If the ‘multi_class’ option is set to ‘multinomial’, then
the coefs_paths are the coefficients corresponding to each class. Each dict value has
shape (n_folds, len(Cs_), n_features) or (n_folds, len(Cs_),
n_features + 1) depending on whether the intercept is fit or not.
scores_ : dict
dict with classes as the keys, and the values as the grid of scores obtained during crossvalidating each fold, after doing an OvR for the corresponding class. If the ‘multi_class’
option given is ‘multinomial’ then the same scores are repeated across all classes, since
this is the multinomial class. Each dict value has shape (n_folds, len(Cs))
C_ : array, shape (n_classes,) or (n_classes - 1,)
Array of C that maps to the best scores across every class. If refit is set to False, then
for each class, the best C is the average of the C’s that correspond to the best scores for
each fold. C_ is of shape(n_classes,) when the problem is binary.
n_iter_ : array, shape (n_classes, n_folds, n_cs) or (1, n_folds, n_cs)
Actual number of iterations for all classes, folds and Cs. In the binary or multinomial
cases, the first dimension is equal to 1.
See also:
LogisticRegression
Methods

decision_function(X)
densify()
fit(X, y[, sample_weight])
get_params([deep])
predict(X)
predict_log_proba(X)
predict_proba(X)
score(X, y[, sample_weight])
set_params(**params)
sparsify()

Predict confidence scores for samples.
Convert coefficient matrix to dense array format.
Fit the model according to the given training data.
Get parameters for this estimator.
Predict class labels for samples in X.
Log of probability estimates.
Probability estimates.
Returns the mean accuracy on the given test data and
labels.
Set the parameters of this estimator.
Convert coefficient matrix to sparse format.

__init__(Cs=10, fit_intercept=True, cv=None, dual=False, penalty=’l2’, scoring=None,
solver=’lbfgs’, tol=0.0001, max_iter=100, class_weight=None, n_jobs=1, verbose=0,
refit=True, intercept_scaling=1.0, multi_class=’ovr’, random_state=None)
decision_function(X)
Predict confidence scores for samples.
The confidence score for a sample is the signed distance of that sample to the hyperplane.

408

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Parameters X : {array-like, sparse matrix}, shape = (n_samples, n_features)
Samples.
Returns array, shape=(n_samples,) if n_classes == 2 else (n_samples, n_classes) :
Confidence scores per (sample, class) combination. In the binary case, confidence score
for self.classes_[1] where >0 means this class would be predicted.
densify()
Convert coefficient matrix to dense array format.
Converts the coef_ member (back) to a numpy.ndarray. This is the default format of coef_ and is
required for fitting, so calling this method is only required on models that have previously been sparsified;
otherwise, it is a no-op.
Returns self : estimator
fit(X, y, sample_weight=None)
Fit the model according to the given training data.
Parameters X : {array-like, sparse matrix}, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and n_features is the number
of features.
y : array-like, shape (n_samples,)
Target vector relative to X.
sample_weight : array-like, shape (n_samples,) optional
Array of weights that are assigned to individual samples. If not provided, then each
sample is given unit weight.
Returns self : object
Returns self.
get_params(deep=True)
Get parameters for this estimator.
Parameters deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
Returns params : mapping of string to any
Parameter names mapped to their values.
predict(X)
Predict class labels for samples in X.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Samples.
Returns C : array, shape = [n_samples]
Predicted class label per sample.
predict_log_proba(X)
Log of probability estimates.
The returned estimates for all classes are ordered by the label of classes.
Parameters X : array-like, shape = [n_samples, n_features]

3.3. Model selection and evaluation

409

scikit-learn user guide, Release 0.19.1

Returns T : array-like, shape = [n_samples, n_classes]
Returns the log-probability of the sample for each class in the model, where classes are
ordered as they are in self.classes_.
predict_proba(X)
Probability estimates.
The returned estimates for all classes are ordered by the label of classes.
For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find
the predicted probability of each class. Else use a one-vs-rest approach, i.e calculate the probability of
each class assuming it to be positive using the logistic function. and normalize these values across all the
classes.
Parameters X : array-like, shape = [n_samples, n_features]
Returns T : array-like, shape = [n_samples, n_classes]
Returns the probability of the sample for each class in the model, where classes are
ordered as they are in self.classes_.
score(X, y, sample_weight=None)
Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns score : float
Mean accuracy of self.predict(X) wrt. y.
set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form __ so that it’s possible to update each component
of a nested object.
Returns self :
sparsify()
Convert coefficient matrix to sparse format.
Converts the coef_ member to a scipy.sparse matrix, which for L1-regularized models can be much more
memory- and storage-efficient than the usual numpy.ndarray representation.
The intercept_ member is not converted.
Returns self : estimator

410

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Notes
For non-sparse models, i.e. when there are not many zeros in coef_, this may actually increase memory
usage, so use this method with care. A rule of thumb is that the number of zero elements, which can be
computed with (coef_ == 0).sum(), must be more than 50% for this to provide significant benefits.
After calling this method, further fitting with the partial_fit method (if any) will not work until you call
densify.
sklearn.linear_model.MultiTaskElasticNetCV
class sklearn.linear_model.MultiTaskElasticNetCV(l1_ratio=0.5, eps=0.001, n_alphas=100,
alphas=None,
fit_intercept=True,
normalize=False,
max_iter=1000,
tol=0.0001, cv=None, copy_X=True,
verbose=0,
n_jobs=1,
random_state=None, selection=’cyclic’)
Multi-task L1/L2 ElasticNet with built-in cross-validation.
The optimization objective for MultiTaskElasticNet is:
(1 / (2 * n_samples)) * ||Y - XW||^Fro_2
+ alpha * l1_ratio * ||W||_21
+ 0.5 * alpha * (1 - l1_ratio) * ||W||_Fro^2

Where:
||W||_21 = \sum_i \sqrt{\sum_j w_{ij}^2}

i.e. the sum of norm of each row.
Read more in the User Guide.
Parameters l1_ratio : float or array of floats
The ElasticNet mixing parameter, with 0 < l1_ratio <= 1. For l1_ratio = 1 the penalty is
an L1/L2 penalty. For l1_ratio = 0 it is an L2 penalty. For 0 < l1_ratio < 1, the
penalty is a combination of L1/L2 and L2. This parameter can be a list, in which case
the different values are tested by cross-validation and the one giving the best prediction
score is used. Note that a good choice of list of values for l1_ratio is often to put more
values close to 1 (i.e. Lasso) and less close to 0 (i.e. Ridge), as in [.1, .5, .7,
.9, .95, .99, 1]
eps : float, optional
Length of the path. eps=1e-3 means that alpha_min / alpha_max = 1e-3.
n_alphas : int, optional
Number of alphas along the regularization path
alphas : array-like, optional
List of alphas where to compute the models. If not provided, set automatically.
fit_intercept : boolean
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
normalize : boolean, optional, default False
3.3. Model selection and evaluation

411

scikit-learn user guide, Release 0.19.1

This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by
the l2-norm. If you wish to standardize, please use sklearn.preprocessing.
StandardScaler before calling fit on an estimator with normalize=False.
max_iter : int, optional
The maximum number of iterations
tol : float, optional
The tolerance for the optimization: if the updates are smaller than tol, the optimization
code checks the dual gap for optimality and continues until it is smaller than tol.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy. Possible inputs for cv are:
• None, to use the default 3-fold cross-validation,
• integer, to specify the number of folds.
• An object to be used as a cross-validation generator.
• An iterable yielding train/test splits.
For integer/None inputs, KFold is used.
Refer User Guide for the various cross-validation strategies that can be used here.
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
verbose : bool or integer
Amount of verbosity.
n_jobs : integer, optional
Number of CPUs to use during the cross validation. If -1, use all the CPUs. Note that
this is used only if multiple values for l1_ratio are given.
random_state : int, RandomState instance or None, optional, default None
The seed of the pseudo random number generator that selects a random feature to update. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when
selection == ‘random’.
selection : str, default ‘cyclic’
If set to ‘random’, a random coefficient is updated every iteration rather than looping
over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.
Attributes intercept_ : array, shape (n_tasks,)
Independent term in decision function.
coef_ : array, shape (n_tasks, n_features)
Parameter vector (W in the cost function formula). Note that coef_ stores the transpose of W, W.T.
alpha_ : float

412

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The amount of penalization chosen by cross validation
mse_path_ : array, shape (n_alphas, n_folds) or (n_l1_ratio, n_alphas, n_folds)
mean square error for the test set on each fold, varying alpha
alphas_ : numpy array, shape (n_alphas,) or (n_l1_ratio, n_alphas)
The grid of alphas used for fitting, for each l1_ratio
l1_ratio_ : float
best l1_ratio obtained by cross-validation.
n_iter_ : int
number of iterations run by the coordinate descent solver to reach the specified tolerance
for the optimal alpha.
See also:
MultiTaskElasticNet, ElasticNetCV , MultiTaskLassoCV
Notes
The algorithm used to fit the model is coordinate descent.
To avoid unnecessary memory duplication the X argument of the fit method should be directly passed as a
Fortran-contiguous numpy array.
Examples
>>> from sklearn import linear_model
>>> clf = linear_model.MultiTaskElasticNetCV()
>>> clf.fit([[0,0], [1, 1], [2, 2]],
...
[[0, 0], [1, 1], [2, 2]])
...
MultiTaskElasticNetCV(alphas=None, copy_X=True, cv=None, eps=0.001,
fit_intercept=True, l1_ratio=0.5, max_iter=1000, n_alphas=100,
n_jobs=1, normalize=False, random_state=None, selection='cyclic',
tol=0.0001, verbose=0)
>>> print(clf.coef_)
[[ 0.52875032 0.46958558]
[ 0.52875032 0.46958558]]
>>> print(clf.intercept_)
[ 0.00166409 0.00166409]

Methods

fit(X, y)
get_params([deep])
path(X, y[, l1_ratio, eps, n_alphas, . . . ])
predict(X)
score(X, y[, sample_weight])

3.3. Model selection and evaluation

Fit linear model with coordinate descent
Get parameters for this estimator.
Compute elastic net path with coordinate descent
Predict using the linear model
Returns the coefficient of determination R^2 of the prediction.
Continued on next page
413

scikit-learn user guide, Release 0.19.1

set_params(**params)

Table 3.8 – continued from previous page
Set the parameters of this estimator.

__init__(l1_ratio=0.5, eps=0.001, n_alphas=100, alphas=None, fit_intercept=True, normalize=False, max_iter=1000, tol=0.0001, cv=None, copy_X=True, verbose=0, n_jobs=1, random_state=None, selection=’cyclic’)
fit(X, y)
Fit linear model with coordinate descent
Fit is on grid of alphas and best alpha estimated by cross-validation.
Parameters X : {array-like}, shape (n_samples, n_features)
Training data. Pass directly as Fortran-contiguous data to avoid unnecessary memory
duplication. If y is mono-output, X can be sparse.
y : array-like, shape (n_samples,) or (n_samples, n_targets)
Target values
get_params(deep=True)
Get parameters for this estimator.
Parameters deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
Returns params : mapping of string to any
Parameter names mapped to their values.
static path(X, y, l1_ratio=0.5, eps=0.001, n_alphas=100, alphas=None, precompute=’auto’,
Xy=None, copy_X=True, coef_init=None, verbose=False, return_n_iter=False, positive=False, check_input=True, **params)
Compute elastic net path with coordinate descent
The elastic net optimization function varies for mono and multi-outputs.
For mono-output tasks it is:
1 / (2 * n_samples) * ||y - Xw||^2_2
+ alpha * l1_ratio * ||w||_1
+ 0.5 * alpha * (1 - l1_ratio) * ||w||^2_2

For multi-output tasks it is:
(1 / (2 * n_samples)) * ||Y - XW||^Fro_2
+ alpha * l1_ratio * ||W||_21
+ 0.5 * alpha * (1 - l1_ratio) * ||W||_Fro^2

Where:
||W||_21 = \sum_i \sqrt{\sum_j w_{ij}^2}

i.e. the sum of norm of each row.
Read more in the User Guide.
Parameters X : {array-like}, shape (n_samples, n_features)
Training data. Pass directly as Fortran-contiguous data to avoid unnecessary memory
duplication. If y is mono-output then X can be sparse.
414

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

y : ndarray, shape (n_samples,) or (n_samples, n_outputs)
Target values
l1_ratio : float, optional
float between 0 and 1 passed to elastic net (scaling between l1 and l2 penalties).
l1_ratio=1 corresponds to the Lasso
eps : float
Length of the path. eps=1e-3 means that alpha_min / alpha_max = 1e-3
n_alphas : int, optional
Number of alphas along the regularization path
alphas : ndarray, optional
List of alphas where to compute the models. If None alphas are set automatically
precompute : True | False | ‘auto’ | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to 'auto'
let us decide. The Gram matrix can also be passed as argument.
Xy : array-like, optional
Xy = np.dot(X.T, y) that can be precomputed. It is useful only when the Gram matrix is
precomputed.
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
coef_init : array, shape (n_features, ) | None
The initial values of the coefficients.
verbose : bool or integer
Amount of verbosity.
return_n_iter : bool
whether to return the number of iterations or not.
positive : bool, default False
If set to True, forces coefficients to be positive. (Only allowed when y.ndim == 1).
check_input : bool, default True
Skip input validation checks, including the Gram matrix when provided assuming there
are handled by the caller when check_input=False.
**params : kwargs
keyword arguments passed to the coordinate descent solver.
Returns alphas : array, shape (n_alphas,)
The alphas along the path where models are computed.
coefs : array, shape (n_features, n_alphas) or (n_outputs, n_features, n_alphas)
Coefficients along the path.
dual_gaps : array, shape (n_alphas,)

3.3. Model selection and evaluation

415

scikit-learn user guide, Release 0.19.1

The dual gaps at the end of the optimization for each alpha.
n_iters : array-like, shape (n_alphas,)
The number of iterations taken by the coordinate descent optimizer to reach the specified
tolerance for each alpha. (Is returned when return_n_iter is set to True).
See also:
MultiTaskElasticNet, MultiTaskElasticNetCV , ElasticNet, ElasticNetCV
Notes
For an example, see examples/linear_model/plot_lasso_coordinate_descent_path.py.
predict(X)
Predict using the linear model
Parameters X : {array-like, sparse matrix}, shape = (n_samples, n_features)
Samples.
Returns C : array, shape = (n_samples,)
Returns predicted values.
score(X, y, sample_weight=None)
Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True values for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns score : float
R^2 of self.predict(X) wrt. y.
set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form __ so that it’s possible to update each component
of a nested object.
Returns self :

416

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

sklearn.linear_model.MultiTaskLassoCV
class sklearn.linear_model.MultiTaskLassoCV(eps=0.001,
n_alphas=100,
alphas=None,
fit_intercept=True,
normalize=False,
max_iter=1000, tol=0.0001, copy_X=True,
cv=None, verbose=False, n_jobs=1, random_state=None, selection=’cyclic’)
Multi-task L1/L2 Lasso with built-in cross-validation.
The optimization objective for MultiTaskLasso is:
(1 / (2 * n_samples)) * ||Y - XW||^Fro_2 + alpha * ||W||_21

Where:
||W||_21 = \sum_i \sqrt{\sum_j w_{ij}^2}

i.e. the sum of norm of each row.
Read more in the User Guide.
Parameters eps : float, optional
Length of the path. eps=1e-3 means that alpha_min / alpha_max = 1e-3.
n_alphas : int, optional
Number of alphas along the regularization path
alphas : array-like, optional
List of alphas where to compute the models. If not provided, set automatically.
fit_intercept : boolean
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
normalize : boolean, optional, default False
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by
the l2-norm. If you wish to standardize, please use sklearn.preprocessing.
StandardScaler before calling fit on an estimator with normalize=False.
max_iter : int, optional
The maximum number of iterations.
tol : float, optional
The tolerance for the optimization: if the updates are smaller than tol, the optimization
code checks the dual gap for optimality and continues until it is smaller than tol.
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy. Possible inputs for cv are:
• None, to use the default 3-fold cross-validation,
• integer, to specify the number of folds.

3.3. Model selection and evaluation

417

scikit-learn user guide, Release 0.19.1

• An object to be used as a cross-validation generator.
• An iterable yielding train/test splits.
For integer/None inputs, KFold is used.
Refer User Guide for the various cross-validation strategies that can be used here.
verbose : bool or integer
Amount of verbosity.
n_jobs : integer, optional
Number of CPUs to use during the cross validation. If -1, use all the CPUs. Note that
this is used only if multiple values for l1_ratio are given.
random_state : int, RandomState instance or None, optional, default None
The seed of the pseudo random number generator that selects a random feature to update. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when
selection == ‘random’
selection : str, default ‘cyclic’
If set to ‘random’, a random coefficient is updated every iteration rather than looping
over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.
Attributes intercept_ : array, shape (n_tasks,)
Independent term in decision function.
coef_ : array, shape (n_tasks, n_features)
Parameter vector (W in the cost function formula). Note that coef_ stores the transpose of W, W.T.
alpha_ : float
The amount of penalization chosen by cross validation
mse_path_ : array, shape (n_alphas, n_folds)
mean square error for the test set on each fold, varying alpha
alphas_ : numpy array, shape (n_alphas,)
The grid of alphas used for fitting.
n_iter_ : int
number of iterations run by the coordinate descent solver to reach the specified tolerance
for the optimal alpha.
See also:
MultiTaskElasticNet, ElasticNetCV , MultiTaskElasticNetCV
Notes
The algorithm used to fit the model is coordinate descent.

418

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

To avoid unnecessary memory duplication the X argument of the fit method should be directly passed as a
Fortran-contiguous numpy array.
Methods

fit(X, y)
get_params([deep])
path(X, y[, eps, n_alphas, alphas, . . . ])
predict(X)
score(X, y[, sample_weight])
set_params(**params)

Fit linear model with coordinate descent
Get parameters for this estimator.
Compute Lasso path with coordinate descent
Predict using the linear model
Returns the coefficient of determination R^2 of the prediction.
Set the parameters of this estimator.

__init__(eps=0.001, n_alphas=100, alphas=None, fit_intercept=True, normalize=False,
max_iter=1000, tol=0.0001, copy_X=True, cv=None, verbose=False, n_jobs=1, random_state=None, selection=’cyclic’)
fit(X, y)
Fit linear model with coordinate descent
Fit is on grid of alphas and best alpha estimated by cross-validation.
Parameters X : {array-like}, shape (n_samples, n_features)
Training data. Pass directly as Fortran-contiguous data to avoid unnecessary memory
duplication. If y is mono-output, X can be sparse.
y : array-like, shape (n_samples,) or (n_samples, n_targets)
Target values
get_params(deep=True)
Get parameters for this estimator.
Parameters deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
Returns params : mapping of string to any
Parameter names mapped to their values.
static path(X, y, eps=0.001, n_alphas=100, alphas=None, precompute=’auto’, Xy=None,
copy_X=True, coef_init=None, verbose=False, return_n_iter=False, positive=False,
**params)
Compute Lasso path with coordinate descent
The Lasso optimization function varies for mono and multi-outputs.
For mono-output tasks it is:
(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1

For multi-output tasks it is:
(1 / (2 * n_samples)) * ||Y - XW||^2_Fro + alpha * ||W||_21

Where:

3.3. Model selection and evaluation

419

scikit-learn user guide, Release 0.19.1

||W||_21 = \sum_i \sqrt{\sum_j w_{ij}^2}

i.e. the sum of norm of each row.
Read more in the User Guide.
Parameters X : {array-like, sparse matrix}, shape (n_samples, n_features)
Training data. Pass directly as Fortran-contiguous data to avoid unnecessary memory
duplication. If y is mono-output then X can be sparse.
y : ndarray, shape (n_samples,), or (n_samples, n_outputs)
Target values
eps : float, optional
Length of the path. eps=1e-3 means that alpha_min / alpha_max = 1e-3
n_alphas : int, optional
Number of alphas along the regularization path
alphas : ndarray, optional
List of alphas where to compute the models. If None alphas are set automatically
precompute : True | False | ‘auto’ | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to 'auto'
let us decide. The Gram matrix can also be passed as argument.
Xy : array-like, optional
Xy = np.dot(X.T, y) that can be precomputed. It is useful only when the Gram matrix is
precomputed.
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
coef_init : array, shape (n_features, ) | None
The initial values of the coefficients.
verbose : bool or integer
Amount of verbosity.
return_n_iter : bool
whether to return the number of iterations or not.
positive : bool, default False
If set to True, forces coefficients to be positive. (Only allowed when y.ndim == 1).
**params : kwargs
keyword arguments passed to the coordinate descent solver.
Returns alphas : array, shape (n_alphas,)
The alphas along the path where models are computed.
coefs : array, shape (n_features, n_alphas) or (n_outputs, n_features, n_alphas)
Coefficients along the path.

420

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

dual_gaps : array, shape (n_alphas,)
The dual gaps at the end of the optimization for each alpha.
n_iters : array-like, shape (n_alphas,)
The number of iterations taken by the coordinate descent optimizer to reach the specified
tolerance for each alpha.
See also:
lars_path, Lasso, LassoLars, LassoCV , LassoLarsCV , sklearn.decomposition.
sparse_encode
Notes
For an example, see examples/linear_model/plot_lasso_coordinate_descent_path.py.
To avoid unnecessary memory duplication the X argument of the fit method should be directly passed as a
Fortran-contiguous numpy array.
Note that in certain cases, the Lars solver may be significantly faster to implement this functionality. In
particular, linear interpolation can be used to retrieve model coefficients between the values output by
lars_path
Examples
Comparing lasso_path and lars_path with interpolation:
>>> X = np.array([[1, 2, 3.1], [2.3, 5.4, 4.3]]).T
>>> y = np.array([1, 2, 3.1])
>>> # Use lasso_path to compute a coefficient path
>>> _, coef_path, _ = lasso_path(X, y, alphas=[5., 1., .5])
>>> print(coef_path)
[[ 0.
0.
0.46874778]
[ 0.2159048
0.4425765
0.23689075]]
>>> # Now use lars_path and 1D linear interpolation to compute the
>>> # same path
>>> from sklearn.linear_model import lars_path
>>> alphas, active, coef_path_lars = lars_path(X, y, method='lasso')
>>> from scipy import interpolate
>>> coef_path_continuous = interpolate.interp1d(alphas[::-1],
...
coef_path_lars[:, ::-1])
>>> print(coef_path_continuous([5., 1., .5]))
[[ 0.
0.
0.46915237]
[ 0.2159048
0.4425765
0.23668876]]

predict(X)
Predict using the linear model
Parameters X : {array-like, sparse matrix}, shape = (n_samples, n_features)
Samples.
Returns C : array, shape = (n_samples,)
Returns predicted values.

3.3. Model selection and evaluation

421

scikit-learn user guide, Release 0.19.1

score(X, y, sample_weight=None)
Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True values for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns score : float
R^2 of self.predict(X) wrt. y.
set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form __ so that it’s possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.OrthogonalMatchingPursuitCV
class sklearn.linear_model.OrthogonalMatchingPursuitCV(copy=True, fit_intercept=True,
normalize=True,
max_iter=None,
cv=None,
n_jobs=1, verbose=False)
Cross-validated Orthogonal Matching Pursuit model (OMP)
Read more in the User Guide.
Parameters copy : bool, optional
Whether the design matrix X must be copied by the algorithm. A false value is only
helpful if X is already Fortran-ordered, otherwise a copy is made anyway.
fit_intercept : boolean, optional
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
normalize : boolean, optional, default True
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by
the l2-norm. If you wish to standardize, please use sklearn.preprocessing.
StandardScaler before calling fit on an estimator with normalize=False.
max_iter : integer, optional
Maximum numbers of iterations to perform, therefore maximum features to include.
10% of n_features but at least 5 if available.
422

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy. Possible inputs for cv are:
• None, to use the default 3-fold cross-validation,
• integer, to specify the number of folds.
• An object to be used as a cross-validation generator.
• An iterable yielding train/test splits.
For integer/None inputs, KFold is used.
Refer User Guide for the various cross-validation strategies that can be used here.
n_jobs : integer, optional
Number of CPUs to use during the cross validation. If -1, use all the CPUs
verbose : boolean or integer, optional
Sets the verbosity amount
Attributes intercept_ : float or array, shape (n_targets,)
Independent term in decision function.
coef_ : array, shape (n_features,) or (n_targets, n_features)
Parameter vector (w in the problem formulation).
n_nonzero_coefs_ : int
Estimated number of non-zero coefficients giving the best mean squared error over the
cross-validation folds.
n_iter_ : int or array-like
Number of active features across every target for the model refit with the best hyperparameters got by cross-validating across all folds.
See also:
orthogonal_mp,
orthogonal_mp_gram,
lars_path,
Lars,
LassoLars,
OrthogonalMatchingPursuit, LarsCV , LassoLarsCV , decomposition.sparse_encode
Methods

fit(X, y)
get_params([deep])
predict(X)
score(X, y[, sample_weight])
set_params(**params)

Fit the model using X, y as training data.
Get parameters for this estimator.
Predict using the linear model
Returns the coefficient of determination R^2 of the prediction.
Set the parameters of this estimator.

__init__(copy=True, fit_intercept=True, normalize=True, max_iter=None, cv=None, n_jobs=1, verbose=False)
fit(X, y)
Fit the model using X, y as training data.
Parameters X : array-like, shape [n_samples, n_features]

3.3. Model selection and evaluation

423

scikit-learn user guide, Release 0.19.1

Training data.
y : array-like, shape [n_samples]
Target values. Will be cast to X’s dtype if necessary
Returns self : object
returns an instance of self.
get_params(deep=True)
Get parameters for this estimator.
Parameters deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
Returns params : mapping of string to any
Parameter names mapped to their values.
predict(X)
Predict using the linear model
Parameters X : {array-like, sparse matrix}, shape = (n_samples, n_features)
Samples.
Returns C : array, shape = (n_samples,)
Returns predicted values.
score(X, y, sample_weight=None)
Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True values for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns score : float
R^2 of self.predict(X) wrt. y.
set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form __ so that it’s possible to update each component
of a nested object.
Returns self :

424

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Examples using sklearn.linear_model.OrthogonalMatchingPursuitCV
• Orthogonal Matching Pursuit
sklearn.linear_model.RidgeCV
class sklearn.linear_model.RidgeCV(alphas=(0.1, 1.0, 10.0),
ize=False, scoring=None,
store_cv_values=False)
Ridge regression with built-in cross-validation.

fit_intercept=True, normalcv=None, gcv_mode=None,

By default, it performs Generalized Cross-Validation, which is a form of efficient Leave-One-Out crossvalidation.
Read more in the User Guide.
Parameters alphas : numpy array of shape [n_alphas]
Array of alpha values to try. Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the
estimates. Larger values specify stronger regularization. Alpha corresponds to C^-1 in
other linear models such as LogisticRegression or LinearSVC.
fit_intercept : boolean
Whether to calculate the intercept for this model. If set to false, no intercept will be
used in calculations (e.g. data is expected to be already centered).
normalize : boolean, optional, default False
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by
the l2-norm. If you wish to standardize, please use sklearn.preprocessing.
StandardScaler before calling fit on an estimator with normalize=False.
scoring : string, callable or None, optional, default: None
A string (see model evaluation documentation) or a scorer callable object / function with
signature scorer(estimator, X, y).
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy. Possible inputs for cv are:
• None, to use the efficient Leave-One-Out cross-validation
• integer, to specify the number of folds.
• An object to be used as a cross-validation generator.
• An iterable yielding train/test splits.
For integer/None inputs, if y is binary
model_selection.StratifiedKFold
is
model_selection.KFold is used.

or multiclass,
used,
else,

sklearn.
sklearn.

Refer User Guide for the various cross-validation strategies that can be used here.
gcv_mode : {None, ‘auto’, ‘svd’, eigen’}, optional
Flag indicating which strategy to use when performing Generalized Cross-Validation.
Options are:

3.3. Model selection and evaluation

425

scikit-learn user guide, Release 0.19.1

'auto' : use svd if n_samples > n_features or when X is a sparse
matrix, otherwise use eigen
'svd' : force computation via singular value decomposition of X
(does not work for sparse matrices)
'eigen' : force computation via eigendecomposition of X^T X

The ‘auto’ mode is the default and is intended to pick the cheaper option of the two
depending upon the shape and format of the training data.
store_cv_values : boolean, default=False
Flag indicating if the cross-validation values corresponding to each alpha should be
stored in the cv_values_ attribute (see below). This flag is only compatible with
cv=None (i.e. using Generalized Cross-Validation).
Attributes cv_values_ : array, shape = [n_samples, n_alphas] or shape = [n_samples, n_targets,
n_alphas], optional
Cross-validation values for each alpha (if store_cv_values=True and cv=None). After
fit() has been called, this attribute will contain the mean squared errors (by default) or
the values of the {loss,score}_func function (if provided in the constructor).
coef_ : array, shape = [n_features] or [n_targets, n_features]
Weight vector(s).
intercept_ : float | array, shape = (n_targets,)
Independent term in decision function. Set to 0.0 if fit_intercept = False.
alpha_ : float
Estimated regularization parameter.
See also:
Ridge Ridge regression
RidgeClassifier Ridge classifier
RidgeClassifierCV Ridge classifier with built-in cross validation
Methods

fit(X, y[, sample_weight])
get_params([deep])
predict(X)
score(X, y[, sample_weight])
set_params(**params)

Fit Ridge regression model
Get parameters for this estimator.
Predict using the linear model
Returns the coefficient of determination R^2 of the prediction.
Set the parameters of this estimator.

__init__(alphas=(0.1, 1.0, 10.0), fit_intercept=True, normalize=False, scoring=None, cv=None,
gcv_mode=None, store_cv_values=False)
fit(X, y, sample_weight=None)
Fit Ridge regression model
Parameters X : array-like, shape = [n_samples, n_features]
Training data
426

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

y : array-like, shape = [n_samples] or [n_samples, n_targets]
Target values. Will be cast to X’s dtype if necessary
sample_weight : float or array-like of shape [n_samples]
Sample weight
Returns self : Returns self.
get_params(deep=True)
Get parameters for this estimator.
Parameters deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
Returns params : mapping of string to any
Parameter names mapped to their values.
predict(X)
Predict using the linear model
Parameters X : {array-like, sparse matrix}, shape = (n_samples, n_features)
Samples.
Returns C : array, shape = (n_samples,)
Returns predicted values.
score(X, y, sample_weight=None)
Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True values for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns score : float
R^2 of self.predict(X) wrt. y.
set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form __ so that it’s possible to update each component
of a nested object.
Returns self :

3.3. Model selection and evaluation

427

scikit-learn user guide, Release 0.19.1

Examples using sklearn.linear_model.RidgeCV
• Face completion with a multi-output estimators
sklearn.linear_model.RidgeClassifierCV
class sklearn.linear_model.RidgeClassifierCV(alphas=(0.1, 1.0, 10.0), fit_intercept=True,
normalize=False, scoring=None, cv=None,
class_weight=None)
Ridge classifier with built-in cross-validation.
By default, it performs Generalized Cross-Validation, which is a form of efficient Leave-One-Out crossvalidation. Currently, only the n_features > n_samples case is handled efficiently.
Read more in the User Guide.
Parameters alphas : numpy array of shape [n_alphas]
Array of alpha values to try. Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the
estimates. Larger values specify stronger regularization. Alpha corresponds to C^-1 in
other linear models such as LogisticRegression or LinearSVC.
fit_intercept : boolean
Whether to calculate the intercept for this model. If set to false, no intercept will be
used in calculations (e.g. data is expected to be already centered).
normalize : boolean, optional, default False
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by
the l2-norm. If you wish to standardize, please use sklearn.preprocessing.
StandardScaler before calling fit on an estimator with normalize=False.
scoring : string, callable or None, optional, default: None
A string (see model evaluation documentation) or a scorer callable object / function with
signature scorer(estimator, X, y).
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy. Possible inputs for cv are:
• None, to use the efficient Leave-One-Out cross-validation
• integer, to specify the number of folds.
• An object to be used as a cross-validation generator.
• An iterable yielding train/test splits.
Refer User Guide for the various cross-validation strategies that can be used here.
class_weight : dict or ‘balanced’, optional
Weights associated with classes in the form {class_label:
given, all classes are supposed to have weight one.

weight}. If not

The “balanced” mode uses the values of y to automatically adjust weights inversely
proportional to class frequencies in the input data as n_samples / (n_classes
* np.bincount(y))

428

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Attributes cv_values_ : array, shape = [n_samples, n_alphas] or shape = [n_samples, n_responses,
n_alphas], optional
Cross-validation values for each alpha (if store_cv_values=True and
‘cv=None‘). After ‘fit()‘ has been called, this attribute will contain the mean squared errors
(by default) or the values of the ‘{loss,score}_func‘ function (if provided in the constructor).
:
coef_ : array, shape = [n_features] or [n_targets, n_features]
Weight vector(s).
intercept_ : float | array, shape = (n_targets,)
Independent term in decision function. Set to 0.0 if fit_intercept = False.
alpha_ : float
Estimated regularization parameter
See also:
Ridge Ridge regression
RidgeClassifier Ridge classifier
RidgeCV Ridge regression with built-in cross validation
Notes
For multi-class classification, n_class classifiers are trained in a one-versus-all approach. Concretely, this is
implemented by taking advantage of the multi-variate response support in Ridge.
Methods

decision_function(X)
fit(X, y[, sample_weight])
get_params([deep])
predict(X)
score(X, y[, sample_weight])
set_params(**params)

Predict confidence scores for samples.
Fit the ridge classifier.
Get parameters for this estimator.
Predict class labels for samples in X.
Returns the mean accuracy on the given test data and
labels.
Set the parameters of this estimator.

__init__(alphas=(0.1, 1.0, 10.0), fit_intercept=True, normalize=False, scoring=None, cv=None,
class_weight=None)
decision_function(X)
Predict confidence scores for samples.
The confidence score for a sample is the signed distance of that sample to the hyperplane.
Parameters X : {array-like, sparse matrix}, shape = (n_samples, n_features)
Samples.
Returns array, shape=(n_samples,) if n_classes == 2 else (n_samples, n_classes) :
Confidence scores per (sample, class) combination. In the binary case, confidence score

3.3. Model selection and evaluation

429

scikit-learn user guide, Release 0.19.1

for self.classes_[1] where >0 means this class would be predicted.
fit(X, y, sample_weight=None)
Fit the ridge classifier.
Parameters X : array-like, shape (n_samples, n_features)
Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape (n_samples,)
Target values. Will be cast to X’s dtype if necessary
sample_weight : float or numpy array of shape (n_samples,)
Sample weight.
Returns self : object
Returns self.
get_params(deep=True)
Get parameters for this estimator.
Parameters deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
Returns params : mapping of string to any
Parameter names mapped to their values.
predict(X)
Predict class labels for samples in X.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Samples.
Returns C : array, shape = [n_samples]
Predicted class label per sample.
score(X, y, sample_weight=None)
Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns score : float
Mean accuracy of self.predict(X) wrt. y.

430

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form __ so that it’s possible to update each component
of a nested object.
Returns self :
Information Criterion
Some models can offer an information-theoretic closed-form formula of the optimal estimate of the regularization
parameter by computing a single regularization path (instead of several when using cross-validation).
Here is the list of models benefitting from the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC) for automated model selection:
linear_model.LassoLarsIC([criterion, . . . ])

Lasso model fit with Lars using BIC or AIC for model selection

sklearn.linear_model.LassoLarsIC
class sklearn.linear_model.LassoLarsIC(criterion=’aic’, fit_intercept=True, verbose=False,
normalize=True, precompute=’auto’, max_iter=500,
eps=2.2204460492503131e-16, copy_X=True, positive=False)
Lasso model fit with Lars using BIC or AIC for model selection
The optimization objective for Lasso is:
(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1

AIC is the Akaike information criterion and BIC is the Bayes Information criterion. Such criteria are useful
to select the value of the regularization parameter by making a trade-off between the goodness of fit and the
complexity of the model. A good model should explain well the data while being simple.
Read more in the User Guide.
Parameters criterion : ‘bic’ | ‘aic’
The type of criterion to use.
fit_intercept : boolean
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
verbose : boolean or integer, optional
Sets the verbosity amount
normalize : boolean, optional, default True
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by
the l2-norm. If you wish to standardize, please use sklearn.preprocessing.
StandardScaler before calling fit on an estimator with normalize=False.
precompute : True | False | ‘auto’ | array-like
3.3. Model selection and evaluation

431

scikit-learn user guide, Release 0.19.1

Whether to use a precomputed Gram matrix to speed up calculations. If set to 'auto'
let us decide. The Gram matrix can also be passed as argument.
max_iter : integer, optional
Maximum number of iterations to perform. Can be used for early stopping.
eps : float, optional
The machine-precision regularization in the computation of the Cholesky diagonal factors. Increase this for very ill-conditioned systems. Unlike the tol parameter in some
iterative optimization-based algorithms, this parameter does not control the tolerance of
the optimization.
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
positive : boolean (default=False)
Restrict coefficients to be >= 0. Be aware that you might want to remove fit_intercept
which is set True by default. Under the positive restriction the model coefficients do
not converge to the ordinary-least-squares solution for small values of alpha. Only coefficients up to the smallest alpha value (alphas_[alphas_ > 0.].min() when
fit_path=True) reached by the stepwise Lars-Lasso algorithm are typically in congruence with the solution of the coordinate descent Lasso estimator. As a consequence
using LassoLarsIC only makes sense for problems where a sparse solution is expected
and/or reached.
Attributes coef_ : array, shape (n_features,)
parameter vector (w in the formulation formula)
intercept_ : float
independent term in decision function.
alpha_ : float
the alpha parameter chosen by the information criterion
n_iter_ : int
number of iterations run by lars_path to find the grid of alphas.
criterion_ : array, shape (n_alphas,)
The value of the information criteria (‘aic’, ‘bic’) across all alphas. The alpha which
has the smallest information criterion is chosen. This value is larger by a factor of
n_samples compared to Eqns. 2.15 and 2.16 in (Zou et al, 2007).
See also:
lars_path, LassoLars, LassoLarsCV
Notes
The estimation of the number of degrees of freedom is given by:
“On the degrees of freedom of the lasso” Hui Zou, Trevor Hastie, and Robert Tibshirani Ann. Statist. Volume
35, Number 5 (2007), 2173-2192.
https://en.wikipedia.org/wiki/Akaike_information_criterion
information_criterion

432

https://en.wikipedia.org/wiki/Bayesian_

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Examples
>>> from sklearn import linear_model
>>> reg = linear_model.LassoLarsIC(criterion='bic')
>>> reg.fit([[-1, 1], [0, 0], [1, 1]], [-1.1111, 0, -1.1111])
...
LassoLarsIC(copy_X=True, criterion='bic', eps=..., fit_intercept=True,
max_iter=500, normalize=True, positive=False, precompute='auto',
verbose=False)
>>> print(reg.coef_)
[ 0. -1.11...]

Methods

fit(X, y[, copy_X])
get_params([deep])
predict(X)
score(X, y[, sample_weight])
set_params(**params)

Fit the model using X, y as training data.
Get parameters for this estimator.
Predict using the linear model
Returns the coefficient of determination R^2 of the prediction.
Set the parameters of this estimator.

__init__(criterion=’aic’, fit_intercept=True, verbose=False, normalize=True, precompute=’auto’,
max_iter=500, eps=2.2204460492503131e-16, copy_X=True, positive=False)
fit(X, y, copy_X=True)
Fit the model using X, y as training data.
Parameters X : array-like, shape (n_samples, n_features)
training data.
y : array-like, shape (n_samples,)
target values. Will be cast to X’s dtype if necessary
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
Returns self : object
returns an instance of self.
get_params(deep=True)
Get parameters for this estimator.
Parameters deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
Returns params : mapping of string to any
Parameter names mapped to their values.
predict(X)
Predict using the linear model
Parameters X : {array-like, sparse matrix}, shape = (n_samples, n_features)

3.3. Model selection and evaluation

433

scikit-learn user guide, Release 0.19.1

Samples.
Returns C : array, shape = (n_samples,)
Returns predicted values.
score(X, y, sample_weight=None)
Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True values for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns score : float
R^2 of self.predict(X) wrt. y.
set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form __ so that it’s possible to update each component
of a nested object.
Returns self :
Examples using sklearn.linear_model.LassoLarsIC
• Lasso model selection: Cross-Validation / AIC / BIC
Out of Bag Estimates
When using ensemble methods base upon bagging, i.e. generating new training sets using sampling with replacement,
part of the training set remains unused. For each classifier in the ensemble, a different part of the training set is left
out.
This left out portion can be used to estimate the generalization error without having to rely on a separate validation
set. This estimate comes “for free” as no additional data is needed and can be used for model selection.
This is currently implemented in the following classes:
ensemble.RandomForestClassifier([. . . ])
ensemble.RandomForestRegressor([. . . ])
ensemble.ExtraTreesClassifier([. . . ])
ensemble.ExtraTreesRegressor([n_estimators,
. . . ])

A random forest classifier.
A random forest regressor.
An extra-trees classifier.
An extra-trees regressor.
Continued on next page

434

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Table 3.15 – continued from previous page
ensemble.GradientBoostingClassifier([loss, Gradient Boosting for classification.
. . . ])
ensemble.GradientBoostingRegressor([loss,
Gradient Boosting for regression.
. . . ])

sklearn.ensemble.RandomForestClassifier
class sklearn.ensemble.RandomForestClassifier(n_estimators=10,
criterion=’gini’,
max_depth=None,
min_samples_split=2, min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_features=’auto’,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
bootstrap=True, oob_score=False, n_jobs=1,
random_state=None,
verbose=0,
warm_start=False, class_weight=None)
A random forest classifier.
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the
dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is
always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True
(default).
Read more in the User Guide.
Parameters n_estimators : integer, optional (default=10)
The number of trees in the forest.
criterion : string, optional (default=”gini”)
The function to measure the quality of a split. Supported criteria are “gini” for the Gini
impurity and “entropy” for the information gain. Note: this parameter is tree-specific.
max_features : int, float, string or None, optional (default=”auto”)
The number of features to consider when looking for the best split:
• If int, then consider max_features features at each split.
• If float, then max_features is a percentage and int(max_features * n_features) features
are considered at each split.
• If “auto”, then max_features=sqrt(n_features).
• If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
• If “log2”, then max_features=log2(n_features).
• If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node
samples is found, even if it requires to effectively inspect more than max_features
features.
max_depth : integer or None, optional (default=None)

3.3. Model selection and evaluation

435

scikit-learn user guide, Release 0.19.1

The maximum depth of the tree. If None, then nodes are expanded until all leaves are
pure or until all leaves contain less than min_samples_split samples.
min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:
• If int, then consider min_samples_split as the minimum number.
• If float, then min_samples_split is a percentage and ceil(min_samples_split *
n_samples) are the minimum number of samples for each split.
Changed in version 0.18: Added float values for percentages.
min_samples_leaf : int, float, optional (default=1)
The minimum number of samples required to be at a leaf node:
• If int, then consider min_samples_leaf as the minimum number.
• If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf *
n_samples) are the minimum number of samples for each node.
Changed in version 0.18: Added float values for percentages.
min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the sum total of weights (of all the input samples)
required to be at a leaf node. Samples have equal weight when sample_weight is not
provided.
max_leaf_nodes : int or None, optional (default=None)
Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as
relative reduction in impurity. If None then unlimited number of leaf nodes.
min_impurity_split : float,
Threshold for early stopping in tree growth. A node will split if its impurity is above
the threshold, otherwise it is a leaf.
Deprecated since version 0.19: min_impurity_split has been deprecated in favor of min_impurity_decrease in 0.19 and will be removed in 0.21. Use
min_impurity_decrease instead.
min_impurity_decrease : float, optional (default=0.)
A node will be split if this split induces a decrease of the impurity greater than or equal
to this value.
The weighted impurity decrease equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current
node, N_t_L is the number of samples in the left child, and N_t_R is the number of
samples in the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is
passed.
New in version 0.19.
bootstrap : boolean, optional (default=True)

436

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Whether bootstrap samples are used when building trees.
oob_score : bool (default=False)
Whether to use out-of-bag samples to estimate the generalization accuracy.
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel for both fit and predict. If -1, then the number of
jobs is set to the number of cores.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
verbose : int, optional (default=0)
Controls the verbosity of the tree building process.
warm_start : bool, optional (default=False)
When set to True, reuse the solution of the previous call to fit and add more estimators
to the ensemble, otherwise, just fit a whole new forest.
class_weight : dict, list of dicts, “balanced”,
“balanced_subsample” or None, optional (default=None) Weights associated with
classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in
the same order as the columns of y.
Note that for multioutput (including multilabel) weights should be defined for each class
of every column in its own dict. For example, for four-class multilabel classification
weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of
[{1:1}, {2:5}, {3:1}, {4:1}].
The “balanced” mode uses the values of y to automatically adjust weights inversely
proportional to class frequencies in the input data as n_samples / (n_classes
* np.bincount(y))
The “balanced_subsample” mode is the same as “balanced” except that weights are
computed based on the bootstrap sample for every tree grown.
For multi-output, the weights of each column of y will be multiplied.
Note that these weights will be multiplied with sample_weight (passed through the fit
method) if sample_weight is specified.
Attributes estimators_ : list of DecisionTreeClassifier
The collection of fitted sub-estimators.
classes_ : array of shape = [n_classes] or a list of such arrays
The classes labels (single output problem), or a list of arrays of class labels (multi-output
problem).
n_classes_ : int or list
The number of classes (single output problem), or a list containing the number of classes
for each output (multi-output problem).
n_features_ : int

3.3. Model selection and evaluation

437

scikit-learn user guide, Release 0.19.1

The number of features when fit is performed.
n_outputs_ : int
The number of outputs when fit is performed.
feature_importances_ : array of shape = [n_features]
The feature importances (the higher, the more important the feature).
oob_score_ : float
Score of the training dataset obtained using an out-of-bag estimate.
oob_decision_function_ : array of shape = [n_samples, n_classes]
Decision function computed with out-of-bag estimate on the training set. If
n_estimators is small it might be possible that a data point was never left out during
the bootstrap. In this case, oob_decision_function_ might contain NaN.
See also:
DecisionTreeClassifier, ExtraTreesClassifier
Notes
The default values for the parameters controlling the size of the trees (e.g.
max_depth,
min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on
some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by
setting those parameter values.
The features are always randomly permuted at each split. Therefore, the best found split may vary, even with
the same training data, max_features=n_features and bootstrap=False, if the improvement of the
criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic
behaviour during fitting, random_state has to be fixed.
References
[R23]
Examples
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.datasets import make_classification
>>>
>>> X, y = make_classification(n_samples=1000, n_features=4,
...
n_informative=2, n_redundant=0,
...
random_state=0, shuffle=False)
>>> clf = RandomForestClassifier(max_depth=2, random_state=0)
>>> clf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=2, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=0, verbose=0, warm_start=False)
>>> print(clf.feature_importances_)

438

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

[ 0.17287856 0.80608704 0.01884792 0.00218648]
>>> print(clf.predict([[0, 0, 0, 0]]))
[1]

Methods

apply(X)
decision_path(X)
fit(X, y[, sample_weight])
get_params([deep])
predict(X)
predict_log_proba(X)
predict_proba(X)
score(X, y[, sample_weight])
set_params(**params)

Apply trees in the forest to X, return leaf indices.
Return the decision path in the forest
Build a forest of trees from the training set (X, y).
Get parameters for this estimator.
Predict class for X.
Predict class log-probabilities for X.
Predict class probabilities for X.
Returns the mean accuracy on the given test data and
labels.
Set the parameters of this estimator.

__init__(n_estimators=10,
criterion=’gini’,
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_features=’auto’,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0,
warm_start=False, class_weight=None)
apply(X)
Apply trees in the forest to X, return leaf indices.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, its dtype will be converted to dtype=np.float32. If
a sparse matrix is provided, it will be converted into a sparse csr_matrix.
Returns X_leaves : array_like, shape = [n_samples, n_estimators]
For each datapoint x in X and for each tree in the forest, return the index of the leaf x
ends up in.
decision_path(X)
Return the decision path in the forest
New in version 0.18.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, its dtype will be converted to dtype=np.float32. If
a sparse matrix is provided, it will be converted into a sparse csr_matrix.
Returns indicator : sparse csr array, shape = [n_samples, n_nodes]
Return a node indicator matrix where non zero elements indicates that the samples goes
through the nodes.
n_nodes_ptr : array of size (n_estimators + 1, )
The columns from indicator[n_nodes_ptr[i]:n_nodes_ptr[i+1]] gives the indicator value
for the i-th estimator.
feature_importances_

3.3. Model selection and evaluation

439

scikit-learn user guide, Release 0.19.1

Return the feature importances (the higher, the more important the feature).
Returns feature_importances_ : array, shape = [n_features]
fit(X, y, sample_weight=None)
Build a forest of trees from the training set (X, y).
Parameters X : array-like or sparse matrix of shape = [n_samples, n_features]
The training input samples. Internally, its dtype will be converted to dtype=np.
float32. If a sparse matrix is provided, it will be converted into a sparse
csc_matrix.
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The target values (class labels in classification, real numbers in regression).
sample_weight : array-like, shape = [n_samples] or None
Sample weights. If None, then samples are equally weighted. Splits that would create
child nodes with net zero or negative weight are ignored while searching for a split in
each node. In the case of classification, splits are also ignored if they would result in
any single class carrying a negative weight in either child node.
Returns self : object
Returns self.
get_params(deep=True)
Get parameters for this estimator.
Parameters deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
Returns params : mapping of string to any
Parameter names mapped to their values.
predict(X)
Predict class for X.
The predicted class of an input sample is a vote by the trees in the forest, weighted by their probability
estimates. That is, the predicted class is the one with highest mean probability estimate across the trees.
Parameters X : array-like or sparse matrix of shape = [n_samples, n_features]
The input samples. Internally, its dtype will be converted to dtype=np.float32. If
a sparse matrix is provided, it will be converted into a sparse csr_matrix.
Returns y : array of shape = [n_samples] or [n_samples, n_outputs]
The predicted classes.
predict_log_proba(X)
Predict class log-probabilities for X.
The predicted class log-probabilities of an input sample is computed as the log of the mean predicted class
probabilities of the trees in the forest.
Parameters X : array-like or sparse matrix of shape = [n_samples, n_features]
The input samples. Internally, its dtype will be converted to dtype=np.float32. If
a sparse matrix is provided, it will be converted into a sparse csr_matrix.

440

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Returns p : array of shape = [n_samples, n_classes], or a list of n_outputs
such arrays if n_outputs > 1. The class probabilities of the input samples. The order of
the classes corresponds to that in the attribute classes_.
predict_proba(X)
Predict class probabilities for X.
The predicted class probabilities of an input sample are computed as the mean predicted class probabilities
of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class
in a leaf.
Parameters X : array-like or sparse matrix of shape = [n_samples, n_features]
The input samples. Internally, its dtype will be converted to dtype=np.float32. If
a sparse matrix is provided, it will be converted into a sparse csr_matrix.
Returns p : array of shape = [n_samples, n_classes], or a list of n_outputs
such arrays if n_outputs > 1. The class probabilities of the input samples. The order of
the classes corresponds to that in the attribute classes_.
score(X, y, sample_weight=None)
Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns score : float
Mean accuracy of self.predict(X) wrt. y.
set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form __ so that it’s possible to update each component
of a nested object.
Returns self :
Examples using sklearn.ensemble.RandomForestClassifier
• Probability Calibration for 3-class classification
• Comparison of Calibration of Classifiers
• Classifier comparison
• OOB Errors for Random Forests
• Feature transformations with ensembles of trees

3.3. Model selection and evaluation

441

scikit-learn user guide, Release 0.19.1

• Plot the decision surfaces of ensembles of trees on the iris dataset
• Plot class probabilities calculated by the VotingClassifier
• Comparing randomized search and grid search for hyperparameter estimation
• Classification of text documents using sparse features
sklearn.ensemble.RandomForestRegressor
class sklearn.ensemble.RandomForestRegressor(n_estimators=10,
criterion=’mse’,
max_depth=None,
min_samples_split=2, min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_features=’auto’, max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
bootstrap=True, oob_score=False, n_jobs=1,
random_state=None,
verbose=0,
warm_start=False)
A random forest regressor.
A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of
the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is
always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True
(default).
Read more in the User Guide.
Parameters n_estimators : integer, optional (default=10)
The number of trees in the forest.
criterion : string, optional (default=”mse”)
The function to measure the quality of a split. Supported criteria are “mse” for the
mean squared error, which is equal to variance reduction as feature selection criterion,
and “mae” for the mean absolute error.
New in version 0.18: Mean Absolute Error (MAE) criterion.
max_features : int, float, string or None, optional (default=”auto”)
The number of features to consider when looking for the best split:
• If int, then consider max_features features at each split.
• If float, then max_features is a percentage and int(max_features * n_features) features
are considered at each split.
• If “auto”, then max_features=n_features.
• If “sqrt”, then max_features=sqrt(n_features).
• If “log2”, then max_features=log2(n_features).
• If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node
samples is found, even if it requires to effectively inspect more than max_features
features.
max_depth : integer or None, optional (default=None)

442

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The maximum depth of the tree. If None, then nodes are expanded until all leaves are
pure or until all leaves contain less than min_samples_split samples.
min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:
• If int, then consider min_samples_split as the minimum number.
• If float, then min_samples_split is a percentage and ceil(min_samples_split *
n_samples) are the minimum number of samples for each split.
Changed in version 0.18: Added float values for percentages.
min_samples_leaf : int, float, optional (default=1)
The minimum number of samples required to be at a leaf node:
• If int, then consider min_samples_leaf as the minimum number.
• If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf *
n_samples) are the minimum number of samples for each node.
Changed in version 0.18: Added float values for percentages.
min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the sum total of weights (of all the input samples)
required to be at a leaf node. Samples have equal weight when sample_weight is not
provided.
max_leaf_nodes : int or None, optional (default=None)
Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as
relative reduction in impurity. If None then unlimited number of leaf nodes.
min_impurity_split : float,
Threshold for early stopping in tree growth. A node will split if its impurity is above
the threshold, otherwise it is a leaf.
Deprecated since version 0.19: min_impurity_split has been deprecated in favor of min_impurity_decrease in 0.19 and will be removed in 0.21. Use
min_impurity_decrease instead.
min_impurity_decrease : float, optional (default=0.)
A node will be split if this split induces a decrease of the impurity greater than or equal
to this value.
The weighted impurity decrease equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current
node, N_t_L is the number of samples in the left child, and N_t_R is the number of
samples in the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is
passed.
New in version 0.19.
bootstrap : boolean, optional (default=True)

3.3. Model selection and evaluation

443

scikit-learn user guide, Release 0.19.1

Whether bootstrap samples are used when building trees.
oob_score : bool, optional (default=False)
whether to use out-of-bag samples to estimate the R^2 on unseen data.
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel for both fit and predict. If -1, then the number of
jobs is set to the number of cores.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
verbose : int, optional (default=0)
Controls the verbosity of the tree building process.
warm_start : bool, optional (default=False)
When set to True, reuse the solution of the previous call to fit and add more estimators
to the ensemble, otherwise, just fit a whole new forest.
Attributes estimators_ : list of DecisionTreeRegressor
The collection of fitted sub-estimators.
feature_importances_ : array of shape = [n_features]
The feature importances (the higher, the more important the feature).
n_features_ : int
The number of features when fit is performed.
n_outputs_ : int
The number of outputs when fit is performed.
oob_score_ : float
Score of the training dataset obtained using an out-of-bag estimate.
oob_prediction_ : array of shape = [n_samples]
Prediction computed with out-of-bag estimate on the training set.
See also:
DecisionTreeRegressor, ExtraTreesRegressor
Notes
The default values for the parameters controlling the size of the trees (e.g.
max_depth,
min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on
some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by
setting those parameter values.
The features are always randomly permuted at each split. Therefore, the best found split may vary, even with
the same training data, max_features=n_features and bootstrap=False, if the improvement of the
criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic
behaviour during fitting, random_state has to be fixed.

444

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

References
[R24]
Examples
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.datasets import make_regression
>>>
>>> X, y = make_regression(n_features=4, n_informative=2,
...
random_state=0, shuffle=False)
>>> regr = RandomForestRegressor(max_depth=2, random_state=0)
>>> regr.fit(X, y)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=0, verbose=0, warm_start=False)
>>> print(regr.feature_importances_)
[ 0.17339552 0.81594114 0.
0.01066333]
>>> print(regr.predict([[0, 0, 0, 0]]))
[-2.50699856]

Methods

apply(X)
decision_path(X)
fit(X, y[, sample_weight])
get_params([deep])
predict(X)
score(X, y[, sample_weight])
set_params(**params)

Apply trees in the forest to X, return leaf indices.
Return the decision path in the forest
Build a forest of trees from the training set (X, y).
Get parameters for this estimator.
Predict regression target for X.
Returns the coefficient of determination R^2 of the prediction.
Set the parameters of this estimator.

__init__(n_estimators=10,
criterion=’mse’,
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_features=’auto’,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0,
warm_start=False)
apply(X)
Apply trees in the forest to X, return leaf indices.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, its dtype will be converted to dtype=np.float32. If
a sparse matrix is provided, it will be converted into a sparse csr_matrix.
Returns X_leaves : array_like, shape = [n_samples, n_estimators]
For each datapoint x in X and for each tree in the forest, return the index of the leaf x
ends up in.

3.3. Model selection and evaluation

445

scikit-learn user guide, Release 0.19.1

decision_path(X)
Return the decision path in the forest
New in version 0.18.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, its dtype will be converted to dtype=np.float32. If
a sparse matrix is provided, it will be converted into a sparse csr_matrix.
Returns indicator : sparse csr array, shape = [n_samples, n_nodes]
Return a node indicator matrix where non zero elements indicates that the samples goes
through the nodes.
n_nodes_ptr : array of size (n_estimators + 1, )
The columns from indicator[n_nodes_ptr[i]:n_nodes_ptr[i+1]] gives the indicator value
for the i-th estimator.
feature_importances_
Return the feature importances (the higher, the more important the feature).
Returns feature_importances_ : array, shape = [n_features]
fit(X, y, sample_weight=None)
Build a forest of trees from the training set (X, y).
Parameters X : array-like or sparse matrix of shape = [n_samples, n_features]
The training input samples. Internally, its dtype will be converted to dtype=np.
float32. If a sparse matrix is provided, it will be converted into a sparse
csc_matrix.
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The target values (class labels in classification, real numbers in regression).
sample_weight : array-like, shape = [n_samples] or None
Sample weights. If None, then samples are equally weighted. Splits that would create
child nodes with net zero or negative weight are ignored while searching for a split in
each node. In the case of classification, splits are also ignored if they would result in
any single class carrying a negative weight in either child node.
Returns self : object
Returns self.
get_params(deep=True)
Get parameters for this estimator.
Parameters deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
Returns params : mapping of string to any
Parameter names mapped to their values.
predict(X)
Predict regression target for X.

446

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The predicted regression target of an input sample is computed as the mean predicted regression targets of
the trees in the forest.
Parameters X : array-like or sparse matrix of shape = [n_samples, n_features]
The input samples. Internally, its dtype will be converted to dtype=np.float32. If
a sparse matrix is provided, it will be converted into a sparse csr_matrix.
Returns y : array of shape = [n_samples] or [n_samples, n_outputs]
The predicted values.
score(X, y, sample_weight=None)
Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True values for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns score : float
R^2 of self.predict(X) wrt. y.
set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form __ so that it’s possible to update each component
of a nested object.
Returns self :
Examples using sklearn.ensemble.RandomForestRegressor
• Imputing missing values before building an estimator
• Prediction Latency
• Comparing random forests and the multi-output meta estimator

3.3. Model selection and evaluation

447

scikit-learn user guide, Release 0.19.1

sklearn.ensemble.ExtraTreesClassifier
class sklearn.ensemble.ExtraTreesClassifier(n_estimators=10,
criterion=’gini’,
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_features=’auto’, max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
bootstrap=False,
oob_score=False,
n_jobs=1,
random_state=None,
verbose=0,
warm_start=False, class_weight=None)
An extra-trees classifier.
This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on
various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.
Read more in the User Guide.
Parameters n_estimators : integer, optional (default=10)
The number of trees in the forest.
criterion : string, optional (default=”gini”)
The function to measure the quality of a split. Supported criteria are “gini” for the Gini
impurity and “entropy” for the information gain.
max_features : int, float, string or None, optional (default=”auto”)
The number of features to consider when looking for the best split:
• If int, then consider max_features features at each split.
• If float, then max_features is a percentage and int(max_features * n_features) features
are considered at each split.
• If “auto”, then max_features=sqrt(n_features).
• If “sqrt”, then max_features=sqrt(n_features).
• If “log2”, then max_features=log2(n_features).
• If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node
samples is found, even if it requires to effectively inspect more than max_features
features.
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are
pure or until all leaves contain less than min_samples_split samples.
min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:
• If int, then consider min_samples_split as the minimum number.
• If float, then min_samples_split is a percentage and ceil(min_samples_split *
n_samples) are the minimum number of samples for each split.
Changed in version 0.18: Added float values for percentages.

448

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

min_samples_leaf : int, float, optional (default=1)
The minimum number of samples required to be at a leaf node:
• If int, then consider min_samples_leaf as the minimum number.
• If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf *
n_samples) are the minimum number of samples for each node.
Changed in version 0.18: Added float values for percentages.
min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the sum total of weights (of all the input samples)
required to be at a leaf node. Samples have equal weight when sample_weight is not
provided.
max_leaf_nodes : int or None, optional (default=None)
Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as
relative reduction in impurity. If None then unlimited number of leaf nodes.
min_impurity_split : float,
Threshold for early stopping in tree growth. A node will split if its impurity is above
the threshold, otherwise it is a leaf.
Deprecated since version 0.19: min_impurity_split has been deprecated in favor of min_impurity_decrease in 0.19 and will be removed in 0.21. Use
min_impurity_decrease instead.
min_impurity_decrease : float, optional (default=0.)
A node will be split if this split induces a decrease of the impurity greater than or equal
to this value.
The weighted impurity decrease equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current
node, N_t_L is the number of samples in the left child, and N_t_R is the number of
samples in the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is
passed.
New in version 0.19.
bootstrap : boolean, optional (default=False)
Whether bootstrap samples are used when building trees.
oob_score : bool, optional (default=False)
Whether to use out-of-bag samples to estimate the generalization accuracy.
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel for both fit and predict. If -1, then the number of
jobs is set to the number of cores.
random_state : int, RandomState instance or None, optional (default=None)

3.3. Model selection and evaluation

449

scikit-learn user guide, Release 0.19.1

If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
verbose : int, optional (default=0)
Controls the verbosity of the tree building process.
warm_start : bool, optional (default=False)
When set to True, reuse the solution of the previous call to fit and add more estimators
to the ensemble, otherwise, just fit a whole new forest.
class_weight : dict, list of dicts, “balanced”, “balanced_subsample” or None, optional (default=None)
Weights associated with classes in the form {class_label: weight}. If not
given, all classes are supposed to have weight one. For multi-output problems, a list of
dicts can be provided in the same order as the columns of y.
Note that for multioutput (including multilabel) weights should be defined for each class
of every column in its own dict. For example, for four-class multilabel classification
weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of
[{1:1}, {2:5}, {3:1}, {4:1}].
The “balanced” mode uses the values of y to automatically adjust weights inversely
proportional to class frequencies in the input data as n_samples / (n_classes
* np.bincount(y))
The “balanced_subsample” mode is the same as “balanced” except that weights are
computed based on the bootstrap sample for every tree grown.
For multi-output, the weights of each column of y will be multiplied.
Note that these weights will be multiplied with sample_weight (passed through the fit
method) if sample_weight is specified.
Attributes estimators_ : list of DecisionTreeClassifier
The collection of fitted sub-estimators.
classes_ : array of shape = [n_classes] or a list of such arrays
The classes labels (single output problem), or a list of arrays of class labels (multi-output
problem).
n_classes_ : int or list
The number of classes (single output problem), or a list containing the number of classes
for each output (multi-output problem).
feature_importances_ : array of shape = [n_features]
The feature importances (the higher, the more important the feature).
n_features_ : int
The number of features when fit is performed.
n_outputs_ : int
The number of outputs when fit is performed.
oob_score_ : float
Score of the training dataset obtained using an out-of-bag estimate.

450

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

oob_decision_function_ : array of shape = [n_samples, n_classes]
Decision function computed with out-of-bag estimate on the training set. If
n_estimators is small it might be possible that a data point was never left out during
the bootstrap. In this case, oob_decision_function_ might contain NaN.
See also:
sklearn.tree.ExtraTreeClassifier Base classifier for this ensemble.
RandomForestClassifier Ensemble Classifier based on trees with optimal splits.
Notes
The default values for the parameters controlling the size of the trees (e.g.
max_depth,
min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on
some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by
setting those parameter values.
References
[R19]
Methods

apply(X)
decision_path(X)
fit(X, y[, sample_weight])
get_params([deep])
predict(X)
predict_log_proba(X)
predict_proba(X)
score(X, y[, sample_weight])
set_params(**params)

Apply trees in the forest to X, return leaf indices.
Return the decision path in the forest
Build a forest of trees from the training set (X, y).
Get parameters for this estimator.
Predict class for X.
Predict class log-probabilities for X.
Predict class probabilities for X.
Returns the mean accuracy on the given test data and
labels.
Set the parameters of this estimator.

__init__(n_estimators=10,
criterion=’gini’,
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_features=’auto’,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
bootstrap=False, oob_score=False, n_jobs=1, random_state=None, verbose=0,
warm_start=False, class_weight=None)
apply(X)
Apply trees in the forest to X, return leaf indices.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, its dtype will be converted to dtype=np.float32. If
a sparse matrix is provided, it will be converted into a sparse csr_matrix.
Returns X_leaves : array_like, shape = [n_samples, n_estimators]
For each datapoint x in X and for each tree in the forest, return the index of the leaf x
ends up in.
3.3. Model selection and evaluation

451

scikit-learn user guide, Release 0.19.1

decision_path(X)
Return the decision path in the forest
New in version 0.18.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, its dtype will be converted to dtype=np.float32. If
a sparse matrix is provided, it will be converted into a sparse csr_matrix.
Returns indicator : sparse csr array, shape = [n_samples, n_nodes]
Return a node indicator matrix where non zero elements indicates that the samples goes
through the nodes.
n_nodes_ptr : array of size (n_estimators + 1, )
The columns from indicator[n_nodes_ptr[i]:n_nodes_ptr[i+1]] gives the indicator value
for the i-th estimator.
feature_importances_
Return the feature importances (the higher, the more important the feature).
Returns feature_importances_ : array, shape = [n_features]
fit(X, y, sample_weight=None)
Build a forest of trees from the training set (X, y).
Parameters X : array-like or sparse matrix of shape = [n_samples, n_features]
The training input samples. Internally, its dtype will be converted to dtype=np.
float32. If a sparse matrix is provided, it will be converted into a sparse
csc_matrix.
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The target values (class labels in classification, real numbers in regression).
sample_weight : array-like, shape = [n_samples] or None
Sample weights. If None, then samples are equally weighted. Splits that would create
child nodes with net zero or negative weight are ignored while searching for a split in
each node. In the case of classification, splits are also ignored if they would result in
any single class carrying a negative weight in either child node.
Returns self : object
Returns self.
get_params(deep=True)
Get parameters for this estimator.
Parameters deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
Returns params : mapping of string to any
Parameter names mapped to their values.
predict(X)
Predict class for X.

452

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The predicted class of an input sample is a vote by the trees in the forest, weighted by their probability
estimates. That is, the predicted class is the one with highest mean probability estimate across the trees.
Parameters X : array-like or sparse matrix of shape = [n_samples, n_features]
The input samples. Internally, its dtype will be converted to dtype=np.float32. If
a sparse matrix is provided, it will be converted into a sparse csr_matrix.
Returns y : array of shape = [n_samples] or [n_samples, n_outputs]
The predicted classes.
predict_log_proba(X)
Predict class log-probabilities for X.
The predicted class log-probabilities of an input sample is computed as the log of the mean predicted class
probabilities of the trees in the forest.
Parameters X : array-like or sparse matrix of shape = [n_samples, n_features]
The input samples. Internally, its dtype will be converted to dtype=np.float32. If
a sparse matrix is provided, it will be converted into a sparse csr_matrix.
Returns p : array of shape = [n_samples, n_classes], or a list of n_outputs
such arrays if n_outputs > 1. The class probabilities of the input samples. The order of
the classes corresponds to that in the attribute classes_.
predict_proba(X)
Predict class probabilities for X.
The predicted class probabilities of an input sample are computed as the mean predicted class probabilities
of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class
in a leaf.
Parameters X : array-like or sparse matrix of shape = [n_samples, n_features]
The input samples. Internally, its dtype will be converted to dtype=np.float32. If
a sparse matrix is provided, it will be converted into a sparse csr_matrix.
Returns p : array of shape = [n_samples, n_classes], or a list of n_outputs
such arrays if n_outputs > 1. The class probabilities of the input samples. The order of
the classes corresponds to that in the attribute classes_.
score(X, y, sample_weight=None)
Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns score : float
Mean accuracy of self.predict(X) wrt. y.

3.3. Model selection and evaluation

453

scikit-learn user guide, Release 0.19.1

set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form __ so that it’s possible to update each component
of a nested object.
Returns self :
Examples using sklearn.ensemble.ExtraTreesClassifier
• Feature importances with forests of trees
• Pixel importances with a parallel forest of trees
• Plot the decision surfaces of ensembles of trees on the iris dataset
• Hashing feature transformation using Totally Random Trees
sklearn.ensemble.ExtraTreesRegressor
class sklearn.ensemble.ExtraTreesRegressor(n_estimators=10,
criterion=’mse’,
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_features=’auto’,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
bootstrap=False,
oob_score=False,
n_jobs=1,
random_state=None,
verbose=0,
warm_start=False)
An extra-trees regressor.
This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on
various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.
Read more in the User Guide.
Parameters n_estimators : integer, optional (default=10)
The number of trees in the forest.
criterion : string, optional (default=”mse”)
The function to measure the quality of a split. Supported criteria are “mse” for the
mean squared error, which is equal to variance reduction as feature selection criterion,
and “mae” for the mean absolute error.
New in version 0.18: Mean Absolute Error (MAE) criterion.
max_features : int, float, string or None, optional (default=”auto”)
The number of features to consider when looking for the best split:
• If int, then consider max_features features at each split.
• If float, then max_features is a percentage and int(max_features * n_features) features
are considered at each split.
• If “auto”, then max_features=n_features.

454

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• If “sqrt”, then max_features=sqrt(n_features).
• If “log2”, then max_features=log2(n_features).
• If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node
samples is found, even if it requires to effectively inspect more than max_features
features.
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are
pure or until all leaves contain less than min_samples_split samples.
min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:
• If int, then consider min_samples_split as the minimum number.
• If float, then min_samples_split is a percentage and ceil(min_samples_split *
n_samples) are the minimum number of samples for each split.
Changed in version 0.18: Added float values for percentages.
min_samples_leaf : int, float, optional (default=1)
The minimum number of samples required to be at a leaf node:
• If int, then consider min_samples_leaf as the minimum number.
• If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf *
n_samples) are the minimum number of samples for each node.
Changed in version 0.18: Added float values for percentages.
min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the sum total of weights (of all the input samples)
required to be at a leaf node. Samples have equal weight when sample_weight is not
provided.
max_leaf_nodes : int or None, optional (default=None)
Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as
relative reduction in impurity. If None then unlimited number of leaf nodes.
min_impurity_split : float,
Threshold for early stopping in tree growth. A node will split if its impurity is above
the threshold, otherwise it is a leaf.
Deprecated since version 0.19: min_impurity_split has been deprecated in favor of min_impurity_decrease in 0.19 and will be removed in 0.21. Use
min_impurity_decrease instead.
min_impurity_decrease : float, optional (default=0.)
A node will be split if this split induces a decrease of the impurity greater than or equal
to this value.
The weighted impurity decrease equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)

3.3. Model selection and evaluation

455

scikit-learn user guide, Release 0.19.1

where N is the total number of samples, N_t is the number of samples at the current
node, N_t_L is the number of samples in the left child, and N_t_R is the number of
samples in the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is
passed.
New in version 0.19.
bootstrap : boolean, optional (default=False)
Whether bootstrap samples are used when building trees.
oob_score : bool, optional (default=False)
Whether to use out-of-bag samples to estimate the R^2 on unseen data.
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel for both fit and predict. If -1, then the number of
jobs is set to the number of cores.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
verbose : int, optional (default=0)
Controls the verbosity of the tree building process.
warm_start : bool, optional (default=False)
When set to True, reuse the solution of the previous call to fit and add more estimators
to the ensemble, otherwise, just fit a whole new forest.
Attributes estimators_ : list of DecisionTreeRegressor
The collection of fitted sub-estimators.
feature_importances_ : array of shape = [n_features]
The feature importances (the higher, the more important the feature).
n_features_ : int
The number of features.
n_outputs_ : int
The number of outputs.
oob_score_ : float
Score of the training dataset obtained using an out-of-bag estimate.
oob_prediction_ : array of shape = [n_samples]
Prediction computed with out-of-bag estimate on the training set.
See also:
sklearn.tree.ExtraTreeRegressor Base estimator for this ensemble.
RandomForestRegressor Ensemble regressor using trees with optimal splits.

456

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Notes
The default values for the parameters controlling the size of the trees (e.g.
max_depth,
min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on
some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by
setting those parameter values.
References
[R20]
Methods

apply(X)
decision_path(X)
fit(X, y[, sample_weight])
get_params([deep])
predict(X)
score(X, y[, sample_weight])
set_params(**params)

Apply trees in the forest to X, return leaf indices.
Return the decision path in the forest
Build a forest of trees from the training set (X, y).
Get parameters for this estimator.
Predict regression target for X.
Returns the coefficient of determination R^2 of the prediction.
Set the parameters of this estimator.

__init__(n_estimators=10,
criterion=’mse’,
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_features=’auto’,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
bootstrap=False, oob_score=False, n_jobs=1, random_state=None, verbose=0,
warm_start=False)
apply(X)
Apply trees in the forest to X, return leaf indices.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, its dtype will be converted to dtype=np.float32. If
a sparse matrix is provided, it will be converted into a sparse csr_matrix.
Returns X_leaves : array_like, shape = [n_samples, n_estimators]
For each datapoint x in X and for each tree in the forest, return the index of the leaf x
ends up in.
decision_path(X)
Return the decision path in the forest
New in version 0.18.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, its dtype will be converted to dtype=np.float32. If
a sparse matrix is provided, it will be converted into a sparse csr_matrix.
Returns indicator : sparse csr array, shape = [n_samples, n_nodes]
Return a node indicator matrix where non zero elements indicates that the samples goes
through the nodes.

3.3. Model selection and evaluation

457

scikit-learn user guide, Release 0.19.1

n_nodes_ptr : array of size (n_estimators + 1, )
The columns from indicator[n_nodes_ptr[i]:n_nodes_ptr[i+1]] gives the indicator value
for the i-th estimator.
feature_importances_
Return the feature importances (the higher, the more important the feature).
Returns feature_importances_ : array, shape = [n_features]
fit(X, y, sample_weight=None)
Build a forest of trees from the training set (X, y).
Parameters X : array-like or sparse matrix of shape = [n_samples, n_features]
The training input samples. Internally, its dtype will be converted to dtype=np.
float32. If a sparse matrix is provided, it will be converted into a sparse
csc_matrix.
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The target values (class labels in classification, real numbers in regression).
sample_weight : array-like, shape = [n_samples] or None
Sample weights. If None, then samples are equally weighted. Splits that would create
child nodes with net zero or negative weight are ignored while searching for a split in
each node. In the case of classification, splits are also ignored if they would result in
any single class carrying a negative weight in either child node.
Returns self : object
Returns self.
get_params(deep=True)
Get parameters for this estimator.
Parameters deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
Returns params : mapping of string to any
Parameter names mapped to their values.
predict(X)
Predict regression target for X.
The predicted regression target of an input sample is computed as the mean predicted regression targets of
the trees in the forest.
Parameters X : array-like or sparse matrix of shape = [n_samples, n_features]
The input samples. Internally, its dtype will be converted to dtype=np.float32. If
a sparse matrix is provided, it will be converted into a sparse csr_matrix.
Returns y : array of shape = [n_samples] or [n_samples, n_outputs]
The predicted values.
score(X, y, sample_weight=None)
Returns the coefficient of determination R^2 of the prediction.

458

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True values for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns score : float
R^2 of self.predict(X) wrt. y.
set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form __ so that it’s possible to update each component
of a nested object.
Returns self :
Examples using sklearn.ensemble.ExtraTreesRegressor
• Face completion with a multi-output estimators
sklearn.ensemble.GradientBoostingClassifier
class sklearn.ensemble.GradientBoostingClassifier(loss=’deviance’,
learning_rate=0.1,
n_estimators=100,
subsample=1.0,
criterion=’friedman_mse’,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_depth=3,
min_impurity_decrease=0.0,
min_impurity_split=None,
init=None,
random_state=None,
max_features=None,
verbose=0,
max_leaf_nodes=None,
warm_start=False, presort=’auto’)
Gradient Boosting for classification.
GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage n_classes_ regression trees are fit on the negative gradient of the binomial
or multinomial deviance loss function. Binary classification is a special case where only a single regression tree
is induced.
Read more in the User Guide.
Parameters loss : {‘deviance’, ‘exponential’}, optional (default=’deviance’)

3.3. Model selection and evaluation

459

scikit-learn user guide, Release 0.19.1

loss function to be optimized. ‘deviance’ refers to deviance (= logistic regression) for
classification with probabilistic outputs. For loss ‘exponential’ gradient boosting recovers the AdaBoost algorithm.
learning_rate : float, optional (default=0.1)
learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off
between learning_rate and n_estimators.
n_estimators : int (default=100)
The number of boosting stages to perform. Gradient boosting is fairly robust to overfitting so a large number usually results in better performance.
max_depth : integer, optional (default=3)
maximum depth of the individual regression estimators. The maximum depth limits the
number of nodes in the tree. Tune this parameter for best performance; the best value
depends on the interaction of the input variables.
criterion : string, optional (default=”friedman_mse”)
The function to measure the quality of a split. Supported criteria are “friedman_mse” for
the mean squared error with improvement score by Friedman, “mse” for mean squared
error, and “mae” for the mean absolute error. The default value of “friedman_mse” is
generally the best as it can provide a better approximation in some cases.
New in version 0.18.
min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:
• If int, then consider min_samples_split as the minimum number.
• If float, then min_samples_split is a percentage and ceil(min_samples_split *
n_samples) are the minimum number of samples for each split.
Changed in version 0.18: Added float values for percentages.
min_samples_leaf : int, float, optional (default=1)
The minimum number of samples required to be at a leaf node:
• If int, then consider min_samples_leaf as the minimum number.
• If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf *
n_samples) are the minimum number of samples for each node.
Changed in version 0.18: Added float values for percentages.
min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the sum total of weights (of all the input samples)
required to be at a leaf node. Samples have equal weight when sample_weight is not
provided.
subsample : float, optional (default=1.0)
The fraction of samples to be used for fitting the individual base learners. If smaller
than 1.0 this results in Stochastic Gradient Boosting. subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and
an increase in bias.
max_features : int, float, string or None, optional (default=None)

460

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The number of features to consider when looking for the best split:
• If int, then consider max_features features at each split.
• If float, then max_features is a percentage and int(max_features * n_features) features
are considered at each split.
• If “auto”, then max_features=sqrt(n_features).
• If “sqrt”, then max_features=sqrt(n_features).
• If “log2”, then max_features=log2(n_features).
• If None, then max_features=n_features.
Choosing max_features < n_features leads to a reduction of variance and an increase in
bias.
Note: the search for a split does not stop until at least one valid partition of the node
samples is found, even if it requires to effectively inspect more than max_features
features.
max_leaf_nodes : int or None, optional (default=None)
Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as
relative reduction in impurity. If None then unlimited number of leaf nodes.
min_impurity_split : float,
Threshold for early stopping in tree growth. A node will split if its impurity is above
the threshold, otherwise it is a leaf.
Deprecated since version 0.19: min_impurity_split has been deprecated in favor of min_impurity_decrease in 0.19 and will be removed in 0.21. Use
min_impurity_decrease instead.
min_impurity_decrease : float, optional (default=0.)
A node will be split if this split induces a decrease of the impurity greater than or equal
to this value.
The weighted impurity decrease equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current
node, N_t_L is the number of samples in the left child, and N_t_R is the number of
samples in the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is
passed.
New in version 0.19.
init : BaseEstimator, None, optional (default=None)
An estimator object that is used to compute the initial predictions. init has to provide
fit and predict. If None it uses loss.init_estimator.
verbose : int, default: 0
Enable verbose output. If 1 then it prints progress and performance once in a while
(the more trees the lower the frequency). If greater than 1 then it prints progress and
performance for every tree.

3.3. Model selection and evaluation

461

scikit-learn user guide, Release 0.19.1

warm_start : bool, default: False
When set to True, reuse the solution of the previous call to fit and add more estimators
to the ensemble, otherwise, just erase the previous solution.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
presort : bool or ‘auto’, optional (default=’auto’)
Whether to presort the data to speed up the finding of best splits in fitting. Auto mode
by default will use presorting on dense data and default to normal sorting on sparse data.
Setting presort to true on sparse data will raise an error.
New in version 0.17: presort parameter.
Attributes feature_importances_ : array, shape = [n_features]
The feature importances (the higher, the more important the feature).
oob_improvement_ : array, shape = [n_estimators]
The improvement in loss (= deviance) on the out-of-bag samples relative to the previous
iteration. oob_improvement_[0] is the improvement in loss of the first stage over
the init estimator.
train_score_ : array, shape = [n_estimators]
The i-th score train_score_[i] is the deviance (= loss) of the model at iteration i
on the in-bag sample. If subsample == 1 this is the deviance on the training data.
loss_ : LossFunction
The concrete LossFunction object.
init : BaseEstimator
The estimator that provides the initial predictions. Set via the init argument or loss.
init_estimator.
estimators_ : ndarray of DecisionTreeRegressor, shape = [n_estimators, loss_.K]
The collection of fitted sub-estimators. loss_.K is 1 for binary classification, otherwise n_classes.
See also:
sklearn.tree.DecisionTreeClassifier,
AdaBoostClassifier

RandomForestClassifier,

Notes
The features are always randomly permuted at each split. Therefore, the best found split may vary, even with
the same training data and max_features=n_features, if the improvement of the criterion is identical for
several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting,
random_state has to be fixed.

462

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

References
J. Friedman, Greedy Function Approximation: A Gradient Boosting Machine, The Annals of Statistics, Vol. 29,
No. 5, 2001.
10. Friedman, Stochastic Gradient Boosting, 1999
T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical Learning Ed. 2, Springer, 2009.
Methods

apply(X)
decision_function(X)
fit(X, y[, sample_weight, monitor])
get_params([deep])
predict(X)
predict_log_proba(X)
predict_proba(X)
score(X, y[, sample_weight])
set_params(**params)
staged_decision_function(X)
staged_predict(X)
staged_predict_proba(X)

Apply trees in the ensemble to X, return leaf indices.
Compute the decision function of X.
Fit the gradient boosting model.
Get parameters for this estimator.
Predict class for X.
Predict class log-probabilities for X.
Predict class probabilities for X.
Returns the mean accuracy on the given test data and
labels.
Set the parameters of this estimator.
Compute decision function of X for each iteration.
Predict class at each stage for X.
Predict class probabilities at each stage for X.

__init__(loss=’deviance’,
learning_rate=0.1,
n_estimators=100,
subsample=1.0,
criterion=’friedman_mse’,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_depth=3,
min_impurity_decrease=0.0,
min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’)
apply(X)
Apply trees in the ensemble to X, return leaf indices.
New in version 0.17.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, its dtype will be converted to dtype=np.float32. If
a sparse matrix is provided, it will be converted to a sparse csr_matrix.
Returns X_leaves : array_like, shape = [n_samples, n_estimators, n_classes]
For each datapoint x in X and for each tree in the ensemble, return the index of the leaf
x ends up in each estimator. In the case of binary classification n_classes is 1.
decision_function(X)
Compute the decision function of X.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, it will be converted to dtype=np.float32 and if a
sparse matrix is provided to a sparse csr_matrix.
Returns score : array, shape = [n_samples, n_classes] or [n_samples]
The decision function of the input samples. The order of the classes corresponds to that
in the attribute classes_. Regression and binary classification produce an array of shape
3.3. Model selection and evaluation

463

scikit-learn user guide, Release 0.19.1

[n_samples].
feature_importances_
Return the feature importances (the higher, the more important the feature).
Returns feature_importances_ : array, shape = [n_features]
fit(X, y, sample_weight=None, monitor=None)
Fit the gradient boosting model.
Parameters X : array-like, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples]
Target values (integers in classification, real numbers in regression) For classification,
labels must correspond to classes.
sample_weight : array-like, shape = [n_samples] or None
Sample weights. If None, then samples are equally weighted. Splits that would create
child nodes with net zero or negative weight are ignored while searching for a split in
each node. In the case of classification, splits are also ignored if they would result in
any single class carrying a negative weight in either child node.
monitor : callable, optional
The monitor is called after each iteration with the current iteration, a reference
to the estimator and the local variables of _fit_stages as keyword arguments
callable(i, self, locals()). If the callable returns True the fitting procedure is stopped. The monitor can be used for various things such as computing held-out
estimates, early stopping, model introspect, and snapshoting.
Returns self : object
Returns self.
get_params(deep=True)
Get parameters for this estimator.
Parameters deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
Returns params : mapping of string to any
Parameter names mapped to their values.
n_features
DEPRECATED: Attribute n_features was deprecated in version 0.19 and will be removed in 0.21.
predict(X)
Predict class for X.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, it will be converted to dtype=np.float32 and if a
sparse matrix is provided to a sparse csr_matrix.
Returns y : array of shape = [n_samples]

464

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The predicted values.
predict_log_proba(X)
Predict class log-probabilities for X.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, it will be converted to dtype=np.float32 and if a
sparse matrix is provided to a sparse csr_matrix.
Returns p : array of shape = [n_samples]
The class log-probabilities of the input samples. The order of the classes corresponds to
that in the attribute classes_.
Raises AttributeError :
If the loss does not support probabilities.
predict_proba(X)
Predict class probabilities for X.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, it will be converted to dtype=np.float32 and if a
sparse matrix is provided to a sparse csr_matrix.
Returns p : array of shape = [n_samples]
The class probabilities of the input samples. The order of the classes corresponds to that
in the attribute classes_.
Raises AttributeError :
If the loss does not support probabilities.
score(X, y, sample_weight=None)
Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns score : float
Mean accuracy of self.predict(X) wrt. y.
set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form __ so that it’s possible to update each component
of a nested object.
Returns self :

3.3. Model selection and evaluation

465

scikit-learn user guide, Release 0.19.1

staged_decision_function(X)
Compute decision function of X for each iteration.
This method allows monitoring (i.e. determine error on testing set) after each stage.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, it will be converted to dtype=np.float32 and if a
sparse matrix is provided to a sparse csr_matrix.
Returns score : generator of array, shape = [n_samples, k]
The decision function of the input samples. The order of the classes corresponds to that
in the attribute classes_. Regression and binary classification are special cases with k
== 1, otherwise k==n_classes.
staged_predict(X)
Predict class at each stage for X.
This method allows monitoring (i.e. determine error on testing set) after each stage.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, it will be converted to dtype=np.float32 and if a
sparse matrix is provided to a sparse csr_matrix.
Returns y : generator of array of shape = [n_samples]
The predicted value of the input samples.
staged_predict_proba(X)
Predict class probabilities at each stage for X.
This method allows monitoring (i.e. determine error on testing set) after each stage.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, it will be converted to dtype=np.float32 and if a
sparse matrix is provided to a sparse csr_matrix.
Returns y : generator of array of shape = [n_samples]
The predicted value of the input samples.
Examples using sklearn.ensemble.GradientBoostingClassifier
• Feature transformations with ensembles of trees
• Gradient Boosting Out-of-Bag estimates
• Gradient Boosting regularization

466

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

sklearn.ensemble.GradientBoostingRegressor
class sklearn.ensemble.GradientBoostingRegressor(loss=’ls’,
learning_rate=0.1,
n_estimators=100,
subsample=1.0,
criterion=’friedman_mse’,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_depth=3,
min_impurity_decrease=0.0,
min_impurity_split=None,
init=None,
random_state=None,
max_features=None,
alpha=0.9,
verbose=0,
max_leaf_nodes=None,
warm_start=False, presort=’auto’)
Gradient Boosting for regression.
GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function.
Read more in the User Guide.
Parameters loss : {‘ls’, ‘lad’, ‘huber’, ‘quantile’}, optional (default=’ls’)
loss function to be optimized. ‘ls’ refers to least squares regression. ‘lad’ (least absolute
deviation) is a highly robust loss function solely based on order information of the input
variables. ‘huber’ is a combination of the two. ‘quantile’ allows quantile regression
(use alpha to specify the quantile).
learning_rate : float, optional (default=0.1)
learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off
between learning_rate and n_estimators.
n_estimators : int (default=100)
The number of boosting stages to perform. Gradient boosting is fairly robust to overfitting so a large number usually results in better performance.
max_depth : integer, optional (default=3)
maximum depth of the individual regression estimators. The maximum depth limits the
number of nodes in the tree. Tune this parameter for best performance; the best value
depends on the interaction of the input variables.
criterion : string, optional (default=”friedman_mse”)
The function to measure the quality of a split. Supported criteria are “friedman_mse” for
the mean squared error with improvement score by Friedman, “mse” for mean squared
error, and “mae” for the mean absolute error. The default value of “friedman_mse” is
generally the best as it can provide a better approximation in some cases.
New in version 0.18.
min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:
• If int, then consider min_samples_split as the minimum number.
• If float, then min_samples_split is a percentage and ceil(min_samples_split *
n_samples) are the minimum number of samples for each split.

3.3. Model selection and evaluation

467

scikit-learn user guide, Release 0.19.1

Changed in version 0.18: Added float values for percentages.
min_samples_leaf : int, float, optional (default=1)
The minimum number of samples required to be at a leaf node:
• If int, then consider min_samples_leaf as the minimum number.
• If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf *
n_samples) are the minimum number of samples for each node.
Changed in version 0.18: Added float values for percentages.
min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the sum total of weights (of all the input samples)
required to be at a leaf node. Samples have equal weight when sample_weight is not
provided.
subsample : float, optional (default=1.0)
The fraction of samples to be used for fitting the individual base learners. If smaller
than 1.0 this results in Stochastic Gradient Boosting. subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and
an increase in bias.
max_features : int, float, string or None, optional (default=None)
The number of features to consider when looking for the best split:
• If int, then consider max_features features at each split.
• If float, then max_features is a percentage and int(max_features * n_features) features
are considered at each split.
• If “auto”, then max_features=n_features.
• If “sqrt”, then max_features=sqrt(n_features).
• If “log2”, then max_features=log2(n_features).
• If None, then max_features=n_features.
Choosing max_features < n_features leads to a reduction of variance and an increase in
bias.
Note: the search for a split does not stop until at least one valid partition of the node
samples is found, even if it requires to effectively inspect more than max_features
features.
max_leaf_nodes : int or None, optional (default=None)
Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as
relative reduction in impurity. If None then unlimited number of leaf nodes.
min_impurity_split : float,
Threshold for early stopping in tree growth. A node will split if its impurity is above
the threshold, otherwise it is a leaf.
Deprecated since version 0.19: min_impurity_split has been deprecated in favor of min_impurity_decrease in 0.19 and will be removed in 0.21. Use
min_impurity_decrease instead.
min_impurity_decrease : float, optional (default=0.)

468

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

A node will be split if this split induces a decrease of the impurity greater than or equal
to this value.
The weighted impurity decrease equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current
node, N_t_L is the number of samples in the left child, and N_t_R is the number of
samples in the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is
passed.
New in version 0.19.
alpha : float (default=0.9)
The alpha-quantile of the huber loss function and the quantile loss function. Only if
loss='huber' or loss='quantile'.
init : BaseEstimator, None, optional (default=None)
An estimator object that is used to compute the initial predictions. init has to provide
fit and predict. If None it uses loss.init_estimator.
verbose : int, default: 0
Enable verbose output. If 1 then it prints progress and performance once in a while
(the more trees the lower the frequency). If greater than 1 then it prints progress and
performance for every tree.
warm_start : bool, default: False
When set to True, reuse the solution of the previous call to fit and add more estimators
to the ensemble, otherwise, just erase the previous solution.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
presort : bool or ‘auto’, optional (default=’auto’)
Whether to presort the data to speed up the finding of best splits in fitting. Auto mode
by default will use presorting on dense data and default to normal sorting on sparse data.
Setting presort to true on sparse data will raise an error.
New in version 0.17: optional parameter presort.
Attributes feature_importances_ : array, shape = [n_features]
The feature importances (the higher, the more important the feature).
oob_improvement_ : array, shape = [n_estimators]
The improvement in loss (= deviance) on the out-of-bag samples relative to the previous
iteration. oob_improvement_[0] is the improvement in loss of the first stage over
the init estimator.
train_score_ : array, shape = [n_estimators]

3.3. Model selection and evaluation

469

scikit-learn user guide, Release 0.19.1

The i-th score train_score_[i] is the deviance (= loss) of the model at iteration i
on the in-bag sample. If subsample == 1 this is the deviance on the training data.
loss_ : LossFunction
The concrete LossFunction object.
init : BaseEstimator
The estimator that provides the initial predictions. Set via the init argument or loss.
init_estimator.
estimators_ : ndarray of DecisionTreeRegressor, shape = [n_estimators, 1]
The collection of fitted sub-estimators.
See also:
DecisionTreeRegressor, RandomForestRegressor
Notes
The features are always randomly permuted at each split. Therefore, the best found split may vary, even with
the same training data and max_features=n_features, if the improvement of the criterion is identical for
several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting,
random_state has to be fixed.
References
J. Friedman, Greedy Function Approximation: A Gradient Boosting Machine, The Annals of Statistics, Vol. 29,
No. 5, 2001.
10. Friedman, Stochastic Gradient Boosting, 1999
T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical Learning Ed. 2, Springer, 2009.
Methods

apply(X)
fit(X, y[, sample_weight, monitor])
get_params([deep])
predict(X)
score(X, y[, sample_weight])
set_params(**params)
staged_predict(X)

Apply trees in the ensemble to X, return leaf indices.
Fit the gradient boosting model.
Get parameters for this estimator.
Predict regression target for X.
Returns the coefficient of determination R^2 of the prediction.
Set the parameters of this estimator.
Predict regression target at each stage for X.

__init__(loss=’ls’,
learning_rate=0.1,
n_estimators=100,
subsample=1.0,
criterion=’friedman_mse’,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_depth=3,
min_impurity_decrease=0.0,
min_impurity_split=None, init=None, random_state=None, max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’)
apply(X)
Apply trees in the ensemble to X, return leaf indices.

470

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

New in version 0.17.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, its dtype will be converted to dtype=np.float32. If
a sparse matrix is provided, it will be converted to a sparse csr_matrix.
Returns X_leaves : array_like, shape = [n_samples, n_estimators]
For each datapoint x in X and for each tree in the ensemble, return the index of the leaf
x ends up in each estimator.
feature_importances_
Return the feature importances (the higher, the more important the feature).
Returns feature_importances_ : array, shape = [n_features]
fit(X, y, sample_weight=None, monitor=None)
Fit the gradient boosting model.
Parameters X : array-like, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples]
Target values (integers in classification, real numbers in regression) For classification,
labels must correspond to classes.
sample_weight : array-like, shape = [n_samples] or None
Sample weights. If None, then samples are equally weighted. Splits that would create
child nodes with net zero or negative weight are ignored while searching for a split in
each node. In the case of classification, splits are also ignored if they would result in
any single class carrying a negative weight in either child node.
monitor : callable, optional
The monitor is called after each iteration with the current iteration, a reference
to the estimator and the local variables of _fit_stages as keyword arguments
callable(i, self, locals()). If the callable returns True the fitting procedure is stopped. The monitor can be used for various things such as computing held-out
estimates, early stopping, model introspect, and snapshoting.
Returns self : object
Returns self.
get_params(deep=True)
Get parameters for this estimator.
Parameters deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
Returns params : mapping of string to any
Parameter names mapped to their values.
n_features
DEPRECATED: Attribute n_features was deprecated in version 0.19 and will be removed in 0.21.

3.3. Model selection and evaluation

471

scikit-learn user guide, Release 0.19.1

predict(X)
Predict regression target for X.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, it will be converted to dtype=np.float32 and if a
sparse matrix is provided to a sparse csr_matrix.
Returns y : array of shape = [n_samples]
The predicted values.
score(X, y, sample_weight=None)
Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True values for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns score : float
R^2 of self.predict(X) wrt. y.
set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form __ so that it’s possible to update each component
of a nested object.
Returns self :
staged_predict(X)
Predict regression target at each stage for X.
This method allows monitoring (i.e. determine error on testing set) after each stage.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
The input samples. Internally, it will be converted to dtype=np.float32 and if a
sparse matrix is provided to a sparse csr_matrix.
Returns y : generator of array of shape = [n_samples]
The predicted value of the input samples.
Examples using sklearn.ensemble.GradientBoostingRegressor
• Model Complexity Influence
• Prediction Intervals for Gradient Boosting Regression

472

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• Gradient Boosting regression
• Partial Dependence Plots

3.3.3 Model evaluation: quantifying the quality of predictions
There are 3 different APIs for evaluating the quality of a model’s predictions:
• Estimator score method: Estimators have a score method providing a default evaluation criterion for the
problem they are designed to solve. This is not discussed on this page, but in each estimator’s documentation.
• Scoring parameter: Model-evaluation tools using cross-validation (such as model_selection.
cross_val_score and model_selection.GridSearchCV ) rely on an internal scoring strategy. This
is discussed in the section The scoring parameter: defining model evaluation rules.
• Metric functions: The metrics module implements functions assessing prediction error for specific purposes.
These metrics are detailed in sections on Classification metrics, Multilabel ranking metrics, Regression metrics
and Clustering metrics.
Finally, Dummy estimators are useful to get a baseline value of those metrics for random predictions.
See also:
For “pairwise” metrics, between samples and not estimators or predictions, see the Pairwise metrics, Affinities and
Kernels section.
The scoring parameter: defining model evaluation rules
Model selection and evaluation using tools, such as model_selection.GridSearchCV and
model_selection.cross_val_score, take a scoring parameter that controls what metric they apply to the estimators evaluated.
Common cases: predefined values
For the most common use cases, you can designate a scorer object with the scoring parameter; the table below
shows all possible values. All scorer objects follow the convention that higher return values are better than
lower return values. Thus metrics which measure the distance between the model and the data, like metrics.
mean_squared_error, are available as neg_mean_squared_error which return the negated value of the metric.

3.3. Model selection and evaluation

473

scikit-learn user guide, Release 0.19.1

Scoring
Classification
‘accuracy’
‘average_precision’
‘f1’
‘f1_micro’
‘f1_macro’
‘f1_weighted’
‘f1_samples’
‘neg_log_loss’

Function
metrics.accuracy_score
metrics.average_precision_score
metrics.f1_score
metrics.f1_score
metrics.f1_score
metrics.f1_score
metrics.f1_score
metrics.log_loss

Comment

for binary targets
micro-averaged
macro-averaged
weighted average
by multilabel sample
requires
predict_proba
support
suffixes apply as with ‘f1’
suffixes apply as with ‘f1’

‘precision’ etc.
metrics.precision_score
‘recall’ etc.
metrics.recall_score
‘roc_auc’
metrics.roc_auc_score
Clustering
‘admetrics.adjusted_mutual_info_score
justed_mutual_info_score’
‘adjusted_rand_score’
metrics.adjusted_rand_score
‘completeness_score’
metrics.completeness_score
‘fowlkes_mallows_score’
metrics.fowlkes_mallows_score
‘homogeneity_score’
metrics.homogeneity_score
‘mutual_info_score’
metrics.mutual_info_score
‘normalmetrics.normalized_mutual_info_score
ized_mutual_info_score’
‘v_measure_score’
metrics.v_measure_score
Regression
‘explained_variance’
metrics.explained_variance_score
‘neg_mean_absolute_error’
metrics.mean_absolute_error
‘neg_mean_squared_error’
metrics.mean_squared_error
‘neg_mean_squared_log_error’ metrics.mean_squared_log_error
‘neg_median_absolute_error’ metrics.median_absolute_error
‘r2’
metrics.r2_score
Usage examples:

>>> from sklearn import svm, datasets
>>> from sklearn.model_selection import cross_val_score
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf = svm.SVC(probability=True, random_state=0)
>>> cross_val_score(clf, X, y, scoring='neg_log_loss')
array([-0.07..., -0.16..., -0.06...])
>>> model = svm.SVC()
>>> cross_val_score(model, X, y, scoring='wrong_choice')
Traceback (most recent call last):
ValueError: 'wrong_choice' is not a valid scoring value. Valid options are ['accuracy
˓→', 'adjusted_mutual_info_score', 'adjusted_rand_score', 'average_precision',
˓→'completeness_score', 'explained_variance', 'f1', 'f1_macro', 'f1_micro', 'f1_
˓→samples', 'f1_weighted', 'fowlkes_mallows_score', 'homogeneity_score', 'mutual_info_
˓→score', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_
˓→mean_squared_log_error', 'neg_median_absolute_error', 'normalized_mutual_info_score
˓→', 'precision', 'precision_macro', 'precision_micro', 'precision_samples',
˓→'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_
˓→samples', 'recall_weighted', 'roc_auc', 'v_measure_score']

474

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Note: The values listed by the ValueError exception correspond to the functions measuring prediction accuracy
described in the following sections. The scorer objects for those functions are stored in the dictionary sklearn.
metrics.SCORERS.

Defining your scoring strategy from metric functions
The module sklearn.metrics also exposes a set of simple functions measuring a prediction error given ground
truth and prediction:
• functions ending with _score return a value to maximize, the higher the better.
• functions ending with _error or _loss return a value to minimize, the lower the better. When converting into
a scorer object using make_scorer, set the greater_is_better parameter to False (True by default; see
the parameter description below).
Metrics available for various machine learning tasks are detailed in sections below.
Many metrics are not given names to be used as scoring values, sometimes because they require additional parameters, such as fbeta_score. In such cases, you need to generate an appropriate scoring object. The simplest way
to generate a callable object for scoring is by using make_scorer. That function converts metrics into callables that
can be used for model evaluation.
One typical use case is to wrap an existing metric function from the library with non-default values for its parameters,
such as the beta parameter for the fbeta_score function:
>>>
>>>
>>>
>>>
>>>

from sklearn.metrics import fbeta_score, make_scorer
ftwo_scorer = make_scorer(fbeta_score, beta=2)
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer)

The second use case is to build a completely custom scorer object from a simple python function using
make_scorer, which can take several parameters:
• the python function you want to use (my_custom_loss_func in the example below)
• whether the python function returns a score (greater_is_better=True, the default) or a loss
(greater_is_better=False). If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
• for classification metrics only: whether the python function you provided requires continuous decision certainties (needs_threshold=True). The default value is False.
• any additional parameters, such as beta or labels in f1_score.
Here is an example of building custom scorers, and of using the greater_is_better parameter:
>>>
>>>
...
...
...
>>>
>>>
>>>
>>>

import numpy as np
def my_custom_loss_func(ground_truth, predictions):
diff = np.abs(ground_truth - predictions).max()
return np.log(1 + diff)
# loss_func will negate the return value of my_custom_loss_func,
# which will be np.log(2), 0.693, given the values for ground_truth
# and predictions defined below.
loss = make_scorer(my_custom_loss_func, greater_is_better=False)

3.3. Model selection and evaluation

475

scikit-learn user guide, Release 0.19.1

>>> score = make_scorer(my_custom_loss_func, greater_is_better=True)
>>> ground_truth = [[1], [1]]
>>> predictions = [0, 1]
>>> from sklearn.dummy import DummyClassifier
>>> clf = DummyClassifier(strategy='most_frequent', random_state=0)
>>> clf = clf.fit(ground_truth, predictions)
>>> loss(clf,ground_truth, predictions)
-0.69...
>>> score(clf,ground_truth, predictions)
0.69...

Implementing your own scoring object
You can generate even more flexible model scorers by constructing your own scoring object from scratch, without using
the make_scorer factory. For a callable to be a scorer, it needs to meet the protocol specified by the following two
rules:
• It can be called with parameters (estimator, X, y), where estimator is the model that should be
evaluated, X is validation data, and y is the ground truth target for X (in the supervised case) or None (in the
unsupervised case).
• It returns a floating point number that quantifies the estimator prediction quality on X, with reference to y.
Again, by convention higher numbers are better, so if your scorer returns loss, that value should be negated.
Using multiple metric evaluation
Scikit-learn also permits evaluation of multiple metrics in GridSearchCV, RandomizedSearchCV and
cross_validate.
There are two ways to specify multiple scoring metrics for the scoring parameter:
• As an iterable of string metrics::
>>> scoring = ['accuracy', 'precision']

• As a dict mapping the scorer name to the scoring function::
>>> from sklearn.metrics import accuracy_score
>>> from sklearn.metrics import make_scorer
>>> scoring = {'accuracy': make_scorer(accuracy_score),
...
'prec': 'precision'}

Note that the dict values can either be scorer functions or one of the predefined metric strings.
Currently only those scorer functions that return a single score can be passed inside the dict. Scorer functions that
return multiple values are not permitted and will require a wrapper to return a single metric:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>

476

from sklearn.model_selection import cross_validate
from sklearn.metrics import confusion_matrix
# A sample toy binary classification dataset
X, y = datasets.make_classification(n_classes=2, random_state=0)
svm = LinearSVC(random_state=0)
def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0,
def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0,
def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1,
def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0,

0]
0]
0]
1]

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>> scoring = {'tp' : make_scorer(tp), 'tn' : make_scorer(tn),
...
'fp' : make_scorer(fp), 'fn' : make_scorer(fn)}
>>> cv_results = cross_validate(svm.fit(X, y), X, y, scoring=scoring)
>>> # Getting the test set true positive scores
>>> print(cv_results['test_tp'])
[12 13 15]
>>> # Getting the test set false negative scores
>>> print(cv_results['test_fn'])
[5 4 1]

Classification metrics
The sklearn.metrics module implements several loss, score, and utility functions to measure classification performance. Some metrics might require probability estimates of the positive class, confidence values, or binary decisions values. Most implementations allow each sample to provide a weighted contribution to the overall score, through
the sample_weight parameter.
Some of these are restricted to the binary classification case:
precision_recall_curve(y_true, probas_pred)
roc_curve(y_true, y_score[, pos_label, . . . ])

Compute precision-recall pairs for different probability
thresholds
Compute Receiver operating characteristic (ROC)

Others also work in the multiclass case:
cohen_kappa_score(y1, y2[, labels, weights, . . . ])
confusion_matrix(y_true, y_pred[, labels, . . . ])
hinge_loss(y_true, pred_decision[, labels, . . . ])
matthews_corrcoef(y_true, y_pred[, . . . ])

Cohen’s kappa: a statistic that measures inter-annotator
agreement.
Compute confusion matrix to evaluate the accuracy of a
classification
Average hinge loss (non-regularized)
Compute the Matthews correlation coefficient (MCC)

Some also work in the multilabel case:
accuracy_score(y_true, y_pred[, normalize, . . . ])
classification_report(y_true, y_pred[, . . . ])
f1_score(y_true, y_pred[, labels, . . . ])
fbeta_score(y_true, y_pred, beta[, labels, . . . ])
hamming_loss(y_true, y_pred[, labels, . . . ])
jaccard_similarity_score(y_true, y_pred[, . . . ])
log_loss(y_true, y_pred[, eps, normalize, . . . ])
precision_recall_fscore_support(y_true,
y_pred)
precision_score(y_true, y_pred[, labels, . . . ])
recall_score(y_true, y_pred[, labels, . . . ])
zero_one_loss(y_true, y_pred[, normalize, . . . ])

Accuracy classification score.
Build a text report showing the main classification metrics
Compute the F1 score, also known as balanced F-score or
F-measure
Compute the F-beta score
Compute the average Hamming loss.
Jaccard similarity coefficient score
Log loss, aka logistic loss or cross-entropy loss.
Compute precision, recall, F-measure and support for each
class
Compute the precision
Compute the recall
Zero-one classification loss.

And some work with binary and multilabel (but not multiclass) problems:

3.3. Model selection and evaluation

477

scikit-learn user guide, Release 0.19.1

average_precision_score(y_true, y_score[, . . . ])
roc_auc_score(y_true, y_score[, average, . . . ])

Compute average precision (AP) from prediction scores
Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

In the following sub-sections, we will describe each of those functions, preceded by some notes on common API and
metric definition.
From binary to multiclass and multilabel
Some metrics are essentially defined for binary classification tasks (e.g. f1_score, roc_auc_score). In these
cases, by default only the positive label is evaluated, assuming by default that the positive class is labelled 1 (though
this may be configurable through the pos_label parameter). In extending a binary metric to multiclass or multilabel
problems, the data is treated as a collection of binary problems, one for each class. There are then a number of ways
to average binary metric calculations across the set of classes, each of which may be useful in some scenario. Where
available, you should select among these using the average parameter.
• "macro" simply calculates the mean of the binary metrics, giving equal weight to each class. In problems
where infrequent classes are nonetheless important, macro-averaging may be a means of highlighting their
performance. On the other hand, the assumption that all classes are equally important is often untrue, such that
macro-averaging will over-emphasize the typically low performance on an infrequent class.
• "weighted" accounts for class imbalance by computing the average of binary metrics in which each class’s
score is weighted by its presence in the true data sample.
• "micro" gives each sample-class pair an equal contribution to the overall metric (except as a result of sampleweight). Rather than summing the metric per class, this sums the dividends and divisors that make up the
per-class metrics to calculate an overall quotient. Micro-averaging may be preferred in multilabel settings,
including multiclass classification where a majority class is to be ignored.
• "samples" applies only to multilabel problems. It does not calculate a per-class measure, instead calculating the metric over the true and predicted classes for each sample in the evaluation data, and returning their
(sample_weight-weighted) average.
• Selecting average=None will return an array with the score for each class.
While multiclass data is provided to the metric, like binary targets, as an array of class labels, multilabel data is
specified as an indicator matrix, in which cell [i, j] has value 1 if sample i has label j and value 0 otherwise.
Accuracy score
The accuracy_score function computes the accuracy, either the fraction (default) or the count (normalize=False)
of correct predictions.
In multilabel classification, the function returns the subset accuracy. If the entire set of predicted labels for a sample
strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.
If 𝑦ˆ𝑖 is the predicted value of the 𝑖-th sample and 𝑦𝑖 is the corresponding true value, then the fraction of correct
predictions over 𝑛samples is defined as
accuracy(𝑦, 𝑦ˆ) =

1
𝑛samples

𝑛samples −1

∑︁

1(ˆ
𝑦𝑖 = 𝑦𝑖 )

𝑖=0

where 1(𝑥) is the indicator function.

478

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>>
>>>
>>>
>>>
>>>
0.5
>>>
2

import numpy as np
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
accuracy_score(y_true, y_pred)
accuracy_score(y_true, y_pred, normalize=False)

In the multilabel case with binary label indicators:
>>> accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))
0.5

Example:
• See Test with permutations the significance of a classification score for an example of accuracy score usage
using permutations of the dataset.

Cohen’s kappa
The function cohen_kappa_score computes Cohen’s kappa statistic. This measure is intended to compare labelings by different human annotators, not a classifier versus a ground truth.
The kappa score (see docstring) is a number between -1 and 1. Scores above .8 are generally considered good agreement; zero or lower means no agreement (practically random labels).
Kappa scores can be computed for binary or multiclass problems, but not for multilabel problems (except by manually
computing a per-label score) and not for more than two annotators.
>>> from sklearn.metrics import cohen_kappa_score
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> cohen_kappa_score(y_true, y_pred)
0.4285714285714286

Confusion matrix
The confusion_matrix function evaluates classification accuracy by computing the confusion matrix.
By definition, entry 𝑖, 𝑗 in a confusion matrix is the number of observations actually in group 𝑖, but predicted to be in
group 𝑗. Here is an example:
>>> from sklearn.metrics import confusion_matrix
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]])

3.3. Model selection and evaluation

479

scikit-learn user guide, Release 0.19.1

Here is a visual representation of such a confusion matrix (this figure comes from the Confusion matrix example):

For binary problems, we can
get counts of true negatives, false positives, false negatives and true positives as follows:
>>>
>>>
>>>
>>>
(2,

y_true = [0, 0, 0, 1, 1, 1, 1, 1]
y_pred = [0, 1, 0, 1, 0, 1, 0, 1]
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
tn, fp, fn, tp
1, 2, 3)

Example:
• See Confusion matrix for an example of using a confusion matrix to evaluate classifier output quality.
• See Recognizing hand-written digits for an example of using a confusion matrix to classify hand-written
digits.
• See Classification of text documents using sparse features for an example of using a confusion matrix to
classify text documents.

Classification report
The classification_report function builds a text report showing the main classification metrics. Here is a
small example with custom target_names and inferred labels:
>>>
>>>
>>>
>>>
>>>

480

from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 0]
y_pred = [0, 0, 2, 1, 0]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))
precision
recall f1-score
support

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

class 0
class 1
class 2

0.67
0.00
1.00

1.00
0.00
0.50

0.80
0.00
0.67

2
1
2

avg / total

0.67

0.60

0.59

5

Example:
• See Recognizing hand-written digits for an example of classification report usage for hand-written digits.
• See Classification of text documents using sparse features for an example of classification report usage for
text documents.
• See Parameter estimation using grid search with cross-validation for an example of classification report usage
for grid search with nested cross-validation.

Hamming loss
The hamming_loss computes the average Hamming loss or Hamming distance between two sets of samples.
If 𝑦ˆ𝑗 is the predicted value for the 𝑗-th label of a given sample, 𝑦𝑗 is the corresponding true value, and 𝑛labels is the
number of classes or labels, then the Hamming loss 𝐿𝐻𝑎𝑚𝑚𝑖𝑛𝑔 between two samples is defined as:
𝐿𝐻𝑎𝑚𝑚𝑖𝑛𝑔 (𝑦, 𝑦ˆ) =

1

𝑛labels
∑︁−1

𝑛labels

𝑗=0

1(ˆ
𝑦𝑗 ̸= 𝑦𝑗 )

where 1(𝑥) is the indicator function.
>>> from sklearn.metrics import hamming_loss
>>> y_pred = [1, 2, 3, 4]
>>> y_true = [2, 2, 3, 4]
>>> hamming_loss(y_true, y_pred)
0.25

In the multilabel case with binary label indicators:
>>> hamming_loss(np.array([[0, 1], [1, 1]]), np.zeros((2, 2)))
0.75

Note: In multiclass classification, the Hamming loss corresponds to the Hamming distance between y_true and
y_pred which is similar to the Zero one loss function. However, while zero-one loss penalizes prediction sets that
do not strictly match true sets, the Hamming loss penalizes individual labels. Thus the Hamming loss, upper bounded
by the zero-one loss, is always between zero and one, inclusive; and predicting a proper subset or superset of the true
labels will give a Hamming loss between zero and one, exclusive.

Jaccard similarity coefficient score
The jaccard_similarity_score function computes the average (default) or sum of Jaccard similarity coefficients, also called the Jaccard index, between pairs of label sets.

3.3. Model selection and evaluation

481

scikit-learn user guide, Release 0.19.1

The Jaccard similarity coefficient of the 𝑖-th samples, with a ground truth label set 𝑦𝑖 and predicted label set 𝑦ˆ𝑖 , is
defined as
𝐽(𝑦𝑖 , 𝑦ˆ𝑖 ) =

|𝑦𝑖 ∩ 𝑦ˆ𝑖 |
.
|𝑦𝑖 ∪ 𝑦ˆ𝑖 |

In binary and multiclass classification, the Jaccard similarity coefficient score is equal to the classification accuracy.
>>>
>>>
>>>
>>>
>>>
0.5
>>>
2

import numpy as np
from sklearn.metrics import jaccard_similarity_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
jaccard_similarity_score(y_true, y_pred)
jaccard_similarity_score(y_true, y_pred, normalize=False)

In the multilabel case with binary label indicators:
>>> jaccard_similarity_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))
0.75

Precision, recall and F-measures
Intuitively, precision is the ability of the classifier not to label as positive a sample that is negative, and recall is the
ability of the classifier to find all the positive samples.
The F-measure (𝐹𝛽 and 𝐹1 measures) can be interpreted as a weighted harmonic mean of the precision and recall. A
𝐹𝛽 measure reaches its best value at 1 and its worst score at 0. With 𝛽 = 1, 𝐹𝛽 and 𝐹1 are equivalent, and the recall
and the precision are equally important.
The precision_recall_curve computes a precision-recall curve from the ground truth label and a score given
by the classifier by varying a decision threshold.
The average_precision_score function computes the average precision (AP) from prediction scores. The
value is between 0 and 1 and higher is better. AP is defined as
∑︁
AP =
(𝑅𝑛 − 𝑅𝑛−1 )𝑃𝑛
𝑛

where 𝑃𝑛 and 𝑅𝑛 are the precision and recall at the nth threshold. With random predictions, the AP is the fraction of
positive samples.
References [Manning2008] and [Everingham2010] present alternative variants of AP that interpolate the precisionrecall curve. Currently, average_precision_score does not implement any interpolated variant. References
[Davis2006] and [Flach2015] describe why a linear interpolation of points on the precision-recall curve provides an
overly-optimistic measure of classifier performance. This linear interpolation is used when computing area under the
curve with the trapezoidal rule in auc.
Several functions allow you to analyze the precision, recall and F-measures score:
average_precision_score(y_true, y_score[, . . . ])
f1_score(y_true, y_pred[, labels, . . . ])
fbeta_score(y_true, y_pred, beta[, labels, . . . ])
precision_recall_curve(y_true, probas_pred)

482

Compute average precision (AP) from prediction scores
Compute the F1 score, also known as balanced F-score or
F-measure
Compute the F-beta score
Compute precision-recall pairs for different probability
thresholds
Continued on next page
Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Table 3.26 – continued from previous page
precision_recall_fscore_support(y_true,
Compute precision, recall, F-measure and support for each
y_pred)
class
precision_score(y_true, y_pred[, labels, . . . ])
Compute the precision
recall_score(y_true, y_pred[, labels, . . . ])
Compute the recall
Note that the precision_recall_curve function is restricted to the binary case.
average_precision_score function works only in binary classification and multilabel indicator format.

The

Examples:
• See Classification of text documents using sparse features for an example of f1_score usage to classify
text documents.
• See Parameter estimation using grid search with cross-validation for an example of precision_score
and recall_score usage to estimate parameters using grid search with nested cross-validation.
• See Precision-Recall for an example of precision_recall_curve usage to evaluate classifier output
quality.

References:

Binary classification
In a binary classification task, the terms ‘’positive” and ‘’negative” refer to the classifier’s prediction, and the terms
‘’true” and ‘’false” refer to whether that prediction corresponds to the external judgment (sometimes known as the
‘’observation’‘). Given these definitions, we can formulate the following table:

Predicted class (expectation)

Actual class (observation)
tp (true positive) Correct result
fn (false negative) Missing result

fp (false positive) Unexpected result
tn (true negative) Correct absence of result

In this context, we can define the notions of precision, recall and F-measure:
precision =
recall =
𝐹𝛽 = (1 + 𝛽 2 )

𝑡𝑝
,
𝑡𝑝 + 𝑓 𝑝

𝑡𝑝
,
𝑡𝑝 + 𝑓 𝑛

precision × recall
.
𝛽 2 precision + recall

Here are some small examples in binary classification:
>>>
>>>
>>>
>>>
1.0
>>>
0.5

from sklearn import metrics
y_pred = [0, 1, 0, 0]
y_true = [0, 1, 0, 1]
metrics.precision_score(y_true, y_pred)
metrics.recall_score(y_true, y_pred)

3.3. Model selection and evaluation

483

scikit-learn user guide, Release 0.19.1

>>> metrics.f1_score(y_true, y_pred)
0.66...
>>> metrics.fbeta_score(y_true, y_pred, beta=0.5)
0.83...
>>> metrics.fbeta_score(y_true, y_pred, beta=1)
0.66...
>>> metrics.fbeta_score(y_true, y_pred, beta=2)
0.55...
>>> metrics.precision_recall_fscore_support(y_true, y_pred, beta=0.5)
(array([ 0.66..., 1.
]), array([ 1. , 0.5]), array([ 0.71...,
˓→array([2, 2]...))

0.83...]),

>>> import numpy as np
>>> from sklearn.metrics import precision_recall_curve
>>> from sklearn.metrics import average_precision_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> precision, recall, threshold = precision_recall_curve(y_true, y_scores)
>>> precision
array([ 0.66..., 0.5
, 1.
, 1.
])
>>> recall
array([ 1. , 0.5, 0.5, 0. ])
>>> threshold
array([ 0.35, 0.4 , 0.8 ])
>>> average_precision_score(y_true, y_scores)
0.83...

Multiclass and multilabel classification
In multiclass and multilabel classification task, the notions of precision, recall, and F-measures can be applied to each label independently. There are a few ways to combine results across labels, specified by the
average argument to the average_precision_score (multilabel only), f1_score, fbeta_score,
precision_recall_fscore_support, precision_score and recall_score functions, as described
above. Note that for “micro”-averaging in a multiclass setting with all labels included will produce equal precision,
recall and 𝐹 , while “weighted” averaging may produce an F-score that is not between precision and recall.
To make this more explicit, consider the following notation:
• 𝑦 the set of predicted (𝑠𝑎𝑚𝑝𝑙𝑒, 𝑙𝑎𝑏𝑒𝑙) pairs
• 𝑦ˆ the set of true (𝑠𝑎𝑚𝑝𝑙𝑒, 𝑙𝑎𝑏𝑒𝑙) pairs
• 𝐿 the set of labels
• 𝑆 the set of samples
• 𝑦𝑠 the subset of 𝑦 with sample 𝑠, i.e. 𝑦𝑠 := {(𝑠′ , 𝑙) ∈ 𝑦|𝑠′ = 𝑠}
• 𝑦𝑙 the subset of 𝑦 with label 𝑙
• similarly, 𝑦ˆ𝑠 and 𝑦ˆ𝑙 are subsets of 𝑦ˆ
• 𝑃 (𝐴, 𝐵) :=

|𝐴∩𝐵|
|𝐴|

• 𝑅(𝐴, 𝐵) :=
for 𝑃 .)

|𝐴∩𝐵|
|𝐵|

(Conventions vary on handling 𝐵 = ∅; this implementation uses 𝑅(𝐴, 𝐵) := 0, and similar

(︀
)︀
• 𝐹𝛽 (𝐴, 𝐵) := 1 + 𝛽 2 𝛽𝑃2 𝑃(𝐴,𝐵)×𝑅(𝐴,𝐵)
(𝐴,𝐵)+𝑅(𝐴,𝐵)
484

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Then the metrics are defined as:
average
"micro"
"samples"
"macro"
"weighted"

Precision
𝑃 (𝑦,
∑︀𝑦ˆ)

Recall
𝑅(𝑦,
∑︀𝑦ˆ)

F_beta
𝐹𝛽 (𝑦,
∑︀ 𝑦ˆ)

None

⟨𝑃 (𝑦𝑙 , 𝑦ˆ𝑙 )|𝑙 ∈ 𝐿⟩

⟨𝑅(𝑦𝑙 , 𝑦ˆ𝑙 )|𝑙 ∈ 𝐿⟩

⟨𝐹𝛽 (𝑦𝑙 , 𝑦ˆ𝑙 )|𝑙 ∈ 𝐿⟩

1
ˆ𝑠 )
|𝑆| ∑︀𝑠∈𝑆 𝑃 (𝑦𝑠 , 𝑦
1
𝑃
(𝑦
,
𝑦
ˆ
𝑙 𝑙)
𝑙∈𝐿
|𝐿|
∑︀
∑︀ 1
𝑦𝑙 | 𝑃 (𝑦𝑙 , 𝑦ˆ𝑙 )
𝑙∈𝐿 |ˆ
𝑦𝑙 |
𝑙∈𝐿 |^

1
ˆ𝑠 )
|𝑆| ∑︀𝑠∈𝑆 𝑅(𝑦𝑠 , 𝑦
1
𝑅(𝑦
,
𝑦
ˆ
𝑙 𝑙)
𝑙∈𝐿
|𝐿|
∑︀
∑︀ 1
𝑦𝑙 | 𝑅(𝑦𝑙 , 𝑦ˆ𝑙 )
𝑙∈𝐿 |ˆ
𝑦𝑙 |
𝑙∈𝐿 |^

1
ˆ𝑠 )
|𝑆| ∑︀𝑠∈𝑆 𝐹𝛽 (𝑦𝑠 , 𝑦
1
𝐹
(𝑦
,
𝑦
ˆ
𝑙∈𝐿 𝛽 𝑙 𝑙 )
|𝐿|
∑︀
∑︀ 1
𝑦𝑙 | 𝐹𝛽 (𝑦𝑙 , 𝑦ˆ𝑙 )
𝑙∈𝐿 |ˆ
𝑦𝑙 |
𝑙∈𝐿 |^

>>> from sklearn import metrics
>>> y_true = [0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 2, 1, 0, 0, 1]
>>> metrics.precision_score(y_true, y_pred, average='macro')
0.22...
>>> metrics.recall_score(y_true, y_pred, average='micro')
...
0.33...
>>> metrics.f1_score(y_true, y_pred, average='weighted')
0.26...
>>> metrics.fbeta_score(y_true, y_pred, average='macro', beta=0.5)
0.23...
>>> metrics.precision_recall_fscore_support(y_true, y_pred, beta=0.5, average=None)
...
(array([ 0.66..., 0.
, 0.
]), array([ 1., 0., 0.]), array([ 0.71...,
˓→
0.
, 0.
]), array([2, 2, 2]...))

For multiclass classification with a “negative class”, it is possible to exclude some labels:
>>> metrics.recall_score(y_true, y_pred, labels=[1, 2], average='micro')
... # excluding 0, no labels were correctly recalled
0.0

Similarly, labels not present in the data sample may be accounted for in macro-averaging.
>>> metrics.precision_score(y_true, y_pred, labels=[0, 1, 2, 3], average='macro')
...
0.166...

Hinge loss
The hinge_loss function computes the average distance between the model and the data using hinge loss, a onesided metric that considers only prediction errors. (Hinge loss is used in maximal margin classifiers such as support
vector machines.)
If the labels are encoded with +1 and -1, 𝑦: is the true value, and 𝑤 is the predicted decisions as output by
decision_function, then the hinge loss is defined as:
𝐿Hinge (𝑦, 𝑤) = max {1 − 𝑤𝑦, 0} = |1 − 𝑤𝑦|+
If there are more than two labels, hinge_loss uses a multiclass variant due to Crammer & Singer. Here is the paper
describing it.
If 𝑦𝑤 is the predicted decision for true label and 𝑦𝑡 is the maximum of the predicted decisions for all other labels,
where predicted decisions are output by decision function, then multiclass hinge loss is defined by:
𝐿Hinge (𝑦𝑤 , 𝑦𝑡 ) = max {1 + 𝑦𝑡 − 𝑦𝑤 , 0}
3.3. Model selection and evaluation

485

scikit-learn user guide, Release 0.19.1

Here a small example demonstrating the use of the hinge_loss function with a svm classifier in a binary class
problem:
>>> from sklearn import svm
>>> from sklearn.metrics import hinge_loss
>>> X = [[0], [1]]
>>> y = [-1, 1]
>>> est = svm.LinearSVC(random_state=0)
>>> est.fit(X, y)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='squared_hinge', max_iter=1000,
multi_class='ovr', penalty='l2', random_state=0, tol=0.0001,
verbose=0)
>>> pred_decision = est.decision_function([[-2], [3], [0.5]])
>>> pred_decision
array([-2.18..., 2.36..., 0.09...])
>>> hinge_loss([-1, 1, 1], pred_decision)
0.3...

Here is an example demonstrating the use of the hinge_loss function with a svm classifier in a multiclass problem:
>>> X = np.array([[0], [1], [2], [3]])
>>> Y = np.array([0, 1, 2, 3])
>>> labels = np.array([0, 1, 2, 3])
>>> est = svm.LinearSVC()
>>> est.fit(X, Y)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='squared_hinge', max_iter=1000,
multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
verbose=0)
>>> pred_decision = est.decision_function([[-1], [2], [3]])
>>> y_true = [0, 2, 3]
>>> hinge_loss(y_true, pred_decision, labels)
0.56...

Log loss
Log loss, also called logistic regression loss or cross-entropy loss, is defined on probability estimates. It is commonly
used in (multinomial) logistic regression and neural networks, as well as in some variants of expectation-maximization,
and can be used to evaluate the probability outputs (predict_proba) of a classifier instead of its discrete predictions.
For binary classification with a true label 𝑦 ∈ {0, 1} and a probability estimate 𝑝 = Pr(𝑦 = 1), the log loss per sample
is the negative log-likelihood of the classifier given the true label:
𝐿log (𝑦, 𝑝) = − log Pr(𝑦|𝑝) = −(𝑦 log(𝑝) + (1 − 𝑦) log(1 − 𝑝))
This extends to the multiclass case as follows. Let the true labels for a set of samples be encoded as a 1-of-K binary
indicator matrix 𝑌 , i.e., 𝑦𝑖,𝑘 = 1 if sample 𝑖 has label 𝑘 taken from a set of 𝐾 labels. Let 𝑃 be a matrix of probability
estimates, with 𝑝𝑖,𝑘 = Pr(𝑡𝑖,𝑘 = 1). Then the log loss of the whole set is
𝐿log (𝑌, 𝑃 ) = − log Pr(𝑌 |𝑃 ) = −

𝑁 −1 𝐾−1
1 ∑︁ ∑︁
𝑦𝑖,𝑘 log 𝑝𝑖,𝑘
𝑁 𝑖=0
𝑘=0

To see how this generalizes the binary log loss given above, note that in the binary case, 𝑝𝑖,0 = 1 − 𝑝𝑖,1 and 𝑦𝑖,0 =
1 − 𝑦𝑖,1 , so expanding the inner sum over 𝑦𝑖,𝑘 ∈ {0, 1} gives the binary log loss.
486

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The log_loss function computes log loss given a list of ground-truth labels and a probability matrix, as returned by
an estimator’s predict_proba method.
>>> from sklearn.metrics import log_loss
>>> y_true = [0, 0, 1, 1]
>>> y_pred = [[.9, .1], [.8, .2], [.3, .7], [.01, .99]]
>>> log_loss(y_true, y_pred)
0.1738...

The first [.9, .1] in y_pred denotes 90% probability that the first sample has label 0. The log loss is non-negative.
Matthews correlation coefficient
The matthews_corrcoef function computes the Matthew’s correlation coefficient (MCC) for binary classes.
Quoting Wikipedia:
“The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary
(two-class) classifications. It takes into account true and false positives and negatives and is generally
regarded as a balanced measure which can be used even if the classes are of very different sizes. The
MCC is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents
a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The statistic is also
known as the phi coefficient.”
In the binary (two-class) case, 𝑡𝑝, 𝑡𝑛, 𝑓 𝑝 and 𝑓 𝑛 are respectively the number of true positives, true negatives, false
positives and false negatives, the MCC is defined as
𝑀 𝐶𝐶 = √︀

𝑡𝑝 × 𝑡𝑛 − 𝑓 𝑝 × 𝑓 𝑛
(𝑡𝑝 + 𝑓 𝑝)(𝑡𝑝 + 𝑓 𝑛)(𝑡𝑛 + 𝑓 𝑝)(𝑡𝑛 + 𝑓 𝑛)

.

In the multiclass case, the Matthews correlation coefficient can be defined in terms of a confusion_matrix 𝐶 for
𝐾 classes. To simplify the definition consider the following intermediate variables:
∑︀𝐾
• 𝑡𝑘 = 𝑖 𝐶𝑖𝑘 the number of times class 𝑘 truly occurred,
∑︀𝐾
• 𝑝𝑘 = 𝑖 𝐶𝑘𝑖 the number of times class 𝑘 was predicted,
∑︀𝐾
• 𝑐 = 𝑘 𝐶𝑘𝑘 the total number of samples correctly predicted,
∑︀𝐾 ∑︀𝐾
• 𝑠= 𝑖
𝑗 𝐶𝑖𝑗 the total number of samples.
Then the multiclass MCC is defined as:
𝑀 𝐶𝐶 = √︁

∑︀𝐾
𝑐 × 𝑠 − 𝑘 𝑝 𝑘 × 𝑡𝑘
∑︀𝐾
∑︀𝐾
(𝑠2 − 𝑘 𝑝2𝑘 ) × (𝑠2 − 𝑘 𝑡2𝑘 )

When there are more than two labels, the value of the MCC will no longer range between -1 and +1. Instead the
minimum value will be somewhere between -1 and 0 depending on the number and distribution of ground true labels.
The maximum value is always +1.
Here is a small example illustrating the usage of the matthews_corrcoef function:
>>> from sklearn.metrics import matthews_corrcoef
>>> y_true = [+1, +1, +1, -1]
>>> y_pred = [+1, -1, +1, +1]
>>> matthews_corrcoef(y_true, y_pred)
-0.33...

3.3. Model selection and evaluation

487

scikit-learn user guide, Release 0.19.1

Receiver operating characteristic (ROC)
The function roc_curve computes the receiver operating characteristic curve, or ROC curve. Quoting Wikipedia :
“A receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates
the performance of a binary classifier system as its discrimination threshold is varied. It is created by
plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false
positives out of the negatives (FPR = false positive rate), at various threshold settings. TPR is also known
as sensitivity, and FPR is one minus the specificity or true negative rate.”
This function requires the true binary value and the target scores, which can either be probability estimates of the
positive class, confidence values, or binary decisions. Here is a small example of how to use the roc_curve function:
>>> import numpy as np
>>> from sklearn.metrics import roc_curve
>>> y = np.array([1, 1, 2, 2])
>>> scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> fpr, tpr, thresholds = roc_curve(y, scores, pos_label=2)
>>> fpr
array([ 0. , 0.5, 0.5, 1. ])
>>> tpr
array([ 0.5, 0.5, 1. , 1. ])
>>> thresholds
array([ 0.8 , 0.4 , 0.35, 0.1 ])

This figure shows an example of such an ROC curve:
The roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is
also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in
one number. For more information see the Wikipedia article on AUC.
>>>
>>>
>>>
>>>
>>>

488

import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
roc_auc_score(y_true, y_scores)

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

0.75

In multi-label classification, the roc_auc_score function is extended by averaging over the labels as above.
Compared to metrics such as the subset accuracy, the Hamming loss, or the F1 score, ROC doesn’t require optimizing a
threshold for each label. The roc_auc_score function can also be used in multi-class classification, if the predicted

outputs have been binarized.
Examples:
• See Receiver Operating Characteristic (ROC) for an example of using ROC to evaluate the quality of the
output of a classifier.
• See Receiver Operating Characteristic (ROC) with cross validation for an example of using ROC to evaluate
classifier output quality, using cross-validation.
• See Species distribution modeling for an example of using ROC to model species distribution.

Zero one loss
The zero_one_loss function computes the sum or the average of the 0-1 classification loss (𝐿0−1 ) over 𝑛samples .
By default, the function normalizes over the sample. To get the sum of the 𝐿0−1 , set normalize to False.
In multilabel classification, the zero_one_loss scores a subset as one if its labels strictly match the predictions,
and as a zero if there are any errors. By default, the function returns the percentage of imperfectly predicted subsets.
To get the count of such subsets instead, set normalize to False
If 𝑦ˆ𝑖 is the predicted value of the 𝑖-th sample and 𝑦𝑖 is the corresponding true value, then the 0-1 loss 𝐿0−1 is defined
as:
𝐿0−1 (𝑦𝑖 , 𝑦ˆ𝑖 ) = 1(ˆ
𝑦𝑖 ̸= 𝑦𝑖 )
where 1(𝑥) is the indicator function.

3.3. Model selection and evaluation

489

scikit-learn user guide, Release 0.19.1

>>> from sklearn.metrics import zero_one_loss
>>> y_pred = [1, 2, 3, 4]
>>> y_true = [2, 2, 3, 4]
>>> zero_one_loss(y_true, y_pred)
0.25
>>> zero_one_loss(y_true, y_pred, normalize=False)
1

In the multilabel case with binary label indicators, where the first label set [0,1] has an error:
>>> zero_one_loss(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))
0.5
>>> zero_one_loss(np.array([[0, 1], [1, 1]]), np.ones((2, 2)),
1

normalize=False)

Example:
• See Recursive feature elimination with cross-validation for an example of zero one loss usage to perform
recursive feature elimination with cross-validation.

Brier score loss
The brier_score_loss function computes the Brier score for binary classes. Quoting Wikipedia:
“The Brier score is a proper score function that measures the accuracy of probabilistic predictions. It is
applicable to tasks in which predictions must assign probabilities to a set of mutually exclusive discrete
outcomes.”
This function returns a score of the mean square difference between the actual outcome and the predicted probability
of the possible outcome. The actual outcome has to be 1 or 0 (true or false), while the predicted probability of the
actual outcome can be a value between 0 and 1.
The brier score loss is also between 0 to 1 and the lower the score (the mean square difference is smaller), the more
accurate the prediction is. It can be thought of as a measure of the “calibration” of a set of probabilistic predictions.
𝐵𝑆 =

𝑁
1 ∑︁
(𝑓𝑡 − 𝑜𝑡 )2
𝑁 𝑡=1

where : 𝑁 is the total number of predictions, 𝑓𝑡 is the predicted probability of the actual outcome 𝑜𝑡 .
Here is a small example of usage of this function::
>>> import numpy as np
>>> from sklearn.metrics import brier_score_loss
>>> y_true = np.array([0, 1, 1, 0])
>>> y_true_categorical = np.array(["spam", "ham", "ham", "spam"])
>>> y_prob = np.array([0.1, 0.9, 0.8, 0.4])
>>> y_pred = np.array([0, 1, 1, 0])
>>> brier_score_loss(y_true, y_prob)
0.055
>>> brier_score_loss(y_true, 1-y_prob, pos_label=0)
0.055
>>> brier_score_loss(y_true_categorical, y_prob, pos_label="ham")
0.055

490

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>> brier_score_loss(y_true, y_prob > 0.5)
0.0

Example:
• See Probability calibration of classifiers for an example of Brier score loss usage to perform probability
calibration of classifiers.

References:
• G. Brier, Verification of forecasts expressed in terms of probability, Monthly weather review 78.1 (1950)

Multilabel ranking metrics
In multilabel learning, each sample can have any number of ground truth labels associated with it. The goal is to give
high scores and better rank to the ground truth labels.
Coverage error
The coverage_error function computes the average number of labels that have to be included in the final prediction such that all true labels are predicted. This is useful if you want to know how many top-scored-labels you have
to predict in average without missing any true one. The best value of this metrics is thus the average number of true
labels.
Note: Our implementation’s score is 1 greater than the one given in Tsoumakas et al., 2010. This extends it to handle
the degenerate case in which an instance has 0 true labels.
Formally, given a binary indicator matrix of the ground truth labels 𝑦 ∈ {0, 1}
with each label 𝑓ˆ ∈ R𝑛samples ×𝑛labels , the coverage is defined as
𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒(𝑦, 𝑓ˆ) =

1
𝑛samples

𝑛samples ×𝑛labels

and the score associated

𝑛samples −1

∑︁
𝑖=0

max rank𝑖𝑗

𝑗:𝑦𝑖𝑗 =1

⃒{︁
}︁⃒
⃒
⃒
with rank𝑖𝑗 = ⃒ 𝑘 : 𝑓ˆ𝑖𝑘 ≥ 𝑓ˆ𝑖𝑗 ⃒. Given the rank definition, ties in y_scores are broken by giving the maximal rank
that would have been assigned to all tied values.
Here is a small example of usage of this function:
>>>
>>>
>>>
>>>
>>>
2.5

import numpy as np
from sklearn.metrics import coverage_error
y_true = np.array([[1, 0, 0], [0, 0, 1]])
y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]])
coverage_error(y_true, y_score)

3.3. Model selection and evaluation

491

scikit-learn user guide, Release 0.19.1

Label ranking average precision
The label_ranking_average_precision_score function implements label ranking average precision
(LRAP). This metric is linked to the average_precision_score function, but is based on the notion of label ranking instead of precision and recall.
Label ranking average precision (LRAP) is the average over each ground truth label assigned to each sample, of the
ratio of true vs. total labels with lower score. This metric will yield better scores if you are able to give better rank to
the labels associated with each sample. The obtained score is always strictly greater than 0, and the best value is 1.
If there is exactly one relevant label per sample, label ranking average precision is equivalent to the mean reciprocal
rank.
Formally, given a binary indicator matrix of the ground truth labels 𝑦 ∈ ℛ𝑛samples ×𝑛labels and the score associated with
each label 𝑓ˆ ∈ ℛ𝑛samples ×𝑛labels , the average precision is defined as
1

𝐿𝑅𝐴𝑃 (𝑦, 𝑓ˆ) =

𝑛samples −1

∑︁

𝑛samples

𝑖=0

1 ∑︁ |ℒ𝑖𝑗 |
|𝑦𝑖 | 𝑗:𝑦 =1 rank𝑖𝑗
𝑖𝑗

⃒{︁
{︁
}︁
}︁⃒
⃒
⃒
with ℒ𝑖𝑗 = 𝑘 : 𝑦𝑖𝑘 = 1, 𝑓ˆ𝑖𝑘 ≥ 𝑓ˆ𝑖𝑗 , rank𝑖𝑗 = ⃒ 𝑘 : 𝑓ˆ𝑖𝑘 ≥ 𝑓ˆ𝑖𝑗 ⃒ and | · | is the l0 norm or the cardinality of the set.
Here is a small example of usage of this function:
>>> import numpy as np
>>> from sklearn.metrics import label_ranking_average_precision_score
>>> y_true = np.array([[1, 0, 0], [0, 0, 1]])
>>> y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]])
>>> label_ranking_average_precision_score(y_true, y_score)
0.416...

Ranking loss
The label_ranking_loss function computes the ranking loss which averages over the samples the number of
label pairs that are incorrectly ordered, i.e. true labels have a lower score than false labels, weighted by the inverse
number of false and true labels. The lowest achievable ranking loss is zero.
Formally, given a binary indicator matrix of the ground truth labels 𝑦 ∈ {0, 1}
with each label 𝑓ˆ ∈ R𝑛samples ×𝑛labels , the ranking loss is defined as
ranking_loss(𝑦, 𝑓ˆ) =

1
𝑛samples

𝑛samples −1

∑︁

1

𝑖=0

|𝑦𝑖 |(𝑛labels − |𝑦𝑖 |)

𝑛samples ×𝑛labels

and the score associated

⃒{︁
}︁⃒
⃒
⃒
⃒ (𝑘, 𝑙) : 𝑓ˆ𝑖𝑘 < 𝑓ˆ𝑖𝑙 , 𝑦𝑖𝑘 = 1, 𝑦𝑖𝑙 = 0 ⃒

where | · | is the ℓ0 norm or the cardinality of the set.
Here is a small example of usage of this function:
>>> import numpy as np
>>> from sklearn.metrics import label_ranking_loss
>>> y_true = np.array([[1, 0, 0], [0, 0, 1]])
>>> y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]])
>>> label_ranking_loss(y_true, y_score)
0.75...
>>> # With the following prediction, we have perfect and minimal loss
>>> y_score = np.array([[1.0, 0.1, 0.2], [0.1, 0.2, 0.9]])
>>> label_ranking_loss(y_true, y_score)
0.0

492

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

References:
• Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Mining multi-label data. In Data mining and knowledge
discovery handbook (pp. 667-685). Springer US.

Regression metrics
The sklearn.metrics module implements several loss, score, and utility functions to measure regression
performance. Some of those have been enhanced to handle the multioutput case: mean_squared_error,
mean_absolute_error, explained_variance_score and r2_score.
These functions have an multioutput keyword argument which specifies the way the scores or losses for each
individual target should be averaged. The default is 'uniform_average', which specifies a uniformly weighted
mean over outputs. If an ndarray of shape (n_outputs,) is passed, then its entries are interpreted as weights
and an according weighted average is returned. If multioutput is 'raw_values' is specified, then all unaltered
individual scores or losses will be returned in an array of shape (n_outputs,).
The r2_score and explained_variance_score accept an additional value 'variance_weighted' for
the multioutput parameter. This option leads to a weighting of each individual score by the variance of the
corresponding target variable. This setting quantifies the globally captured unscaled variance. If the target variables are of different scale, then this score puts more importance on well explaining the higher variance variables.
multioutput='variance_weighted' is the default value for r2_score for backward compatibility. This
will be changed to uniform_average in the future.
Explained variance score
The explained_variance_score computes the explained variance regression score.
If 𝑦ˆ is the estimated target output, 𝑦 the corresponding (correct) target output, and 𝑉 𝑎𝑟 is Variance, the square of the
standard deviation, then the explained variance is estimated as follow:
explained_variance(𝑦, 𝑦ˆ) = 1 −

𝑉 𝑎𝑟{𝑦 − 𝑦ˆ}
𝑉 𝑎𝑟{𝑦}

The best possible score is 1.0, lower values are worse.
Here is a small example of usage of the explained_variance_score function:
>>> from sklearn.metrics import explained_variance_score
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> explained_variance_score(y_true, y_pred)
0.957...
>>> y_true = [[0.5, 1], [-1, 1], [7, -6]]
>>> y_pred = [[0, 2], [-1, 2], [8, -5]]
>>> explained_variance_score(y_true, y_pred, multioutput='raw_values')
...
array([ 0.967..., 1.
])
>>> explained_variance_score(y_true, y_pred, multioutput=[0.3, 0.7])
...
0.990...

3.3. Model selection and evaluation

493

scikit-learn user guide, Release 0.19.1

Mean absolute error
The mean_absolute_error function computes mean absolute error, a risk metric corresponding to the expected
value of the absolute error loss or 𝑙1-norm loss.
If 𝑦ˆ𝑖 is the predicted value of the 𝑖-th sample, and 𝑦𝑖 is the corresponding true value, then the mean absolute error
(MAE) estimated over 𝑛samples is defined as
MAE(𝑦, 𝑦ˆ) =

1
𝑛samples

𝑛samples −1

∑︁

|𝑦𝑖 − 𝑦ˆ𝑖 | .

𝑖=0

Here is a small example of usage of the mean_absolute_error function:
>>> from sklearn.metrics import mean_absolute_error
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> mean_absolute_error(y_true, y_pred)
0.5
>>> y_true = [[0.5, 1], [-1, 1], [7, -6]]
>>> y_pred = [[0, 2], [-1, 2], [8, -5]]
>>> mean_absolute_error(y_true, y_pred)
0.75
>>> mean_absolute_error(y_true, y_pred, multioutput='raw_values')
array([ 0.5, 1. ])
>>> mean_absolute_error(y_true, y_pred, multioutput=[0.3, 0.7])
...
0.849...

Mean squared error
The mean_squared_error function computes mean square error, a risk metric corresponding to the expected
value of the squared (quadratic) error or loss.
If 𝑦ˆ𝑖 is the predicted value of the 𝑖-th sample, and 𝑦𝑖 is the corresponding true value, then the mean squared error
(MSE) estimated over 𝑛samples is defined as
MSE(𝑦, 𝑦ˆ) =

1
𝑛samples

𝑛samples −1

∑︁

(𝑦𝑖 − 𝑦ˆ𝑖 )2 .

𝑖=0

Here is a small example of usage of the mean_squared_error function:
>>> from sklearn.metrics import mean_squared_error
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> mean_squared_error(y_true, y_pred)
0.375
>>> y_true = [[0.5, 1], [-1, 1], [7, -6]]
>>> y_pred = [[0, 2], [-1, 2], [8, -5]]
>>> mean_squared_error(y_true, y_pred)
0.7083...

Examples:

494

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• See Gradient Boosting regression for an example of mean squared error usage to evaluate gradient boosting
regression.

Mean squared logarithmic error
The mean_squared_log_error function computes a risk metric corresponding to the expected value of the
squared logarithmic (quadratic) error or loss.
If 𝑦ˆ𝑖 is the predicted value of the 𝑖-th sample, and 𝑦𝑖 is the corresponding true value, then the mean squared logarithmic
error (MSLE) estimated over 𝑛samples is defined as
MSLE(𝑦, 𝑦ˆ) =

1
𝑛samples

𝑛samples −1

∑︁

(log𝑒 (1 + 𝑦𝑖 ) − log𝑒 (1 + 𝑦ˆ𝑖 ))2 .

𝑖=0

Where log𝑒 (𝑥) means the natural logarithm of 𝑥. This metric is best to use when targets having exponential growth,
such as population counts, average sales of a commodity over a span of years etc. Note that this metric penalizes an
under-predicted estimate greater than an over-predicted estimate.
Here is a small example of usage of the mean_squared_log_error function:
>>> from sklearn.metrics import mean_squared_log_error
>>> y_true = [3, 5, 2.5, 7]
>>> y_pred = [2.5, 5, 4, 8]
>>> mean_squared_log_error(y_true, y_pred)
0.039...
>>> y_true = [[0.5, 1], [1, 2], [7, 6]]
>>> y_pred = [[0.5, 2], [1, 2.5], [8, 8]]
>>> mean_squared_log_error(y_true, y_pred)
0.044...

Median absolute error
The median_absolute_error is particularly interesting because it is robust to outliers. The loss is calculated by
taking the median of all absolute differences between the target and the prediction.
If 𝑦ˆ𝑖 is the predicted value of the 𝑖-th sample and 𝑦𝑖 is the corresponding true value, then the median absolute error
(MedAE) estimated over 𝑛samples is defined as
MedAE(𝑦, 𝑦ˆ) = median(| 𝑦1 − 𝑦ˆ1 |, . . . , | 𝑦𝑛 − 𝑦ˆ𝑛 |).
The median_absolute_error does not support multioutput.
Here is a small example of usage of the median_absolute_error function:
>>>
>>>
>>>
>>>
0.5

from sklearn.metrics import median_absolute_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
median_absolute_error(y_true, y_pred)

R2 score, the coefficient of determination
The r2_score function computes R2 , the coefficient of determination. It provides a measure of how well future
samples are likely to be predicted by the model. Best possible score is 1.0 and it can be negative (because the model
3.3. Model selection and evaluation

495

scikit-learn user guide, Release 0.19.1

can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features,
would get a R^2 score of 0.0.
If 𝑦ˆ𝑖 is the predicted value of the 𝑖-th sample and 𝑦𝑖 is the corresponding true value, then the score R2 estimated over
𝑛samples is defined as
∑︀𝑛samples −1
(𝑦𝑖 − 𝑦ˆ𝑖 )2
𝑅2 (𝑦, 𝑦ˆ) = 1 − ∑︀𝑖=0
𝑛samples −1
(𝑦𝑖 − 𝑦¯)2
𝑖=0
where 𝑦¯ =

1
𝑛samples

∑︀𝑛samples −1
𝑖=0

𝑦𝑖 .

Here is a small example of usage of the r2_score function:
>>> from sklearn.metrics import r2_score
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> r2_score(y_true, y_pred)
0.948...
>>> y_true = [[0.5, 1], [-1, 1], [7, -6]]
>>> y_pred = [[0, 2], [-1, 2], [8, -5]]
>>> r2_score(y_true, y_pred, multioutput='variance_weighted')
...
0.938...
>>> y_true = [[0.5, 1], [-1, 1], [7, -6]]
>>> y_pred = [[0, 2], [-1, 2], [8, -5]]
>>> r2_score(y_true, y_pred, multioutput='uniform_average')
...
0.936...
>>> r2_score(y_true, y_pred, multioutput='raw_values')
...
array([ 0.965..., 0.908...])
>>> r2_score(y_true, y_pred, multioutput=[0.3, 0.7])
...
0.925...

Example:
• See Lasso and Elastic Net for Sparse Signals for an example of R2 score usage to evaluate Lasso and Elastic
Net on sparse signals.

Clustering metrics
The sklearn.metrics module implements several loss, score, and utility functions. For more information see the
Clustering performance evaluation section for instance clustering, and Biclustering evaluation for biclustering.
Dummy estimators
When doing supervised learning, a simple sanity check consists of comparing one’s estimator against simple rules of
thumb. DummyClassifier implements several such simple strategies for classification:
• stratified generates random predictions by respecting the training set class distribution.
• most_frequent always predicts the most frequent label in the training set.

496

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• prior always predicts the class that maximizes the class prior (like most_frequent`) and
``predict_proba returns the class prior.
• uniform generates predictions uniformly at random.
• constant always predicts a constant label that is provided by the user. A major motivation of this
method is F1-scoring, when the positive class is in the minority.
Note that with all these strategies, the predict method completely ignores the input data!
To illustrate DummyClassifier, first let’s create an imbalanced dataset:
>>>
>>>
>>>
>>>
>>>
>>>

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
y[y != 1] = -1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Next, let’s compare the accuracy of SVC and most_frequent:
>>> from sklearn.dummy import DummyClassifier
>>> from sklearn.svm import SVC
>>> clf = SVC(kernel='linear', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
0.63...
>>> clf = DummyClassifier(strategy='most_frequent',random_state=0)
>>> clf.fit(X_train, y_train)
DummyClassifier(constant=None, random_state=0, strategy='most_frequent')
>>> clf.score(X_test, y_test)
0.57...

We see that SVC doesn’t do much better than a dummy classifier. Now, let’s change the kernel:
>>> clf = SVC(kernel='rbf', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
0.97...

We see that the accuracy was boosted to almost 100%. A cross validation strategy is recommended for a better
estimate of the accuracy, if it is not too CPU costly. For more information see the Cross-validation: evaluating
estimator performance section. Moreover if you want to optimize over the parameter space, it is highly recommended
to use an appropriate methodology; see the Tuning the hyper-parameters of an estimator section for details.
More generally, when the accuracy of a classifier is too close to random, it probably means that something went wrong:
features are not helpful, a hyperparameter is not correctly tuned, the classifier is suffering from class imbalance, etc. . .
DummyRegressor also implements four simple rules of thumb for regression:
• mean always predicts the mean of the training targets.
• median always predicts the median of the training targets.
• quantile always predicts a user provided quantile of the training targets.
• constant always predicts a constant value that is provided by the user.
In all these strategies, the predict method completely ignores the input data.

3.3. Model selection and evaluation

497

scikit-learn user guide, Release 0.19.1

3.3.4 Model persistence
After training a scikit-learn model, it is desirable to have a way to persist the model for future use without having to
retrain. The following section gives you an example of how to persist a model with pickle. We’ll also review a few
security and maintainability issues when working with pickle serialization.
Persistence example
It is possible to save a model in the scikit by using Python’s built-in persistence model, namely pickle:
>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> clf2.predict(X[0:1])
array([0])
>>> y[0]
0

In the specific case of the scikit, it may be more interesting to use joblib’s replacement of pickle (joblib.dump &
joblib.load), which is more efficient on objects that carry large numpy arrays internally as is often the case for
fitted scikit-learn estimators, but can only pickle to the disk and not to a string:
>>> from sklearn.externals import joblib
>>> joblib.dump(clf, 'filename.pkl')

Later you can load back the pickled model (possibly in another Python process) with:
>>> clf = joblib.load('filename.pkl')

Note: joblib.dump and joblib.load functions also accept file-like object instead of filenames. More information on data persistence with Joblib is available here.

Security & maintainability limitations
pickle (and joblib by extension), has some issues regarding maintainability and security. Because of this,
• Never unpickle untrusted data as it could lead to malicious code being executed upon loading.
• While models saved using one version of scikit-learn might load in other versions, this is entirely unsupported
and inadvisable. It should also be kept in mind that operations performed on such data could give different and
unexpected results.
In order to rebuild a similar model with future versions of scikit-learn, additional metadata should be saved along the
pickled model:
498

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• The training data, e.g. a reference to a immutable snapshot
• The python source code used to generate the model
• The versions of scikit-learn and its dependencies
• The cross validation score obtained on the training data
This should make it possible to check that the cross-validation score is in the same range as before.
Since a model internal representation may be different on two different architectures, dumping a model on one architecture and loading it on another architecture is not supported.
If you want to know more about these issues and explore other possible serialization methods, please refer to this talk
by Alex Gaynor.

3.3.5 Validation curves: plotting scores to evaluate models
Every estimator has its advantages and drawbacks. Its generalization error can be decomposed in terms of bias,
variance and noise. The bias of an estimator is its average error for different training sets. The variance of an
estimator indicates how sensitive it is to varying training sets. Noise is a property of the data.
In the following plot, we see a function 𝑓 (𝑥) = cos( 32 𝜋𝑥) and some noisy samples from that function. We use three
different estimators to fit the function: linear regression with polynomial features of degree 1, 4 and 15. We see that
the first estimator can at best provide only a poor fit to the samples and the true function because it is too simple
(high bias), the second estimator approximates it almost perfectly and the last estimator approximates the training data
perfectly but does not fit the true function very well, i.e. it is very sensitive to varying training data (high variance).

Bias and variance are inherent properties of estimators and we usually have to select learning algorithms and hyperparameters so that both bias and variance are as low as possible (see Bias-variance dilemma). Another way to reduce
the variance of a model is to use more training data. However, you should only collect more training data if the true
function is too complex to be approximated by an estimator with a lower variance.
In the simple one-dimensional problem that we have seen in the example it is easy to see whether the estimator suffers
from bias or variance. However, in high-dimensional spaces, models can become very difficult to visualize. For this
reason, it is often helpful to use the tools described below.
Examples:
• Underfitting vs. Overfitting

3.3. Model selection and evaluation

499

scikit-learn user guide, Release 0.19.1

• Plotting Validation Curves
• Plotting Learning Curves

Validation curve
To validate a model we need a scoring function (see Model evaluation: quantifying the quality of predictions), for
example accuracy for classifiers. The proper way of choosing multiple hyperparameters of an estimator are of course
grid search or similar methods (see Tuning the hyper-parameters of an estimator) that select the hyperparameter with
the maximum score on a validation set or multiple validation sets. Note that if we optimized the hyperparameters
based on a validation score the validation score is biased and not a good estimate of the generalization any longer. To
get a proper estimate of the generalization we have to compute the score on another test set.
However, it is sometimes helpful to plot the influence of a single hyperparameter on the training score and the validation score to find out whether the estimator is overfitting or underfitting for some hyperparameter values.
The function validation_curve can help in this case:
>>>
>>>
>>>
>>>

import numpy as np
from sklearn.model_selection import validation_curve
from sklearn.datasets import load_iris
from sklearn.linear_model import Ridge

>>>
>>>
>>>
>>>
>>>
>>>

np.random.seed(0)
iris = load_iris()
X, y = iris.data, iris.target
indices = np.arange(y.shape[0])
np.random.shuffle(indices)
X, y = X[indices], y[indices]

>>> train_scores, valid_scores = validation_curve(Ridge(), X, y, "alpha",
...
np.logspace(-7, 3, 3))
>>> train_scores
array([[ 0.94..., 0.92..., 0.92...],
[ 0.94..., 0.92..., 0.92...],
[ 0.47..., 0.45..., 0.42...]])
>>> valid_scores
array([[ 0.90..., 0.92..., 0.94...],
[ 0.90..., 0.92..., 0.94...],
[ 0.44..., 0.39..., 0.45...]])

If the training score and the validation score are both low, the estimator will be underfitting. If the training score is
high and the validation score is low, the estimator is overfitting and otherwise it is working very well. A low training
score and a high validation score is usually not possible. All three cases can be found in the plot below where we vary
the parameter 𝛾 of an SVM on the digits dataset.
Learning curve
A learning curve shows the validation and training score of an estimator for varying numbers of training samples. It
is a tool to find out how much we benefit from adding more training data and whether the estimator suffers more from
a variance error or a bias error. If both the validation score and the training score converge to a value that is too low
with increasing size of the training set, we will not benefit much from more training data. In the following plot you
can see an example: naive Bayes roughly converges to a low score.
We will probably have to use an estimator or a parametrization of the current estimator that can learn more complex
concepts (i.e. has a lower bias). If the training score is much greater than the validation score for the maximum number
500

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

3.3. Model selection and evaluation

501

scikit-learn user guide, Release 0.19.1

of training samples, adding more training samples will most likely increase generalization. In the following plot you
can see that the SVM could benefit from more training examples.

We can use the function learning_curve to generate the values that are required to plot such a learning curve
(number of samples that have been used, the average scores on the training sets and the average scores on the validation
sets):
>>> from sklearn.model_selection import learning_curve
>>> from sklearn.svm import SVC
>>> train_sizes, train_scores, valid_scores = learning_curve(
...
SVC(kernel='linear'), X, y, train_sizes=[50, 80, 110], cv=5)
>>> train_sizes
array([ 50, 80, 110])
>>> train_scores
array([[ 0.98..., 0.98 , 0.98..., 0.98..., 0.98...],
[ 0.98..., 1.
, 0.98..., 0.98..., 0.98...],
[ 0.98..., 1.
, 0.98..., 0.98..., 0.99...]])
>>> valid_scores
array([[ 1. , 0.93..., 1. , 1. , 0.96...],
[ 1. , 0.96..., 1. , 1. , 0.96...],
[ 1. , 0.96..., 1. , 1. , 0.96...]])

3.4 Dataset transformations
scikit-learn provides a library of transformers, which may clean (see Preprocessing data), reduce (see Unsupervised
dimensionality reduction), expand (see Kernel Approximation) or generate (see Feature extraction) feature representations.
Like other estimators, these are represented by classes with a fit method, which learns model parameters (e.g.
mean and standard deviation for normalization) from a training set, and a transform method which applies this
transformation model to unseen data. fit_transform may be more convenient and efficient for modelling and
transforming the training data simultaneously.
Combining such transformers, either in parallel or series is covered in Pipeline and FeatureUnion: combining estimators. Pairwise metrics, Affinities and Kernels covers transforming feature spaces into affinity matrices, while
Transforming the prediction target (y) considers transformations of the target space (e.g. categorical labels) for use in
scikit-learn.

502

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

3.4.1 Pipeline and FeatureUnion: combining estimators
Pipeline: chaining estimators
Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of
steps in processing the data, for example feature selection, normalization and classification. Pipeline serves two
purposes here:
Convenience and encapsulation You only have to call fit and predict once on your data to fit a whole sequence
of estimators.
Joint parameter selection You can grid search over parameters of all estimators in the pipeline at once.
Safety Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring
that the same samples are used to train the transformers and predictors.
All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method). The
last estimator may be any type (transformer, classifier, etc.).
Usage
The Pipeline is built using a list of (key, value) pairs, where the key is a string containing the name you
want to give this step and value is an estimator object:
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.svm import SVC
>>> from sklearn.decomposition import PCA
>>> estimators = [('reduce_dim', PCA()), ('clf', SVC())]
>>> pipe = Pipeline(estimators)
>>> pipe
Pipeline(memory=None,
steps=[('reduce_dim', PCA(copy=True,...)),
('clf', SVC(C=1.0,...))])

The utility function make_pipeline is a shorthand for constructing pipelines; it takes a variable number of estimators and returns a pipeline, filling in the names automatically:
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.naive_bayes import MultinomialNB
>>> from sklearn.preprocessing import Binarizer
>>> make_pipeline(Binarizer(), MultinomialNB())
Pipeline(memory=None,
steps=[('binarizer', Binarizer(copy=True, threshold=0.0)),
('multinomialnb', MultinomialNB(alpha=1.0,
class_prior=None,
fit_prior=True))])

The estimators of a pipeline are stored as a list in the steps attribute:
>>> pipe.steps[0]
('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_
˓→state=None,
svd_solver='auto', tol=0.0, whiten=False))

and as a dict in named_steps:

3.4. Dataset transformations

503

scikit-learn user guide, Release 0.19.1

>>> pipe.named_steps['reduce_dim']
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)

Parameters of the estimators in the pipeline can be accessed using the __ syntax:
>>> pipe.set_params(clf__C=10)
Pipeline(memory=None,
steps=[('reduce_dim', PCA(copy=True, iterated_power='auto',...)),
('clf', SVC(C=10, cache_size=200, class_weight=None,...))])

Attributes of named_steps map to keys, enabling tab completion in interactive environments:
>>> pipe.named_steps.reduce_dim is pipe.named_steps['reduce_dim']
True

This is particularly important for doing grid searches:
>>> from sklearn.model_selection import GridSearchCV
>>> param_grid = dict(reduce_dim__n_components=[2, 5, 10],
...
clf__C=[0.1, 10, 100])
>>> grid_search = GridSearchCV(pipe, param_grid=param_grid)

Individual steps may also be replaced as parameters, and non-final steps may be ignored by setting them to None:
>>> from sklearn.linear_model import LogisticRegression
>>> param_grid = dict(reduce_dim=[None, PCA(5), PCA(10)],
...
clf=[SVC(), LogisticRegression()],
...
clf__C=[0.1, 10, 100])
>>> grid_search = GridSearchCV(pipe, param_grid=param_grid)

Examples:
• Pipeline Anova SVM
• Sample pipeline for text feature extraction and evaluation
• Pipelining: chaining a PCA and a logistic regression
• Explicit feature map approximation for RBF kernels
• SVM-Anova: SVM with univariate feature selection
• Selecting dimensionality reduction with Pipeline and GridSearchCV

See also:
• Tuning the hyper-parameters of an estimator

Notes
Calling fit on the pipeline is the same as calling fit on each estimator in turn, transform the input and pass it
on to the next step. The pipeline has all the methods that the last estimator in the pipeline has, i.e. if the last estimator
is a classifier, the Pipeline can be used as a classifier. If the last estimator is a transformer, again, so is the pipeline.
504

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Caching transformers: avoid repeated computation
Fitting transformers may be computationally expensive. With its memory parameter set, Pipeline will cache each
transformer after calling fit. This feature is used to avoid computing the fit transformers within a pipeline if the
parameters and input data are identical. A typical example is the case of a grid search in which the transformers can
be fitted only once and reused for each configuration.
The parameter memory is needed in order to cache the transformers. memory can be either a string containing the
directory where to cache the transformers or a joblib.Memory object:
>>> from tempfile import mkdtemp
>>> from shutil import rmtree
>>> from sklearn.decomposition import PCA
>>> from sklearn.svm import SVC
>>> from sklearn.pipeline import Pipeline
>>> estimators = [('reduce_dim', PCA()), ('clf', SVC())]
>>> cachedir = mkdtemp()
>>> pipe = Pipeline(estimators, memory=cachedir)
>>> pipe
Pipeline(...,
steps=[('reduce_dim', PCA(copy=True,...)),
('clf', SVC(C=1.0,...))])
>>> # Clear the cache directory when you don't need it anymore
>>> rmtree(cachedir)

Warning: Side effect of caching transformers
Using a Pipeline without cache enabled, it is possible to inspect the original instance such as:
>>> from sklearn.datasets import load_digits
>>> digits = load_digits()
>>> pca1 = PCA()
>>> svm1 = SVC()
>>> pipe = Pipeline([('reduce_dim', pca1), ('clf', svm1)])
>>> pipe.fit(digits.data, digits.target)
...
Pipeline(memory=None,
steps=[('reduce_dim', PCA(...)), ('clf', SVC(...))])
>>> # The pca instance can be inspected directly
>>> print(pca1.components_)
[[ -1.77484909e-19 ... 4.07058917e-18]]

Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to
the pipeline cannot be inspected directly. In following example, accessing the PCA instance pca2 will raise an
AttributeError since pca2 will be an unfitted transformer. Instead, use the attribute named_steps to
inspect estimators within the pipeline:
>>> cachedir = mkdtemp()
>>> pca2 = PCA()
>>> svm2 = SVC()
>>> cached_pipe = Pipeline([('reduce_dim', pca2), ('clf', svm2)],
...
memory=cachedir)
>>> cached_pipe.fit(digits.data, digits.target)
...
Pipeline(memory=...,
steps=[('reduce_dim', PCA(...)), ('clf', SVC(...))])
>>> print(cached_pipe.named_steps['reduce_dim'].components_)
...
[[ -1.77484909e-19 ... 4.07058917e-18]]
>>> # Remove the cache directory
>>>Dataset
rmtree(cachedir)
3.4.
transformations

505

scikit-learn user guide, Release 0.19.1

Examples:
• Selecting dimensionality reduction with Pipeline and GridSearchCV

FeatureUnion: composite feature spaces
FeatureUnion combines several transformer objects into a new transformer that combines their output. A
FeatureUnion takes a list of transformer objects. During fitting, each of these is fit to the data independently.
For transforming data, the transformers are applied in parallel, and the sample vectors they output are concatenated
end-to-end into larger vectors.
FeatureUnion serves the same purposes as Pipeline - convenience and joint parameter estimation and validation.
FeatureUnion and Pipeline can be combined to create complex models.
(A FeatureUnion has no way of checking whether two transformers might produce identical features. It only
produces a union when the feature sets are disjoint, and making sure they are the caller’s responsibility.)
Usage
A FeatureUnion is built using a list of (key, value) pairs, where the key is the name you want to give to a
given transformation (an arbitrary string; it only serves as an identifier) and value is an estimator object:
>>> from sklearn.pipeline import FeatureUnion
>>> from sklearn.decomposition import PCA
>>> from sklearn.decomposition import KernelPCA
>>> estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
>>> combined = FeatureUnion(estimators)
>>> combined
FeatureUnion(n_jobs=1,
transformer_list=[('linear_pca', PCA(copy=True,...)),
('kernel_pca', KernelPCA(alpha=1.0,...))],
transformer_weights=None)

Like pipelines, feature unions have a shorthand constructor called make_union that does not require explicit naming
of the components.
Like Pipeline, individual steps may be replaced using set_params, and ignored by setting to None:
>>> combined.set_params(kernel_pca=None)
...
FeatureUnion(n_jobs=1,
transformer_list=[('linear_pca', PCA(copy=True,...)),
('kernel_pca', None)],
transformer_weights=None)

Examples:
• Concatenating multiple feature extraction methods

506

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• Feature Union with Heterogeneous Data Sources

3.4.2 Feature extraction
The sklearn.feature_extraction module can be used to extract features in a format supported by machine
learning algorithms from datasets consisting of formats such as text and image.
Note: Feature extraction is very different from Feature selection: the former consists in transforming arbitrary data,
such as text or images, into numerical features usable for machine learning. The latter is a machine learning technique
applied on these features.

Loading features from dicts
The class DictVectorizer can be used to convert feature arrays represented as lists of standard Python dict
objects to the NumPy/SciPy representation used by scikit-learn estimators.
While not particularly fast to process, Python’s dict has the advantages of being convenient to use, being sparse
(absent features need not be stored) and storing feature names in addition to values.
DictVectorizer implements what is called one-of-K or “one-hot” coding for categorical (aka nominal, discrete)
features. Categorical features are “attribute-value” pairs where the value is restricted to a list of discrete of possibilities
without ordering (e.g. topic identifiers, types of objects, tags, names. . . ).
In the following, “city” is a categorical attribute while “temperature” is a traditional numerical feature:
>>> measurements = [
...
{'city': 'Dubai', 'temperature': 33.},
...
{'city': 'London', 'temperature': 12.},
...
{'city': 'San Francisco', 'temperature': 18.},
... ]
>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()
>>> vec.fit_transform(measurements).toarray()
array([[ 1.,
0.,
0., 33.],
[ 0.,
1.,
0., 12.],
[ 0.,
0.,
1., 18.]])
>>> vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Francisco', 'temperature']

DictVectorizer is also a useful representation transformation for training sequence classifiers in Natural Language Processing models that typically work by extracting feature windows around a particular word of interest.
For example, suppose that we have a first algorithm that extracts Part of Speech (PoS) tags that we want to use as
complementary tags for training a sequence classifier (e.g. a chunker). The following dict could be such a window of
features extracted around the word ‘sat’ in the sentence ‘The cat sat on the mat.’:
>>> pos_window = [
...
{
...
'word-2': 'the',
...
'pos-2': 'DT',
...
'word-1': 'cat',

3.4. Dataset transformations

507

scikit-learn user guide, Release 0.19.1

...
...
...
...
...
... ]

'pos-1': 'NN',
'word+1': 'on',
'pos+1': 'PP',
},
# in a real application one would extract many such dictionaries

This description can be vectorized into a sparse two-dimensional matrix suitable for feeding into a classifier (maybe
after being piped into a text.TfidfTransformer for normalization):
>>> vec = DictVectorizer()
>>> pos_vectorized = vec.fit_transform(pos_window)
>>> pos_vectorized
<1x6 sparse matrix of type '<... 'numpy.float64'>'
with 6 stored elements in Compressed Sparse ... format>
>>> pos_vectorized.toarray()
array([[ 1., 1., 1., 1., 1., 1.]])
>>> vec.get_feature_names()
['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']

As you can imagine, if one extracts such a context around each individual word of a corpus of documents the resulting
matrix will be very wide (many one-hot-features) with most of them being valued to zero most of the time. So as to
make the resulting data structure able to fit in memory the DictVectorizer class uses a scipy.sparse matrix
by default instead of a numpy.ndarray.
Feature hashing
The class FeatureHasher is a high-speed, low-memory vectorizer that uses a technique known as feature hashing,
or the “hashing trick”. Instead of building a hash table of the features encountered in training, as the vectorizers
do, instances of FeatureHasher apply a hash function to the features to determine their column index in sample
matrices directly. The result is increased speed and reduced memory usage, at the expense of inspectability; the hasher
does not remember what the input features looked like and has no inverse_transform method.
Since the hash function might cause collisions between (unrelated) features, a signed hash function is used and the
sign of the hash value determines the sign of the value stored in the output matrix for a feature. This way, collisions
are likely to cancel out rather than accumulate error, and the expected mean of any output feature’s value is zero.
This mechanism is enabled by default with alternate_sign=True and is particularly useful for small hash table
sizes (n_features < 10000). For large hash table sizes, it can be disabled, to allow the output to be passed
to estimators like sklearn.naive_bayes.MultinomialNB or sklearn.feature_selection.chi2
feature selectors that expect non-negative inputs.
FeatureHasher accepts either mappings (like Python’s dict and its variants in the collections module),
(feature, value) pairs, or strings, depending on the constructor parameter input_type. Mapping are treated
as lists of (feature, value) pairs, while single strings have an implicit value of 1, so ['feat1', 'feat2',
'feat3'] is interpreted as [('feat1', 1), ('feat2', 1), ('feat3', 1)]. If a single feature occurs
multiple times in a sample, the associated values will be summed (so ('feat', 2) and ('feat', 3.5) become
('feat', 5.5)). The output from FeatureHasher is always a scipy.sparse matrix in the CSR format.
Feature hashing can be employed in document classification, but unlike text.CountVectorizer,
FeatureHasher does not do word splitting or any other preprocessing except Unicode-to-UTF-8 encoding; see
Vectorizing a large text corpus with the hashing trick, below, for a combined tokenizer/hasher.
As an example, consider a word-level natural language processing task that needs features extracted from (token,
part_of_speech) pairs. One could use a Python generator function to extract features:

508

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

def token_features(token, part_of_speech):
if token.isdigit():
yield "numeric"
else:
yield "token={}".format(token.lower())
yield "token,pos={},{}".format(token, part_of_speech)
if token[0].isupper():
yield "uppercase_initial"
if token.isupper():
yield "all_uppercase"
yield "pos={}".format(part_of_speech)

Then, the raw_X to be fed to FeatureHasher.transform can be constructed using:
raw_X = (token_features(tok, pos_tagger(tok)) for tok in corpus)

and fed to a hasher with:
hasher = FeatureHasher(input_type='string')
X = hasher.transform(raw_X)

to get a scipy.sparse matrix X.
Note the use of a generator comprehension, which introduces laziness into the feature extraction: tokens are only
processed on demand from the hasher.
Implementation details
FeatureHasher uses the signed 32-bit variant of MurmurHash3. As a result (and because of limitations in scipy.
sparse), the maximum number of features supported is currently 231 − 1.
The original formulation of the hashing trick by Weinberger et al. used two separate hash functions ℎ and 𝜉 to determine the column index and sign of a feature, respectively. The present implementation works under the assumption
that the sign bit of MurmurHash3 is independent of its other bits.
Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two
as the n_features parameter; otherwise the features will not be mapped evenly to the columns.
References:
• Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola and Josh Attenberg (2009). Feature hashing for large scale multitask learning. Proc. ICML.
• MurmurHash3.

Text feature extraction
The Bag of Words representation
Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of
symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a
fixed size rather than the raw text documents with variable length.

3.4. Dataset transformations

509

scikit-learn user guide, Release 0.19.1

In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from
text content, namely:
• tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and
punctuation as token separators.
• counting the occurrences of tokens in each document.
• normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.
In this scheme, features and samples are defined as follows:
• each individual token occurrence frequency (normalized or not) is treated as a feature.
• the vector of all the token frequencies for a given document is considered a multivariate sample.
A corpus of documents can thus be represented by a matrix with one row per document and one column per token
(e.g. word) occurring in the corpus.
We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This
specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of
the words in the document.
Sparsity
As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will
have many feature values that are zeros (typically more than 99% of them).
For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order
of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse
package.
Common Vectorizer usage
CountVectorizer implements both tokenization and occurrence counting in a single class:
>>> from sklearn.feature_extraction.text import CountVectorizer

This model has many parameters, however the default values are quite reasonable (please see the reference documentation for the details):
>>> vectorizer = CountVectorizer()
>>> vectorizer
CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern=...'(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)

Let’s use it to tokenize and count the word occurrences of a minimalistic corpus of text documents:

510

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>> corpus = [
...
'This is the first document.',
...
'This is the second second document.',
...
'And the third one.',
...
'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X
<4x9 sparse matrix of type '<... 'numpy.int64'>'
with 19 stored elements in Compressed Sparse ... format>

The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does
this step can be requested explicitly:
>>> analyze = vectorizer.build_analyzer()
>>> analyze("This is a text document to analyze.") == (
...
['this', 'is', 'text', 'document', 'to', 'analyze'])
True

Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the
resulting matrix. This interpretation of the columns can be retrieved as follows:
>>> vectorizer.get_feature_names() == (
...
['and', 'document', 'first', 'is', 'one',
...
'second', 'the', 'third', 'this'])
True
>>> X.toarray()
array([[0, 1, 1,
[0, 1, 0,
[1, 0, 0,
[0, 1, 1,

1,
1,
0,
1,

0,
0,
1,
0,

0,
2,
0,
0,

1,
1,
1,
1,

0,
0,
1,
0,

1],
1],
0],
1]]...)

The converse mapping from feature name to column index is stored in the vocabulary_ attribute of the vectorizer:
>>> vectorizer.vocabulary_.get('document')
1

Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform
method:
>>> vectorizer.transform(['Something completely new.']).toarray()
...
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]]...)

Note that in the previous corpus, the first and the last documents have exactly the same words hence are encoded in
equal vectors. In particular we lose the information that the last document is an interrogative form. To preserve some
of the local ordering information we can extract 2-grams of words in addition to the 1-grams (individual words):
>>> bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
...
token_pattern=r'\b\w+\b', min_df=1)
>>> analyze = bigram_vectorizer.build_analyzer()
>>> analyze('Bi-grams are cool!') == (
...
['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool'])
True

The vocabulary extracted by this vectorizer is hence much bigger and can now resolve ambiguities encoded in local
positioning patterns:
3.4. Dataset transformations

511

scikit-learn user guide, Release 0.19.1

>>> X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
>>> X_2
...
array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
[0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0,
[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1,
[0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,

1,
1,
0,
1,

1,
1,
0,
0,

0],
0],
0],
1]]...)

In particular the interrogative form “Is this” is only present in the last document:
>>> feature_index = bigram_vectorizer.vocabulary_.get('is this')
>>> X_2[:, feature_index]
array([0, 0, 0, 1]...)

Tf–idf term weighting
In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little
meaningful information about the actual contents of the document. If we were to feed the direct count data directly to
a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.
In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common
to use the tf–idf transform.
Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency: tf-idf(t,d) =
tf(t,d) × idf(t).
Using the TfidfTransformer’s default settings, TfidfTransformer(norm='l2', use_idf=True,
smooth_idf=True, sublinear_tf=False) the term frequency, the number of times a term occurs in a
given document, is multiplied with idf component, which is computed as
1+𝑛𝑑
+ 1,
idf(𝑡) = 𝑙𝑜𝑔 1+df(𝑑,𝑡)

where 𝑛𝑑 is the total number of documents, and df(𝑑, 𝑡) is the number of documents that contain term 𝑡. The resulting
tf-idf vectors are then normalized by the Euclidean norm:
𝑣𝑛𝑜𝑟𝑚 =

𝑣
||𝑣||2

=

√

𝑣
.
𝑣 1 2 +𝑣 2 2 +···+𝑣 𝑛 2

This was originally a term weighting scheme developed for information retrieval (as a ranking function for search
engines results) that has also found good use in document classification and clustering.
The following sections contain further explanations and examples that illustrate how the tf-idfs are computed exactly
and how the tf-idfs computed in scikit-learn’s TfidfTransformer and TfidfVectorizer differ slightly from
the standard textbook notation that defines the idf as
𝑛𝑑
idf(𝑡) = 𝑙𝑜𝑔 1+df(𝑑,𝑡)
.

In the TfidfTransformer and TfidfVectorizer with smooth_idf=False, the “1” count is added to the
idf instead of the idf’s denominator:
𝑛𝑑
idf(𝑡) = 𝑙𝑜𝑔 df(𝑑,𝑡)
+1

This normalization is implemented by the TfidfTransformer class:
>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer(smooth_idf=False)
>>> transformer
TfidfTransformer(norm=...'l2', smooth_idf=False, sublinear_tf=False,
use_idf=True)

512

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Again please see the reference documentation for the details on all the parameters.
Let’s take an example with the following counts. The first term is present 100% of the time hence not very interesting.
The two other features only in less than 50% of the time hence probably more representative of the content of the
documents:
>>> counts = [[3, 0, 1],
...
[2, 0, 0],
...
[3, 0, 0],
...
[4, 0, 0],
...
[3, 2, 0],
...
[3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf
<6x3 sparse matrix of type '<... 'numpy.float64'>'
with 9 stored elements in Compressed Sparse ... format>
>>> tfidf.toarray()
array([[ 0.81940995,
[ 1.
,
[ 1.
,
[ 1.
,
[ 0.47330339,
[ 0.58149261,

0.
,
0.
,
0.
,
0.
,
0.88089948,
0.
,

0.57320793],
0.
],
0.
],
0.
],
0.
],
0.81355169]])

Each row is normalized to have unit Euclidean norm:
𝑣𝑛𝑜𝑟𝑚 =

𝑣
||𝑣||2

=

√

𝑣
𝑣 1 2 +𝑣 2 2 +···+𝑣 𝑛 2

For example, we can compute the tf-idf of the first term in the first document in the counts array as follows:
𝑛𝑑,term1 = 6
df(𝑑, 𝑡)term1 = 6
𝑛𝑑
+ 1 = 𝑙𝑜𝑔(1) + 1 = 1
idf(𝑑, 𝑡)term1 = 𝑙𝑜𝑔 df(𝑑,𝑡)

tf-idfterm1 = tf × idf = 3 × 1 = 3
Now, if we repeat this computation for the remaining 2 terms in the document, we get
tf-idfterm2 = 0 × (𝑙𝑜𝑔(6/1) + 1) = 0
tf-idfterm3 = 1 × (𝑙𝑜𝑔(6/2) + 1) ≈ 2.0986
and the vector of raw tf-idfs:
tf-idfraw = [3, 0, 2.0986].
Then, applying the Euclidean (L2) norm, we obtain the following tf-idfs for document 1:
[3,0,2.0986]
√︂(︀

32 +02 +2.09862

)︀ = [0.819, 0, 0.573].

Furthermore, the default parameter smooth_idf=True adds “1” to the numerator and denominator as if an extra
document was seen containing every term in the collection exactly once, which prevents zero divisions:
1+𝑛𝑑
idf(𝑡) = 𝑙𝑜𝑔 1+df(𝑑,𝑡)
+1

Using this modification, the tf-idf of the third term in document 1 changes to 1.8473:
tf-idfterm3 = 1 × 𝑙𝑜𝑔(7/3) + 1 ≈ 1.8473
And the L2-normalized tf-idf changes to

3.4. Dataset transformations

513

scikit-learn user guide, Release 0.19.1

[3,0,1.8473]
√︂(︀

32 +02 +1.84732

)︀ = [0.8515, 0, 0.5243]:

>>> transformer = TfidfTransformer()
>>> transformer.fit_transform(counts).toarray()
array([[ 0.85151335, 0.
, 0.52433293],
[ 1.
, 0.
, 0.
],
[ 1.
, 0.
, 0.
],
[ 1.
, 0.
, 0.
],
[ 0.55422893, 0.83236428, 0.
],
[ 0.63035731, 0.
, 0.77630514]])

The weights of each feature computed by the fit method call are stored in a model attribute:
>>> transformer.idf_
array([ 1. ..., 2.25...,

1.84...])

As tf–idf is very often used for text features, there is also another class called TfidfVectorizer that combines all
the options of CountVectorizer and TfidfTransformer in a single model:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> vectorizer = TfidfVectorizer()
>>> vectorizer.fit_transform(corpus)
...
<4x9 sparse matrix of type '<... 'numpy.float64'>'
with 19 stored elements in Compressed Sparse ... format>

While the tf–idf normalization is often very useful, there might be cases where the binary occurrence markers might
offer better features. This can be achieved by using the binary parameter of CountVectorizer. In particular,
some estimators such as Bernoulli Naive Bayes explicitly model discrete boolean random variables. Also, very short
texts are likely to have noisy tf–idf values while the binary occurrence info is more stable.
As usual the best way to adjust the feature extraction parameters is to use a cross-validated grid search, for instance by
pipelining the feature extractor with a classifier:
• Sample pipeline for text feature extraction and evaluation
Decoding text files
Text is made of characters, but files are made of bytes. These bytes represent characters according to some encoding.
To work with text files in Python, their bytes must be decoded to a character set called Unicode. Common encodings
are ASCII, Latin-1 (Western Europe), KOI8-R (Russian) and the universal encodings UTF-8 and UTF-16. Many
others exist.
Note: An encoding can also be called a ‘character set’, but this term is less accurate: several encodings can exist for
a single character set.
The text feature extractors in scikit-learn know how to decode text files, but only if you tell them what encoding the
files are in. The CountVectorizer takes an encoding parameter for this purpose. For modern text files, the
correct encoding is probably UTF-8, which is therefore the default (encoding="utf-8").
If the text you are loading is not actually encoded with UTF-8, however, you will get a UnicodeDecodeError.
The vectorizers can be told to be silent about decoding errors by setting the decode_error parameter to either
"ignore" or "replace". See the documentation for the Python function bytes.decode for more details (type
help(bytes.decode) at the Python prompt).

514

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

If you are having trouble decoding text, here are some things to try:
• Find out what the actual encoding of the text is. The file might come with a header or README that tells you
the encoding, or there might be some standard encoding you can assume based on where the text comes from.
• You may be able to find out what kind of encoding it is in general using the UNIX command file. The Python
chardet module comes with a script called chardetect.py that will guess the specific encoding, though
you cannot rely on its guess being correct.
• You could try UTF-8 and disregard the errors.
You can decode byte strings with bytes.
decode(errors='replace') to replace all decoding errors with a meaningless character, or set
decode_error='replace' in the vectorizer. This may damage the usefulness of your features.
• Real text may come from a variety of sources that may have used different encodings, or even be sloppily
decoded in a different encoding than the one it was encoded with. This is common in text retrieved from the
Web. The Python package ftfy can automatically sort out some classes of decoding errors, so you could try
decoding the unknown text as latin-1 and then using ftfy to fix errors.
• If the text is in a mish-mash of encodings that is simply too hard to sort out (which is the case for the 20
Newsgroups dataset), you can fall back on a simple single-byte encoding such as latin-1. Some text may
display incorrectly, but at least the same sequence of bytes will always represent the same feature.
For example, the following snippet uses chardet (not shipped with scikit-learn, must be installed separately) to
figure out the encoding of three texts. It then vectorizes the texts and prints the learned vocabulary. The output is not
shown here.
>>> import chardet
>>> text1 = b"Sei mir gegr\xc3\xbc\xc3\x9ft mein Sauerkraut"
>>> text2 = b"holdselig sind deine Ger\xfcche"
>>> text3 = b"\xff\xfeA\x00u\x00f\x00 \x00F\x00l\x00\xfc\x00g\x00e\x00l\x00n\x00
˓→\x00d\x00e\x00s\x00 \x00G\x00e\x00s\x00a\x00n\x00g\x00e\x00s\x00,\x00
˓→\x00H\x00e\x00r\x00z\x00l\x00i\x00e\x00b\x00c\x00h\x00e\x00n\x00,\x00
˓→\x00t\x00r\x00a\x00g\x00 \x00i\x00c\x00h\x00 \x00d\x00i\x00c\x00h\x00
˓→\x00f\x00o\x00r\x00t\x00"
>>> decoded = [x.decode(chardet.detect(x)['encoding'])
...
for x in (text1, text2, text3)]
>>> v = CountVectorizer().fit(decoded).vocabulary_
>>> for term in v: print(v)

(Depending on the version of chardet, it might get the first one wrong.)
For an introduction to Unicode and character encodings in general, see Joel Spolsky’s Absolute Minimum Every
Software Developer Must Know About Unicode.
Applications and examples
The bag of words representation is quite simplistic but surprisingly useful in practice.
In particular in a supervised setting it can be successfully combined with fast and scalable linear models to train
document classifiers, for instance:
• Classification of text documents using sparse features
In an unsupervised setting it can be used to group similar documents together by applying clustering algorithms such
as K-means:
• Clustering text documents using k-means
Finally it is possible to discover the main topics of a corpus by relaxing the hard assignment constraint of clustering,
for instance by using Non-negative matrix factorization (NMF or NNMF):

3.4. Dataset transformations

515

scikit-learn user guide, Release 0.19.1

• Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation
Limitations of the Bag of Words representation
A collection of unigrams (what bag of words is) cannot capture phrases and multi-word expressions, effectively disregarding any word order dependence. Additionally, the bag of words model doesn’t account for potential misspellings
or word derivations.
N-grams to the rescue! Instead of building a simple collection of unigrams (n=1), one might prefer a collection of
bigrams (n=2), where occurrences of pairs of consecutive words are counted.
One might alternatively consider a collection of character n-grams, a representation resilient against misspellings and
derivations.
For example, let’s say we’re dealing with a corpus of two documents: ['words', 'wprds']. The second document contains a misspelling of the word ‘words’. A simple bag of words representation would consider these two as
very distinct documents, differing in both of the two possible features. A character 2-gram representation, however,
would find the documents matching in 4 out of 8 features, which may help the preferred classifier decide better:
>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2))
>>> counts = ngram_vectorizer.fit_transform(['words', 'wprds'])
>>> ngram_vectorizer.get_feature_names() == (
...
[' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])
True
>>> counts.toarray().astype(int)
array([[1, 1, 1, 0, 1, 1, 1, 0],
[1, 1, 0, 1, 1, 1, 0, 1]])

In the above example, 'char_wb analyzer is used, which creates n-grams only from characters inside word boundaries (padded with space on each side). The 'char' analyzer, alternatively, creates n-grams that span across words:
>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(5, 5))
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
...
<1x4 sparse matrix of type '<... 'numpy.int64'>'
with 4 stored elements in Compressed Sparse ... format>
>>> ngram_vectorizer.get_feature_names() == (
...
[' fox ', ' jump', 'jumpy', 'umpy '])
True
>>> ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(5, 5))
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
...
<1x5 sparse matrix of type '<... 'numpy.int64'>'
with 5 stored elements in Compressed Sparse ... format>
>>> ngram_vectorizer.get_feature_names() == (
...
['jumpy', 'mpy f', 'py fo', 'umpy ', 'y fox'])
True

The word boundaries-aware variant char_wb is especially interesting for languages that use white-spaces for word
separation as it generates significantly less noisy features than the raw char variant in that case. For such languages
it can increase both the predictive accuracy and convergence speed of classifiers trained using such features while
retaining the robustness with regards to misspellings and word derivations.
While some local positioning information can be preserved by extracting n-grams instead of individual words, bag of
words and bag of n-grams destroy most of the inner structure of the document and hence most of the meaning carried
by that internal structure.

516

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

In order to address the wider task of Natural Language Understanding, the local structure of sentences and paragraphs
should thus be taken into account. Many such models will thus be casted as “Structured output” problems which are
currently outside of the scope of scikit-learn.
Vectorizing a large text corpus with the hashing trick
The above vectorization scheme is simple but the fact that it holds an in- memory mapping from the string tokens
to the integer feature indices (the vocabulary_ attribute) causes several problems when dealing with large
datasets:
• the larger the corpus, the larger the vocabulary will grow and hence the memory use too,
• fitting requires the allocation of intermediate data structures of size proportional to that of the original dataset.
• building the word-mapping requires a full pass over the dataset hence it is not possible to fit text classifiers in a
strictly online manner.
• pickling and un-pickling vectorizers with a large vocabulary_ can be very slow (typically much slower than
pickling / un-pickling flat data structures such as a NumPy array of the same size),
• it is not easily possible to split the vectorization work into concurrent sub tasks as the vocabulary_ attribute
would have to be a shared state with a fine grained synchronization barrier: the mapping from token string
to feature index is dependent on ordering of the first occurrence of each token hence would have to be shared,
potentially harming the concurrent workers’ performance to the point of making them slower than the sequential
variant.
It is possible to overcome those limitations by combining the “hashing trick” (Feature hashing) implemented by the
sklearn.feature_extraction.FeatureHasher class and the text preprocessing and tokenization features
of the CountVectorizer.
This combination is implementing in HashingVectorizer, a transformer class that is mostly API compatible with
CountVectorizer. HashingVectorizer is stateless, meaning that you don’t have to call fit on it:
>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> hv = HashingVectorizer(n_features=10)
>>> hv.transform(corpus)
...
<4x10 sparse matrix of type '<... 'numpy.float64'>'
with 16 stored elements in Compressed Sparse ... format>

You can see that 16 non-zero feature tokens were extracted in the vector output: this is less than the 19 non-zeros
extracted previously by the CountVectorizer on the same toy corpus. The discrepancy comes from hash function
collisions because of the low value of the n_features parameter.
In a real world setting, the n_features parameter can be left to its default value of 2 ** 20 (roughly one million
possible features). If memory or downstream models size is an issue selecting a lower value such as 2 ** 18 might
help without introducing too many additional collisions on typical text classification tasks.
Note that the dimensionality does not affect the CPU training time of algorithms which operate on CSR matrices
(LinearSVC(dual=True), Perceptron, SGDClassifier, PassiveAggressive) but it does for algorithms that work with CSC matrices (LinearSVC(dual=False), Lasso(), etc).
Let’s try again with the default setting:
>>> hv = HashingVectorizer()
>>> hv.transform(corpus)
...
<4x1048576 sparse matrix of type '<... 'numpy.float64'>'
with 19 stored elements in Compressed Sparse ... format>

3.4. Dataset transformations

517

scikit-learn user guide, Release 0.19.1

We no longer get the collisions, but this comes at the expense of a much larger dimensionality of the output space. Of
course, other terms than the 19 used here might still collide with each other.
The HashingVectorizer also comes with the following limitations:
• it is not possible to invert the model (no inverse_transform method), nor to access the original string
representation of the features, because of the one-way nature of the hash function that performs the mapping.
• it does not provide IDF weighting as that would introduce statefulness in the model. A TfidfTransformer
can be appended to it in a pipeline if required.
Performing out-of-core scaling with HashingVectorizer
An interesting development of using a HashingVectorizer is the ability to perform out-of-core scaling. This
means that we can learn from data that does not fit into the computer’s main memory.
A strategy to implement out-of-core scaling is to stream data to the estimator in mini-batches. Each mini-batch is
vectorized using HashingVectorizer so as to guarantee that the input space of the estimator has always the same
dimensionality. The amount of memory used at any time is thus bounded by the size of a mini-batch. Although there is
no limit to the amount of data that can be ingested using such an approach, from a practical point of view the learning
time is often limited by the CPU time one wants to spend on the task.
For a full-fledged example of out-of-core scaling in a text classification task see Out-of-core classification of text
documents.
Customizing the vectorizer classes
It is possible to customize the behavior by passing a callable to the vectorizer constructor:
>>> def my_tokenizer(s):
...
return s.split()
...
>>> vectorizer = CountVectorizer(tokenizer=my_tokenizer)
>>> vectorizer.build_analyzer()(u"Some... punctuation!") == (
...
['some...', 'punctuation!'])
True

In particular we name:
• preprocessor: a callable that takes an entire document as input (as a single string), and returns a possibly
transformed version of the document, still as an entire string. This can be used to remove HTML tags, lowercase
the entire document, etc.
• tokenizer: a callable that takes the output from the preprocessor and splits it into tokens, then returns a list
of these.
• analyzer: a callable that replaces the preprocessor and tokenizer. The default analyzers all call the preprocessor and tokenizer, but custom analyzers will skip this. N-gram extraction and stop word filtering take place
at the analyzer level, so a custom analyzer may have to reproduce these steps.
(Lucene users might recognize these names, but be aware that scikit-learn concepts may not map one-to-one onto
Lucene concepts.)
To make the preprocessor, tokenizer and analyzers aware of the model parameters it is possible to derive from the
class and override the build_preprocessor, build_tokenizer` and build_analyzer factory methods
instead of passing custom functions.
Some tips and tricks:

518

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• If documents are pre-tokenized by an external package, then store them in files (or strings) with the tokens
separated by whitespace and pass analyzer=str.split
• Fancy token-level analysis such as stemming, lemmatizing, compound splitting, filtering based on part-ofspeech, etc. are not included in the scikit-learn codebase, but can be added by customizing either the tokenizer
or the analyzer. Here’s a CountVectorizer with a tokenizer and lemmatizer using NLTK:
>>>
>>>
>>>
...
...
...
...
...
>>>

from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
class LemmaTokenizer(object):
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, doc):
return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
vect = CountVectorizer(tokenizer=LemmaTokenizer())

(Note that this will not filter out punctuation.)
The following example will, for instance, transform some British spelling to American spelling:
>>> import re
>>> def to_british(tokens):
...
for t in tokens:
...
t = re.sub(r"(...)our$", r"\1or", t)
...
t = re.sub(r"([bt])re$", r"\1er", t)
...
t = re.sub(r"([iy])s(e$|ing|ation)", r"\1z\2", t)
...
t = re.sub(r"ogue$", "og", t)
...
yield t
...
>>> class CustomVectorizer(CountVectorizer):
...
def build_tokenizer(self):
...
tokenize = super(CustomVectorizer, self).build_tokenizer()
...
return lambda doc: list(to_british(tokenize(doc)))
...
>>> print(CustomVectorizer().build_analyzer()(u"color colour"))
[...'color', ...'color']

for other styles of preprocessing; examples include stemming, lemmatization, or normalizing numerical tokens,
with the latter illustrated in:
– Biclustering documents with the Spectral Co-clustering algorithm
Customizing the vectorizer can also be useful when handling Asian languages that do not use an explicit word separator
such as whitespace.
Image feature extraction
Patch extraction
The extract_patches_2d function extracts patches from an image stored as a two-dimensional array, or
three-dimensional with color information along the third axis. For rebuilding an image from all its patches, use
reconstruct_from_patches_2d. For example let use generate a 4x4 pixel picture with 3 color channels (e.g.
in RGB format):
>>> import numpy as np
>>> from sklearn.feature_extraction import image

3.4. Dataset transformations

519

scikit-learn user guide, Release 0.19.1

>>> one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3))
>>> one_image[:, :, 0] # R channel of a fake RGB picture
array([[ 0, 3, 6, 9],
[12, 15, 18, 21],
[24, 27, 30, 33],
[36, 39, 42, 45]])
>>> patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2,
...
random_state=0)
>>> patches.shape
(2, 2, 2, 3)
>>> patches[:, :, :, 0]
array([[[ 0, 3],
[12, 15]],
[[15, 18],
[27, 30]]])
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> patches.shape
(9, 2, 2, 3)
>>> patches[4, :, :, 0]
array([[15, 18],
[27, 30]])

Let us now try to reconstruct the original image from the patches by averaging on overlapping areas:
>>> reconstructed = image.reconstruct_from_patches_2d(patches, (4, 4, 3))
>>> np.testing.assert_array_equal(one_image, reconstructed)

The PatchExtractor class works in the same way as extract_patches_2d, only it supports multiple images
as input. It is implemented as an estimator, so it can be used in pipelines. See:
>>> five_images = np.arange(5 * 4 * 4 * 3).reshape(5, 4, 4, 3)
>>> patches = image.PatchExtractor((2, 2)).transform(five_images)
>>> patches.shape
(45, 2, 2, 3)

Connectivity graph of an image
Several estimators in the scikit-learn can use connectivity information between features or samples. For instance Ward
clustering (Hierarchical clustering) can cluster together only neighboring pixels of an image, thus forming contiguous
patches:
For this purpose, the estimators use a ‘connectivity’ matrix, giving which samples are connected.
The function img_to_graph returns such a matrix from a 2D or 3D image. Similarly, grid_to_graph build a
connectivity matrix for images given the shape of these image.
These matrices can be used to impose connectivity in estimators that use connectivity information, such as Ward
clustering (Hierarchical clustering), but also to build precomputed kernels, or similarity matrices.
Note: Examples
• A demo of structured Ward hierarchical clustering on a raccoon face image
• Spectral clustering for image segmentation

520

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

• Feature agglomeration vs. univariate selection

3.4.3 Preprocessing data
The sklearn.preprocessing package provides several common utility functions and transformer classes to
change raw feature vectors into a representation that is more suitable for the downstream estimators.
In general, learning algorithms benefit from standardization of the data set. If some outliers are present in the set, robust
scalers or transformers are more appropriate. The behaviors of the different scalers, transformers, and normalizers on
a dataset containing marginal outliers is highlighted in Compare the effect of different scalers on data with outliers.
Standardization, or mean removal and variance scaling
Standardization of datasets is a common requirement for many machine learning estimators implemented in
scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.
In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean
value of each feature, then scale it by dividing non-constant features by their standard deviation.
For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support
Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and
have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might
dominate the objective function and make the estimator unable to learn from other features correctly as expected.
The function scale provides a quick and easy way to perform this operation on a single array-like dataset:
>>>
>>>
>>>
...
...
>>>

from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
X_scaled = preprocessing.scale(X_train)

>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])

Scaled data has zero mean and unit variance:

3.4. Dataset transformations

521

scikit-learn user guide, Release 0.19.1

>>> X_scaled.mean(axis=0)
array([ 0., 0., 0.])
>>> X_scaled.std(axis=0)
array([ 1., 1., 1.])

The preprocessing module further provides a utility class StandardScaler that implements the
Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply
the same transformation on the testing set. This class is hence suitable for use in the early steps of a sklearn.
pipeline.Pipeline:
>>> scaler = preprocessing.StandardScaler().fit(X_train)
>>> scaler
StandardScaler(copy=True, with_mean=True, with_std=True)
>>> scaler.mean_
array([ 1. ..., 0. ...,
>>> scaler.scale_
array([ 0.81..., 0.81...,

0.33...])

1.24...])

>>> scaler.transform(X_train)
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])

The scaler instance can then be used on new data to transform it the same way it did on the training set:
>>> X_test = [[-1., 1., 0.]]
>>> scaler.transform(X_test)
array([[-2.44..., 1.22..., -0.26...]])

It is possible to disable either centering or scaling by either passing with_mean=False or with_std=False to
the constructor of StandardScaler.
Scaling features to a range
An alternative standardization is scaling features to lie between a given minimum and maximum value, often between
zero and one, or so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using
MinMaxScaler or MaxAbsScaler, respectively.
The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero
entries in sparse data.
Here is an example to scale a toy data matrix to the [0, 1] range:
>>> X_train = np.array([[ 1., -1., 2.],
...
[ 2., 0., 0.],
...
[ 0., 1., -1.]])
...
>>> min_max_scaler = preprocessing.MinMaxScaler()
>>> X_train_minmax = min_max_scaler.fit_transform(X_train)
>>> X_train_minmax
array([[ 0.5
, 0.
, 1.
],
[ 1.
, 0.5
, 0.33333333],
[ 0.
, 1.
, 0.
]])

522

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The same instance of the transformer can then be applied to some new test data unseen during the fit call: the same
scaling and shifting operations will be applied to be consistent with the transformation performed on the train data:
>>> X_test = np.array([[ -3., -1., 4.]])
>>> X_test_minmax = min_max_scaler.transform(X_test)
>>> X_test_minmax
array([[-1.5
, 0.
, 1.66666667]])

It is possible to introspect the scaler attributes to find about the exact nature of the transformation learned on the
training data:
>>> min_max_scaler.scale_
array([ 0.5
, 0.5

,

0.33...])

>>> min_max_scaler.min_
array([ 0.
, 0.5

,

0.33...])

If MinMaxScaler is given an explicit feature_range=(min, max) the full formula is:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

MaxAbsScaler works in a very similar fashion, but scales in a way that the training data lies within the range [-1,
1] by dividing through the largest maximum value in each feature. It is meant for data that is already centered at zero
or sparse data.
Here is how to use the toy data from the previous example with this scaler:
>>> X_train = np.array([[ 1., -1., 2.],
...
[ 2., 0., 0.],
...
[ 0., 1., -1.]])
...
>>> max_abs_scaler = preprocessing.MaxAbsScaler()
>>> X_train_maxabs = max_abs_scaler.fit_transform(X_train)
>>> X_train_maxabs
# doctest +NORMALIZE_WHITESPACE^
array([[ 0.5, -1. , 1. ],
[ 1. , 0. , 0. ],
[ 0. , 1. , -0.5]])
>>> X_test = np.array([[ -3., -1., 4.]])
>>> X_test_maxabs = max_abs_scaler.transform(X_test)
>>> X_test_maxabs
array([[-1.5, -1. , 2. ]])
>>> max_abs_scaler.scale_
array([ 2., 1., 2.])

As with scale, the module further provides convenience functions minmax_scale and maxabs_scale if you
don’t want to create an object.
Scaling sparse data
Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do.
However, it can make sense to scale sparse inputs, especially if features are on different scales.
MaxAbsScaler and maxabs_scale were specifically designed for scaling sparse data, and are the recommended
way to go about this. However, scale and StandardScaler can accept scipy.sparse matrices as input, as
long as with_mean=False is explicitly passed to the constructor. Otherwise a ValueError will be raised as

3.4. Dataset transformations

523

scikit-learn user guide, Release 0.19.1

silently centering would break the sparsity and would often crash the execution by allocating excessive amounts of
memory unintentionally. RobustScaler cannot be fitted to sparse inputs, but you can use the transform method
on sparse inputs.
Note that the scalers accept both Compressed Sparse Rows and Compressed Sparse Columns format (see scipy.
sparse.csr_matrix and scipy.sparse.csc_matrix). Any other sparse input will be converted to the
Compressed Sparse Rows representation. To avoid unnecessary memory copies, it is recommended to choose the
CSR or CSC representation upstream.
Finally, if the centered data is expected to be small enough, explicitly converting the input to an array using the
toarray method of sparse matrices is another option.
Scaling data with outliers
If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well.
In these cases, you can use robust_scale and RobustScaler as drop-in replacements instead. They use more
robust estimates for the center and range of your data.
References:
Further discussion on the importance of centering and scaling data is available on this FAQ: Should I normalize/standardize/rescale the data?

Scaling vs Whitening
It is sometimes not enough to center and scale the features independently, since a downstream model can further
make some assumption on the linear independence of the features.
To address this issue you can use sklearn.decomposition.PCA or sklearn.decomposition.
RandomizedPCA with whiten=True to further remove the linear correlation across features.

Scaling target variables in regression
scale and StandardScaler work out-of-the-box with 1d arrays. This is very useful for scaling the target /
response variables used for regression.

Centering kernel matrices
If you have a kernel matrix of a kernel 𝐾 that computes a dot product in a feature space defined by function 𝑝ℎ𝑖, a
KernelCenterer can transform the kernel matrix so that it contains inner products in the feature space defined by
𝑝ℎ𝑖 followed by removal of the mean in that space.
Non-linear transformation
Like scalers, QuantileTransformer puts each feature into the same range or distribution. However, by performing a rank transformation, it smooths out unusual distributions and is less influenced by outliers than scaling methods.
It does, however, distort correlations and distances within and across features.

524

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

QuantileTransformer and quantile_transform provide a non-parametric transformation based on the
quantile function to map the data to a uniform distribution with values between 0 and 1:
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
>>> quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
>>> X_train_trans = quantile_transformer.fit_transform(X_train)
>>> X_test_trans = quantile_transformer.transform(X_test)
>>> np.percentile(X_train[:, 0], [0, 25, 50, 75, 100])
array([ 4.3, 5.1, 5.8, 6.5, 7.9])

This feature corresponds to the sepal length in cm. Once the quantile transformation applied, those landmarks approach
closely the percentiles previously defined:
>>> np.percentile(X_train_trans[:, 0], [0, 25, 50, 75, 100])
...
array([ 0.00... , 0.24..., 0.49..., 0.73..., 0.99... ])

This can be confirmed on a independent testing set with similar remarks:
>>> np.percentile(X_test[:, 0], [0, 25, 50, 75, 100])
...
array([ 4.4 , 5.125, 5.75 , 6.175, 7.3 ])
>>> np.percentile(X_test_trans[:, 0], [0, 25, 50, 75, 100])
...
array([ 0.01..., 0.25..., 0.46..., 0.60... , 0.94...])

It is also possible to map the
output_distribution='normal':

transformed

data

to

a

normal

distribution

by

setting

>>> quantile_transformer = preprocessing.QuantileTransformer(
...
output_distribution='normal', random_state=0)
>>> X_trans = quantile_transformer.fit_transform(X)
>>> quantile_transformer.quantiles_
array([[ 4.3...,
2...,
1...,
0.1...],
[ 4.31..., 2.02..., 1.01..., 0.1...],
[ 4.32..., 2.05..., 1.02..., 0.1...],
...,
[ 7.84..., 4.34..., 6.84..., 2.5...],
[ 7.87..., 4.37..., 6.87..., 2.5...],
[ 7.9...,
4.4...,
6.9...,
2.5...]])

Thus the median of the input becomes the mean of the output, centered at 0. The normal output is clipped so that the
input’s minimum and maximum — corresponding to the 1e-7 and 1 - 1e-7 quantiles respectively — do not become
infinite under the transformation.
Normalization
Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan
to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.
This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.
The function normalize provides a quick and easy way to perform this operation on a single array-like dataset,
either using the l1 or l2 norms:
3.4. Dataset transformations

525

scikit-learn user guide, Release 0.19.1

>>> X = [[ 1., -1., 2.],
...
[ 2., 0., 0.],
...
[ 0., 1., -1.]]
>>> X_normalized = preprocessing.normalize(X, norm='l2')
>>> X_normalized
array([[ 0.40..., -0.40..., 0.81...],
[ 1. ..., 0. ..., 0. ...],
[ 0. ..., 0.70..., -0.70...]])

The preprocessing module further provides a utility class Normalizer that implements the same operation
using the Transformer API (even though the fit method is useless in this case: the class is stateless as this
operation treats samples independently).
This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline:
>>> normalizer = preprocessing.Normalizer().fit(X)
>>> normalizer
Normalizer(copy=True, norm='l2')

# fit does nothing

The normalizer instance can then be used on sample vectors as any transformer:
>>> normalizer.transform(X)
array([[ 0.40..., -0.40..., 0.81...],
[ 1. ..., 0. ..., 0. ...],
[ 0. ..., 0.70..., -0.70...]])
>>> normalizer.transform([[-1.,
array([[-0.70..., 0.70..., 0.

1., 0.]])
...]])

Sparse input
normalize and Normalizer accept both dense array-like and sparse matrices from scipy.sparse as input.
For sparse input the data is converted to the Compressed Sparse Rows representation (see scipy.sparse.
csr_matrix) before being fed to efficient Cython routines. To avoid unnecessary memory copies, it is recommended to choose the CSR representation upstream.

Binarization
Feature binarization
Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for
downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate
Bernoulli distribution. For instance, this is the case for the sklearn.neural_network.BernoulliRBM .
It is also common among the text processing community to use binary feature values (probably to simplify the probabilistic reasoning) even if normalized counts (a.k.a. term frequencies) or TF-IDF valued features often perform slightly
better in practice.
As for the Normalizer, the utility class Binarizer is meant to be used in the early stages of sklearn.
pipeline.Pipeline. The fit method does nothing as each sample is treated independently of others:

526

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>> X = [[ 1., -1., 2.],
...
[ 2., 0., 0.],
...
[ 0., 1., -1.]]
>>> binarizer = preprocessing.Binarizer().fit(X)
>>> binarizer
Binarizer(copy=True, threshold=0.0)

# fit does nothing

>>> binarizer.transform(X)
array([[ 1., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]])

It is possible to adjust the threshold of the binarizer:
>>> binarizer = preprocessing.Binarizer(threshold=1.1)
>>> binarizer.transform(X)
array([[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 0., 0.]])

As for the StandardScaler and Normalizer classes, the preprocessing module provides a companion function
binarize to be used when the transformer API is not necessary.
Sparse input
binarize and Binarizer accept both dense array-like and sparse matrices from scipy.sparse as input.
For sparse input the data is converted to the Compressed Sparse Rows representation (see scipy.sparse.
csr_matrix). To avoid unnecessary memory copies, it is recommended to choose the CSR representation upstream.

Encoding categorical features
Often features are not given as continuous values but categorical. For example a person could have features ["male", "female"], ["from Europe", "from US", "from Asia"], ["uses Firefox",
"uses Chrome", "uses Safari", "uses Internet Explorer"]. Such features can be efficiently
coded as integers, for instance ["male", "from US", "uses Internet Explorer"] could be expressed
as [0, 1, 3] while ["female", "from Asia", "uses Chrome"] would be [1, 2, 1].
Such integer representation can not be used directly with scikit-learn estimators, as these expect continuous input,
and would interpret the categories as being ordered, which is often not desired (i.e. the set of browsers was ordered
arbitrarily).
One possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a oneof-K or one-hot encoding, which is implemented in OneHotEncoder. This estimator transforms each categorical
feature with m possible values into m binary features, with only one active.
Continuing the example above:
>>> enc = preprocessing.OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.transform([[0, 1, 3]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])

3.4. Dataset transformations

527

scikit-learn user guide, Release 0.19.1

By default, how many values each feature can take is inferred automatically from the dataset. It is possible to specify
this explicitly using the parameter n_values. There are two genders, three possible continents and four web browsers
in our dataset. Then we fit the estimator, and transform a data point. In the result, the first two numbers encode the
gender, the next set of three numbers the continent and the last four the web browser.
Note that, if there is a possibility that the training data might have missing categorical features, one has to explicitly
set n_values. For example,
>>> enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])
>>> # Note that there are missing categorical values for the 2nd and 3rd
>>> # features
>>> enc.fit([[1, 2, 3], [0, 2, 0]])
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values=[2, 3, 4], sparse=True)
>>> enc.transform([[1, 0, 0]]).toarray()
array([[ 0., 1., 1., 0., 0., 1., 0., 0., 0.]])

See Loading features from dicts for categorical features that are represented as a dict, not as integers.
Imputation of missing values
For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array
are numerical, and that all have and hold meaning. A basic strategy to use incomplete datasets is to discard entire rows
and/or columns containing missing values. However, this comes at the price of losing data which may be valuable
(even though incomplete). A better strategy is to impute the missing values, i.e., to infer them from the known part of
the data.
The Imputer class provides basic strategies for imputing missing values, either using the mean, the median or the
most frequent value of the row or column in which the missing values are located. This class also allows for different
missing values encodings.
The following snippet demonstrates how to replace missing values, encoded as np.nan, using the mean value of the
columns (axis 0) that contain the missing values:
>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]])
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
>>> X = [[np.nan, 2], [6, np.nan], [7, 6]]
>>> print(imp.transform(X))
[[ 4.
2.
]
[ 6.
3.666...]
[ 7.
6.
]]

The Imputer class also supports sparse matrices:
>>> import scipy.sparse as sp
>>> X = sp.csc_matrix([[1, 2], [0, 3], [7, 6]])
>>> imp = Imputer(missing_values=0, strategy='mean', axis=0)
>>> imp.fit(X)
Imputer(axis=0, copy=True, missing_values=0, strategy='mean', verbose=0)
>>> X_test = sp.csc_matrix([[0, 2], [6, 0], [7, 6]])
>>> print(imp.transform(X_test))
[[ 4.
2.
]

528

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

[ 6.
[ 7.

3.666...]
6.
]]

Note that, here, missing values are encoded by 0 and are thus implicitly stored in the matrix. This format is thus
suitable when there are many more missing values than observed values.
Imputer can be used in a Pipeline as a way to build a composite estimator that supports imputation. See Imputing
missing values before building an estimator.
Generating polynomial features
Often it’s useful to add complexity to the model by considering nonlinear features of the input data. A simple and common method to use is polynomial features, which can get features’ high-order and interaction terms. It is implemented
in PolynomialFeatures:
>>> import numpy as np
>>> from sklearn.preprocessing import PolynomialFeatures
>>> X = np.arange(6).reshape(3, 2)
>>> X
array([[0, 1],
[2, 3],
[4, 5]])
>>> poly = PolynomialFeatures(2)
>>> poly.fit_transform(X)
array([[ 1.,
0.,
1.,
0.,
0.,
1.],
[ 1.,
2.,
3.,
4.,
6.,
9.],
[ 1.,
4.,
5., 16., 20., 25.]])

The features of X have been transformed from (𝑋1 , 𝑋2 ) to (1, 𝑋1 , 𝑋2 , 𝑋12 , 𝑋1 𝑋2 , 𝑋22 ).
In some cases, only interaction terms among features are required, and it can be gotten with the setting
interaction_only=True:
>>> X = np.arange(9).reshape(3, 3)
>>> X
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> poly = PolynomialFeatures(degree=3, interaction_only=True)
>>> poly.fit_transform(X)
array([[
1.,
0.,
1.,
2.,
0.,
0.,
2.,
0.],
[
1.,
3.,
4.,
5.,
12.,
15.,
20.,
60.],
[
1.,
6.,
7.,
8.,
42.,
48.,
56., 336.]])

The features of X have been transformed from (𝑋1 , 𝑋2 , 𝑋3 ) to (1, 𝑋1 , 𝑋2 , 𝑋3 , 𝑋1 𝑋2 , 𝑋1 𝑋3 , 𝑋2 𝑋3 , 𝑋1 𝑋2 𝑋3 ).
Note that polynomial features are used implicitly in kernel methods (e.g., sklearn.svm.SVC, sklearn.
decomposition.KernelPCA) when using polynomial Kernel functions.
See Polynomial interpolation for Ridge regression using created polynomial features.
Custom transformers
Often, you will want to convert an existing Python function into a transformer to assist in data cleaning or processing.
You can implement a transformer from an arbitrary function with FunctionTransformer. For example, to build
a transformer that applies a log transformation in a pipeline, do:

3.4. Dataset transformations

529

scikit-learn user guide, Release 0.19.1

>>> import numpy as np
>>> from sklearn.preprocessing import FunctionTransformer
>>> transformer = FunctionTransformer(np.log1p)
>>> X = np.array([[0, 1], [2, 3]])
>>> transformer.transform(X)
array([[ 0.
, 0.69314718],
[ 1.09861229, 1.38629436]])

For a full code example that demonstrates using a FunctionTransformer to do custom feature selection, see
Using FunctionTransformer to select columns

3.4.4 Unsupervised dimensionality reduction
If your number of features is high, it may be useful to reduce it with an unsupervised step prior to supervised steps.
Many of the Unsupervised learning methods implement a transform method that can be used to reduce the dimensionality. Below we discuss two specific example of this pattern that are heavily used.
Pipelining
The unsupervised data reduction and the supervised estimator can be chained in one step. See Pipeline: chaining
estimators.

PCA: principal component analysis
decomposition.PCA looks for a combination of features that capture well the variance of the original features.
See Decomposing signals in components (matrix factorization problems).
Examples
• Faces recognition example using eigenfaces and SVMs

Random projections
The module: random_projection provides several tools for data reduction by random projections. See the
relevant section of the documentation: Random Projection.
Examples
• The Johnson-Lindenstrauss bound for embedding with random projections

Feature agglomeration
cluster.FeatureAgglomeration applies Hierarchical clustering to group together features that behave similarly.

530

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Examples
• Feature agglomeration vs. univariate selection
• Feature agglomeration

Feature scaling
Note that if features have very different scaling or statistical properties, cluster.FeatureAgglomeration
may not be able to capture the links between related features. Using a preprocessing.StandardScaler
can be useful in these settings.

3.4.5 Random Projection
The sklearn.random_projection module implements a simple and computationally efficient way to reduce
the dimensionality of the data by trading a controlled amount of accuracy (as additional variance) for faster processing
times and smaller model sizes. This module implements two types of unstructured random matrix: Gaussian random
matrix and sparse random matrix.
The dimensions and distribution of random projections matrices are controlled so as to preserve the pairwise distances
between any two samples of the dataset. Thus random projection is a suitable approximation technique for distance
based method.
References:
• Sanjoy Dasgupta. 2000. Experiments with random projection. In Proceedings of the Sixteenth conference
on Uncertainty in artificial intelligence (UAI‘00), Craig Boutilier and Moisés Goldszmidt (Eds.). Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA, 143-151.
• Ella Bingham and Heikki Mannila. 2001. Random projection in dimensionality reduction: applications to
image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge
discovery and data mining (KDD ‘01). ACM, New York, NY, USA, 245-250.

The Johnson-Lindenstrauss lemma
The main theoretical result behind the efficiency of random projection is the Johnson-Lindenstrauss lemma (quoting
Wikipedia):
In mathematics, the Johnson-Lindenstrauss lemma is a result concerning low-distortion embeddings of
points from high-dimensional into low-dimensional Euclidean space. The lemma states that a small set
of points in a high-dimensional space can be embedded into a space of much lower dimension in such a
way that distances between the points are nearly preserved. The map used for the embedding is at least
Lipschitz, and can even be taken to be an orthogonal projection.
Knowing
only
the
number
of
samples,
the
sklearn.random_projection.
johnson_lindenstrauss_min_dim estimates conservatively the minimal size of the random subspace
to guarantee a bounded distortion introduced by the random projection:
>>> from sklearn.random_projection import johnson_lindenstrauss_min_dim
>>> johnson_lindenstrauss_min_dim(n_samples=1e6, eps=0.5)
663

3.4. Dataset transformations

531

scikit-learn user guide, Release 0.19.1

>>> johnson_lindenstrauss_min_dim(n_samples=1e6, eps=[0.5, 0.1, 0.01])
array([
663,
11841, 1112658])
>>> johnson_lindenstrauss_min_dim(n_samples=[1e4, 1e5, 1e6], eps=0.1)
array([ 7894, 9868, 11841])

Example:
• See The Johnson-Lindenstrauss bound for embedding with random projections for a theoretical explication
on the Johnson-Lindenstrauss lemma and an empirical validation using sparse random matrices.

References:
• Sanjoy Dasgupta and Anupam Gupta, 1999. An elementary proof of the Johnson-Lindenstrauss Lemma.

Gaussian random projection
The sklearn.random_projection.GaussianRandomProjection reduces the dimensionality by projecting the original input space on a randomly generated matrix where components are drawn from the following
1
distribution 𝑁 (0, 𝑛𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠
).
Here a small excerpt which illustrates how to use the Gaussian random projection transformer:
>>>
>>>
>>>
>>>
>>>

532

import numpy as np
from sklearn import random_projection
X = np.random.rand(100, 10000)
transformer = random_projection.GaussianRandomProjection()
X_new = transformer.fit_transform(X)

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>> X_new.shape
(100, 3947)

Sparse random projection
The sklearn.random_projection.SparseRandomProjection reduces the dimensionality by projecting
the original input space using a sparse random matrix.
Sparse random matrices are an alternative to dense Gaussian random projection matrix that guarantees similar embedding quality while being much more memory efficient and allowing faster computation of the projected data.
If we define s = 1 / density, the elements of the random matrix are drawn from
⎧ √︁
𝑠
⎪
1/2𝑠
⎪ − 𝑛components
⎨
with probability 1 − 1/𝑠
√︁ 0
⎪
⎪
𝑠
⎩ +
1/2𝑠
𝑛components
where 𝑛components is the size of the projected subspace. By default the density of non zero elements is set to the
√
minimum density as recommended by Ping Li et al.: 1/ 𝑛features .
Here a small excerpt which illustrates how to use the sparse random projection transformer:
>>> import numpy as np
>>> from sklearn import random_projection
>>> X = np.random.rand(100,10000)
>>> transformer = random_projection.SparseRandomProjection()
>>> X_new = transformer.fit_transform(X)
>>> X_new.shape
(100, 3947)

3.4. Dataset transformations

533

scikit-learn user guide, Release 0.19.1

References:
• D. Achlioptas. 2003. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences 66 (2003) 671–687
• Ping Li, Trevor J. Hastie, and Kenneth W. Church. 2006. Very sparse random projections. In Proceedings
of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ‘06).
ACM, New York, NY, USA, 287-296.

3.4.6 Kernel Approximation
This submodule contains functions that approximate the feature mappings that correspond to certain kernels, as they
are used for example in support vector machines (see Support Vector Machines). The following feature functions
perform non-linear transformations of the input, which can serve as a basis for linear classification or other algorithms.
The advantage of using approximate explicit feature maps compared to the kernel trick, which makes use of feature
maps implicitly, is that explicit mappings can be better suited for online learning and can significantly reduce the
cost of learning with very large datasets. Standard kernelized SVMs do not scale well to large datasets, but using an
approximate kernel map it is possible to use much more efficient linear SVMs. In particular, the combination of kernel
map approximations with SGDClassifier can make non-linear learning on large datasets possible.
Since there has not been much empirical work using approximate embeddings, it is advisable to compare results
against exact kernel methods when possible.
See also:
Polynomial regression: extending linear models with basis functions for an exact polynomial transformation.
Nystroem Method for Kernel Approximation
The Nystroem method, as implemented in Nystroem is a general method for low-rank approximations of kernels.
It achieves this by essentially subsampling the data on which the kernel is evaluated. By default Nystroem uses the
rbf kernel, but it can use any kernel function or a precomputed kernel matrix. The number of samples used - which
is also the dimensionality of the features computed - is given by the parameter n_components.
Radial Basis Function Kernel
The RBFSampler constructs an approximate mapping for the radial basis function kernel, also known as Random
Kitchen Sinks [RR2007]. This transformation can be used to explicitly model a kernel map, prior to applying a linear
algorithm, for example a linear SVM:
>>> from sklearn.kernel_approximation import RBFSampler
>>> from sklearn.linear_model import SGDClassifier
>>> X = [[0, 0], [1, 1], [1, 0], [0, 1]]
>>> y = [0, 0, 1, 1]
>>> rbf_feature = RBFSampler(gamma=1, random_state=1)
>>> X_features = rbf_feature.fit_transform(X)
>>> clf = SGDClassifier()
>>> clf.fit(X_features, y)
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
eta0=0.0, fit_intercept=True, l1_ratio=0.15,
learning_rate='optimal', loss='hinge', max_iter=None, n_iter=None,
n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
shuffle=True, tol=None, verbose=0, warm_start=False)

534

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>> clf.score(X_features, y)
1.0

The mapping relies on a Monte Carlo approximation to the kernel values. The fit function performs the Monte Carlo
sampling, whereas the transform method performs the mapping of the data. Because of the inherent randomness
of the process, results may vary between different calls to the fit function.
The fit function takes two arguments: n_components, which is the target dimensionality of the feature transform,
and gamma, the parameter of the RBF-kernel. A higher n_components will result in a better approximation of the
kernel and will yield results more similar to those produced by a kernel SVM. Note that “fitting” the feature function
does not actually depend on the data given to the fit function. Only the dimensionality of the data is used. Details
on the method can be found in [RR2007].
For a given value of n_components RBFSampler is often less accurate as Nystroem. RBFSampler is cheaper
to compute, though, making use of larger feature spaces more efficient.

Fig. 3.9: Comparing an exact RBF kernel (left) with the approximation (right)

Examples:
• Explicit feature map approximation for RBF kernels

Additive Chi Squared Kernel
The additive chi squared kernel is a kernel on histograms, often used in computer vision.
The additive chi squared kernel as used here is given by
𝑘(𝑥, 𝑦) =

∑︁ 2𝑥𝑖 𝑦𝑖
𝑥𝑖 + 𝑦𝑖
𝑖

This is not exactly the same as sklearn.metrics.additive_chi2_kernel. The authors of [VZ2010] prefer
the version above as it is always positive definite. Since the kernel is additive, it is possible to treat all components
𝑥𝑖 separately for embedding. This makes it possible to sample the Fourier transform in regular intervals, instead of
approximating using Monte Carlo sampling.

3.4. Dataset transformations

535

scikit-learn user guide, Release 0.19.1

The class AdditiveChi2Sampler implements this component wise deterministic sampling. Each component
is sampled 𝑛 times, yielding 2𝑛 + 1 dimensions per input dimension (the multiple of two stems from the real and
complex part of the Fourier transform). In the literature, 𝑛 is usually chosen to be 1 or 2, transforming the dataset to
size n_samples * 5 * n_features (in the case of 𝑛 = 2).
The approximate feature map provided by AdditiveChi2Sampler can be combined with the approximate feature
map provided by RBFSampler to yield an approximate feature map for the exponentiated chi squared kernel. See
the [VZ2010] for details and [VVZ2010] for combination with the RBFSampler.
Skewed Chi Squared Kernel
The skewed chi squared kernel is given by:
𝑘(𝑥, 𝑦) =

∏︁ 2√𝑥𝑖 + 𝑐√𝑦𝑖 + 𝑐
𝑖

𝑥𝑖 + 𝑦𝑖 + 2𝑐

It has properties that are similar to the exponentiated chi squared kernel often used in computer vision, but allows for
a simple Monte Carlo approximation of the feature map.
The usage of the SkewedChi2Sampler is the same as the usage described above for the RBFSampler. The only
difference is in the free parameter, that is called 𝑐. For a motivation for this mapping and the mathematical details see
[LS2010].
Mathematical Details
Kernel methods like support vector machines or kernelized PCA rely on a property of reproducing kernel Hilbert
spaces. For any positive definite kernel function 𝑘 (a so called Mercer kernel), it is guaranteed that there exists a
mapping 𝜑 into a Hilbert space ℋ, such that
𝑘(𝑥, 𝑦) = ⟨𝜑(𝑥), 𝜑(𝑦)⟩
Where ⟨·, ·⟩ denotes the inner product in the Hilbert space.
If an algorithm, such as a linear support vector machine or PCA, relies only on the scalar product of data points 𝑥𝑖 ,
one may use the value of 𝑘(𝑥𝑖 , 𝑥𝑗 ), which corresponds to applying the algorithm to the mapped data points 𝜑(𝑥𝑖 ). The
advantage of using 𝑘 is that the mapping 𝜑 never has to be calculated explicitly, allowing for arbitrary large features
(even infinite).
One drawback of kernel methods is, that it might be necessary to store many kernel values 𝑘(𝑥𝑖 , 𝑥𝑗 ) during optimization. If a kernelized classifier is applied to new data 𝑦𝑗 , 𝑘(𝑥𝑖 , 𝑦𝑗 ) needs to be computed to make predictions, possibly
for many different 𝑥𝑖 in the training set.
The classes in this submodule allow to approximate the embedding 𝜑, thereby working explicitly with the representations 𝜑(𝑥𝑖 ), which obviates the need to apply the kernel or store training examples.
References:

3.4.7 Pairwise metrics, Affinities and Kernels
The sklearn.metrics.pairwise submodule implements utilities to evaluate pairwise distances or affinity of
sets of samples.
This module contains both distance metrics and kernels. A brief summary is given on the two here.

536

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

Distance metrics are functions d(a, b) such that d(a, b) < d(a, c) if objects a and b are considered “more
similar” than objects a and c. Two objects exactly alike would have a distance of zero. One of the most popular
examples is Euclidean distance. To be a ‘true’ metric, it must obey the following four conditions:
1.
2.
3.
4.

d(a,
d(a,
d(a,
d(a,

b)
b)
b)
c)

>=
==
==
<=

0, for all a and b
0, if and only if a = b, positive definiteness
d(b, a), symmetry
d(a, b) + d(b, c), the triangle inequality

Kernels are measures of similarity, i.e. s(a, b) > s(a, c) if objects a and b are considered “more similar” than
objects a and c. A kernel must also be positive semi-definite.
There are a number of ways to convert between a distance metric and a similarity measure, such as a kernel. Let D be
the distance, and S be the kernel:
1. S = np.exp(-D * gamma), where one heuristic for choosing gamma is 1 / num_features
2. S = 1. / (D / np.max(D))
Cosine similarity
cosine_similarity computes the L2-normalized dot product of vectors. That is, if 𝑥 and 𝑦 are row vectors, their
cosine similarity 𝑘 is defined as:
𝑘(𝑥, 𝑦) =

𝑥𝑦 ⊤
‖𝑥‖‖𝑦‖

This is called cosine similarity, because Euclidean (L2) normalization projects the vectors onto the unit sphere, and
their dot product is then the cosine of the angle between the points denoted by the vectors.
This kernel is a popular choice for computing the similarity of documents represented as tf-idf vectors.
cosine_similarity accepts scipy.sparse matrices. (Note that the tf-idf functionality in sklearn.
feature_extraction.text can produce normalized vectors, in which case cosine_similarity is equivalent to linear_kernel, only slower.)
References:
• C.D. Manning, P. Raghavan and H. Schütze (2008). Introduction to Information Retrieval. Cambridge University Press. http://nlp.stanford.edu/IR-book/html/htmledition/the-vector-space-model-for-scoring-1.html

Linear kernel
The function linear_kernel computes the linear kernel, that is, a special case of polynomial_kernel with
degree=1 and coef0=0 (homogeneous). If x and y are column vectors, their linear kernel is:
𝑘(𝑥, 𝑦) = 𝑥⊤ 𝑦
Polynomial kernel
The function polynomial_kernel computes the degree-d polynomial kernel between two vectors. The polynomial kernel represents the similarity between two vectors. Conceptually, the polynomial kernels considers not only
the similarity between vectors under the same dimension, but also across dimensions. When used in machine learning
algorithms, this allows to account for feature interaction.

3.4. Dataset transformations

537

scikit-learn user guide, Release 0.19.1

The polynomial kernel is defined as:
𝑘(𝑥, 𝑦) = (𝛾𝑥⊤ 𝑦 + 𝑐0 )𝑑
where:
• x, y are the input vectors
• d is the kernel degree
If 𝑐0 = 0 the kernel is said to be homogeneous.
Sigmoid kernel
The function sigmoid_kernel computes the sigmoid kernel between two vectors. The sigmoid kernel is also
known as hyperbolic tangent, or Multilayer Perceptron (because, in the neural network field, it is often used as neuron
activation function). It is defined as:
𝑘(𝑥, 𝑦) = tanh(𝛾𝑥⊤ 𝑦 + 𝑐0 )
where:
• x, y are the input vectors
• 𝛾 is known as slope
• 𝑐0 is known as intercept
RBF kernel
The function rbf_kernel computes the radial basis function (RBF) kernel between two vectors. This kernel is
defined as:
𝑘(𝑥, 𝑦) = exp(−𝛾‖𝑥 − 𝑦‖2 )
where x and y are the input vectors. If 𝛾 = 𝜎 −2 the kernel is known as the Gaussian kernel of variance 𝜎 2 .
Laplacian kernel
The function laplacian_kernel is a variant on the radial basis function kernel defined as:
𝑘(𝑥, 𝑦) = exp(−𝛾‖𝑥 − 𝑦‖1 )
where x and y are the input vectors and ‖𝑥 − 𝑦‖1 is the Manhattan distance between the input vectors.
It has proven useful in ML applied to noiseless data. See e.g. Machine learning for quantum mechanics in a nutshell.
Chi-squared kernel
The chi-squared kernel is a very popular choice for training non-linear SVMs in computer vision applications. It can
be computed using chi2_kernel and then passed to an sklearn.svm.SVC with kernel="precomputed":

538

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

>>> from sklearn.svm import SVC
>>> from sklearn.metrics.pairwise import chi2_kernel
>>> X = [[0, 1], [1, 0], [.2, .8], [.7, .3]]
>>> y = [0, 1, 0, 1]
>>> K = chi2_kernel(X, gamma=.5)
>>> K
array([[ 1.
, 0.36..., 0.89..., 0.58...],
[ 0.36..., 1.
, 0.51..., 0.83...],
[ 0.89..., 0.51..., 1.
, 0.77... ],
[ 0.58..., 0.83..., 0.77... , 1.
]])
>>> svm = SVC(kernel='precomputed').fit(K, y)
>>> svm.predict(K)
array([0, 1, 0, 1])

It can also be directly used as the kernel argument:
>>> svm = SVC(kernel=chi2_kernel).fit(X, y)
>>> svm.predict(X)
array([0, 1, 0, 1])

The chi squared kernel is given by
(︃
𝑘(𝑥, 𝑦) = exp −𝛾

∑︁ (𝑥[𝑖] − 𝑦[𝑖])2
𝑖

)︃

𝑥[𝑖] + 𝑦[𝑖]

The data is assumed to be non-negative, and is often normalized to have an L1-norm of one. The normalization is
rationalized with the connection to the chi squared distance, which is a distance between discrete probability distributions.
The chi squared kernel is most commonly used on histograms (bags) of visual words.
References:
• Zhang, J. and Marszalek, M. and Lazebnik, S. and Schmid, C. Local features and kernels for classification
of texture and object categories: A comprehensive study International Journal of Computer Vision 2007
http://research.microsoft.com/en-us/um/people/manik/projects/trade-off/papers/ZhangIJCV06.pdf

3.4.8 Transforming the prediction target (y)
Label binarization
LabelBinarizer is a utility class to help create a label indicator matrix from a list of multi-class labels:
>>> from sklearn import preprocessing
>>> lb = preprocessing.LabelBinarizer()
>>> lb.fit([1, 2, 6, 4, 2])
LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)
>>> lb.classes_
array([1, 2, 4, 6])
>>> lb.transform([1, 6])
array([[1, 0, 0, 0],
[0, 0, 0, 1]])

3.4. Dataset transformations

539

scikit-learn user guide, Release 0.19.1

For multiple labels per instance, use MultiLabelBinarizer:
>>> lb = preprocessing.MultiLabelBinarizer()
>>> lb.fit_transform([(1, 2), (3,)])
array([[1, 1, 0],
[0, 0, 1]])
>>> lb.classes_
array([1, 2, 3])

Label encoding
LabelEncoder is a utility class to help normalize labels such that they contain only values between 0 and n_classes1. This is sometimes useful for writing efficient Cython routines. LabelEncoder can be used as follows:
>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2])
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])

It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical
labels:
>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1])
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

3.5 Dataset loading utilities
The sklearn.datasets package embeds some small toy datasets as introduced in the Getting Started section.
To evaluate the impact of the scale of the dataset (n_samples and n_features) while controlling the statistical
properties of the data (typically the correlation and informativeness of the features), it is also possible to generate
synthetic data.
This package also features helpers to fetch larger datasets commonly used by the machine learning community to
benchmark algorithm on data that comes from the ‘real world’.

3.5.1 General dataset API
There are three distinct kinds of dataset interfaces for different types of datasets. The simplest one is the interface for
sample images, which is described below in the Sample images section.

540

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

The dataset generation functions and the svmlight loader share a simplistic interface, returning a tuple (X, y) consisting of a n_samples * n_features numpy array X and an array of length n_samples containing the targets
y.
The toy datasets as well as the ‘real world’ datasets and the datasets fetched from mldata.org have more sophisticated
structure. These functions return a dictionary-like object holding at least two items: an array of shape n_samples *
n_features with key data (except for 20newsgroups) and a numpy array of length n_samples, containing the
target values, with key target.
The datasets also contain a description in DESCR and some contain feature_names and target_names. See
the dataset descriptions below for details.

3.5.2 Toy datasets
scikit-learn comes with a few small standard datasets that do not require to download any file from some external
website.
load_boston([return_X_y])
load_iris([return_X_y])
load_diabetes([return_X_y])
load_digits([n_class, return_X_y])
load_linnerud([return_X_y])
load_wine([return_X_y])
load_breast_cancer([return_X_y])

Load and return the boston house-prices dataset (regression).
Load and return the iris dataset (classification).
Load and return the diabetes dataset (regression).
Load and return the digits dataset (classification).
Load and return the linnerud dataset (multivariate regression).
Load and return the wine dataset (classification).
Load and return the breast cancer wisconsin dataset (classification).

These datasets are useful to quickly illustrate the behavior of the various algorithms implemented in the scikit. They
are however often too small to be representative of real world machine learning tasks.

3.5.3 Sample images
The scikit also embed a couple of sample JPEG images published under Creative Commons license by their authors.
Those image can be useful to test algorithms and pipeline on 2D data.
load_sample_images()
load_sample_image(image_name)

Load sample images for image manipulation.
Load the numpy array of a single sample image

Warning: The default coding of images is based on the uint8 dtype to spare memory. Often machine learning
algorithms work best if the input is converted to a floating point representation first. Also, if you plan to use

3.5. Dataset loading utilities

541

scikit-learn user guide, Release 0.19.1

matplotlib.pyplpt.imshow don’t forget to scale to the range 0 - 1 as done in the following example.

Examples:
• Color Quantization using K-Means

3.5.4 Sample generators
In addition, scikit-learn includes various random sample generators that can be used to build artificial datasets of
controlled size and complexity.
Generators for classification and clustering
These generators produce a matrix of features and corresponding discrete targets.
Single label
Both make_blobs and make_classification create multiclass datasets by allocating each class one or more
normally-distributed clusters of points. make_blobs provides greater control regarding the centers and standard deviations of each cluster, and is used to demonstrate clustering. make_classification specialises in introducing
noise by way of: correlated, redundant and uninformative features; multiple Gaussian clusters per class; and linear
transformations of the feature space.
make_gaussian_quantiles divides a single Gaussian cluster into near-equal-size classes separated
by concentric hyperspheres.
make_hastie_10_2 generates a similar binary, 10-dimensional problem.

make_circles and make_moons gener542

Chapter 3. User Guide

scikit-learn user guide, Release 0.19.1

ate 2d binary classification datasets that are challenging to certain algorithms (e.g. centroid-based clustering or linear
classification), including optional Gaussian noise. They are useful for visualisation. produces Gaussian data with a
spherical decision boundary for binary classification.
Multilabel
make_multilabel_classification generates random samples with multiple labels, reflecting a bag of words
drawn from a mixture of topics. The number of topics for each document is drawn from a Poisson distribution, and the
topics themselves are drawn from a fixed random distribution. Similarly, the number of words is drawn from Poisson,
with words drawn from a multinomial, where each topic defines a probability distribution over words. Simplifications
with respect to true bag-of-words mixtures include:
• Per-topic word distributions are independently drawn, where in reality all would be affected by a sparse base
distribution, and would be correlated.
• For a document generated from multiple topics, all topics are weighted equally in generating its bag of words.
• Documents without labels words at random, rather than from a base distribution.

Biclustering

make_biclusters(shape, n_clusters[, noise, . . . ])
make_checkerboard(shape, n_clusters[, . . . ])

Generate an array with constant block diagonal structure
for biclustering.
Generate an array with block checkerboard structure for biclustering.

Generators for regression
make_regression produces regression targets as an optionally-sparse random linear combination of random features, with noise. Its informative features may be uncorrelated, or low rank (few features account for most of the
variance).
Other
regression
generators
generate
functions
deterministically
from
randomized
features.
make_sparse_uncorrelated produces a target as a linear combination of four features with fixed coefficients. Others encode explicitly non-linear relations: make_friedman1 is related by polynomial and sine
transforms; make_friedman2 includes feature multiplication and reciprocation; and make_friedman3 is
similar with an arctan transformation on the target.

3.5. Dataset loading utilities

543

scikit-learn user guide, Release 0.19.1

Generators for manifold learning
make_s_curve([n_samples, noise, random_state])
make_swiss_roll([n_samples, noise, random_state])

Generate an S curve dataset.
Generate a swiss roll dataset.

Generators for decomposition
make_low_rank_matrix([n_samples, . . . ])
make_sparse_coded_signal(n_samples, . . . [, . . . ])
make_spd_matrix(n_dim[, random_state])
make_sparse_spd_matrix([dim, alpha, . . . ])

Generate a mostly low rank matrix with bell-shaped singular values
Generate a signal as a sparse combination of dictionary elements.
Generate a random symmetric, positive-definite matrix.
Generate a sparse symmetric definite positive matrix.

3.5.5 Datasets in svmlight / libsvm format
scikit-learn includes utility functions for loading datasets in the svmlight / libsvm format. In this format, each line
takes the form 

Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 2170
Page Mode                       : UseOutlines
Author                          : scikit-learn developers
Title                           : scikit-learn user guide
Subject                         : 
Creator                         : LaTeX with hyperref package
Producer                        : pdfTeX-1.40.14
Create Date                     : 2017:11:21 06:43:53Z
Modify Date                     : 2017:11:21 06:43:53Z
Trapped                         : False
PTEX Fullbanner                 : This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
EXIF Metadata provided by EXIF.tools

Navigation menu