Practical Machine Learning With Python A Problem Solver’s Guide To Building Real World Intelligen

User Manual: Pdf
Open the PDF directly: View PDF .
Page Count: 545
Download
Open PDF In Browser	View PDF
Practical Machine
Learning with
Python
A Problem-Solver’s Guide to Building
Real-World Intelligent Systems
—
Dipanjan Sarkar
Raghav Bali
Tushar Sharma

Practical Machine
Learning with Python
A Problem-Solver’s Guide to Building
Real-World Intelligent Systems

Dipanjan Sarkar
Raghav Bali
Tushar Sharma

Practical Machine Learning with Python
Dipanjan Sarkar					Raghav Bali
Bangalore, Karnataka, India				
Bangalore, Karnataka, India
Tushar Sharma
Bangalore, Karnataka, India
ISBN-13 (pbk): 978-1-4842-3206-4			
https://doi.org/10.1007/978-1-4842-3207-1

ISBN-13 (electronic): 978-1-4842-3207-1

Library of Congress Control Number: 2017963290
Copyright © 2018 by Dipanjan Sarkar, Raghav Bali and Tushar Sharma
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage
and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or
hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with
every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an
editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to
proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material
contained herein.
Cover image by Freepik (www.freepik.com)
Managing Director: Welmoed Spahr
Editorial Director: Todd Green
Acquisitions Editor: Celestin Suresh John
Development Editor: Matthew Moodie
Technical Reviewer: Jojo Moolayil
Coordinating Editor: Sanchita Mandal
Copy Editor: Kezia Endsley
Distributed to the book trade worldwide by Springer Science+Business Media New York,
233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail
orders-ny@springer-sbm.com, or visit www.springeronline.com. Apress Media, LLC is a California LLC
and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc).
SSBM Finance Inc is a Delaware corporation.
For information on translations, please e-mail rights@apress.com, or visit http://www.apress.com/
rights-permissions.
Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions
and licenses are also available for most titles. For more information, reference our Print and eBook Bulk
Sales web page at http://www.apress.com/bulk-sales.
Any source code or other supplementary material referenced by the author in this book is available to
readers on GitHub via the book’s product page, located at www.apress.com/978-1-4842-3206-4. For
more detailed information, please visit http://www.apress.com/source-code.
Printed on acid-free paper

This book is dedicated to my parents, partner, friends, family, and well-wishers.
—Dipanjan Sarkar
To all my inspirations, who would never read this!
—Raghav Bali
Dedicated to my family and friends.
—Tushar Sharma

Contents
About the Authors��������������������������������������������������������������������������������������������������xvii
About the Technical Reviewer��������������������������������������������������������������������������������xix
Acknowledgments��������������������������������������������������������������������������������������������������xxi
Foreword��������������������������������������������������������������������������������������������������������������xxiii
Introduction�����������������������������������������������������������������������������������������������������������xxv

■Part
■
I: Understanding Machine Learning�������������������������������������������� 1
■Chapter
■
1: Machine Learning Basics��������������������������������������������������������������������� 3
The Need for Machine Learning��������������������������������������������������������������������������������������� 4
Making Data-Driven Decisions��������������������������������������������������������������������������������������������������������������� 4
Efficiency and Scale������������������������������������������������������������������������������������������������������������������������������� 5
Traditional Programming Paradigm�������������������������������������������������������������������������������������������������������� 5
Why Machine Learning?������������������������������������������������������������������������������������������������������������������������� 6

Understanding Machine Learning������������������������������������������������������������������������������������ 8
Why Make Machines Learn?������������������������������������������������������������������������������������������������������������������ 8
Formal Definition������������������������������������������������������������������������������������������������������������������������������������ 9
A Multi-Disciplinary Field��������������������������������������������������������������������������������������������������������������������� 13

Computer Science���������������������������������������������������������������������������������������������������������� 14
Theoretical Computer Science������������������������������������������������������������������������������������������������������������� 15
Practical Computer Science����������������������������������������������������������������������������������������������������������������� 15
Important Concepts������������������������������������������������������������������������������������������������������������������������������ 15

D
 ata Science������������������������������������������������������������������������������������������������������������������ 16

v

■ Contents

Mathematics������������������������������������������������������������������������������������������������������������������ 18
Important Concepts������������������������������������������������������������������������������������������������������������������������������ 19

Statistics������������������������������������������������������������������������������������������������������������������������ 24
Data Mining�������������������������������������������������������������������������������������������������������������������� 25
Artificial Intelligence������������������������������������������������������������������������������������������������������ 25
Natural Language Processing���������������������������������������������������������������������������������������� 26
Deep Learning���������������������������������������������������������������������������������������������������������������� 28
Important Concepts������������������������������������������������������������������������������������������������������������������������������ 31

Machine Learning Methods�������������������������������������������������������������������������������������������� 34
Supervised Learning������������������������������������������������������������������������������������������������������ 35
Classification���������������������������������������������������������������������������������������������������������������������������������������� 36
Regression�������������������������������������������������������������������������������������������������������������������������������������������� 37

U
 nsupervised Learning�������������������������������������������������������������������������������������������������� 38
C
 lustering��������������������������������������������������������������������������������������������������������������������������������������������� 39
Dimensionality Reduction��������������������������������������������������������������������������������������������������������������������� 40
Anomaly Detection������������������������������������������������������������������������������������������������������������������������������� 41
Association Rule-Mining����������������������������������������������������������������������������������������������������������������������� 41

Semi-Supervised Learning��������������������������������������������������������������������������������������������� 42
Reinforcement Learning������������������������������������������������������������������������������������������������� 42
Batch Learning��������������������������������������������������������������������������������������������������������������� 43
Online Learning�������������������������������������������������������������������������������������������������������������� 44
Instance Based Learning������������������������������������������������������������������������������������������������ 44
Model Based Learning���������������������������������������������������������������������������������������������������� 45
The CRISP-DM Process Model���������������������������������������������������������������������������������������� 45
Business Understanding����������������������������������������������������������������������������������������������������������������������� 46
Data Understanding������������������������������������������������������������������������������������������������������������������������������ 48
Data Preparation����������������������������������������������������������������������������������������������������������������������������������� 50
Modeling����������������������������������������������������������������������������������������������������������������������������������������������� 51
Evaluation��������������������������������������������������������������������������������������������������������������������������������������������� 52
Deployment������������������������������������������������������������������������������������������������������������������������������������������ 52
vi

■ Contents

Building Machine Intelligence���������������������������������������������������������������������������������������� 52
Machine Learning Pipelines����������������������������������������������������������������������������������������������������������������� 52
Supervised Machine Learning Pipeline������������������������������������������������������������������������������������������������ 54
Unsupervised Machine Learning Pipeline�������������������������������������������������������������������������������������������� 55

Real-World Case Study: Predicting Student Grant Recommendations��������������������������� 55
Objective����������������������������������������������������������������������������������������������������������������������������������������������� 56
Data Retrieval��������������������������������������������������������������������������������������������������������������������������������������� 56
Data Preparation����������������������������������������������������������������������������������������������������������������������������������� 57
Modeling����������������������������������������������������������������������������������������������������������������������������������������������� 60
Model Evaluation���������������������������������������������������������������������������������������������������������������������������������� 61
Model Deployment�������������������������������������������������������������������������������������������������������������������������������� 61
Prediction in Action������������������������������������������������������������������������������������������������������������������������������� 62

Challenges in Machine Learning������������������������������������������������������������������������������������ 64
Real-World Applications of Machine Learning��������������������������������������������������������������� 64
Summary������������������������������������������������������������������������������������������������������������������������ 65
■Chapter
■
2: The Python Machine Learning Ecosystem����������������������������������������� 67
Python: An Introduction�������������������������������������������������������������������������������������������������� 67
Strengths���������������������������������������������������������������������������������������������������������������������������������������������� 68
Pitfalls��������������������������������������������������������������������������������������������������������������������������������������������������� 68
Setting Up a Python Environment��������������������������������������������������������������������������������������������������������� 69
Why Python for Data Science?������������������������������������������������������������������������������������������������������������� 71

Introducing the Python Machine Learning Ecosystem��������������������������������������������������� 72
J upyter Notebooks�������������������������������������������������������������������������������������������������������������������������������� 72
NumPy�������������������������������������������������������������������������������������������������������������������������������������������������� 75
Pandas�������������������������������������������������������������������������������������������������������������������������������������������������� 84
Scikit-learn������������������������������������������������������������������������������������������������������������������������������������������� 96
Neural Networks and Deep Learning�������������������������������������������������������������������������������������������������� 102
Text Analytics and Natural Language Processing������������������������������������������������������������������������������� 112
Statsmodels���������������������������������������������������������������������������������������������������������������������������������������� 116

Summary���������������������������������������������������������������������������������������������������������������������� 118
vii

■ Contents

■Part
■
II: The Machine Learning Pipeline������������������������������������������� 119
■Chapter
■
3: Processing, Wrangling, and Visualizing Data����������������������������������� 121
Data Collection������������������������������������������������������������������������������������������������������������� 122
C
 SV����������������������������������������������������������������������������������������������������������������������������������������������������� 122
JSON��������������������������������������������������������������������������������������������������������������������������������������������������� 124
XML����������������������������������������������������������������������������������������������������������������������������������������������������� 128
HTML and Scraping���������������������������������������������������������������������������������������������������������������������������� 131
SQL����������������������������������������������������������������������������������������������������������������������������������������������������� 136

D
 ata Description����������������������������������������������������������������������������������������������������������� 137
Numeric���������������������������������������������������������������������������������������������������������������������������������������������� 137
Text����������������������������������������������������������������������������������������������������������������������������������������������������� 137
Categorical����������������������������������������������������������������������������������������������������������������������������������������� 137

D
 ata Wrangling������������������������������������������������������������������������������������������������������������� 138
U
 nderstanding Data���������������������������������������������������������������������������������������������������������������������������� 138
Filtering Data�������������������������������������������������������������������������������������������������������������������������������������� 141
Typecasting����������������������������������������������������������������������������������������������������������������������������������������� 144
Transformations���������������������������������������������������������������������������������������������������������������������������������� 144
Imputing Missing Values��������������������������������������������������������������������������������������������������������������������� 145
Handling Duplicates���������������������������������������������������������������������������������������������������������������������������� 147
Handling Categorical Data������������������������������������������������������������������������������������������������������������������ 147
Normalizing Values����������������������������������������������������������������������������������������������������������������������������� 148
String Manipulations�������������������������������������������������������������������������������������������������������������������������� 149

Data Summarization����������������������������������������������������������������������������������������������������� 149
Data Visualization��������������������������������������������������������������������������������������������������������� 151
Visualizing with Pandas���������������������������������������������������������������������������������������������������������������������� 152
Visualizing with Matplotlib����������������������������������������������������������������������������������������������������������������� 161
Python Visualization Ecosystem��������������������������������������������������������������������������������������������������������� 176

Summary���������������������������������������������������������������������������������������������������������������������� 176

viii

■ Contents

■Chapter
■
4: Feature Engineering and Selection�������������������������������������������������� 177
Features: Understand Your Data Better������������������������������������������������������������������������ 178
Data and Datasets������������������������������������������������������������������������������������������������������������������������������ 178
Features���������������������������������������������������������������������������������������������������������������������������������������������� 179
Models������������������������������������������������������������������������������������������������������������������������������������������������ 179

Revisiting the Machine Learning Pipeline�������������������������������������������������������������������� 179
Feature Extraction and Engineering����������������������������������������������������������������������������� 181
What Is Feature Engineering?������������������������������������������������������������������������������������������������������������ 181
Why Feature Engineering?������������������������������������������������������������������������������������������������������������������ 183
How Do You Engineer Features?��������������������������������������������������������������������������������������������������������� 184

Feature Engineering on Numeric Data������������������������������������������������������������������������� 185
Raw Measures������������������������������������������������������������������������������������������������������������������������������������ 185
Binarization����������������������������������������������������������������������������������������������������������������������������������������� 187
Rounding�������������������������������������������������������������������������������������������������������������������������������������������� 188
Interactions����������������������������������������������������������������������������������������������������������������������������������������� 189
Binning����������������������������������������������������������������������������������������������������������������������������������������������� 191
Statistical Transformations����������������������������������������������������������������������������������������������������������������� 197

Feature Engineering on Categorical Data��������������������������������������������������������������������� 200
Transforming Nominal Features��������������������������������������������������������������������������������������������������������� 201
Transforming Ordinal Features����������������������������������������������������������������������������������������������������������� 202
Encoding Categorical Features����������������������������������������������������������������������������������������������������������� 203

Feature Engineering on Text Data�������������������������������������������������������������������������������� 209
Text Pre-Processing���������������������������������������������������������������������������������������������������������������������������� 210
Bag of Words Model���������������������������������������������������������������������������������������������������������������������������� 211
Bag of N-Grams Model����������������������������������������������������������������������������������������������������������������������� 212
TF-IDF Model�������������������������������������������������������������������������������������������������������������������������������������� 213
Document Similarity��������������������������������������������������������������������������������������������������������������������������� 214
Topic Models��������������������������������������������������������������������������������������������������������������������������������������� 216
Word Embeddings������������������������������������������������������������������������������������������������������������������������������� 217

ix

■ Contents

Feature Engineering on Temporal Data������������������������������������������������������������������������ 220
Date-Based Features�������������������������������������������������������������������������������������������������������������������������� 221
Time-Based Features������������������������������������������������������������������������������������������������������������������������� 222

Feature Engineering on Image Data����������������������������������������������������������������������������� 224
Image Metadata Features������������������������������������������������������������������������������������������������������������������� 225
Raw Image and Channel Pixels���������������������������������������������������������������������������������������������������������� 225
Grayscale Image Pixels����������������������������������������������������������������������������������������������������������������������� 227
Binning Image Intensity Distribution�������������������������������������������������������������������������������������������������� 227
Image Aggregation Statistics�������������������������������������������������������������������������������������������������������������� 228
Edge Detection����������������������������������������������������������������������������������������������������������������������������������� 229
Object Detection��������������������������������������������������������������������������������������������������������������������������������� 230
Localized Feature Extraction�������������������������������������������������������������������������������������������������������������� 231
Visual Bag of Words Model����������������������������������������������������������������������������������������������������������������� 233
Automated Feature Engineering with Deep Learning������������������������������������������������������������������������� 236

Feature Scaling������������������������������������������������������������������������������������������������������������ 239
Standardized Scaling�������������������������������������������������������������������������������������������������������������������������� 240
Min-Max Scaling��������������������������������������������������������������������������������������������������������������������������������� 240
Robust Scaling������������������������������������������������������������������������������������������������������������������������������������ 241

Feature Selection��������������������������������������������������������������������������������������������������������� 242
Threshold-Based Methods������������������������������������������������������������������������������������������������������������������ 243
Statistical Methods����������������������������������������������������������������������������������������������������������������������������� 244
Recursive Feature Elimination������������������������������������������������������������������������������������������������������������ 247
Model-Based Selection����������������������������������������������������������������������������������������������������������������������� 248

Dimensionality Reduction��������������������������������������������������������������������������������������������� 249
Feature Extraction with Principal Component Analysis���������������������������������������������������������������������� 250

Summary���������������������������������������������������������������������������������������������������������������������� 252
■Chapter
■
5: Building, Tuning, and Deploying Models������������������������������������������ 255
Building Models������������������������������������������������������������������������������������������������������������ 256
M
 odel Types���������������������������������������������������������������������������������������������������������������������������������������� 257
Learning a Model�������������������������������������������������������������������������������������������������������������������������������� 260
Model Building Examples������������������������������������������������������������������������������������������������������������������� 263
x

■ Contents

M
 odel Evaluation���������������������������������������������������������������������������������������������������������� 271
Evaluating Classification Models�������������������������������������������������������������������������������������������������������� 271
Evaluating Clustering Models������������������������������������������������������������������������������������������������������������� 278
Evaluating Regression Models����������������������������������������������������������������������������������������������������������� 281

M
 odel Tuning���������������������������������������������������������������������������������������������������������������� 282
Introduction to Hyperparameters�������������������������������������������������������������������������������������������������������� 283
The Bias-Variance Tradeoff����������������������������������������������������������������������������������������������������������������� 284
Cross Validation���������������������������������������������������������������������������������������������������������������������������������� 288
Hyperparameter Tuning Strategies����������������������������������������������������������������������������������������������������� 291

M
 odel Interpretation����������������������������������������������������������������������������������������������������� 295
U
 nderstanding Skater������������������������������������������������������������������������������������������������������������������������� 297
Model Interpretation in Action������������������������������������������������������������������������������������������������������������ 298

M
 odel Deployment������������������������������������������������������������������������������������������������������� 302
M
 odel Persistence������������������������������������������������������������������������������������������������������������������������������ 302
Custom Development������������������������������������������������������������������������������������������������������������������������� 303
In-House Model Deployment�������������������������������������������������������������������������������������������������������������� 303
Model Deployment as a Service��������������������������������������������������������������������������������������������������������� 304

S
 ummary���������������������������������������������������������������������������������������������������������������������� 304

■Part
■
III: Real-World Case Studies��������������������������������������������������� 305
■Chapter
■
6: Analyzing Bike Sharing Trends�������������������������������������������������������� 307
The Bike Sharing Dataset��������������������������������������������������������������������������������������������� 307
Problem Statement������������������������������������������������������������������������������������������������������ 308
Exploratory Data Analysis��������������������������������������������������������������������������������������������� 308
P reprocessing������������������������������������������������������������������������������������������������������������������������������������� 308
Distribution and Trends����������������������������������������������������������������������������������������������������������������������� 310
Outliers����������������������������������������������������������������������������������������������������������������������������������������������� 312
Correlations���������������������������������������������������������������������������������������������������������������������������������������� 314

xi

■ Contents

R
 egression Analysis����������������������������������������������������������������������������������������������������� 315
Types of Regression��������������������������������������������������������������������������������������������������������������������������� 315
Assumptions��������������������������������������������������������������������������������������������������������������������������������������� 316
Evaluation Criteria������������������������������������������������������������������������������������������������������������������������������ 316

Modeling���������������������������������������������������������������������������������������������������������������������� 317
Linear Regression������������������������������������������������������������������������������������������������������������������������������� 319
Decision Tree Based Regression��������������������������������������������������������������������������������������������������������� 323

Next Steps�������������������������������������������������������������������������������������������������������������������� 330
Summary���������������������������������������������������������������������������������������������������������������������� 330
■Chapter
■
7: Analyzing Movie Reviews Sentiment����������������������������������������������� 331
Problem Statement������������������������������������������������������������������������������������������������������ 332
Setting Up Dependencies��������������������������������������������������������������������������������������������� 332
Getting the Data����������������������������������������������������������������������������������������������������������� 333
Text Pre-Processing and Normalization����������������������������������������������������������������������� 333
Unsupervised Lexicon-Based Models�������������������������������������������������������������������������� 336
Bing Liu’s Lexicon������������������������������������������������������������������������������������������������������������������������������� 337
MPQA Subjectivity Lexicon����������������������������������������������������������������������������������������������������������������� 337
Pattern Lexicon����������������������������������������������������������������������������������������������������������������������������������� 338
AFINN Lexicon������������������������������������������������������������������������������������������������������������������������������������ 338
SentiWordNet Lexicon������������������������������������������������������������������������������������������������������������������������ 340
VADER Lexicon������������������������������������������������������������������������������������������������������������������������������������ 342

Classifying Sentiment with Supervised Learning��������������������������������������������������������� 345
Traditional Supervised Machine Learning Models������������������������������������������������������� 346
Newer Supervised Deep Learning Models������������������������������������������������������������������� 349
Advanced Supervised Deep Learning Models�������������������������������������������������������������� 355
Analyzing Sentiment Causation������������������������������������������������������������������������������������ 363
Interpreting Predictive Models����������������������������������������������������������������������������������������������������������� 363
Analyzing Topic Models���������������������������������������������������������������������������������������������������������������������� 368

S
 ummary���������������������������������������������������������������������������������������������������������������������� 372

xii

■ Contents

■Chapter
■
8: Customer Segmentation and Effective Cross Selling����������������������� 373
Online Retail Transactions Dataset������������������������������������������������������������������������������� 374
Exploratory Data Analysis��������������������������������������������������������������������������������������������� 374
Customer Segmentation����������������������������������������������������������������������������������������������� 378
O
 bjectives������������������������������������������������������������������������������������������������������������������������������������������� 378
Strategies������������������������������������������������������������������������������������������������������������������������������������������� 379
Clustering Strategy����������������������������������������������������������������������������������������������������������������������������� 380

Cross Selling���������������������������������������������������������������������������������������������������������������� 392
Market Basket Analysis with Association Rule-Mining����������������������������������������������������������������������� 393
Association Rule-Mining Basics��������������������������������������������������������������������������������������������������������� 394
Association Rule-Mining in Action������������������������������������������������������������������������������������������������������ 396

Summary���������������������������������������������������������������������������������������������������������������������� 405
■Chapter
■
9: Analyzing Wine Types and Quality��������������������������������������������������� 407
Problem Statement������������������������������������������������������������������������������������������������������ 407
Setting Up Dependencies��������������������������������������������������������������������������������������������� 408
Getting the Data����������������������������������������������������������������������������������������������������������� 408
Exploratory Data Analysis��������������������������������������������������������������������������������������������� 409
Process and Merge Datasets�������������������������������������������������������������������������������������������������������������� 409
Understanding Dataset Features�������������������������������������������������������������������������������������������������������� 410
Descriptive Statistics�������������������������������������������������������������������������������������������������������������������������� 413
Inferential Statistics���������������������������������������������������������������������������������������������������������������������������� 414
Univariate Analysis����������������������������������������������������������������������������������������������������������������������������� 416
Multivariate Analysis�������������������������������������������������������������������������������������������������������������������������� 419

P redictive Modeling������������������������������������������������������������������������������������������������������ 426
Predicting Wine Types�������������������������������������������������������������������������������������������������� 427
Predicting Wine Quality������������������������������������������������������������������������������������������������ 433
Summary���������������������������������������������������������������������������������������������������������������������� 446

xiii

■ Contents

■Chapter
■
10: Analyzing Music Trends and Recommendations��������������������������� 447
The Million Song Dataset Taste Profile������������������������������������������������������������������������� 448
Exploratory Data Analysis��������������������������������������������������������������������������������������������� 448
Loading and Trimming Data���������������������������������������������������������������������������������������������������������������� 448
Enhancing the Data���������������������������������������������������������������������������������������������������������������������������� 451
Visual Analysis������������������������������������������������������������������������������������������������������������������������������������ 452

R
 ecommendation Engines�������������������������������������������������������������������������������������������� 456
Types of Recommendation Engines���������������������������������������������������������������������������������������������������� 457
Utility of Recommendation Engines���������������������������������������������������������������������������������������������������� 457
Popularity-Based Recommendation Engine��������������������������������������������������������������������������������������� 458
Item Similarity Based Recommendation Engine��������������������������������������������������������������������������������� 459
Matrix Factorization Based Recommendation Engine������������������������������������������������������������������������ 461

A Note on Recommendation Engine Libraries�������������������������������������������������������������� 466
Summary���������������������������������������������������������������������������������������������������������������������� 466
■Chapter
■
11: Forecasting Stock and Commodity Prices������������������������������������� 467
Time Series Data and Analysis������������������������������������������������������������������������������������� 467
Time Series Components�������������������������������������������������������������������������������������������������������������������� 469
Smoothing Techniques����������������������������������������������������������������������������������������������������������������������� 471

Forecasting Gold Price������������������������������������������������������������������������������������������������� 474
P roblem Statement����������������������������������������������������������������������������������������������������������������������������� 474
Dataset����������������������������������������������������������������������������������������������������������������������������������������������� 474
Traditional Approaches����������������������������������������������������������������������������������������������������������������������� 474
Modeling��������������������������������������������������������������������������������������������������������������������������������������������� 476

Stock Price Prediction�������������������������������������������������������������������������������������������������� 483
P roblem Statement����������������������������������������������������������������������������������������������������������������������������� 484
Dataset����������������������������������������������������������������������������������������������������������������������������������������������� 484
Recurrent Neural Networks: LSTM����������������������������������������������������������������������������������������������������� 485
Upcoming Techniques: Prophet���������������������������������������������������������������������������������������������������������� 495

S
 ummary���������������������������������������������������������������������������������������������������������������������� 497

xiv

■ Contents

■Chapter
■
12: Deep Learning for Computer Vision����������������������������������������������� 499
Convolutional Neural Networks������������������������������������������������������������������������������������ 499
Image Classification with CNNs����������������������������������������������������������������������������������� 501
P roblem Statement����������������������������������������������������������������������������������������������������������������������������� 501
Dataset����������������������������������������������������������������������������������������������������������������������������������������������� 501
CNN Based Deep Learning Classifier from Scratch���������������������������������������������������������������������������� 502
CNN Based Deep Learning Classifier with Pretrained Models������������������������������������������������������������ 505

Artistic Style Transfer with CNNs��������������������������������������������������������������������������������� 509
B
 ackground���������������������������������������������������������������������������������������������������������������������������������������� 510
Preprocessing������������������������������������������������������������������������������������������������������������������������������������� 511
Loss Functions������������������������������������������������������������������������������������������������������������������������������������ 513
Custom Optimizer������������������������������������������������������������������������������������������������������������������������������� 515
Style Transfer in Action����������������������������������������������������������������������������������������������������������������������� 516

S
 ummary���������������������������������������������������������������������������������������������������������������������� 520
Index��������������������������������������������������������������������������������������������������������������������� 521

xv

About the Authors
Dipanjan Sarkar is a data scientist at Intel, on a mission to make the
world more connected and productive. He primarily works on Data
Science, analytics, business intelligence, application development, and
building large-scale intelligent systems. He holds a master of technology
degree in Information Technology with specializations in Data Science
and Software Engineering from the International Institute of Information
Technology, Bangalore. He is also an avid supporter of self-learning,
especially Massive Open Online Courses and also holds a Data Science
Specialization from Johns Hopkins University on Coursera.
Dipanjan has been an analytics practitioner for several years,
specializing in statistical, predictive, and text analytics. Having a
passion for Data Science and education, he is a Data Science Mentor
at Springboard, helping people up-skill on areas like Data Science and
Machine Learning. Dipanjan has also authored several books on R,
Python, Machine Learning, and analytics, including Text Analytics with
Python, Apress 2016. Besides this, he occasionally reviews technical books
and acts as a course beta tester for Coursera. Dipanjan’s interests include learning about new technology,
financial markets, disruptive start-ups, Data Science, and more recently, artificial intelligence and Deep
Learning.

Raghav Bali is a data scientist at Intel, enabling proactive and data-driven
IT initiatives. He primarily works on Data Science, analytics, business
intelligence, and development of scalable Machine Learning-based
solutions. He has also worked in domains such as ERP and finance with
some of the leading organizations in the world. Raghav has a master’s
degree (gold medalist) in Information Technology from International
Institute of Information Technology, Bangalore.
Raghav is a technology enthusiast who loves reading and playing
around with new gadgets and technologies. He has also authored
several books on R, Machine Learning, and Analytics. He is a shutterbug,
capturing moments when he isn’t busy solving problems.

xvii

■ About the Authors

Tushar Sharma has a master’s degree from International Institute of
Information Technology, Bangalore. He works as a Data Scientist with
Intel. His work involves developing analytical solutions at scale using
enormous volumes of infrastructure data. In his previous role, he worked
in the financial domain developing scalable Machine Learning solutions
for major financial organizations. He is proficient in Python, R, and Big
Data frameworks like Spark and Hadoop.
Apart from work, Tushar enjoys watching movies, playing badminton,
and is an avid reader. He has also authored a book on R and social media
analytics.

xviii

About the Technical Reviewer
Jojo Moolayil is an Artificial Intelligence professional and published
author of the book: Smarter Decisions – The Intersection of IoT and
Decision Science. With over five years of industrial experience in A.I.,
Machine Learning, Decision Science, and IoT, he has worked with
industry leaders on high impact and critical projects across multiple
verticals. He is currently working with General Electric, the pioneer and
leader in Data Science for Industrial IoT, and lives in Bengaluru—the
Silicon Valley of India.
He was born and raised in Pune, India and graduated from University
of Pune with a major in Information Technology Engineering. He started
his career with Mu Sigma Inc., the world’s largest pure play analytics
provider and then Flutura, an IoT Analytics startup. He has also worked
with the leaders of many Fortune 50 clients.
In his present role with General Electric, he focuses on solving A.I.
and decision science problems for Industrial IoT use cases and developing
Data Science products and platforms for Industrial IoT.
Apart from authoring books on decision science and IoT, Jojo has also been technical reviewer for
various books on Machine Learning and Business Analytics with Apress. He is an active Data Science tutor
and maintains a blog at http://www.jojomoolayil.com/web/blog/.
You can reach out to Jojo at:
http://www.jojomoolayil.com/
https://www.linkedin.com/in/jojo62000
I would like to thank my family, friends, and mentors for their kind support and constant motivation
throughout my life.
—Jojo John Moolayil

xix

Acknowledgments
This book would have definitely not been a reality without the help and support from some excellent people
and organizations that have helped us along this journey. First and foremost, a big thank you to all our
readers for not only reading our books but also supporting us with valuable feedback and insights. Truly,
we have learnt a lot from all of you and still continue to do so. We would like to acknowledge the entire
team at Apress for working tirelessly behind the scenes to create and publish quality content for everyone.
A big shout-out goes to the entire Python developer community, especially to the developers of frameworks
like numpy, scipy, scikit-learn, spacy, nltk, pandas, statsmodels, keras, and tensorflow. Thanks also to
organizations like Anaconda, for making the lives of data scientists easier and for fostering an amazing
ecosystem around Data Science and Machine Learning that has been growing exponentially with time. We
also thank our friends, colleagues, teachers, managers, and well-wishers for supporting us with excellent
challenges, strong motivation, and good thoughts. A special mention goes to Ram Varra for not only being
a great mentor and guide to us, but also teaching us how to leverage Data Science as an effective tool from
technical aspects as well as from the business and domain perspectives for adding real impact and value.
We would also like to express our gratitude to our managers and mentors, both past and present, including
Nagendra Venkatesh, Sanjeev Reddy, Tamoghna Ghosh and Sailaja Parthasarathy.
A lot of the content in this book wouldn’t have been possible without the help from several people and
some excellent resources. We would like to thank Christopher Olah for providing some excellent depictions
and explanation for LSTM models (http://colah.github.io), Edwin Chen for also providing an excellent
depiction for LSTM models in his blog (http://blog.echen.me), Gabriel Moreira for providing some
excellent pointers on feature engineering techniques, Ian London for his resources on the Visual Bag of
Words Model (https://ianlondon.github.io), the folks at DataScience.com, especially Pramit Choudhary,
Ian Swanson, and Aaron Kramer, for helping us cover a lot of ground in model interpretation with skater
(https://www.datascience.com), Karlijn Willems and DataCamp for providing an excellent source of
information pertaining to wine quality analysis (https://www.datacamp.com), Siraj Raval for creating
amazing content especially with regard to time series analysis and recommendation engines, Amar Lalwani
for giving us some vital inputs around time series forecasting with Deep Learning, Harish Narayanan for an
excellent article on neural style transfer (https://harishnarayanan.org/writing), and last but certainly
not the least, François Chollet for creating keras and writing an excellent book on Deep Learning.
I would also like to acknowledge and express my gratitude to my parents, Digbijoy and Sampa, my
partner Durba and my family and well-wishers for their constant love, support, and encouragement that
drive me to strive to achieve more. Special thanks to my fellow colleagues, friends, and co-authors Raghav
and Tushar for slogging many days and nights with me and making this experience worthwhile! Finally, once
again I would like to thank the entire team at Apress, especially Sanchita Mandal, Celestin John, Matthew
Moodie, and our technical reviewer, Jojo Moolayil, for being a part of this wonderful journey.
—Dipanjan Sarkar

xxi

■ Acknowledgments

I am indebted to my family, teachers, friends, colleagues, and mentors who have inspired and encouraged
me over the years. I would also like to take this opportunity to thank my co-authors and good friends
Dipanjan Sarkar and Tushar Sharma; you guys are amazing. Special thanks to Sanchita Mandal, Celestin
John, Matthew Moodie, and Apress for the opportunity and support, and last but not the least, thank you to
Jojo Moolayil for the feedback and reviews
—Raghav Bali
I would like to express my gratitude to my family, teachers, and friends who have encouraged, supported,
and taught me over the years. Special thanks to my classmates, friends, and colleagues, Dipanjan Sarkar and
Raghav Bali, for co-authoring and making this journey wonderful through their valuable inputs and eye for
detail.
I would also like to thank Matthew Moodie, Sanchita Mandal, Celestin John, and Apress for the
opportunity and their support throughout the journey. Special thanks to the reviews and comments
provided by Jojo Moolayil.
—Tushar Sharma

xxii

Foreword
The availability of affordable compute power enabled by Moore’s law has been enabling rapid advances
in Machine Learning solutions and driving adoption across diverse segments of the industry. The ability
to learn complex models underlying the real-world processes from observed (training) data through
systemic, easy-to-apply Machine Learning solution stacks has been of tremendous attraction to businesses
to harness meaningful business value. The appeal and opportunities of Machine Learning have resulted in
the availability of many resources—books, tutorials, online training, and courses for solution developers,
analysts, engineers, and scientists to learn the algorithms and implement platforms and methodologies. It
is not uncommon for someone just starting out to get overwhelmed by the abundance of the material. In
addition, not following a structured workflow might not yield consistent and relevant results with Machine
Learning solutions.
Key requirements for building robust Machine Learning applications and getting consistent, actionable
results involve investing significant time and effort in understanding the objectives and key value of
the project, establishing robust data pipelines, analyzing and visualizing data, and feature engineering,
selection, and modeling. The iterative nature of these projects involves several Select → Apply → Validate
→ Tune cycles before coming up with a suitable Machine Learning-based model. A final and important
step is to integrate the solution (Machine Learning model) into existing (or new) organization systems
or business processes to sustain actionable and relevant results. Hence, the broad requirements of the
ingredients for a robust Machine Learning solution require a development platform that is suited not just
for interactive modeling of Machine Learning, but also excels in data ingestion, processing, visualization,
systems integration, and strong ecosystem support for runtime deployment and maintenance. Python is
an excellent choice of language because it fits the need of the hour with its multi-purpose capabilities, ease
of implementation and integration, active developer community, and ever-growing Machine Learning
ecosystem, leading to its adoption for Machine Learning growing rapidly.
The authors of this book have leveraged their hands-on experience with solving real-world problems
using Python and its Machine Learning ecosystem to help the readers gain the solid knowledge needed to
apply essential concepts, methodologies, tools, and techniques for solving their own real-world problems
and use-cases. Practical Machine Learning with Python aims to cater to readers with varying skill levels
ranging from beginners to experts and enable them in structuring and building practical Machine
Learning solutions.
—Ram R. Varra, Senior Principal Engineer, Intel

xxiii

Introduction
Data is the new oil and Machine Learning is a powerful concept and framework for making the best out of
it. In this age of automation and intelligent systems, it is hardly a surprise that Machine Learning and Data
Science are some of the top buzz words. The tremendous interest and renewed investments in the field of
Data Science across industries, enterprises, and domains are clear indicators of its enormous potential.
Intelligent systems and data-driven organizations are becoming a reality and the advancements in tools
and techniques is only helping it expand further. With data being of paramount importance, there has never
been a higher demand for Machine Learning and Data Science practitioners than there is now. Indeed,
the world is facing a shortage of data scientists. It’s been coined “The sexiest job in the 21st Century” which
makes it all the more worthwhile to try to build some valuable expertise in this domain.
Practical Machine Learning with Python is a problem solver’s guide to building real-world intelligent
systems. It follows a comprehensive three-tiered approach packed with concepts, methodologies, hands-on
examples, and code. This book helps its readers master the essential skills needed to recognize and solve
complex problems with Machine Learning and Deep Learning by following a data-driven mindset. Using
real-world case studies that leverage the popular Python Machine Learning ecosystem, this book is your
perfect companion for learning the art and science of Machine Learning to become a successful practitioner.
The concepts, techniques, tools, frameworks, and methodologies used in this book will teach you how to
think, design, build, and execute Machine Learning systems and projects successfully.
This book will get you started on the ways to leverage the Python Machine Learning ecosystem with its
diverse set of frameworks and libraries. The three-tiered approach of this book starts by focusing on building
a strong foundation around the basics of Machine Learning and relevant tools and frameworks, the next part
emphasizes the core processes around building Machine Learning pipelines, and the final part leverages this
knowledge on solving some real-world case studies from diverse domains, including retail, transportation,
movies, music, computer vision, art, and finance. We also cover a wide range of Machine Learning models,
including regression, classification, forecasting, rule-mining, and clustering. This book also touches on
cutting edge methodologies and research from the field of Deep Learning, including concepts like transfer
learning and case studies relevant to computer vision, including image classification and neural style
transfer. Each chapter consists of detailed concepts with complete hands-on examples, code, and detailed
discussions. The main intent of this book is to give a wide range of readers—including IT professionals,
analysts, developers, data scientists, engineers, and graduate students—a structured approach to gaining
essential skills pertaining to Machine Learning and enough knowledge about leveraging state-of-the-art
Machine Learning techniques and frameworks so that they can start solving their own real-world problems.
This book is application-focused, so it’s not a replacement for gaining deep conceptual and theoretical
knowledge about Machine Learning algorithms, methods, and their internal implementations. We strongly
recommend you supplement the practical knowledge gained through this book with some standard books
on data mining, statistical analysis, and theoretical aspects of Machine Learning algorithms and methods to
gain deeper insights into the world of Machine Learning.

xxv

PART I

Understanding Machine
Learning

CHAPTER 1

Machine Learning Basics
The idea of making intelligent, sentient, and self-aware machines is not something that suddenly came into
existence in the last few years. In fact a lot of lore from Greek mythology talks about intelligent machines
and inventions having self-awareness and intelligence of their own. The origins and the evolution of the
computer have been really revolutionary over a period of several centuries, starting from the basic Abacus
and its descendant the slide rule in the 17th Century to the first general purpose computer designed by
Charles Babbage in the 1800s. In fact, once computers started evolving with the invention of the Analytical
Engine by Babbage and the first computer program, which was written by Ada Lovelace in 1842, people
started wondering and contemplating that could there be a time when computers or machines truly become
intelligent and start thinking for themselves. In fact, the renowned computer scientist, Alan Turing, was
highly influential in the development of theoretical computer science, algorithms, and formal language and
addressed concepts like artificial intelligence and Machine Learning as early as the 1950s. This brief insight
into the evolution of making machines learn is just to give you an idea of something that has been out there
since centuries but has recently started gaining a lot of attention and focus.
With faster computers, better processing, better computation power, and more storage, we have been
living in what I like to call, the “age of information” or the “age of data”. Day in and day out, we deal with
managing Big Data and building intelligent systems by using concepts and methodologies from Data
Science, Artificial Intelligence, Data Mining, and Machine Learning. Of course, most of you must have heard
many of the terms I just mentioned and come across sayings like “data is the new oil”. The main challenge
that businesses and organizations have embarked on in the last decade is to use approaches to try to make
sense of all the data that they have and use valuable information and insights from it in order to make better
decisions. Indeed with great advancements in technology, including availability of cheap and massive
computing, hardware (including GPUs) and storage, we have seen a thriving ecosystem built around
domains like Artificial Intelligence, Machine Learning, and most recently Deep Learning. Researchers,
developers, data scientists, and engineers are working continuously round the clock to research and build
tools, frameworks, algorithms, techniques, and methodologies to build intelligent models and systems that
can predict events, automate tasks, perform complex analyses, detect anomalies, self-heal failures, and even
understand and respond to human inputs.
This chapter follows a structured approach to cover various concepts, methodologies, and ideas
associated with Machine Learning. The core idea is to give you enough background on why we need
Machine Learning, the fundamental building blocks of Machine Learning, and what Machine Learning
offers us presently. This will enable you to learn about how best you can leverage Machine Learning to
get the maximum from your data. Since this is a book on practical Machine Learning, while we will be
focused on specific use cases, problems, and real-world case studies in subsequent chapters, it is extremely
important to understand formal definitions, concepts, and foundations with regard to learning algorithms,
data management, model building, evaluation, and deployment. Hence, we cover all these aspects,
including industry standards related to data mining and Machine Learning workflows, so that it gives you a
foundational framework that can be applied to approach and tackle any of the real-world problems we solve

© Dipanjan Sarkar, Raghav Bali and Tushar Sharma 2018
D. Sarkar et al., Practical Machine Learning with Python, https://doi.org/10.1007/978-1-4842-3207-1_1

3

Chapter 1 ■ Machine Learning Basics

in subsequent chapters. Besides this, we also cover the different inter-disciplinary fields associated with
Machine Learning, which are in fact related fields all under the umbrella of artificial intelligence.
This book is more focused on applied or practical Machine Learning, hence the major focus in most
of the chapters will be the application of Machine Learning techniques and algorithms to solve real-world
problems. Hence some level of proficiency in basic mathematics, statistics, and Machine Learning would be
beneficial. However since this book takes into account the varying levels of expertise for various readers, this
foundational chapter along with other chapters in Part I and II will get you up to speed on the key aspects
of Machine Learning and building Machine Learning pipelines. If you are already familiar with the basic
concepts relevant to Machine Learning and its significance, you can quickly skim through this chapter and
head over to Chapter 2, “The Python Machine Learning Ecosystem,” where we discuss the benefits of Python
for building Machine Learning systems and the major tools and frameworks typically used to solve Machine
Learning problems.

This book heavily emphasizes learning by doing with a lot of code snippets, examples, and multiple case
studies. We leverage Python 3 and depict all our examples with relevant code files (.py) and jupyter notebooks
(.ipynb) for a more interactive experience. We encourage you to refer to the GitHub repository for this book at
https://github.com/dipanjanS/practical-machine-learning-with-python, where we will be sharing
necessary code and datasets pertaining to each chapter. You can leverage this repository to try all the examples
by yourself as you go through the book and adopt them in solving your own real-world problems. Bonus content
relevant to Machine Learning and Deep Learning will also be shared in the future, so keep watching that space!

The Need for Machine Learning
Human beings are perhaps the most advanced and intelligent lifeform on this planet at the moment. We can
think, reason, build, evaluate, and solve complex problems. The human brain is still something we ourselves
haven’t figured out completely and hence artificial intelligence is still something that’s not surpassed human
intelligence in several aspects. Thus you might get a pressing question in mind as to why do we really need
Machine Learning? What is the need to go out of our way to spend time and effort to make machines learn
and be intelligent? The answer can be summed up in a simple sentence, “To make data-driven decisions at
scale”. We will dive into details to explain this sentence in the following sections.

Making Data-Driven Decisions
Getting key information or insights from data is the key reason businesses and organizations invest
heavily in a good workforce as well as newer paradigms and domains like Machine Learning and artificial
intelligence. The idea of data-driven decisions is not new. Fields like operations research, statistics, and
management information systems have existed for decades and attempt to bring efficiency to any business
or organization by using data and analytics to make data-driven decisions. The art and science of leveraging
your data to get actionable insights and make better decisions is known as making data-driven decisions.
Of course, this is easier said than done because rarely can we directly use raw data to make any insightful
decisions. Another important aspect of this problem is that often we use the power of reasoning or intuition
to try to make decisions based on what we have learned over a period of time and on the job. Our brain is
an extremely powerful device that helps us do so. Consider problems like understanding what your fellow
colleagues or friends are speaking, recognizing people in images, deciding whether to approve or reject a
business transaction, and so on. While we can solve these problems almost involuntary, can you explain
someone the process of how you solved each of these problems? Maybe to some extent, but after a while,

4

Chapter 1 ■ Machine Learning Basics

it would be like, “Hey! My brain did most of the thinking for me!” This is exactly why it is difficult to make
machines learn to solve these problems like regular computational programs like computing loan interest or
tax rebates. Solutions to problems that cannot be programmed inherently need a different approach where
we use the data itself to drive decisions instead of using programmable logic, rules, or code to make these
decisions. We discuss this further in future sections.

Efficiency and Scale
While getting insights and making decisions driven by data are of paramount importance, it also needs to
be done with efficiency and at scale. The key idea of using techniques from Machine Learning or artificial
intelligence is to automate processes or tasks by learning specific patterns from the data. We all want computers
or machines to tell us when a stock might rise or fall, whether an image is of a computer or a television, whether
our product placement and offers are the best, determine shopping price trends, detect failures or outages
before they occur, and the list just goes on! While human intelligence and expertise is something that we
definitely can’t do without, we need to solve real-world problems at huge scale with efficiency.

A REAL-WORLD PROBLEM AT SCALE
Consider the following real-world problem. You are the manager of a world-class infrastructure team
for the DSS Company that provides Data Science services in the form of cloud based infrastructure
and analytical platforms for other businesses and consumers. Being a provider of services and
infrastructure, you want your infrastructure to be top-notch and robust to failures and outages.
Considering you are starting out of St. Louis in a small office, you have a good grasp over monitoring
all your network devices including routers, switches, firewalls, and load balancers regularly with your
team of 10 experienced employees. Soon you make a breakthrough with providing cloud based Deep
Learning services and GPUs for development and earn huge profits. However, now you keep getting
more and more customers. The time has come for expanding your base to offices in San Francisco,
New York, and Boston. You have a huge connected infrastructure now with hundreds of network devices
in each building! How will you manage your infrastructure at scale now? Do you hire more manpower
for each office or do you try to leverage Machine Learning to deal with tasks like outage prediction,
auto-recovery, and device monitoring? Think about this for some time from both an engineer as well as
a manager's point of view.

Traditional Programming Paradigm
Computers, while being extremely sophisticated and complex devices, are just another version of our well
known idiot box, the television! “How can that be?” is a very valid question at this point. Let’s consider a
television or even one of the so-called smart TVs, which are available these days. In theory as well as in
practice, the TV will do whatever you program it to do. It will show you the channels you want to see, record
the shows you want to view later on, and play the applications you want to play! The computer has been
doing the exact same thing but in a different way. Traditional programming paradigms basically involve the
user or programmer to write a set of instructions or operations using code that makes the computer perform
specific computations on data to give the desired results. Figure 1-1 depicts a typical workflow for traditional
programming paradigms.

5

Chapter 1 ■ Machine Learning Basics

Figure 1-1. Traditional programming paradigm
From Figure 1-1, you can get the idea that the core inputs that are given to the computer are data and
one or more programs that are basically code written with the help of a programming language, such as
high-level languages like Java, Python, or low-level like C or even Assembly. Programs enable computers
to work on data, perform computations, and generate output. A task that can be performed really well with
traditional programming paradigms is computing your annual tax.
Now, let’s think about the real-world infrastructure problem we discussed in the previous section for
DSS Company. Do you think a traditional programming approach might be able to solve this problem? Well,
it could to some extent. We might be able to tap in to the device data and event streams and logs and access
various device attributes like usage levels, signal strength, incoming and outgoing connections, memory
and processor usage levels, error logs and events, and so on. We could then use the domain knowledge
of our network and infrastructure experts in our teams and set up some event monitoring systems based
on specific decisions and rules based on these data attributes. This would give us what we could call as a
rule-based reactive analytical solution where we can monitor devices, observe if any specific anomalies or
outages occur, and then take necessary action to quickly resolve any potential issues. We might also have
to hire some support and operations staff to continuously monitor and resolve issues as needed. However,
there is still a pressing problem of trying to prevent as many outages or issues as possible before they actually
take place. Can Machine Learning help us in some way?

Why Machine Learning?
We will now address the question that started this discussion of why we need Machine Learning.
Considering what you have learned so far, while the traditional programming paradigm is quite good and
human intelligence and domain expertise is definitely an important factor in making data-driven decisions,
we need Machine Learning to make faster and better decisions. The Machine Learning paradigm tries to
take into account data and expected outputs or results if any and uses the computer to build the program,
which is also known as a model. This program or model can then be used in the future to make necessary
decisions and give expected outputs from new inputs. Figure 1-2 shows how the Machine Learning
paradigm is similar yet different from traditional programming paradigms.

6

Chapter 1 ■ Machine Learning Basics

Figure 1-2. Machine Learning paradigm
Figure 1-2 reinforces the fact that in the Machine Learning paradigm, the machine, in this context the
computer, tries to use input data and expected outputs to try to learn inherent patterns in the data that
would ultimately help in building a model analogous to a computer program, which would help in making
data-driven decisions in the future (predict or tell us the output) for new input data points by using the
learned knowledge from previous data points (its knowledge or experience). You might start to see the
benefit in this. We would not need hand-coded rules, complex flowcharts, case and if-then conditions, and
other criteria that are typically used to build any decision making system or a decision support system. The
basic idea is to use Machine Learning to make insightful decisions.
This will be clearer once we discuss our real-world problem of managing infrastructure for DSS
Company. In the traditional programming approach, we talked about hiring new staff, setting up rule-based
monitoring systems, and so on. If we were to use a Machine Learning paradigm shift here, we could go about
solving the problem using the following steps.
•

Leverage device data and logs and make sure we have enough historical data in
some data store (database, logs, or flat files)

•

Decide key data attributes that could be useful for building a model. This could be
device usage, logs, memory, processor, connections, line strength, links, and so on.

•

Observe and capture device attributes and their behavior over various time periods
that would include normal device behavior and anomalous device behavior or
outages. These outcomes would be your outputs and device data would be your inputs

•

Feed these input and output pairs to any specific Machine Learning algorithm in
your computer and build a model that learns inherent device patterns and observes
the corresponding output or outcome

•

Deploy this model such that for newer values of device attributes it can predict if a
specific device is behaving normally or it might cause a potential outage

Thus once you are able to build a Machine Learning model, you can easily deploy it and build an
intelligent system around it such that you can not only monitor devices reactively but you would be able
to proactively identify potential problems and even fix them before any issues crop up. Imagine building
self-heal or auto-heal systems coupled with round the clock device monitoring. The possibilities are indeed
endless and you will not have to keep on hiring new staff every time you expand your office or buy new
infrastructure.
Of course, the workflow discussed earlier with the series of steps needed for building a Machine
Learning model is much more complex than how it has been portrayed, but again this is just to emphasize
and make you think more conceptually rather than technically of how the paradigm has shifted in case

7

Chapter 1 ■ Machine Learning Basics

of Machine Learning processes and you need to change your thinking too from the traditional based
approaches toward being more data-driven. The beauty of Machine Learning is that it is never domain
constrained and you can use techniques to solve problems spanning multiple domains, businesses, and
industries. Also, as depicted in Figure 1-2, you always do not need output data points to build a model;
sometimes input data is sufficient (or rather output data might not be present) for techniques more suited
toward unsupervised learning (which we will discuss in depth later on in this chapter). A simple example is
trying to determine customer shopping patterns by looking at the grocery items they typically buy together
in a store based on past transactional data. In the next section, we take a deeper dive toward understanding
Machine Learning.

Understanding Machine Learning
By now, you have seen how a typical real-world problem suitable to solve using Machine Learning might
look like. Besides this, you have also got a good grasp over the basics of traditional programming and
Machine Learning paradigms. In this section, we discuss Machine Learning in more detail. To be more
specific, we will look at Machine Learning from a conceptual as well as a domain-specific standpoint.
Machine Learning came into prominence perhaps in the 1990s when researchers and scientists started
giving it more prominence as a sub-field of Artificial Intelligence (AI) such that techniques borrow concepts
from AI, probability, and statistics, which perform far better compared to using fixed rule-based models
requiring a lot of manual time and effort. Of course, as we have pointed out earlier, Machine Learning didn’t
just come out of nowhere in the 1990s. It is a multi-disciplinary field that has gradually evolved over time
and is still evolving as we speak.
A brief mention of history of evolution would be really helpful to get an idea of the various concepts
and techniques that have been involved in the development of Machine Learning and AI. You could say
that it started off in the late 1700s and the early 1800s when the first works of research were published which
basically talked about the Bayes’ Theorem. In fact Thomas Bayes’ major work, “An Essay Towards Solving
a Problem in the Doctrine of Chances,” was published in 1763. Besides this, a lot of research and discovery
was done during this time in the field of probability and mathematics. This paved the way for more ground
breaking research and inventions in the 20th Century, which included Markov Chains by Andrey Markov
in the early 1900s, proposition of a learning system by Alan Turing, and the invention of the very famous
perceptron by Frank Rosenblatt in the 1950s. Many of you might know that neural networks had several
highs and lows since the 1950s and they finally came back to prominence in the 1980s with the discovery
of backpropagation (thanks to Rumelhart, Hinton, and Williams!) and several other inventions, including
Hopfield networks, neocognition, convolutional and recurrent neural networks, and Q-learning. Of course,
rapid strides of evolution started taking place in Machine Learning too since the 1990s with the discovery
of random forests, support vector machines, long short-term memory networks (LSTMs), and development
and release of frameworks in both machine and Deep Learning including torch, theano, tensorflow,
scikit-learn, and so on. We also saw the rise of intelligent systems including IBM Watson, DeepFace, and
AlphaGo. Indeed the journey has been quite a roller coaster ride and there’s still miles to go in this journey.
Take a moment and reflect on this evolutional journey and let’s talk about the purpose of this journey. Why
and when should we really make machines learn?

Why Make Machines Learn?
We have discussed a fair bit about why we need Machine Learning in a previous section when we address
the issue of trying to leverage data to make data-driven decisions at scale using learning algorithms without
focusing too much on manual efforts and fixed rule-based systems. In this section, we discuss in more
detail why and when should we make machines learn. There are several real-world tasks and problems
that humans, businesses, and organizations try to solve day in and day out for our benefit. There are several
scenarios when it might be beneficial to make machines learn and some of them are mentioned as follows.

8

Chapter 1 ■ Machine Learning Basics

•

Lack of sufficient human expertise in a domain (e.g., simulating navigations in
unknown territories or even spatial planets).

•

Scenarios and behavior can keep changing over time (e.g., availability of
infrastructure in an organization, network connectivity, and so on).

•

Humans have sufficient expertise in the domain but it is extremely difficult to
formally explain or translate this expertise into computational tasks (e.g., speech
recognition, translation, scene recognition, cognitive tasks, and so on).

•

Addressing domain specific problems at scale with huge volumes of data with too
many complex conditions and constraints.

The previously mentioned scenarios are just several examples where making machines learn would be
more effective than investing time, effort, and money in trying to build sub-par intelligent systems that might
be limited in scope, coverage, performance, and intelligence. We as humans and domain experts already
have enough knowledge about the world and our respective domains, which can be objective, subjective,
and sometimes even intuitive. With the availability of large volumes of historical data, we can leverage the
Machine Learning paradigm to make machines perform specific tasks by gaining enough experience by
observing patterns in data over a period of time and then use this experience in solving tasks in the future
with minimal manual intervention. The core idea remains to make machines solve tasks that can be easily
defined intuitively and almost involuntarily but extremely hard to define formally.

Formal Definition
We are now ready to define Machine Learning formally. You may have come across multiple definitions of
Machine Learning by now which include, techniques to make machines intelligent, automation on steroids,
automating the task of automation itself, the sexiest job of the 21st century, making computers learn by
themselves and countless others! While all of them are good quotes and true to certain extents, the best way
to define Machine Learning would be to start from the basics of Machine Learning as defined by renowned
professor Tom Mitchell in 1997.
The idea of Machine Learning is that there will be some learning algorithm that will help the machine
learn from data. Professor Mitchell defined it as follows.

“A computer program is said to learn from experience E with respect to some class of tasks
T and performance measure P, if its performance at tasks in T, as measured by P, improves
with experience E.”
While this definition might seem daunting at first, I ask you go read through it a couple of times slowly
focusing on the three parameters—T, P, and E—which are the main components of any learning algorithm,
as depicted in Figure 1-3.

9

Chapter 1 ■ Machine Learning Basics

Figure 1-3. Defining the components of a learning algorithm
We can simplify the definition as follows. Machine Learning is a field that consists of learning
algorithms that:
•

Improve their performance P

•

At executing some task T

•

Over time with experience E

While we discuss at length each of these entities in the following sections, we will not spend time
in formally or mathematically defining each of these entities since the scope of the book is more toward
applied or practical Machine Learning. If you consider our real-world problem from earlier, one of the tasks
T could be predicting outages for our infrastructure; experience E would be what our Machine Learning
model would gain over time by observing patterns from various device data attributes; and the performance
of the model P could be measured in various ways like how accurately the model predicts outages.

Defining the Task, T
We had discussed briefly in the previous section about the task, T, which can be defined in a two-fold
approach. From a problem standpoint, the task, T, is basically the real-world problem to be solved at hand,
which could be anything from finding the best marketing or product mix to predicting infrastructure failures.
In the Machine Learning world, it is best if you can define the task as concretely as possible such that you
talk about what the exact problem is which you are planning to solve and how you could define or formulate
the problem into a specific Machine Learning task.
Machine Learning based tasks are difficult to solve by conventional and traditional programming
approaches. A task, T, can usually be defined as a Machine Learning task based on the process or workflow
that the system should follow to operate on data points or samples. Typically a data sample or point will
consist of multiple data attributes (also called features in Machine Learning lingo) just like the various
device parameters we mentioned in our problem for DSS Company earlier. A typical data point can be

10

Chapter 1 ■ Machine Learning Basics

denoted by a vector (Python list) such that each element in the vector is for a specific data feature or
attribute. We discuss more about features and data points in detail in a future section as well as in Chapter 4,
“Feature Engineering and Selection”.
Coming back to the typical tasks that could be classified as Machine Learning tasks, the following list
describes some popular tasks.
•

Classification or categorization: This typically encompasses the list of problems or
tasks where the machine has to take in data points or samples and assign a specific
class or category to each sample. A simple example would be classifying animal
images into dogs, cats, and zebras.

•

Regression: These types of tasks usually involve performing a prediction such that
a real numerical value is the output instead of a class or category for an input data
point. The best way to understand a regression task would be to take the case of a
real-world problem of predicting housing prices considering the plot area, number
of floors, bathrooms, bedrooms, and kitchen as input attributes for each data point.

•

Anomaly detection: These tasks involve the machine going over event logs,
transaction logs, and other data points such that it can find anomalous or unusual
patterns or events that are different from the normal behavior. Examples for this
include trying to find denial of service attacks from logs, indications of fraud,
and so on.

•

Structured annotation: This usually involves performing some analysis on input
data points and adding structured metadata as annotations to the original data
that depict extra information and relationships among the data elements. Simple
examples would be annotating text with their parts of speech, named entities,
grammar, and sentiment. Annotations can also be done for images like assigning
specific categories to image pixels, annotate specific areas of images based on their
type, location, and so on.

•

Translation: Automated machine translation tasks are typically of the nature such
that if you have input data samples belonging to a specific language, you translate it
into output having another desired language. Natural language based translation is
definitely a huge area dealing with a lot of text data.

•

Clustering or grouping: Clusters or groups are usually formed from input data
samples by making the machine learn or observe inherent latent patterns,
relationships and similarities among the input data points themselves. Usually there
is a lack of pre-labeled or pre-annotated data for these tasks hence they form a part
of unsupervised Machine Learning (which we will discuss later on). Examples would
be grouping similar products, events and entities.

•

Transcriptions: These tasks usually entail various representations of data that are
usually continuous and unstructured and converting them into more structured
and discrete data elements. Examples include speech to text, optical character
recognition, images to text, and so on.

This should give you a good idea of typical tasks that are often solved using Machine Learning, but this
list is definitely not an exhaustive one as the limits of tasks are indeed endless and more are being discovered
with extensive research over time.

11

Chapter 1 ■ Machine Learning Basics

Defining the Experience, E
At this point, you know that any learning algorithm typically needs data to learn over time and perform a
specific task, which we named as T. The process of consuming a dataset that consists of data samples or data
points such that a learning algorithm or model learns inherent patterns is defined as the experience, E which
is gained by the learning algorithm. Any experience that the algorithm gains is from data samples or data
points and this can be at any point of time. You can feed it data samples in one go using historical data or
even supply fresh data samples whenever they are acquired.
Thus, the idea of a model or algorithm gaining experience usually occurs as an iterative process, also
known as training the model. You could think of the model to be an entity just like a human being which
gains knowledge or experience through data points by observing and learning more and more about various
attributes, relationships and patterns present in the data. Of course, there are various forms and ways of
learning and gaining experience including supervised, unsupervised, and reinforcement learning but we
will discuss learning methods in a future section. For now, take a step back and remember the analogy we
drew that when a machine truly learns, it is based on data which is fed to it from time to time thus allowing
it to gain experience and knowledge about the task to be solved, such that it can used this experience, E, to
predict or solve the same task, T, in the future for previously unseen data points.

Defining the Performance, P
Let’s say we have a Machine Learning algorithm that is supposed to perform a task, T, and is gaining
experience, E, with data points over a period of time. But how do we know if it’s performing well or behaving
the way it is supposed to behave? This is where the performance, P, of the model comes into the picture.
The performance, P, is usually a quantitative measure or metric that’s used to see how well the algorithm or
model is performing the task, T, with experience, E. While performance metrics are usually standard metrics
that have been established after years of research and development, each metric is usually computed
specific to the task, T, which we are trying to solve at any given point of time.
Typical performance measures include accuracy, precision, recall, F1 score, sensitivity, specificity,
error rate, misclassification rate, and many more. Performance measures are usually evaluated on training
data samples (used by the algorithm to gain experience, E) as well as data samples which it has not seen or
learned from before, which are usually known as validation and test data samples. The idea behind this is to
generalize the algorithm so that it doesn’t become too biased only on the training data points and performs
well in the future on newer data points. More on training, validation, and test data will be discussed when we
talk about model building and validation.
While solving any Machine Learning problem, most of the times, the choice of performance measure,
P, is either accuracy, F1 score, precision, and recall. While this is true in most scenarios, you should always
remember that sometimes it is difficult to choose performance measures that will accurately be able to
give us an idea of how well the algorithm is performing based on the actual behavior or outcome which is
expected from it. A simple example would be that sometimes we would want to penalize misclassification
or false positives more than correct hits or predictions. In such a scenario, we might need to use a modified
cost function or priors such that we give a scope to sacrifice hit rate or overall accuracy for more accurate
predictions with lesser false positives. A real-world example would be an intelligent system that predicts
if we should give a loan to a customer. It’s better to build the system in such a way that it is more cautious
against giving a loan than denying one. The simple reason is because one big mistake of giving a loan to
a potential defaulter can lead to huge losses as compared to denying several smaller loans to potential
customers. To conclude, you need to take into account all parameters and attributes involved in task, T,
such that you can decide on the right performance measures, P, for your system.

12

Chapter 1 ■ Machine Learning Basics

A Multi-Disciplinary Field
We have formally introduced and defined Machine Learning in the previous section, which should give
you a good idea about the main components involved with any learning algorithm. Let’s now shift our
perspective to Machine Learning as a domain and field. You might already know that Machine Learning
is mostly considered to be a sub-field of artificial intelligence and even computer science from some
perspectives. Machine Learning has concepts that have been derived and borrowed from multiple fields
over a period of time since its inception, making it a true multi-disciplinary or inter-disciplinary field.
Figure 1-4 should give you a good idea with regard to the major fields that overlap with Machine Learning
based on concepts, methodologies, ideas, and techniques. An important point to remember here is that this
is definitely not an exhaustive list of domains or fields but pretty much depicts the major fields associated in
tandem with Machine Learning.

Figure 1-4. Machine Learning: a true multi-disciplinary field
The major domains or fields associated with Machine Learning include the following, as depicted in
Figure 1-4. We will discuss each of these fields in upcoming sections.
•

Artificial intelligence

•

Natural language processing

•

Data mining

•

Mathematics

•

Statistics

•

Computer science

•

Deep Learning

•

Data Science

13

Chapter 1 ■ Machine Learning Basics

You could say that Data Science is like a broad inter-disciplinary field spanning across all the other fields
which are sub-fields inside it. Of course this is just a simple generalization and doesn’t strictly indicate that it
is inclusive of all other other fields as a superset, but rather borrows important concepts and methodologies
from them. The basic idea of Data Science is once again processes, methodologies, and techniques to extract
information from data and domain knowledge. This is a big part of what we discuss in an upcoming section
when we talk about Data Science in further details.
Coming back to Machine Learning, ideas of pattern recognition and basic data mining methodologies
like knowledge discovery of databases (KDD) came into existence when relational databases were very
prominent. These areas focus more on the ability and technique to mine for information from large datasets,
such that you can get patterns, knowledge, and insights of interest. Of course, KDD is a whole process by
itself that includes data acquisition, storage, warehousing, processing, and analysis. Machine Learning
borrows concepts that are more concerned with the analysis phase, although you do need to go through the
other steps to reach to the final stage. Data mining is again a interdisciplinary or multi-disciplinary field and
borrows concepts from computer science, mathematics, and statistics. The consequence of this is the fact
that computational statistics form an important part of most Machine Learning algorithms and techniques.
Artificial intelligence (AI) is the superset consisting of Machine Learning as one of its specialized areas.
The basic idea of AI is the study and development of intelligence as exhibited by machines based on their
perception of their environment, input parameters and attributes and their response such that they can
perform desired tasks based on expectations. AI itself is a truly massive field which is itself inter-disciplinary.
It draws on concepts from mathematics, statistics, computer science, cognitive sciences, linguistics,
neuroscience, and many more. Machine Learning is more concerned with algorithms and techniques that
can be used to understand data, build representations, and perform tasks such as predictions. Another
major sub-field under AI related to Machine Learning is natural language processing (NLP) which borrows
concepts heavily from computational linguistics and computer science. Text Analytics is a prominent field
today among analysts and data scientists to extract, process and understand natural human language.
Combine NLP with AI and Machine Learning and you get chatbots, machine translators, and virtual
personal assistants, which are indeed the future of innovation and technology!
Coming to Deep Learning, it is a subfield of Machine Learning itself which deals more with techniques
related to representational learning such that it improves with more and more data by gaining more
experience. It follows a layered and hierarchical approach such that it tries to represent the given input
attributes and its current surroundings, using a nested layered hierarchy of concept representations such
that, each complex layer is built from another layer of simpler concepts. Neural networks are something
which is heavily utilized by Deep Learning and we will look into Deep Learning in a bit more detail in a
future section and solve some real-world problems later on in this book.
Computer science is pretty much the foundation for most of these domains dealing with study,
development, engineering, and programming of computers. Hence we won’t be expanding too much on this
but you should definitely remember the importance of computer science for Machine Learning to exist and
be easily applied to solve real-world problems. This should give you a good idea about the broad landscape
of the multi-disciplinary field of Machine Learning and how it is connected across multiple related and
overlapping fields. We will discuss some of these fields in more detail in upcoming sections and cover some
basic concepts in each of these fields wherever necessary.
Let’s look at some core fundamentals of Computer Science in the following section.

Computer Science
The field of computer science (CS) can be defined as the study of the science of understanding computers.
This involves study, research, development, engineering, and experimentation of areas dealing with
understanding, designing, building, and using computers. This also involves extensive design and
development of algorithms and programs that can be used to make the computer perform computations
and tasks as desired. There are mainly two major areas or fields under computer science, as follows.

14

Chapter 1 ■ Machine Learning Basics

•

Theoretical computer science

•

Applied or practical computer science

The two major areas under computer science span across multiple fields and domains wherein each
field forms a part or a sub-field of computer science. The main essence of computer science includes
formal languages, automata and theory of computation, algorithms, data structures, computer design and
architecture, programming languages, and software engineering principles.

Theoretical Computer Science
Theoretical computer science is the study of theory and logic that tries to explain the principles and
processes behind computation. This involves understanding the theory of computation which talks about
how computation can be used efficiently to solve problems. Theory of computation includes the study of
formal languages, automata, and understanding complexities involved in computations and algorithms.
Information and coding theory is another major field under theoretical CS that has given us domains like
signal processing, cryptography, and data compression. Principles of programming languages and their
analysis is another important aspect that talks about features, design, analysis, and implementations
of various programming languages and how compilers and interpreters work in understanding these
languages. Last but never the least, data structures and algorithms are the two fundamental pillars of
theoretical CS used extensively in computational programs and functions.

Practical Computer Science
Practical computer science also known as applied computer science is more about tools, methodologies,
and processes that deal with applying concepts and principles from computer science in the real world to
solve practical day-to-day problems. This includes emerging sub-fields like artificial intelligence, Machine
Learning, computer vision, Deep Learning, natural language processing, data mining, and robotics and they
try to solve complex real-world problems based on multiple constraints and parameters and try to emulate
tasks that require considerable human intelligence and experience. Besides these, we also have wellestablished fields, including computer architecture, operating systems, digital logic and design, distributed
computing, computer networks, security, databases, and software engineering.

Important Concepts
These are several concepts from computer science that you should know and remember since they would be
useful as foundational concepts to understand the other chapters, concepts, and examples better. It’s not an
exhaustive list but should pretty much cover enough to get started.

Algorithms
An algorithm can be described as a sequence of steps, operations, computations, or functions that can
be executed to carry out a specific task. They are basically methods to describe and represent a computer
program formally through a series of operations, which are often described using plain natural language,
mathematical symbols, and diagrams. Typically flowcharts, pseudocode, and natural language are used
extensively to represent algorithms. An algorithm can be as simple as adding two numbers and as complex
as computing the inverse of a matrix.

15

Chapter 1 ■ Machine Learning Basics

Programming Languages
A programming language is a language that has its own set of symbols, words, tokens, and operators having
their own significance and meaning. Thus syntax and semantics combine to form a formal language in itself.
This language can be used to write computer programs, which are basically real-world implementations of
algorithms that can be used to specify specific instructions to the computer such that it carries our necessary
computation and operations. Programming languages can be low level like C and Assembly or high level
languages like Java and Python.

C
 ode
This is basically source code that forms the foundation of computer programs. Code is written using
programming languages and consists of a collection of computer statements and instructions to make the
computer perform specific desired tasks. Code helps convert algorithms into programs using programming
languages. We will be using Python to implement most of our real-world Machine Learning solutions.

Data Structures
Data structures are specialized structures that are used to manage data. Basically they are real-world
implementations for abstract data type specifications that can be used to store, retrieve, manage, and
operate on data efficiently. There is a whole suite of data structures like arrays, lists, tuples, records,
structures, unions, classes, and many more. We will be using Python data structures like lists, arrays,
dataframes, and dictionaries extensively to operate on real-world data!

D
 ata Science
The field of Data Science is a very diverse, inter-disciplinary field which encompasses multiple fields that
we depicted in Figure 1-4. Data Science basically deals with principles, methodologies, processes, tools, and
techniques to gather knowledge or information from data (structured as well as unstructured). Data Science
is more of a compilation of processes, techniques, and methodologies to foster a data-driven decision
based culture. In fact Drew Conway’s “Data Science Venn Diagram,” depicted in Figure 1-5, shows the core
components and essence of Data Science, which in fact went viral and became insanely popular!

16

Chapter 1 ■ Machine Learning Basics

Figure 1-5. Drew Conway’s Data Science Venn diagram
Figure 1-5 is quite intuitive and easy to interpret. Basically there are three major components and
Data Science sits at the intersection of them. Math and statistics knowledge is all about applying various
computational and quantitative math and statistical based techniques to extract insights from data. Hacking
skills basically indicate the capability of handling, processing, manipulating and wrangling data into easy to
understand and analyzable formats. Substantive expertise is basically the actual real-world domain expertise
which is extremely important when you are solving a problem because you need to know about various
factors, attributes, constraints, and knowledge related to the domain besides your expertise in data and
algorithms.
Thus Drew rightly points out that Machine Learning is a combination of expertise on data hacking
skills, math, and statistical learning methods and for Data Science, you need some level of domain expertise
and knowledge along with Machine Learning. You can check out Drew’s personal insights in his article at
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram, where talks all about the Data
Science Venn diagram. Besides this, we also have Brendan Tierney, who talks about the true nature of Data
Science being a multi-disciplinary field with his own depiction, as shown in Figure 1-6.

17

Chapter 1 ■ Machine Learning Basics

Figure 1-6. Brendan Tierney's depiction of Data Science as a true multi-disciplinary field
If you observe his depiction closely, you will see a lot of the domains mentioned here are what we just
talked about in the previous sections and matches a substantial part of Figure 1-4. You can clearly see Data
Science being the center of attention and drawing parts from all the other fields and Machine Learning as a
sub-field.

Mathematics
The field of mathematics deals with numbers, logic, and formal systems. The best definition of mathematics
was coined by Aristotle as “The science of quantity”. The scope of mathematics as a scientific field is huge
spanning across areas including algebra, trigonometry, calculus, geometry, and number theory just to
name a few major fields. Linear algebra and probability are two major sub-fields under mathematics that
are used extensively in Machine Learning and we will be covering a few important concepts from them in
this section. Our major focus will always be on practical Machine Learning, and applied mathematics is an
important aspect for the same. Linear algebra deals with mathematical objects and structures like vectors,
matrices, lines, planes, hyperplanes, and vector spaces. The theory of probability is a mathematical field
and framework used for studying and quantifying events of chance and uncertainty and deriving theorems
and axioms from the same. These laws and axioms help us in reasoning, understanding, and quantifying
uncertainty and its effects in any real-world system or scenario, which helps us in building our Machine
Learning models by leveraging this framework.

18

Chapter 1 ■ Machine Learning Basics

Important Concepts
In this section, we discuss some key terms and concepts from applied mathematics, namely linear algebra
and probability theory. These concepts are widely used across Machine Learning and form some of the
foundational structures and principles across Machine Learning algorithms, models, and processes.

Scalar
A scalar usually denotes a single number as opposed to a collection of numbers. A simple example might be
x = 5 or x ∈ R, where x is the scalar element pointing to a single number or a real-valued single number.

Vector
A vector is defined as a structure that holds an array of numbers which are arranged in order. This basically
means the order or sequence of numbers in the collection is important. Vectors can be mathematically
denoted as x = [x1, x2, …, xn], which basically tells us that x is a one-dimensional vector having n elements in
the array. Each element can be referred to using an array index determining its position in the vector. The
following snippet shows us how we can represent simple vectors in Python.
In [1]: x = [1, 2, 3, 4, 5]
   ...: x
Out[1]: [1, 2, 3, 4, 5]
In [2]: import numpy as np
   ...: x = np.array([1, 2, 3, 4, 5])
   ...:
   ...: print(x)
   ...: print(type(x))
[1 2 3 4 5]

Thus you can see that Python lists as well as numpy based arrays can be used to represent vectors. Each
row in a dataset can act as a one-dimensional vector of n attributes, which can serve as inputs to learning
algorithms.

Matrix
A matrix is a two-dimensional structure that basically holds numbers. It’s also often referred to as a 2D array.
Each element can be referred to using a row and column index as compared to a single vector index in case
ém11 m12 m13 ù
ê
ú
of vectors. Mathematically, you can depict a matrix as M = êm21 m22 m23 ú such that M is a 3 x 3 matrix
êëm31 m32 m33 úû
having three rows and three columns and each element is denoted by mrc such that r denotes the row index
and c denotes the column index. Matrices can be easily represented as list of lists in Python and we can
leverage the numpy array structure as depicted in the following snippet.
In [3]: m = np.array([[1, 5, 2],
   ...:               [4, 7, 4],
   ...:               [2, 0, 9]])

19

Chapter 1 ■ Machine Learning Basics

In [4]: # view matrix
   ...: print(m)
[[1 5 2]
[4 7 4]
[2 0 9]]
In [5]: # view dimensions
   ...: print(m.shape)
(3, 3)
Thus you can see how we can easily leverage numpy arrays to represent matrices. You can think of
a dataset with rows and columns as a matrix such that the data features or attributes are represented by
columns and each row denotes a data sample. We will be using the same analogy later on in our analyses.
Of course, you can perform matrix operations like add, subtract, products, inverse, transpose, determinants,
and many more. The following snippet shows some popular matrix operations.
In [9]: # matrix transpose
   ...: print('Matrix Transpose:\n', m.transpose(), '\n')
   ...:
   ...: # matrix determinant
   ...: print ('Matrix Determinant:', np.linalg.det(m), '\n')
   ...:
   ...: # matrix inverse
   ...: m_inv = np.linalg.inv(m)
   ...: print ('Matrix inverse:\n', m_inv, '\n')
   ...:
   ...: # identity matrix (result of matrix x matrix_inverse)
   ...: iden_m =  np.dot(m, m_inv)
   ...: iden_m = np.round(np.abs(iden_m), 0)
   ...: print ('Product of matrix and its inverse:\n', iden_m)
   ...:
Matrix Transpose:
[[1 4 2]
  [5 7 0]
  [2 4 9]]
Matrix Determinant: -105.0
Matrix inverse:
[[-0.6         0.42857143 -0.05714286]
  [ 0.26666667 -0.04761905 -0.03809524]
  [ 0.13333333 -0.0952381   0.12380952]]
Product of matrix and its inverse:
[[ 1.  0.  0.]
  [ 0.  1.  0.]
  [ 0.  0.  1.]]
This should give you a good idea to get started with matrices and their basic operations. More on this is
covered in Chapter 2, “The Python Machine Learning Ecosystem”.

20

Chapter 1 ■ Machine Learning Basics

Tensor
You can think of a tensor as a generic array. Tensors are basically arrays with a variable number of axes.
An element in a three-dimensional tensor T can be denoted by Tx,y,z where x, y, z denote the three axes for
specifying element T.

Norm
The norm is a measure that is used to compute the size of a vector often also defined as the measure of
distance from the origin to the point denoted by the vector. Mathematically, the pth norm of a vector is
denoted as follows.
1

L = xp
p

æ
p öp
= ç å xi ÷
è i
ø

Such that p ≥ 1 and p ∈ R. Popular norms in Machine Learning include the L1 norm used extensively in Lasso
regression models and the L2 norm, also known as the Euclidean norm, used in ridge regression models.

Eigen Decomposition
This is basically a matrix decomposition process such that we decompose or break down a matrix into a
set of eigen vectors and eigen values. The eigen decomposition of a matrix can be mathematically denoted
by M = V diag(λ) V-1 such that the matrix M has a total of n linearly independent eigen vectors represented
as {v(1), v(2), …, v(n)} and their corresponding eigen values can be represented as {λ1, λ2, …, λn}. The matrix V
consists of one eigen vector per column of the matrix i.e., V = [v(1), v(2), …, v(n)] and the vector λ consists of all
the eigen values together i.e., λ = [λ1, λ2, …, λn].
An eigen vector of the matrix is defined as a non-zero vector such that on multiplying the matrix by the
eigen vector, the result only changes the scale of the eigen vector itself, i.e., the result is a scalar multiplied by
the eigen vector. This scalar is known as the eigen value corresponding to the eigen vector. Mathematically
this can be denoted by Mv = λv where M is our matrix, v is the eigen vector and λ is the corresponding eigen
value. The following Python snippet depicts how to extract eigen values and eigen vectors from a matrix.
In [4]: # eigendecomposition
   ...: m = np.array([[1, 5, 2],
   ...:               [4, 7, 4],
   ...:               [2, 0, 9]])
   ...:
   ...: eigen_vals, eigen_vecs = np.linalg.eig(m)
   ...:
   ...: print('Eigen Values:', eigen_vals, '\n')
   ...: print('Eigen Vectors:\n', eigen_vecs)
   ...:
Eigen Values: [ -1.32455532  11.32455532   7.        ]
Eigen Vectors:
[[-0.91761521  0.46120352 -0.46829291]
  [ 0.35550789  0.79362022 -0.74926865]
  [ 0.17775394  0.39681011  0.46829291]]

21

Chapter 1 ■ Machine Learning Basics

Singular Value Decomposition
The process of singular value decomposition, also known as SVD, is another matrix decomposition or
factorization process such that we are able to break down a matrix to obtain singular vectors and singular
values. Any real matrix will always be decomposed by SVD even if eigen decomposition may not be
applicable in some cases. Mathematically, SVD can be defined as follows. Considering a matrix M having
dimensions m x n such that m denotes total rows and n denotes total columns, the SVD of the matrix can be
represented with the following equation.
M m´n = U m´m Sm´n VnT´n
This gives us the following main components of the decomposition equation.
•

Um x m is an m x m unitary matrix where each column represents a left singular vector

•

Sm x n is an m x n matrix with positive numbers on the diagonal, which can also be
represented as a vector of the singular values

•

VTn x n is an n x n unitary matrix where each row represents a right singular vector

In some representations, the rows and columns might be interchanged but the end result should be
the same, i.e., U and V are always orthogonal. The following snippet shows a simple SVD decomposition in
Python.
In [7]: # SVD
   ...: m = np.array([[1, 5, 2],
   ...:               [4, 7, 4],
   ...:               [2, 0, 9]])
   ...:
   ...: U, S, VT = np.linalg.svd(m)
   ...:
   ...: print('Getting SVD outputs:-\n')
   ...: print('U:\n', U, '\n')
   ...: print('S:\n', S, '\n')
   ...: print('VT:\n', VT, '\n')
   ...:
Getting SVD outputs:U:
[[ 0.3831556  -0.39279153  0.83600634]
  [ 0.68811254 -0.48239977 -0.54202545]
  [ 0.61619228  0.78294653  0.0854506 ]]
S:
[ 12.10668383   6.91783499   1.25370079]
VT:
[[ 0.36079164  0.55610321  0.74871798]
  [-0.10935467 -0.7720271   0.62611158]
  [-0.92621323  0.30777163  0.21772844]]
SVD as a technique and the singular values in particular are very useful in summarization based
algorithms and various other methods like dimensionality reduction.

22

Chapter 1 ■ Machine Learning Basics

Random Variable
Used frequently in probability and uncertainty measurement, a random variable is basically a variable that
can take on various values at random. These variables can be of discrete or continuous type in general.

Probability Distribution
A probability distribution is a distribution or arrangement that depicts the likelihood of a random variable
or variables to take on each of its probable states. There are usually two main types of distributions based on
the variable being discrete or continuous.

Probability Mass Function
A probability mass function, also known as PMF, is a probability distribution over discrete random variables.
Popular examples include the Poisson and binomial distributions.

Probability Density Function
A probability density function, also known as PDF, is a probability distribution over continuous random
variables. Popular examples include the normal, uniform, and student’s T distributions.

Marginal Probability
The marginal probability rule is used when we already have the probability distribution for a set of random
variables and we want to compute the probability distribution for a subset of these random variables. For
discrete random variables, we can define marginal probability as follows.
P ( x ) = åP ( x , y )
y

For continuous random variables, we can define it using the integration operation as follows.
p ( x ) = òp ( x , y ) dy

Conditional Probability
The conditional probability rule is used when we want to determine the probability that an event is going to
take place, such that another event has already taken place. This is mathematically represented as follows.
P(x y) =

P ( x ,y )
P(y)

This tells us the conditional probability of x, given that y has already taken place.

23

Chapter 1 ■ Machine Learning Basics

Bayes Theorem
This is another rule or theorem which is useful when we know the probability of an event of interest P(A), the
conditional probability for another event based on our event of interest P(B | A) and we want to determine
the conditional probability of our event of interest given the other event has taken place P(A | B). This can be
defined mathematically using the following expression.
P(A B ) =

P ( B A)P ( A)
P (B)

such that A and B are events and P ( B ) = åP ( B A ) P ( A ) .
x

Statistics
The field of statistics can be defined as a specialized branch of mathematics that consists of frameworks
and methodologies to collect, organize, analyze, interpret, and present data. Generally this falls more under
applied mathematics and borrows concepts from linear algebra, distributions, probability theory, and
inferential methodologies. There are two major areas under statistics that are mentioned as follows.
•

Descriptive statistics

•

Inferential statistics

The core component of any statistical process is data. Hence typically data collection is done first, which
could be in global terms, often called a population or a more restricted subset due to various constraints
often knows as a sample. Samples are usually collected manually, from surveys, experiments, data stores,
and observational studies. From this data, various analyses are carried out using statistical methods.
Descriptive statistics is used to understand basic characteristics of the data using various aggregation
and summarization measures to describe and understand the data better. These could be standard
measures like mean, median, mode, skewness, kurtosis, standard deviation, variance, and so on. You can
refer to any standard book on statistics to deep dive into these measures if you’re interested. The following
snippet depicts how to compute some essential descriptive statistical measures.
In [74]:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:

24

# descriptive statistics
import scipy as sp
import numpy as np
# get data
nums = np.random.randint(1,20, size=(1,15))[0]
print('Data: ', nums)
# get
print
print
print
print
print
print
print

descriptive stats
('Mean:', sp.mean(nums))
('Median:', sp.median(nums))
('Mode:', sp.stats.mode(nums))
('Standard Deviation:', sp.std(nums))
('Variance:', sp.var(nums))
('Skew:', sp.stats.skew(nums))
('Kurtosis:', sp.stats.kurtosis(nums))

Chapter 1 ■ Machine Learning Basics

Data:  [ 2 19  8 10 17 13 18  9 19 16  4 14 16 15  5]
Mean: 12.3333333333
Median: 14.0
Mode: ModeResult(mode=array([16]), count=array([2]))
Standard Deviation: 5.44875113112
Variance: 29.6888888889
Skew: -0.49820055879944575
Kurtosis: -1.0714842769550714
Libraries and frameworks like pandas, scipy, and numpy in general help us compute descriptive
statistics and summarize data easily in Python. We cover these frameworks as well as basic data analysis and
visualization in Chapters 2 and 3.
Inferential statistics are used when we want to test hypothesis, draw inferences, and conclusions about
various characteristics of our data sample or population. Frameworks and techniques like hypothesis
testing, correlation, and regression analysis, forecasting, and predictions are typically used for any form
of inferential statistics. We look at this in much detail in subsequent chapters when we cover predictive
analytics as well as time series based forecasting.

Data Mining
The field of data mining involves processes, methodologies, tools and techniques to discover and extract
patterns, knowledge, insights and valuable information from non-trivial datasets. Datasets are defined
as non-trivial when they are substantially huge usually available from databases and data warehouses.
Once again, data mining itself is a multi-disciplinary field, incorporating concepts and techniques from
mathematics, statistics, computer science, databases, Machine Learning and Data Science. The term is a
misnomer in general since the “mining” refers to the mining of actual insights or information from the data
and not data itself! In the whole process of KDD or Knowledge Discovery in Databases, data mining is the
step where all the analysis takes place.
In general, both KDD as well as data mining are closely linked with Machine Learning since they
are all concerned with analyzing data to extract useful patterns and insights. Hence methodologies,
concepts, techniques, and processes are shared among them. The standard process for data mining followed
in the industry is known as the CRISP-DM model, which we discuss in more detail in an upcoming section in
this chapter.

Artificial Intelligence
The field of artificial Intelligence encompasses multiple sub-fields including Machine Learning, natural
language processing, data mining, and so on. It can be defined as the art, science and engineering of making
intelligent agents, machines and programs. The field aims to provide solutions for one simple yet extremely
tough objective, “Can machines think, reason, and act like human beings?” AI in fact existed as early as the
1300s when people started asking such questions and conducting research and development on building
tools that could work on concepts instead of numbers like a calculator does. Progress in AI took place in a
steady pace with discoveries and inventions by Alan Turing, McCullouch, and Pitts Artificial Neurons. AI was
revived once again after a slowdown till the 1980s with success of expert systems, the resurgence of neural
networks thanks to Hopfield, Rumelhart, McClelland, Hinton, and many more. Faster and better computation
thanks to Moore’s Law led to fields like data mining, Machine Learning and even Deep Learning come into
prominence to solve complex problems that would otherwise have been impossible to solve using traditional
approaches. Figure 1-7 shows some of the major facets under the broad umbrella of AI.

25

Chapter 1 ■ Machine Learning Basics

Figure 1-7. Diverse major facets under the AI umbrella
Some of the main objectives of AI include emulation of cognitive functions also known as cognitive
learning, semantics, and knowledge representation, learning, reasoning, problem solving, planning, and
natural language processing. AI borrows tools, concepts, and techniques from statistical learning, applied
mathematics, optimization methods, logic, probability theory, Machine Learning, data mining, pattern
recognition, and linguistics. AI is still evolving over time and a lot of innovation is being done in this
field including some of the latest discoveries and inventions like self-driving cars, chatbots, drones, and
intelligent robots.

Natural Language Processing
The field of Natural Language Processing (NLP) is a multi-disciplinary field combining concepts from
computational linguistics, computer science and artificial intelligence. NLP involves the ability to
make machines process, understand, and interact with natural human languages. The major objective
of applications or systems built using NLP is to enable interactions between machines and natural
languages that have evolved over time. Major challenges in this aspect include knowledge and semantics
representation, natural language understanding, generation, and processing. Some of the major applications
of NLP are mentioned as follows.

26

•

Machine translation

•

Speech recognition

•

Question answering systems

•

Context recognition and resolution

•

Text summarization

•

Text categorization

Chapter 1 ■ Machine Learning Basics

•

Information extraction

•

Sentiment and emotion analysis

•

Topic segmentation

Using techniques from NLP and text analytics, you can work on text data to process, annotate, classify,
cluster, summarize, extract semantics, determine sentiment, and much more! The following example
snippet depicts some basic NLP operations on textual data where we annotate a document (text sentence)
with various components like parts of speech, phrase level tags, and so on based on its constituent grammar.
You can refer to page 159 of Text Analytics with Python (Apress; Dipanjan Sarkar, 2016) for more details on
constituency parsing.
from nltk.parse.stanford import StanfordParser
sentence = 'The quick brown fox jumps over the lazy dog'
# create parser object
scp = StanfordParser(path_to_jar='E:/stanford/stanford-parser-full-2015-04-20/stanfordparser.jar',
                   path_to_models_jar='E:/stanford/stanford-parser-full-2015-04-20/stanfordparser-3.5.2-models.jar')
# get parse tree
result = list(scp.raw_parse(sentence))
tree = result[0]
In [98]: # print the constituency parse tree
    ...: print(tree)
(ROOT
  (NP
    (NP (DT The) (JJ quick) (JJ brown) (NN fox))
    (NP
      (NP (NNS jumps))
      (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))))))
In [99]: # visualize constituency parse tree
    ...: tree.draw()

27

Chapter 1 ■ Machine Learning Basics

Figure 1-8. Constituency parse tree for our sample sentence

Thus you can clearly see that Figure 1-8 depicts the constituency grammar based parse tree for our
sample sentence, which consists of multiple noun phrases (NP). Each phrase has several words that are also
annotated with their own parts of speech (POS) tags. We cover more on processing and analyzing textual
data for various steps in the Machine Learning pipeline as well as practical use cases in subsequent chapters.

D
 eep Learning
The field of Deep Learning, as depicted earlier, is a sub-field of Machine Learning that has recently come
into much prominence. Its main objective is to get Machine Learning research closer to its true goal
of “making machines intelligent”. Deep Learning is often termed as a rebranded fancy term for neural
networks. This is true to some extent but there is definitely more to Deep Learning than just basic neural
networks. Deep Learning based algorithms involves the use of concepts from representation learning
where various representations of the data are learned in different layers that also aid in automated
feature extraction from the data. In simple terms, a Deep Learning based approach tries to build machine
intelligence by representing data as a layered hierarchy of concepts, where each layer of concepts is built
from other simpler layers. This layered architecture itself is one of the core components of any Deep
Learning algorithm.
In any basic supervised Machine Learning technique, we basically try to learn a mapping between
our data samples and our output and then try to predict output for newer data samples. Representational
learning tries to understand the representations in the data itself besides learning mapping from inputs
to outputs. This makes Deep Learning algorithms extremely powerful as compared to regular techniques,
which require significant expertise in areas like feature extraction and engineering. Deep Learning is
also extremely effective with regard to its performance as well as scalability with more and more data as
compared to older Machine Learning algorithms. This is depicted in Figure 1-9 based on a slide from
Andrew Ng’s talk at the Extract Data Conference.

28

Chapter 1 ■ Machine Learning Basics

Figure 1-9. Performance comparison of Deep Learning and traditional Machine Learning by Andrew Ng
Indeed, as rightly pointed out by Andrew Ng, there have been several noticeable trends and characteristics
related to Deep Learning that we have noticed over the past decade. They are summarized as follows.
•

Deep Learning algorithms are based on distributed representational learning and
they start performing better with more data over time.

•

Deep Learning could be said to be a rebranding of neural networks, but there is a lot
into it compared to traditional neural networks.

•

Better software frameworks like tensorflow, theano, caffe, mxnet, and keras,
coupled with superior hardware have made it possible to build extremely complex,
multi-layered Deep Learning models with huge sizes.

•

Deep Learning has multiple advantages related to automated feature extraction as
well as performing supervised learning operations, which have helped data scientists
and engineers solve increasingly complex problems over time.

The following points describe the salient features of most Deep Learning algorithms, some of which we
will be using in this book.
•

Hierarchical layered representation of concepts. These concepts are also called
features in Machine Learning terminology (data attributes).

•

Distributed representational learning of the data happens through a multi-layered
architecture (unsupervised learning).

•

More complex and high-level features and concepts are derived from simpler, lowlevel features.

29

Chapter 1 ■ Machine Learning Basics

•

A “deep” neural network usually is considered to have at least more than one hidden
layer besides the input and output layers. Usually it consists of a minimum of three
to four hidden layers.

•

Deep architectures have a multi-layered architecture where each layer consists of
multiple non-linear processing units. Each layer’s input is the previous layer in the
architecture. The first layer is usually the input and the last layer is the output.

•

Can perform automated feature extraction, classification, anomaly detection, and
many other Machine Learning tasks.

This should give you a good foundational grasp of the concepts pertaining to Deep Learning. Suppose
we had a real-world problem of object recognition from images. Figure 1-10 will give us a good idea of how
typical Machine Learning and Deep Learning pipelines differ (Source: Yann LeCun).

Figure 1-10. Comparing various learning pipelines by Yann LeCun
You can clearly see how Deep Learning methods involve a hierarchical layer representation of features
and concept from the raw data as compared to other Machine Learning methods. We conclude this section
with a brief coverage of some essential concepts pertaining to Deep Learning.

30

Chapter 1 ■ Machine Learning Basics

Important Concepts
In this section, we discuss some key terms and concepts from Deep Learning algorithms and architecture.
This should be useful in the future when you are building your own Deep Learning models.

Artificial Neural Networks
An Artificial Neural Network (ANN) is a computational model and architecture that simulates biological
neurons and the way they function in our brain. Typically, an ANN has layers of interconnected nodes. The
nodes and their inter-connections are analogous to the network of neurons in our brain. A typical ANN has
an input layer, an output layer, and at least one hidden layer between the input and output with
inter-connections, as depicted in Figure 1-11

Figure 1-11. A typical artificial neural network
Any basic ANN will always have multiple layers of nodes, specific connection patterns and links
between the layers, connection weights and activation functions for the nodes/neurons that convert
weighted inputs to outputs. The process of learning for the network typically involves a cost function and the
objective is to optimize the cost function (typically minimize the cost). The weights keep getting updated in
the process of learning.

31

Chapter 1 ■ Machine Learning Basics

Backpropagation
The backpropagation algorithm is a popular technique to train ANNs and it led to a resurgence in the
popularity of neural networks in the 1980s. The algorithm typically has two main stages—propagation and
weight updates. They are described briefly as follows.
1.

2.

Propagation
a.

The input data sample vectors are propagated forward through the neural
network to generate the output values from the output layer.

b.

Compare the generated output vector with the actual/desired output vector
for that input data vector.

c.

Compute difference in error at the output units.

d.

Backpropagate error values to generate deltas at each node/neuron.

Weight Update
a.

Compute weight gradients by multiplying the output delta (error) and input
activation.

b.

Use learning rate to determine percentage of the gradient to be subtracted
from original weight and update the weight of the nodes.

These two stages are repeated multiple times with multiple iterations/epochs until we get satisfactory
results. Typically backpropagation is used along with optimization algorithms or functions like stochastic
gradient descent.

Multilayer Perceptrons
A multilayer perceptron, also known as MLP, is a fully connected, feed-forward artificial neural network with
at least three layers (input, output, and at least one hidden layer) where each layer is fully connected to the
adjacent layer. Each neuron usually is a non-linear functional processing unit. Backpropagation is typically
used to train MLPs and even deep neural nets are MLPs when they have multiple hidden layers. Typically
used for supervised Machine Learning tasks like classification.

Convolutional Neural Networks
A convolutional neural network, also known as convnet or CNN, is a variant of the artificial neural network,
which specializes in emulating functionality and behavior of our visual cortex. CNNs typically consist of the
following three components.
•

32

Multiple convolutional layers, which consist of multiple filters that are convolved
across the height and width of the input data (e.g., image raw pixels) by basically
computing a dot product to give a two-dimensional activation map. On stacking
all the maps across all the filters, we end up getting the final output from a
convolutional layer.

Chapter 1 ■ Machine Learning Basics

•

Pooling layers, which are basically layers that perform non-linear down sampling to
reduce the input size and number of parameters from the convolutional layer output
to generalize the model more, prevent overfitting and reduce computation time. Filters
go through the heights and width of the input and reduce it by taking an aggregate like
sum, average, or max. Typical pooling components are average or max pooling.

•

Fully connected MLPs to perform tasks such as image classification and object
recognition.

A typical CNN architecture with all the components is depicted as follows in Figure 1-12, which is a
LeNet CNN model (Source: deeplearning.net)

Figure 1-12. LeNet CNN model (Source: deeplearning.net)

Recurrent Neural Networks
A recurrent neural network, also known as RNN, is a special type of an artificial neural network that allows
persisting information based on past knowledge by using a special type of looped architecture. They are used a
lot in areas related to data with sequences like predicting the next word of a sentence. These looped networks
are called recurrent because they perform the same operations and computation for each and every element
in a sequence of input data. RNNs have memory that helps in capturing information from past sequences.
Figure 1-13 (Source: Colah’s blog at http://colah.github.io/posts/2015-08-Understanding-LSTMs/) shows
the typical structure of a RNN and how it works by unrolling the network based on input sequence length to be
fed at any point in time.

Figure 1-13. A recurrent neural network (Source: Colah's Blog)

33

Chapter 1 ■ Machine Learning Basics

Figure 1-13 clearly depicts how the unrolled network will accept sequences of length t in each pass of
the input data and operate on the same.

Long Short-Term Memory Networks
RNNs are good in working on sequence based data but as the sequences start increasing, they start losing
historical context over time in the sequence and hence outputs are not always what is desired. This is where
Long Short-Term Memory Networks, popularly known as LSTMs, come into the picture! Introduced by
Hochreiter & Schmidhuber in 1997, LSTMs can remember information from really long sequence based
data and prevent issues like the vanishing gradient problem, which typically occurs in ANNs trained with
backpropagation. LSTMs usually consist of three or four gates, including input, output, and a special forget
gate. Figure 1-14 shows a high-level pictorial representation of a single LSTM cell.

Figure 1-14. An LSTM cell (Source: deeplearning.net)
The input gate usually can allow or deny incoming signals or inputs to alter the memory cell state. The
output gate usually propagates the value to other neurons as needed. The forget gate controls the memory
cell’s self-recurrent connection to remember or forget previous states as necessary. Multiple LSTM cells are
usually stacked in any Deep Learning network to solve real-world problems like sequence prediction.

Autoencoders
An autoencoder is a specialized Artificial Neural Network that is primarily used for performing unsupervised
Machine Learning tasks. Its main objective is to learn data representations, approximations, and encodings.
Autoencoders can be used for building generative models, performing dimensionality reduction, and
detecting anomalies.

Machine Learning Methods
Machine Learning has multiple algorithms, techniques, and methodologies that can be used to build models
to solve real-world problems using data. This section tries to classify these Machine Learning methods
under some broad categories to give some sense to the overall landscape of Machine Learning methods that
are ultimately used to perform specific Machine Learning tasks we discussed in a previous section. Typically
the same Machine Learning methods can be classified in multiple ways under multiple umbrellas. Following
are some of the major broad areas of Machine Learning methods.

34

Chapter 1 ■ Machine Learning Basics

1.

2.

3.

Methods based on the amount of human supervision in the learning process
a.

Supervised learning

b.

Unsupervised learning

c.

Semi-supervised learning

d.

Reinforcement learning

Methods based on the ability to learn from incremental data samples
a.

Batch learning

b.

Online learning

Methods based on their approach to generalization from data samples
a.

Instance based learning

b.

Model based learning

We briefly cover the various types of learning methods in the following sections to build a good
foundation with regard to Machine Learning methods and the type of tasks they usually solve. This should
give you enough knowledge to start understanding which methods should be applied in what scenarios
when we tackle various real-world use cases and problems in the subsequent chapters of the book.

■■ Discussing mathematical details and internals of each and every Machine Learning algorithm would be out
of the current scope and intent of the book, since the focus is more on solving real-world problems by applying
Machine Learning and not on theoretical Machine Learning. Hence you are encouraged to refer to standard
Machine Learning references like Pattern Recognition and Machine Learning, Christopher Bishop, 2006, and
The Elements of Statistical Learning, Robert Tibshirani et al., 2001, for more theoretical and mathematical
details on the internals of Machine Learning algorithms and methods.

Supervised Learning
Supervised learning methods or algorithms include learning algorithms that take in data samples (known
as training data) and associated outputs (known as labels or responses) with each data sample during the
model training process. The main objective is to learn a mapping or association between input data samples
x and their corresponding outputs y based on multiple training data instances. This learned knowledge can
then be used in the future to predict an output y′ for any new input data sample x′ which was previously
unknown or unseen during the model training process. These methods are termed as supervised because
the model learns on data samples where the desired output responses/labels are already known beforehand
in the training phase.
Supervised learning basically tries to model the relationship between the inputs and their
corresponding outputs from the training data so that we would be able to predict output responses for new
data inputs based on the knowledge it gained earlier with regard to relationships and mappings between the
inputs and their target outputs. This is precisely why supervised learning methods are extensively used in
predictive analytics where the main objective is to predict some response for some input data that’s typically
fed into a trained supervised ML model. Supervised learning methods are of two major classes based on the
type of ML tasks they aim to solve.

35

Chapter 1 ■ Machine Learning Basics

•

Classification

•

Regression

Let’s look at these two Machine Learning tasks and observe the subset of supervised learning methods
that are best suited for tackling these tasks.

Classification
The classification based tasks are a sub-field under supervised Machine Learning, where the key objective
is to predict output labels or responses that are categorical in nature for input data based on what the model
has learned in the training phase. Output labels here are also known as classes or class labels are these are
categorical in nature meaning they are unordered and discrete values. Thus, each output response belongs
to a specific discrete class or category.
Suppose we take a real-world example of predicting the weather. Let’s keep it simple and say we
are trying to predict if the weather is sunny or rainy based on multiple input data samples consisting of
attributes or features like humidity, temperature, pressure, and precipitation. Since the prediction can be
either sunny or rainy, there are a total of two distinct classes in total; hence this problem can also be termed
as a binary classification problem. Figure 1-15 depicts the binary weather classification task of predicting
weather as either sunny or rainy based on training the supervised model on input data samples having
feature vectors, (precipitation, humidity, pressure, and temperature) for each data sample/observation and
their corresponding class labels as either sunny or rainy.

Figure 1-15. Supervised learning: binary classification for weather prediction
A task where the total number of distinct classes is more than two becomes a multi-class classification
problem where each prediction response can be any one of the probable classes from this set. A simple
example would be trying to predict numeric digits from scanned handwritten images. In this case it becomes
a 10-class classification problem because the output class label for any image can be any digit from 0 - 9. In

36

Chapter 1 ■ Machine Learning Basics

both the cases, the output class is a scalar value pointing to one specific class. Multi-label classification tasks
are such that based on any input data sample, the output response is usually a vector having one or more
than one output class label. A simple real-world problem would be trying to predict the category of a news
article that could have multiple output classes like news, finance, politics, and so on.
Popular classification algorithms include logistic regression, support vector machines, neural networks,
ensembles like random forests and gradient boosting, K-nearest neighbors, decision trees, and many more.

Regression
Machine Learning tasks where the main objective is value estimation can be termed as regression tasks.
Regression based methods are trained on input data samples having output responses that are continuous
numeric values unlike classification, where we have discrete categories or classes. Regression models
make use of input data attributes or features (also called explanatory or independent variables) and their
corresponding continuous numeric output values (also called as response, dependent, or outcome variable)
to learn specific relationships and associations between the inputs and their corresponding outputs. With
this knowledge, it can predict output responses for new, unseen data instances similar to classification but
with continuous numeric outputs.
One of the most common real-world examples of regression is prediction of house prices. You can build
a simple regression model to predict house prices based on data pertaining to land plot areas in square feet.
Figure 1-16 shows two possible regression models based on different methods to predict house prices based
on plot area.

Figure 1-16. Supervised learning: regression models for house price prediction
The basic idea here is that we try to determine if there is any relationship or association between the
data feature plot area and the outcome variable, which is the house price and is what we want to predict.
Thus once we learn this trend or relationship depicted in Figure 1-16, we can predict house prices in the
future for any given plot of land. If you have noticed the figure closely, we depicted two types of models on
purpose to show that there can be multiple ways to build a model on your training data. The main objective
is to minimize errors during training and validating the model so that it generalized well, does not overfit or
get biased only to the training data and performs well in future predictions.
Simple linear regression models try to model relationships on data with one feature or explanatory
variable x and a single response variable y where the objective is to predict y. Methods like ordinary least
squares (OLS) are typically used to get the best linear fit during model training.

37

Chapter 1 ■ Machine Learning Basics

Multiple regression is also known as multivariable regression. These methods try to model data where we
have one response output variable y in each observation but multiple explanatory variables in the form of a
vector X instead of a single explanatory variable. The idea is to predict y based on the different features present
in X. A real-world example would be extending our house prediction model to build a more sophisticated model
where we predict the house price based on multiple features instead of just plot area in each data sample. The
features could be represented in a vector as plot area, number of bedrooms, number of bathrooms, total floors,
furnished, or unfurnished. Based on all these attributes, the model tries to learn the relationship between each
feature vector and its corresponding house price so that it can predict them in the future.
Polynomial regression is a special case of multiple regression where the response variable y is modeled
as an nth degree polynomial of the input feature x. Basically it is multiple regression, where each feature in
the input feature vector is a multiple of x. The model on the right in Figure 1-16 to predict house prices is a
polynomial model of degree 2.
Non-linear regression methods try to model relationships between input features and outputs based on
a combination of non-linear functions applied on the input features and necessary model parameters.
Lasso regression is a special form of regression that performs normal regression and generalizes the
model well by performing regularization as well as feature or variable selection. Lasso stands for least absolute
shrinkage and selection operator. The L1 norm is typically used as the regularization term in lasso regression.
Ridge regression is another special form of regression that performs normal regression and generalizes
the model by performing regularization to prevent overfitting the model. Typically the L2 norm is used as the
regularization term in ridge regression.
Generalized linear models are generic frameworks that can be used to model data predicting different
types of output responses, including continuous, discrete, and ordinal data. Algorithms like logistic
regression are used for categorical data and ordered probit regression for ordinal data.

U
 nsupervised Learning
Supervised learning methods usually require some training data where the outcomes which we are trying
to predict are already available in the form of discrete labels or continuous values. However, often we do not
have the liberty or advantage of having pre-labeled training data and we still want to extract useful insights
or patterns from our data. In this scenario, unsupervised learning methods are extremely powerful. These
methods are called unsupervised because the model or algorithm tries to learn inherent latent structures,
patterns and relationships from given data without any help or supervision like providing annotations in the
form of labeled outputs or outcomes.
Unsupervised learning is more concerned with trying to extract meaningful insights or information
from data rather than trying to predict some outcome based on previously available supervised training
data. There is more uncertainty in the results of unsupervised learning but you can also gain a lot of
information from these models that was previously unavailable to view just by looking at the raw data.
Often unsupervised learning could be one of the tasks involved in building a huge intelligence system. For
example, we could use unsupervised learning to get possible outcome labels for tweet sentiments by using
the knowledge of the English vocabulary and then train a supervised model on similar data points and their
outcomes which we obtained previously through unsupervised learning. There is no hard and fast rule with
regard to using just one specific technique. You can always combine multiple methods as long as they are
relevant in solving the problem. Unsupervised learning methods can be categorized under the following
broad areas of ML tasks relevant to unsupervised learning.

38

•

Clustering

•

Dimensionality reduction

•

Anomaly detection

•

Association rule-mining

Chapter 1 ■ Machine Learning Basics

We explore these tasks briefly in the following sections to get a good feel of how unsupervised learning
methods are used in the real world.

Clustering
Clustering methods are Machine Learning methods that try to find patterns of similarity and relationships
among data samples in our dataset and then cluster these samples into various groups, such that each group
or cluster of data samples has some similarity, based on the inherent attributes or features. These methods
are completely unsupervised because they try to cluster data by looking at the data features without any
prior training, supervision, or knowledge about data attributes, associations, and relationships.
Consider a real-world problem of running multiple servers in a data center and trying to analyze logs
for typical issues or errors. Our main task is to determine the various kinds of log messages that usually
occur frequently each week. In simple words, we want to group log messages into various clusters based on
some inherent characteristics. A simple approach would be to extract features from the log messages, which
would be in textual format and apply clustering on the same and group similar log messages together based
on similarity in content. Figure 1-17 shows how clustering would solve this problem. Basically we have raw
log messages to start with. Our clustering system would employ feature extraction to extract features from
text like word occurrences, phrase occurrences, and so on. Finally, a clustering algorithm like K-means or
hierarchical clustering would be employed to group or cluster messages based on similarity of their
inherent features.

Figure 1-17. Unsupervised learning: clustering log messages
It is quite clear from Figure 1-17 that our systems have three distinct clusters of log messages where
the first cluster depicts disk issues, the second cluster is about memory issues, and the third cluster is about
processor issues. Top feature words that helped in distinguishing the clusters and grouping similar data
samples (logs) together are also depicted in the figure. Of course, sometimes some features might be present
across multiple data samples hence there can be slight overlap of clusters too since this is unsupervised
learning. However, the main objective is always to create clusters such that elements of each cluster are near
each other and far apart from elements of other clusters.
There are various types of clustering methods that can be classified under the following major approaches.
•

Centroid based methods such as K-means and K-medoids

•

Hierarchical clustering methods such as agglomerative and divisive (Ward’s, affinity
propagation)

•

Distribution based clustering methods such as Gaussian mixture models

•

Density based methods such as dbscan and optics.

39

Chapter 1 ■ Machine Learning Basics

Besides this, we have several methods that recently came into the clustering landscape, like birch
and clarans.

Dimensionality Reduction
Once we start extracting attributes or features from raw data samples, sometimes our feature space gets
bloated up with a humongous number of features. This poses multiple challenges including analyzing and
visualizing data with thousands or millions of features, which makes the feature space extremely complex
posing problems with regard to training models, memory, and space constraints. In fact this is referred
to as the “curse of dimensionality”. Unsupervised methods can also be used in these scenarios, where we
reduce the number of features or attributes for each data sample. These methods reduce the number of
feature variables by extracting or selecting a set of principal or representative features. There are multiple
popular algorithms available for dimensionality reduction like Principal Component Analysis (PCA), nearest
neighbors, and discriminant analysis. Figure 1-18 shows the output of a typical feature reduction process
applied to a Swiss Roll 3D structure having three dimensions to obtain a two-dimensional feature space for
each data sample using PCA.

Figure 1-18. Unsupervised learning: dimensionality reduction
From Figure 1-18, it is quite clear that each data sample originally had three features or dimensions,
namely D(x1, x2, x3) and after applying PCA, we reduce each data sample from our dataset into two
dimensions, namely D’(z1, z2). Dimensionality reduction techniques can be classified in two major
approaches as follows.

40

•

Feature Selection methods: Specific features are selected for each data sample from
the original list of features and other features are discarded. No new features are
generated in this process.

•

Feature Extraction methods: We engineer or extract new features from the original
list of features in the data. Thus the reduced subset of features will contain newly
generated features that were not part of the original feature set. PCA falls under this
category.

Chapter 1 ■ Machine Learning Basics

Anomaly Detection
The process of anomaly detection is also termed as outlier detection, where we are interested in finding
out occurrences of rare events or observations that typically do not occur normally based on historical
data samples. Sometimes anomalies occur infrequently and are thus rare events, and in other instances,
anomalies might not be rare but might occur in very short bursts over time, thus have specific patterns.
Unsupervised learning methods can be used for anomaly detection such that we train the algorithm
on the training dataset having normal, non-anomalous data samples. Once it learns the necessary data
representations, patterns, and relations among attributes in normal samples, for any new data sample, it
would be able to identify it as anomalous or a normal data point by using its learned knowledge. Figure 1-19
depicts some typical anomaly detection based scenarios where you could apply supervised methods like
one-class SVM and unsupervised methods like clustering, K-nearest neighbors, auto-encoders, and so on to
detect anomalies based on data and its features.

Figure 1-19. Unsupervised learning: anomaly detection
Anomaly detection based methods are extremely popular in real-world scenarios like detection of
security attacks or breaches, credit card fraud, manufacturing anomalies, network issues, and many more.

Association Rule-Mining
Typically association rule-mining is a data mining method use to examine and analyze large transactional
datasets to find patterns and rules of interest. These patterns represent interesting relationships and
associations, among various items across transactions. Association rule-mining is also often termed as
market basket analysis, which is used to analyze customer shopping patterns. Association rules help in
detecting and predicting transactional patterns based on the knowledge it gains from training transactions.
Using this technique, we can answer questions like what items do people tend to buy together, thereby
indicating frequent item sets. We can also associate or correlate products and items, i.e., insights like people
who buy beer also tend to buy chicken wings at a pub. Figure 1-20 shows how a typical association rulemining method should work ideally on a transactional dataset.

41

Chapter 1 ■ Machine Learning Basics

Figure 1-20. Unsupervised learning: association rule-mining
From Figure 1-20, you can clearly see that based on different customer transactions over a period of
time, we have obtained the items that are closely associated and customers tend to buy them together. Some
of these frequent item sets are depicted like {meat, eggs}, {milk, eggs} and so on. The criterion of determining
good quality association rules or frequent item sets is usually done using metrics like support, confidence,
and lift.
This is an unsupervised method, because we have no idea what the frequent item sets are or which
items are more strongly associated with which items beforehand. Only after applying algorithms like the
apriori algorithm or FP-growth, can we detect and predict products or items associated closely with each
other and find conditional probabilistic dependencies. We cover association rule-mining in further details in
Chapter 8.

Semi-Supervised Learning
The semi-supervised learning methods typically fall between supervised and unsupervised learning
methods. These methods usually use a lot of training data that’s unlabeled (forming the unsupervised
learning component) and a small amount of pre-labeled and annotated data (forming the supervised
learning component). Multiple techniques are available in the form of generative methods, graph based
methods, and heuristic based methods.
A simple approach would be building a supervised model based on labeled data, which is limited, and
then applying the same to large amounts of unlabeled data to get more labeled samples, train the model
on them and repeat the process. Another approach would be to use unsupervised algorithms to cluster
similar data samples, use human-in-the-loop efforts to manually annotate or label these groups, and then
use a combination of this information in the future. This approach is used in many image tagging systems.
Covering semi-supervised methods would be out of the present scope of this book.

R
 einforcement Learning
The reinforcement learning methods are a bit different from conventional supervised or unsupervised
methods. In this context, we have an agent that we want to train over a period of time to interact with a
specific environment and improve its performance over a period of time with regard to the type of actions it
performs on the environment. Typically the agent starts with a set of strategies or policies for interacting with
the environment. On observing the environment, it takes a particular action based on a rule or policy and by
observing the current state of the environment. Based on the action, the agent gets a reward, which could be
beneficial or detrimental in the form of a penalty. It updates its current policies and strategies if needed and

42

Chapter 1 ■ Machine Learning Basics

this iterative process continues till it learns enough about its environment to get the desired rewards. The
main steps of a reinforcement learning method are mentioned as follows.
1.

Prepare agent with set of initial policies and strategy

2.

Observe environment and current state

3.

Select optimal policy and perform action

4.

Get corresponding reward (or penalty)

5.

Update policies if needed

6.

Repeat Steps 2 - 5 iteratively until agent learns the most optimal policies

Consider a real-world problem of trying to make a robot or a machine learn to play chess. In this case
the agent would be the robot and the environment and states would be the chessboard and the positions of
the chess pieces. A suitable reinforcement learning methodology is depicted in Figure 1-21.

Figure 1-21. Reinforcement learning: training a robot to play chess
The main steps involved for making the robot learn to play chess is pictorially depicted in Figure 1-21.
This is based on the steps discussed earlier for any reinforcement learning method. In fact, Google’s
DeepMind built the AlphaGo AI with components of reinforcement learning to train the system to play the
game of Go.

Batch Learning
Batch learning methods are also popularly known as offline learning methods. These are Machine Learning
methods that are used in end-to-end Machine Learning systems where the model is trained using all the
available training data in one go. Once training is done and the model completes the process of learning,
on getting a satisfactory performance, it is deployed into production where it predicts outputs for new data
samples. However, the model doesn’t keep learning over a period of time continuously with the new data.
Once the training is complete the model stops learning. Thus, since the model trains with data in one single
batch and it is usually a one-time procedure, this is known as batch or offline learning.

43

Chapter 1 ■ Machine Learning Basics

We can always train the model on new data but then we would have to add new data samples along
with the older historical training data and again re-build the model using this new batch of data. If most of
the model building workflow has already been implemented, retraining a model would not involve a lot of
effort; however, with the data size getting bigger with each new data sample, the retraining process will start
consuming more processor, memory, and disk resources over a period of time. These are some points to be
considered when you are building models that would be running from systems having limited capacity.

Online Learning
Online learning methods work in a different way as compared to batch learning methods. The training
data is usually fed in multiple incremental batches to the algorithm. These data batches are also known as
mini-batches in ML terminology. However, the training process does not end there unlike batch learning
methods. It keeps on learning over a period of time based on new data samples which are sent to it for
prediction. Basically it predicts and learns in the process with new data on the fly without have to re-run the
whole model on previous data samples.
There are several advantages to online learning—it is suitable in real-world scenarios where the model
might need to keep learning and re-training on new data samples as they arrive. Problems like device failure
or anomaly prediction and stock market forecasting are two relevant scenarios. Besides this, since the data
is fed to the model in incremental mini-batches, you can build these models on commodity hardware
without worrying about memory or disk constraints since unlike batch learning methods, you do not need
to load the full dataset in memory before training the model. Besides this, once the model trains on datasets,
you can remove them since we do not need the same data again as the model learns incrementally and
remembers what it has learned in the past.
One of the major caveats in online learning methods is the fact that bad data samples can affect the
model performance adversely. All ML methods work on the principle of “Garbage In Garbage Out”. Hence
if you supply bad data samples to a well-trained model, it can start learning relationships and patterns that
have no real significance and this ends up affecting the overall model performance. Since online learning
methods keep learning based on new data samples, you should ensure proper checks are in place to
notify you in case suddenly the model performance drops. Also suitable model parameters like learning
rate should be selected with care to ensure the model doesn’t overfit or get biased based on specific data
samples.

Instance Based Learning
There are various ways to build Machine Learning models using methods that try to generalize based
on input data. Instance based learning involves ML systems and methods that use the raw data points
themselves to figure out outcomes for newer, previously unseen data samples instead of building an explicit
model on training data and then testing it out.
A simple example would be a K-nearest neighbor algorithm. Assuming k = 3, we have our initial training
data. The ML method knows the representation of the data from the features, including its dimensions,
position of each data point, and so on. For any new data point, it will use a similarity measure (like cosine or
Euclidean distance) and find the three nearest input data points to this new data point. Once that is decided,
we simply take a majority of the outcomes for those three training points and predict or assign it as the
outcome label/response for this new data point. Thus, instance based learning works by looking at the input
data points and using a similarity metric to generalize and predict for new data points.

44

Chapter 1 ■ Machine Learning Basics

Model Based Learning
The model based learning methods are a more traditional ML approach toward generalizing based on
training data. Typically an iterative process takes place where the input data is used to extract features and
models are built based on various model parameters (known as hyperparameters). These hyperparameters
are optimized based on various model validation techniques to select the model that generalizes best on the
training data and some amount of validation and test data (split from the initial dataset). Finally, the best
model is used to make predictions or decisions as and when needed.

The CRISP-DM Process Model
The CRISP-DM model stands for CRoss Industry Standard Process for Data Mining. More popularly known
by the acronym itself, CRISP-DM is a tried, tested, and robust industry standard process model followed for
data mining and analytics projects. CRISP-DM clearly depicts necessary steps, processes, and workflows for
executing any project right from formalizing business requirements to testing and deploying a solution to
transform data into insights. Data Science, Data Mining, and Machine Learning are all about trying to run
multiple iterative processes to extract insights and information from data. Hence we can say that analyzing
data is truly both an art as well as a science, because it is not always about running algorithms without
reason; a lot of the major effort involves in understanding the business, the actual value of the efforts being
invested, and proper methods to articulate end results and insights.
The CRISP-DM model tells us that for building an end-to-end solution for any analytics project or
system, there are a total of six major steps or phases, some of them being iterative. Just like we have a
software development lifecycle with several major phases or steps for a software development project, we
have a data mining or analysis lifecycle in this scenario. Figure 1-22 depicts the data mining lifecycle with the
CRISP-DM model.

45

Chapter 1 ■ Machine Learning Basics

Figure 1-22. The CRISP-DM model depicting the data mining lifecycle
Figure 1-22 clearly shows there are a total of six major phases in the data mining lifecycle and the
direction to proceed is depicted with arrows. This model is not a rigid imposition but rather a framework to
ensure you are on the right track when going through the lifecycle of any analytics project. In some scenarios
like anomaly detection or trend analysis, you might be more interested in data understanding, exploration,
and visualization rather than intensive modeling. Each of the six phases is described in detail as follows.

Business Understanding
This is the initial phase before kick starting any project in full flow. However this is one of the most important
phases in the lifecycle! The main objective here starts with understanding the business context and
requirements for the problem to be solved at hand. Definition of business requirements is crucial to convert
the business problem into a data mining or analytics problem and to set expectations and success criteria
for both the customer as well as the solution task force. The final deliverable from this phase would be a
detailed plan with the major milestones of the project and expected timelines along with success criteria,
assumptions, constraints, caveats, and challenges.

46

Chapter 1 ■ Machine Learning Basics

Define Business Problem
The first task in this phase would be to start by understanding the business objective of the problem to
be solved and build a formal definition of the problem. The following points are crucial toward clearly
articulating and defining the business problem.
•

Get business context of the problem to be solved, assess the problem with the help of
domain, and subject matter experts (SMEs).

•

Describe main pain points or target areas for business objective to be solved.

•

Understand the solutions that are currently in place, what is lacking, and what needs
to be improved.

•

Define the business objective along with proper deliverables and success criteria
based on inputs from business, data scientists, analysts, and SMEs.

Assess and Analyze Scenarios
Once the business problem is defined clearly, the main tasks involved would be to analyze and assess the
current scenario with regard to the business problem definition. This includes looking at what is currently
available and making a note of various items required ranging from resources, personnel, to data. Besides
this, proper assessment of risks and contingency plans need to be discussed. The main steps involved in the
assessment stage here are mentioned as follows.
•

Assess and analyze what is currently available to solve the problem from various
perspectives including data, personnel, resource time, and risks.

•

Build out a brief report of key resources needed (both hardware and software) and
personnel involved. In case of any shortcomings, make sure to call them out as
necessary.

•

Discuss business objective requirements one by one and then identify and record
possible assumptions and constraints for each requirement with the help of SMEs.

•

Verify assumptions and constraints based on data available (a lot of this might be
answered only after detailed analysis, hence it depends on the problem to be solved
and the data available).

•

Document and report possible risks involved in the project including timelines,
resources, personnel, data, and financial based concerns. Build contingency plans
for each possible scenario.

•

Discuss success criteria and try to document a comparative return on investment or
cost versus valuation analysis if needed. This just needs to be a rough benchmark to
make sure the project aligns with the company or business vision.

47

Chapter 1 ■ Machine Learning Basics

Define Data Mining Problem
This could be defined as the pre-analysis phase, which starts once the success criteria and the business
problem is defined and all the risks, assumptions and constraints have been documented. This phase involves
having detailed technical discussions with your analysts, data scientists, and developers and keeping the
business stakeholders in sync. The following are the key tasks that are to be undertaken in this phase.
•

Discuss and document possible Machine Learning and data mining methods
suitable for the solution by assessing possible tools, algorithms, and techniques.

•

Develop high-level designs for end-to-end solution architecture.

•

Record notes on what the end output from the solution will be and how will it
integrate with existing business components.

•

Record success evaluation criteria from a Data Science standpoint. A simple example
could be making sure that predictions are at least 80% accurate.

Project Plan
This is the final stage under the business understanding phase. A project plan is generally created consisting
of the entire major six phases in the CRISP-DM model, estimated timelines, allocated resources and
personnel, and possible risks and contingency plans. Care is taken to ensure concrete high-level deliverables
and success criteria are defined for each phase and iterative phases like modeling are highlighted with
annotations like feedback based on SMEs might need models to be rebuilt and retuned before deployment.
You should be ready for the next step once you have the following points covered.
•

Definition of business objectives for the problem

•

Success criteria for business and data mining efforts

•

Budget allocation and resource planning

•

Clear, well-defined Machine Learning and data mining methodologies to be
followed, including high-level workflows from exploration to deployment

•

Detailed project plan with all six phases of the CRISP-DM model defined with
estimated timelines and risks

Data Understanding
The second phase in the CRISP-DM process involves taking a deep dive into the data available and
understanding it in further detail before starting the process of analysis. This involves collecting the data,
describing the various attributes, performing some exploratory analysis of the data, and keeping tabs on data
quality. This phase should not be neglected because bad data or insufficient knowledge about available data
can have cascading adverse effects in the later stages in this process.

Data Collection
This task is undertaken to extract, curate, and collect all the necessary data needed for your business
objective. Usually this involves making use of the organizations historical data warehouses, data marts, data
lakes and so on. An assessment is done based on the existing data available in the organization and if there is
any need for additional data. This can be obtained from the web, i.e., open data sources or it can be obtained
from other channels like surveys, purchases, experiments and simulations. Detailed documents should keep

48

Chapter 1 ■ Machine Learning Basics

track of all datasets which would be used for analysis and additional data sources if any are necessary. This
document can be combined with the subsequent stages of this phase.

Data Description
Data description involves carrying out initial analysis on the data to understand more about the data, its
source, volume, attributes, and relationships. Once these details are documented, any shortcomings if
noted should be informed to relevant personnel. The following factors are crucial to building a proper data
description document.
•

Data sources (SQL, NoSQL, Big Data), record of origin (ROO), record of
reference(ROR)

•

Data volume (size, number of records, total databases, tables)

•

Data attributes and their description (variables, data types)

•

Relationship and mapping schemes (understand attribute representations)

•

Basic descriptive statistics (mean, median, variance)

•

Focus on which attributes are important for the business

Exploratory Data Analysis
Exploratory data analysis, also known as EDA, is one of the first major analysis stages in the lifecycle. Here,
the main objective is to explore and understand the data in detail. You can make use of descriptive statistics,
plots, charts, and visualizations to look at the various data attributes, find associations and correlations and
make a note of data quality problems if any. Following are some of the major tasks in this stage.
•

Explore, describe, and visualize data attributes

•

Select data and attributes subsets that seem most important for the problem

•

Extensive analysis to find correlations and associations and test hypotheses

•

Note missing data points if any

Data Quality Analysis
Data quality analysis is the final stage in the data understanding phase where we analyze the quality of data
in our datasets and document potential errors, shortcomings, and issues that need to be resolved before
analyzing the data further or starting modeling efforts. The main focus on data quality analysis involves the
following.
•

Missing values

•

Inconsistent values

•

Wrong information due to data errors (manual/automated)

•

Wrong metadata information

49

Chapter 1 ■ Machine Learning Basics

Data Preparation
The third phase in the CRISP-DM process takes place after gaining enough knowledge on the business
problem and relevant dataset. Data preparation is mainly a set of tasks that are performed to clean, wrangle,
curate, and prepare the data before running any analytical or Machine Learning methods and building
models. We will briefly discuss some of the major tasks under the data preparation phase in this section. An
important point to remember here is that data preparation usually is the most time consuming phase in the
data mining lifecycle and often takes 60% to 70% time in the overall project. However this phase should be
taken very seriously because, like we have discussed multiple times before, bad data will lead to bad models
and poor performance and results.

Data Integration
The process of data integration is mainly done when we have multiple datasets that we might want to
integrate or merge. This can be done in two ways. Appending several datasets by combining them, which
is typically done for datasets having the same attributes. Merging several datasets together having different
attributes or columns, by using common fields like keys.

Data Wrangling
The process of data wrangling or data munging involves data processing, cleaning, normalization, and
formatting. Data in its raw form is rarely consumable by Machine Learning methods to build models. Hence
we need to process the data based on its form, clean underlying errors and inconsistencies, and format it
into more consumable formats for ML algorithms. Following are the main tasks relevant to data wrangling.
•

Handling missing values (remove rows, impute missing values)

•

Handling data inconsistencies (delete rows, attributes, fix inconsistencies)

•

Fixing incorrect metadata and annotations

•

Handling ambiguous attribute values

•

Curating and formatting data into necessary formats (CSV, Json, relational)

Attribute Generation and Selection
Data is comprised of observations or samples (rows) and attributes or features (columns). The process of
attribute generation is also known as feature extraction and engineering in Machine Learning terminology.
Attribute generation is basically creating new attributes or variables from existing attributes based on some
rules, logic, or hypothesis. A simple example would be creating a new numeric variable called age based on
two date-time fields—current_date and birth_date—for a dataset of employees in an organization. There
are several techniques with regard to attribute generation that we discuss in future chapters.
Attribute selection is basically selecting a subset of features or attributes from the dataset based on
parameters like attribute importance, quality, relevancy, assumptions, and constraints. Sometimes even
Machine Learning methods are used to select relevant attributes based on the data. This is popularly known
as feature selection in Machine Learning terminology.

50

Chapter 1 ■ Machine Learning Basics

Modeling
The fourth phase in the CRISP-DM process is the core phase in the process where most of the analysis
takes place with regard to using clean, formatted data and its attributes to build models to solve business
problems. This is an iterative process, as depicted in Figure 1-22 earlier, along with model evaluation and all
the preceding steps leading up to modeling. The basic idea is to build multiple models iteratively trying to
get to the best model that satisfies our success criteria, data mining objectives, and business objectives. We
briefly talk about some of the major stages relevant to modeling in this section.

Selecting Modeling Techniques
In this stage, we pick up a list of relevant Machine Learning and data mining tools, frameworks, techniques,
and algorithms listed in the “Business Understanding” phase. Techniques that are proven to be robust
and useful in solving the problem are usually selected based on inputs and insights from data analysts and
data scientists. These are mainly decided by the current data available, business goals, data mining goals,
algorithm requirements, and constraints.

Model Building
The process of model building is also known as training the model using data and features from our dataset.
A combination of data (features) and Machine Learning algorithms together give us a model that tries
to generalize on the training data and give necessary results in the form of insights and/or predictions.
Generally various algorithms are used to try out multiple modeling approaches on the same data to solve
the same problem to get the best model that performs and gives outputs that are the closest to the business
success criteria. Key things to keep track here are the models created, model parameters being used, and
their results.

Model Evaluation and Tuning
In this stage, we evaluate each model based on several metrics like model accuracy, precision, recall,
F1 score, and so on. We also tune the model parameters based on techniques like grid search and cross
validation to get to the model that gives us the best results. Tuned models are also matched with the data
mining goals to see if we are able to get the desired results as well as performance. Model tuning is also
termed as hyperparameter optimization in the Machine Learning world.

Model Assessment
Once we have models that are providing desirable and relevant results, a detailed assessment of the model is
performed based on the following parameters.
•

Model performance is in line with defined success criteria

•

Reproducible and consistent results from models

•

Scalability, robustness, and ease of deployment

•

Future extensibility of the model

•

Model evaluation gives satisfactory results

51

Chapter 1 ■ Machine Learning Basics

Evaluation
The fifth phase in the CRISP-DM process takes place once we have the final models from the modeling
phase that satisfy necessary success criteria with respect to our data mining goals and have the desired
performance and results with regard to model evaluation metrics like accuracy. The evaluation phase
involves carrying out a detailed assessment and review of the final models and the results which are
obtained from them. Some of the main points that are evaluated in this section are as follows.
•

Ranking final models based on the quality of results and their relevancy based on
alignment with business objectives

•

Any assumptions or constraints that were invalidated by the models

•

Cost of deployment of the entire Machine Learning pipeline from data extraction
and processing to modeling and predictions

•

Any pain points in the whole process? What should be recommended? What should
be avoided?

•

Data sufficiency report based on results

•

Final suggestions, feedback, and recommendations from solutions team and SMEs

Based on the report formed from these points, after a discussion, the team can decide whether they
want to proceed to the next phase of model deployment or a full reiteration is needed, starting from business
and data understanding to modeling.

Deployment
The final phase in the CRISP-DM process is all about deploying your selected models to production and
making sure the transition from development to production is seamless. Usually most organizations follow
a standard path-to-production methodology. A proper plan for deployment is built based on resources
required, servers, hardware, software, and so on. Models are validated, saved, and deployed on necessary
systems and servers. A plan is also put in place for regular monitoring and maintenance of models to
continuously evaluate their performance, check for results and their validity, and retire, replace, and update
models as and when needed.

Building Machine Intelligence
The objective of Machine Learning, data mining, or artificial intelligence is to make our lives easier,
automate tasks, and take better decisions. Building machine intelligence involves everything we have
learned until now starting from Machine Learning concepts to actually implementing and building models
and using them in the real world. Machine intelligence can be built using non-traditional computing
approaches like Machine Learning. In this section, we establish full-fledged end-to-end Machine Learning
pipelines based on the CRISP-DM model, which will help us solve real-world problems by building machine
intelligence using a structured process.

Machine Learning Pipelines
The best way to solve a real-world Machine Learning or analytics problem is to use a Machine Learning
pipeline starting from getting your data to transforming it into information and insights using Machine

52

Chapter 1 ■ Machine Learning Basics

Learning algorithms and techniques. This is more of a technical or solution based pipeline and it assumes
that several aspects of the CRISP-DM model are already covered, including the following points.
•

Business and data understanding

•

ML/DM technique selection

•

Risk, assumptions, and constraints assessment

A Machine Learning pipeline will mainly consist of elements related to data retrieval and extraction,
preparation, modeling, evaluation, and deployment. Figure 1-23 shows a high-level overview of a standard
Machine Learning pipeline with the major phases highlighted in their blocks.

Figure 1-23. A standard Machine Learning pipeline
From Figure 1-23, it is evident that there are several major phases in the Machine Learning pipeline and
they are quite similar to the CRISP-DM process model, which is why we talked about it in detail earlier. The
major steps in the pipeline are briefly mentioned here.
•

Data retrieval: This is mainly data collection, extraction, and acquisition from
various data sources and data stores. We cover data retrieval mechanisms in detail in
Chapter 3, “Processing, Wrangling, and Visualizing Data”.

•

Data preparation: In this step, we pre-process the data, clean it, wrangle it, and
manipulate it as needed. Initial exploratory data analysis is also carried out.
Next steps involved extracting, engineering, and selecting features/attributes from
the data.
•

Data processing and wrangling: Mainly concerned with data processing,
cleaning, munging, wrangling and performing initial descriptive and
exploratory data analysis. We cover this in further details with hands-on
examples in Chapter 3, “Processing, Wrangling, and Visualizing Data”.

•

Feature extraction and engineering: Here, we extract important features or
attributes from the raw data and even create or engineer new features from
existing features. Details on various feature engineering techniques are covered
in Chapter 4, “Feature Engineering and Selection”.

53

Chapter 1 ■ Machine Learning Basics

•

Feature scaling and selection: Data features often need to be normalized and
scaled to prevent Machine Learning algorithms from getting biased. Besides
this, often we need to select a subset of all available features based on feature
importance and quality. This process is known as feature selection. Chapter 4,
“Feature Engineering and Selection,” covers these aspects.

•

Modeling: In the process of modeling, we usually feed the data features to a Machine
Learning method or algorithm and train the model, typically to optimize a specific
cost function in most cases with the objective of reducing errors and generalizing the
representations learned from the data. Chapter 5, “Building, Tuning, and Deploying
Models,” covers the art and science behind building Machine Learning models.

•

Model evaluation and tuning: Built models are evaluated and tested on validation
datasets and, based on metrics like accuracy, F1 score, and others, the model
performance is evaluated. Models have various parameters that are tuned in a
process called hyperparameter optimization to get models with the best and optimal
results. Chapter 5, “Building, Tuning, and Deploying Models,” covers these aspects.

•

Deployment and monitoring: Selected models are deployed in production and
are constantly monitored based on their predictions and results. Details on model
deployment are covered in Chapter 5, “Building, Tuning and Deploying Models”.

Supervised Machine Learning Pipeline
By now we know that supervised Machine Learning methods are all about working with supervised
labeled data to train models and then predict outcomes for new data samples. Some processes like feature
engineering, scaling, and selection should always remain constant so that the same features are used for
training the model and the same features are extracted from new data samples to feed the model in the
prediction phase. Based on our earlier generic Machine Learning pipeline, Figure 1-24 shows a standard
supervised Machine Learning pipeline.

Figure 1-24. Supervised Machine Learning pipeline
You can clearly see the two phases of model training and prediction highlighted in Figure 1-24.
Also, based on what we had mentioned earlier, the same sequence of data processing, wrangling, feature
engineering, scaling, and selection is used for both data used in training the model and future data samples
for which the model predicts outcomes. This is a very important point that you must remember whenever
you are building any supervised model. Besides this, as depicted, the model is a combination of a Machine

54

Chapter 1 ■ Machine Learning Basics

Learning (supervised) algorithm and training data features and corresponding labels. This model will take
features from new data samples and output predicted labels in the prediction phase.

Unsupervised Machine Learning Pipeline
Unsupervised Machine Learning is all about extracting patterns, relationships, associations, and clusters
from data. The processes related to feature engineering, scaling and selection are similar to supervised
learning. However there is no concept of pre-labeled data here. Hence the unsupervised Machine Learning
pipeline would be slightly different in contrast to the supervised pipeline. Figure 1-25 depicts a standard
unsupervised Machine Learning pipeline.

Figure 1-25. Unsupervised Machine Learning pipeline
Figure 1-25 clearly depicts that no supervised labeled data is used for training the model. With the
absence of labels, we just have training data that goes through the same data preparation phase as in the
supervised learning pipeline and we build our unsupervised model with an unsupervised Machine Learning
algorithm and training features. In the prediction phase, we extract features from new data samples and pass
them through the model which gives relevant results according to the type of Machine Learning task we are
trying to perform, which can be clustering, pattern detection, association rules, or dimensionality reduction.

Real-World Case Study: Predicting Student Grant
Recommendations
Let’s take a step back from what we have learned so far! The main objective here was to gain a solid grasp
over the entire Machine Learning landscape, understand crucial concepts, build on the basic foundations,
and understand how to execute Machine Learning projects with the help of Machine Learning pipelines
with the CRISP-DM process model being the source of all inspiration. Let’s put all this together to take a
very basic real-world case study by building a supervised Machine Learning pipeline on a toy dataset. Our
major objective is as follows. Given that you have several students with multiple attributes like grades,
performance, and scores, can you build a model based on past historical data to predict the chance of the
student getting a recommendation grant for a research project?
This will be a quick walkthrough with the main intent of depicting how to build and deploy a real-world
Machine Learning pipeline and perform predictions. This will also give you a good hands-on experience to
get started with Machine Learning. Do not worry too much if you don’t understand the details of each and
every line of code; the subsequent chapters cover all the tools, techniques, and frameworks used here in

55

Chapter 1 ■ Machine Learning Basics

detail. We will be using Python 3.5 in this book; you can refer to Chapter 2, “The Python Machine Learning
Ecosystem” to understand more about Python and the various tools and frameworks used in Machine
Learning. You can follow along with the code snippets in this section or open the Predicting Student
Recommendation Machine Learning Pipeline.ipynb jupyter notebook by running jupyter notebook
in the command line/terminal in the same directory as this notebook. You can then run the relevant code
snippets in the notebook from your browser. Chapter 2 covers jupyter notebooks in detail.

Objective
You have historical student performance data and their grant recommendation outcomes in the form of
a comma separated value file named student_records.csv. Each data sample consists of the following
attributes.
•

Name (the student name)

•

OverallGrade (overall grade obtained)

•

Obedient (whether they were diligent during their course of stay)

•

ResearchScore (marks obtained in their research work)

•

ProjectScore (marks obtained in the project)

•

Recommend (whether they got the grant recommendation)

You main objective is to build a predictive model based on this data such that you can predict for any
future student whether they will be recommended for the grant based on their performance attributes.

Data Retrieval
Here, we will leverage the pandas framework to retrieve the data from the CSV file. The following snippet
shows us how to retrieve the data and view it.
In [1]:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:

56

import pandas as pd
# turn of warning messages
pd.options.mode.chained_assignment = None  # default='warn'
# get data
df = pd.read_csv('student_records.csv')
df

Chapter 1 ■ Machine Learning Basics

Figure 1-26. Raw data depicting student records and their recommendations
Now that we can see data samples showing records for each student and their corresponding
recommendation outcomes in Figure 1-26, we will perform necessary tasks relevant to data preparation.

Data Preparation
Based on the dataset we saw earlier, we do not have any data errors or missing values, hence we will mainly
focus on feature engineering and scaling in this section.

Feature Extraction and Engineering
Let’s start by extracting the existing features from the dataset and the outcomes in separate variables. The
following snippet shows this process. See Figures 1-27 and 1-28.
In [2]: # get features and corresponding outcomes
   ...: feature_names = ['OverallGrade', 'Obedient', 'ResearchScore',
                         'ProjectScore']
   ...: training_features = df[feature_names]
   ...:
   ...: outcome_name = ['Recommend']
   ...: outcome_labels = df[outcome_name]
In [3]: # view features
   ...: training_features

57

Chapter 1 ■ Machine Learning Basics

Figure 1-27. Dataset features
In [4]: # view outcome labels
   ...: outcome_labels

Figure 1-28. Dataset recommendation outcome labels for each student

58

Chapter 1 ■ Machine Learning Basics

Now that we have extracted our initial available features from the data and their corresponding outcome
labels, let’s separate out our available features based on their type (numerical and categorical). Types of
feature variables are covered in more detail in Chapter 3, “Processing, Wrangling, and Visualizing Data”.
In [5]: # list down features based on type
   ...: numeric_feature_names = ['ResearchScore', 'ProjectScore']
   ...: categoricial_feature_names = ['OverallGrade', 'Obedient']
We will now use a standard scalar from scikit-learn to scale or normalize our two numeric scorebased attributes using the following code.
In [6]: from sklearn.preprocessing import StandardScaler
   ...: ss = StandardScaler()
   ...:
   ...: # fit scaler on numeric features
   ...: ss.fit(training_features[numeric_feature_names])
   ...:
   ...: # scale numeric features now
   ...: training_features[numeric_feature_names] =
                                    ss.transform(training_features[numeric_feature_names])
   ...:
   ...: # view updated featureset
   ...: training_features

Figure 1-29. Feature set with scaled numeric attributes

59

Chapter 1 ■ Machine Learning Basics

Now that we have successfully scaled our numeric features (see Figure 1-29), let’s handle our categorical
features and carry out the necessary feature engineering needed based on the following code.
In [7]: training_features = pd.get_dummies(training_features,
                                           columns=categoricial_feature_names)
   ...: # view newly engineering features
   ...: training_features

Figure 1-30. Feature set with engineered categorical variables
In [8]: # get list of new categorical features
   ...: categorical_engineered_features = list(set(training_features.columns)                                                  set(numeric_feature_names))
Figure 1-30 shows us the updated feature set with the newly engineered categorical variables. This
process is also known as one hot encoding.

Modeling
We will now build a simple classification (supervised) model based on our feature set by using the logistic
regression algorithm. The following code depicts how to build the supervised model.
In [9]: from sklearn.linear_model import LogisticRegression
   ...: import numpy as np
   ...:
   ...: # fit the model
   ...: lr = LogisticRegression()
   ...: model = lr.fit(training_features,
                       np.array(outcome_labels['Recommend']))
   ...: # view model parameters
   ...: model
Out[9]: LogisticRegression(C=1.0, class_weight=None, dual=False,
                        fit_intercept=True, intercept_scaling=1, max_iter=100,
                        multi_class='ovr', n_jobs=1, penalty='l2',
                        random_state=None, solver='liblinear', tol=0.0001,
                        verbose=0, warm_start=False)
Thus, we now have our supervised learning model based on the logistic regression model with
L2 regularization, as you can see from the parameters in the previous output.

60

Chapter 1 ■ Machine Learning Basics

Model Evaluation
Typically model evaluation is done based on some holdout or validation dataset that is different from the
training dataset to prevent overfitting or biasing the model. Since this is an example on a toy dataset, let’s
evaluate the performance of our model on the training data using the following snippet.
In [10]: # simple evaluation on training data
    ...: pred_labels = model.predict(training_features)
    ...: actual_labels = np.array(outcome_labels['Recommend'])
    ...:
    ...: # evaluate model performance
    ...: from sklearn.metrics import accuracy_score
    ...: from sklearn.metrics import classification_report
    ...:
    ...: print('Accuracy:', float(accuracy_score(actual_labels,
                pred_labels))*100, '%')
    ...: print('Classification Stats:')
    ...: print(classification_report(actual_labels, pred_labels))
Accuracy: 100.0 %
Classification Stats:
             precision    recall  f1-score   support
         No       1.00      1.00      1.00         5
        Yes       1.00      1.00      1.00         3
avg / total       1.00      1.00      1.00       

8

Thus you can see the various metrics that we had mentioned earlier, like accuracy, precision, recall, and
F1 score depicting the model performance. We talk about these metrics in detail in Chapter 5, “Building,
Tuning, and Deploying Models”.

Model Deployment
We built our first supervised learning model, and to deploy this model typically in a system or server, we
need to persist the model. We also need to save the scalar object we used to scale the numerical features
since we use it to transform the numeric features of new data samples. The following snippet depicts a way
to store the model and scalar objects.
In [11]: from sklearn.externals import joblib
    ...: import os
    ...: # save models to be deployed on your server
    ...: if not os.path.exists('Model'):
    ...:     os.mkdir('Model')
    ...: if not os.path.exists('Scaler'):
    ...:     os.mkdir('Scaler')
    ...:
    ...: joblib.dump(model, r'Model/model.pickle')
    ...: joblib.dump(ss, r'Scaler/scaler.pickle')
These files can be easily deployed on a server with necessary code to reload the model and predict new
data samples, which we will see in the upcoming sections.

61

Chapter 1 ■ Machine Learning Basics

Prediction in Action
We are now ready to start predicting with our newly built and deployed model! To start predictions, we need
to load our model and scalar objects into memory. The following code helps us do this.
In [12]: # load model and scaler objects
    ...: model = joblib.load(r'Model/model.pickle')
    ...: scaler = joblib.load(r'Scaler/scaler.pickle')
We have some sample new student records (for two students) for which we want our model to predict if
they will get the grant recommendation. Let’s retrieve and view this data using the following code.
In [13]: ## data retrieval
    ...: new_data = pd.DataFrame([{'Name': 'Nathan', 'OverallGrade': 'F',
                   'Obedient': 'N', 'ResearchScore': 30, 'ProjectScore': 20},
    ...:                         {'Name': 'Thomas', 'OverallGrade': 'A',
                   'Obedient': 'Y', 'ResearchScore': 78, 'ProjectScore': 80}])
    ...: new_data = new_data[['Name', 'OverallGrade', 'Obedient',
                              'ResearchScore', 'ProjectScore']]
    ...: new_data

Figure 1-31. New student records
We will now carry out the tasks relevant to data preparation—feature extraction, engineering, and
scaling—in the following code snippet.
In [14]: ## data preparation
    ...: prediction_features = new_data[feature_names]
    ...:
    ...: # scaling
    ...: prediction_features[numeric_feature_names] =
                scaler.transform(prediction_features[numeric_feature_names])
    ...:
    ...: # engineering categorical variables
    ...: prediction_features = pd.get_dummies(prediction_features,
                                     columns=categoricial_feature_names)
    ...:
    ...: # view feature set
    ...: prediction_features

62

Chapter 1 ■ Machine Learning Basics

Figure 1-32. Updated feature set for new students
We now have the relevant features for the new students! However you can see that some of the
categorical features are missing based on some grades like B, C, and E. This is because none of these
students obtained those grades but we still need those attributes because the model was trained on all
attributes including these. The following snippet helps us identify and add the missing categorical features.
We add the value for each of those features as 0 for each student since they did not obtain those grades.
In [15]: # add missing categorical feature columns
    ...: current_categorical_engineered_features =
             set(prediction_features.columns) - set(numeric_feature_names)
    ...: missing_features = set(categorical_engineered_features)                                current_categorical_engineered_features
    ...: for feature in missing_features:
    ...:     # add zeros since feature is absent in these data samples
    ...:     prediction_features[feature] = [0] * len(prediction_features)
    ...:
    ...: # view final feature set
    ...: prediction_features

Figure 1-33. Final feature set for new students
We have our complete feature set ready for both the new students. Let’s put our model to the test and
get the predictions with regard to grant recommendations!
In [16]:
    ...:
    ...:
    ...:
    ...:
    ...:

## predict using model
predictions = model.predict(prediction_features)
## display results
new_data['Recommend'] = predictions
new_data

63

Chapter 1 ■ Machine Learning Basics

Figure 1-34. New student records with model predictions for grant recommendations
We can clearly see from Figure 1-34 that our model has predicted grant recommendation labels for both
the new students. Thomas clearly being diligent, having a straight A average and decent scores, is most likely
to get the grant recommendation as compared to Nathan. Thus you can see that our model has learned how
to predict grant recommendation outcomes based on past historical student data. This should whet your
appetite on getting started with Machine Learning. We are about to deep dive into more complex real-world
problems in the upcoming chapters!

Challenges in Machine Learning
Machine Learning is a rapidly evolving, fast-paced, and exciting field with a lot of prospect, opportunity,
and scope. However it comes with its own set of challenges, due to the complex nature of Machine Learning
methods, its dependency on data, and not being one of the more traditional computing paradigms. The
following points cover some of the main challenges in Machine Learning.
•

Data quality issues lead to problems, especially with regard to data processing and
feature extraction.

•

Data acquisition, extraction, and retrieval is an extremely tedious and time
consuming process.

•

Lack of good quality and sufficient training data in many scenarios.

•

Formulating business problems clearly with well-defined goals and objectives.

•

Feature extraction and engineering, especially hand-crafting features, is one of the
most difficult yet important tasks in Machine Learning. Deep Learning seems to have
gained some advantage in this area recently.

•

Overfitting or underfitting models can lead to the model learning poor
representations and relationships from the training data leading to detrimental
performance.

•

The curse of dimensionality: too many features can be a real hindrance.

•

Complex models can be difficult to deploy in the real world.

This is not an exhaustive list of challenges faced in Machine Learning today, but it is definitely a list of
the top problems data scientists or analysts usually face in Machine Learning projects and tasks. We will
cover dealing with these issues in detail when we discuss more about the various stages in the Machine
Learning pipeline as well as solve real-world problems in subsequent chapters.

Real-World Applications of Machine Learning
Machine Learning is widely being applied and used in the real world today to solve complex problems that
would otherwise have been impossible to solve based on traditional approaches and rule-based systems.
The following list depicts some of the real-world applications of Machine Learning.

64

Chapter 1 ■ Machine Learning Basics

•

Product recommendations in online shopping platforms

•

Sentiment and emotion analysis

•

Anomaly detection

•

Fraud detection and prevention

•

Content recommendation (news, music, movies, and so on)

•

Weather forecasting

•

Stock market forecasting

•

Market basket analysis

•

Customer segmentation

•

Object and scene recognition in images and video

•

Speech recognition

•

Churn analytics

•

Click through predictions

•

Failure/defect detection and prevention

•

E-mail spam filtering

Summary
The intent of this chapter was to get you familiarized with the foundations of Machine Learning before
taking a deep dive into Machine Learning pipelines and solving real-world problems. The need for Machine
Learning in today’s world is introduced in the chapter with a focus on making data-driven decisions at scale.
We also talked about the various programming paradigms and how Machine Learning has disrupted the
traditional programming paradigm. Next up, we explored the Machine Learning landscape starting from the
formal definition to the various domains and fields associated with Machine Learning. Basic foundational
concepts were covered in areas like mathematics, statistics, computer science, Data Science, data mining,
artificial intelligence, natural language processing, and Deep Learning since all of them tie back to Machine
Learning and we will also be using tools, techniques, methodologies, and processes from these fields in
future chapters. Concepts relevant to the various Machine Learning methods have also been covered
including supervised, unsupervised, semi-supervised, and reinforcement learning. Other classifications
of Machine Learning methods were depicted, like batch versus online based learning methods and
online versus instance based learning methods. A detailed depiction of the CRISP-DM process model was
explained to give an overview of the industry standard process for data mining projects. Analogies were
drawn from this model to build Machine Learning pipelines, where we focus on both supervised and
unsupervised learning pipelines.
We brought everything covered in this chapter together in solving a small real-world problem of
predicting grant recommendations for students and building a sample Machine Learning pipeline from
scratch. This should definitely get you ready for the next chapters, where you will be exploring each of the
stages in a Machine Learning pipeline in further details and cover ground on the Python Machine Learning
ecosystem. Last but not the least, challenges, and real-world applications of Machine Learning will give you
a good idea on the vast scope of Machine Learning and make you aware of the caveats and pitfalls associated
with Machine Learning problems.

65

CHAPTER 2

The Python Machine Learning
Ecosystem
In the first chapter we explored the absolute basics of Machine Learning and looked at some of the
algorithms that we can use. Machine Learning is a very popular and relevant topic in the world of technology
today. Hence we have a very diverse and varied support for Machine Learning in terms of programming
languages and frameworks. There are Machine Learning libraries for almost all popular languages including
C++, R, Julia, Scala, Python, etc. In this chapter we try to justify why Python is an apt language for Machine
Learning. Once we have argued our selection logically, we give you a brief introduction to the Python
Machine Learning (ML) ecosystem. This Python ML ecosystem is a collection of libraries that enable the
developers to extract and transform data, perform data wrangling operations, apply existing robust Machine
Learning algorithms and also develop custom algorithms easily. These libraries include numpy, scipy,
pandas, scikit-learn, statsmodels, tensorflow, keras, and so on. We cover several of these libraries
in a nutshell so that the user will have some familiarity with the basics of each of these libraries. These will
be used extensively in the later chapters of the book. An important thing to keep in mind here is that the
purpose of this chapter is to acquaint you with the diverse set of frameworks and libraries in the Python
ML ecosystem to get an idea of what can be leveraged to solve Machine Learning problems. We enrich the
content with useful links that you can refer to for extensive documentation and tutorials. We assume some
basic proficiency with Python and programming in general. All the code snippets and examples used in this
chapter is available in the GitHub repository for this book at https://github.com/dipanjanS/practicalmachine-learning-with-python under the directory/folder for Chapter 2. You can refer to the Python file
named python_ml_ecosystem.py for all the examples used in this chapter and try the examples as you
read this chapter or you can even refer to the jupyter notebook named The Python Machine Learning
Ecosystem.ipynb for a more interactive experience.

Python: An Introduction
Python was created by Guido van Rossum at Stichting Mathematisch Centrum (CWI, see https://www.cwi.nl/)
in the Netherlands. The first version of Python was released in 1991. Guido wrote Python as a successor of the
language called ABC. In the following years Python has developed into an extensively used high level language
and a general programming language. Python is an interpreted language, which means that the source code of
a Python program is converted into bytecode, which is then executed by the Python virtual machine. Python is

© Dipanjan Sarkar, Raghav Bali and Tushar Sharma 2018
D. Sarkar et al., Practical Machine Learning with Python, https://doi.org/10.1007/978-1-4842-3207-1_2

67

Chapter 2 ■ The Python Machine Learning Ecosystem

different from major compiled languages like C and C++ as Python code is not required to be built and linked
like code for these languages. This distinction makes for two important points:
•

Python code is fast to develop: As the code is not required to be compiled and
built, Python code can be much readily changed and executed. This makes for a fast
development cycle.

•

Python code is not as fast in execution: Since the code is not directly compiled and
executed and an additional layer of the Python virtual machine is responsible for
execution, Python code runs a little slow as compared to conventional languages like
C, C++, etc.

Strengths
Python has steadily risen in the charts of widely used programming languages and according to several
surveys and research; it is the fifth most important language in the world. Recently several surveys depicted
Python to be the most popular language for Machine Learning and Data Science! We will compile a brief list
of advantages that Python offers that probably explains its popularity.
1.

Easy to learn: Python is a relatively easy-to-learn language. Its syntax is simple
for a beginner to learn and understand. When compared with languages likes
C or Java, there is minimal boilerplate code required in executing a Python
program.

2.

Supports multiple programming paradigms: Python is a multi-paradigm,
multi-purpose programming language. It supports object oriented programming,
structured programming, functional programming, and even aspect
oriented programming. This versatility allows it to be used by a multitude of
programmers.

3.

Extensible: Extensibility of Python is one of its most important characteristics.
Python has a huge number of modules easily available which can be readily
installed and used. These modules cover every aspect of programming from data
access to implementation of popular algorithms. This easy-to-extend feature
ensures that a Python developer is more productive as a large array of problems
can be solved by available libraries.

4.

Active open source community: Python is open source and supported by a large
developer community. This makes it robust and adaptive. The bugs encountered
are easily fixed by the Python community. Being open source, developers can
tinker with the Python source code if their requirements call for it.

Pitfalls
Although Python is a very popular programming language, it comes with its own share of pitfalls. One of the
most important limitations it suffers is in terms of execution speed. Being an interpreted language, it is slow
when compared to compiled languages. This limitation can be a bit restrictive in scenarios where extremely
high performance code is required. This is a major area of improvement for future implementations of
Python and every subsequent Python version addresses it. Although we have to admit it can never be as fast
as a compiled language, we are convinced that it makes up for this deficiency by being super-efficient and
effective in other departments.

68

Chapter 2 ■ The Python Machine Learning Ecosystem

Setting Up a Python Environment
The starting step for our journey into the world of Data Science is the setup of our Python environment. We
usually have two options for setting up our environment:
•

Install Python and the necessary libraries individually

•

Use a pre-packaged Python distribution that comes with necessary libraries, i.e.
Anaconda

Anaconda is a packaged compilation of Python along with a whole suite of a variety of libraries,
including core libraries which are widely used in Data Science. Developed by Anaconda, formerly known
as Continuum Analytics, it is often the go-to setup for data scientists. Travis Oliphant, primary contributor
to both the numpy and scipy libraries, is Anaconda’s president and one of the co-founders. The Anaconda
distribution is BSD licensed and hence it allows us to use it for commercial and redistribution purposes.
A major advantage of this distribution is that we don’t require an elaborate setup and it works well on all
flavors of operating systems and platforms, especially Windows, which can often cause problems with
installing specific Python packages. Thus, we can get started with our Data Science journey with just one
download and install. The Anaconda distribution is widely used across industry Data Science environments
and it also comes with a wonderful IDE, Spyder (Scientific Python Development Environment), besides
other useful utilities like jupyter notebooks, the IPython console, and the excellent package management
tool, conda. Recently they have also talked extensively about Jupyterlab, the next generation UI for Project
Jupyter. We recommend using the Anaconda distribution and also checking out https://www.anaconda.
com/what-is-anaconda/ to learn more about Anaconda.

Set Up Anaconda Python Environment
The first step in setting up your environment with the required Anaconda distribution is downloading
the required installation package from https://www.anaconda.com/download/, which is the provider of
the Anaconda distribution. The important point to note here is that we will be using Python 3.5 and the
corresponding Anaconda distribution. Python 3.5.2 was released on June 2016 compared to 3.6, which
released on December 2016. We have opted for 3.5 as we want to ensure that none of the libraries that we
will be using in this book have any compatibility issues. Hence, as Python 3.5 has been around for a long
time we avoid any such compatibility issues by opting for it. However, you are free to use Python 3.6 and
the code used in this book is expected to work without major issues. We chose to leave out Python 2.7 since
support for Python 2 will be ending in 2020 and from the Python community vision, it is clear that Python 3
is the future and we recommend you use it.
Download the Anaconda3-4.2.0-Windows-x86_64 package (the one with Python 3.5) from https://
repo.continuum.io/archive/. A screenshot of the target page is shown in Figure 2-1. We have chosen the
Windows OS specifically because sometimes, few Python packages or libraries cause issues with installing or
running and hence we wanted to make sure we cover those details. If you are using any other OS like Linux
or MacOSX, download the correct version for your OS and install it.

69

Chapter 2 ■ The Python Machine Learning Ecosystem

Figure 2-1. Downloading the Anaconda package
Installing the downloaded file is as simple as double-clicking the file and letting the installer take care of
the entire process. To check if the installation was successful, just open a command prompt or terminal and
start up Python. You should be greeted with the message shown in Figure 2-2 identifying the Python and the
Anaconda version. We also recommend that you use the iPython shell (the command is ipython) instead of
the regular Python shell, because you get a lot of features including inline plots, autocomplete, and so on.

Figure 2-2. Verifying installation with the Python shell
This should complete the process of setting up your Python environment for Data Science and
Machine Learning.

70

Chapter 2 ■ The Python Machine Learning Ecosystem

I nstalling Libraries
We will not be covering the basics of Python, as we assume you are already acquainted with basic Python
syntax. Feel free to check out any standard course or book on Python programming to pick up on the basics.
We will cover one very basic but very important aspect of installing additional libraries. In Python the
preferred way to install additional libraries is using the pip installer. The basic syntax to install a package
from Python Package Index (PyPI) using pip is as follows.
pip install required_package
This will install the required_package if it is present in PyPI. We can also use other sources other than
PyPI to install packages but that generally would not be required. The Anaconda distribution is already
supplemented with a plethora of additional libraries, hence it is very unlikely that we will need additional
packages from other sources.
Another way to install packages, limited to Anaconda, is to use the conda install command. This will
install the packages from the Anaconda package channels and usually we recommend using this, especially
on Windows.

Why Python for Data Science?
According to a 2017 survey by StackOverflow (https://insights.stackoverflow.com/survey/2017),
Python is world’s 5th most used language. It is one of the top three languages used by data scientists and one
of the most “wanted” language among StackOverflow users. In fact, in a recent poll by KDnuggets in 2017,
Python got the maximum number of votes for being the leading platform for Analytics, Data Science, and
Machine Learning based on the choice of users (http://www.kdnuggets.com/2017/08/python-overtakesr-leader-analytics-data-science.html). Python has a lot of advantages that makes it a language of
choice when it comes to the practices of Data Science. We will now try to illustrate these advantages and
argue our case for “Why Python is a language of choice for Data scientists?”

Powerful Set of Packages
Python is known for its extensive and powerful set of packages. In fact one of the philosophies shared by
Python is batteries included, which means that Python has a rich and powerful set of packages ready to be
used in a wide variety of domains and use cases. This philosophy is extended into the packages required
for Data Science and Machine Learning. Packages like numpy, scipy, pandas, scikit-learn, etc., which are
tailor-made for solving a variety of real-world Data Science problems, and are immensely powerful. This
makes Python a go-to language for solving Data Science related problems.

Easy and Rapid Prototyping
Python’s simplicity is another important aspect when we want to discuss its suitability for Data Science.
Python syntax is easy to understand as well as idiomatic, which makes comprehending existing code a
relatively simple task. This allows the developer to easily modify existing implementations and develop
his own ones. This feature is especially useful for developing new algorithms which may be experimental
or yet to be supported by any external library. Based on what we discussed earlier, Python development is
independent of time consuming build and link processes. Using the REPL shell, IDEs, and notebooks, you
can rapidly build and iterate over multiple research and development cycles and all the changes can be
readily made and tested.

71

Chapter 2 ■ The Python Machine Learning Ecosystem

Easy to Collaborate
Data science solutions are rarely a one man job. Often a lot of collaboration is required in a Data Science
team to develop a great analytical solution. Luckily Python provides tools that make it extremely easy to
collaborate for a diverse team. One of the most liked features, which empowers this collaboration, are
jupyter notebooks. Notebooks are a novel concept that allow data scientists to share the code, data, and
insightful results in a single place. This makes for an easily reproducible research tool. We consider this to
be a very important feature and will devote an entire section to cover the advantages offered by the use of
notebooks.

One-Stop Solution
In the first chapter we explored how Data Science as a field is interconnected to various domains. A typical
project will have an iterative lifecycle that will involve data extraction, data manipulation, data analysis,
feature engineering, modeling, evaluation, solution development, deployment, and continued updating
of the solution. Python as a multi-purpose programming language is extremely diverse and it allows
developers to address all these assorted operations from a common platform. Using Python libraries you
can consume data from a multitude of sources, apply different data wrangling operations to that data, apply
Machine Learning algorithms on the processed data, and deploy the developed solution. This makes Python
extremely useful as no interface is required, i.e. you don’t need to port any part of the whole pipeline to some
different programming language. Also enterprise level Data Science projects often require interfacing with
different programming languages, which is also achievable by using Python. For example, suppose some
enterprise uses a custom made Java library for some esoteric data manipulation, then you can use Jython
implementation of Python to use that Java library without writing custom code for the interfacing layer.

Large and Active Community Support
The Python developer community is very active and humongous in number. This large community ensures
that the core Python language and packages remain efficient and bug free. A developer can seek support
about a Python issue using a variety of platforms like the Python mailing list, stack overflow, blogs, and
usenet groups. This large support ecosystem is also one of the reasons for making Python a favored language
for Data Science.

Introducing the Python Machine Learning Ecosystem
In this section, we address the important components of the Python Machine Learning ecosystem and give
a small introduction to each of them. These components are few of the reasons why Python is an important
language for Data Science. This section is structured to give you a gentle introduction and acquaint you
with these core Data Science libraries. Covering all of them in depth would be impractical and beyond the
current scope since we would be using them in detail in subsequent chapters. Another advantage of having a
great community of Python developers is the rich content that can be found about each one of these libraries
with a simple search. The list of components that we cover is by no means exhaustive but we have shortlisted
them on the basis of their importance in the whole ecosystem.

Jupyter Notebooks
Jupyter notebooks, formerly known as ipython notebooks, are an interactive computational environment
that can be used to develop Python based Data Science analyses, which emphasize on reproducible
research. The interactive environment is great for development and enables us to easily share the notebook

72

Chapter 2 ■ The Python Machine Learning Ecosystem

and hence the code among peers who can replicate our research and analyses by themselves. These jupyter
notebooks can contain code, text, images, output, etc., and can be arranged in a step by step manner to
give a complete step by step illustration of the whole analysis process. This capability makes notebooks a
valuable tool for reproducible analyses and research, especially when you want to share your work with
a peer. While developing your analyses, you can document your thought process and capture the results
as part of the notebook. This seamless intertwining of documentation, code, and results make jupyter
notebooks a valuable tool for every data scientist.
We will be using jupyter notebooks, which are installed by default with our Anaconda distribution. This
is similar to the ipython shell with the difference that it can be used for different programming backends,
i.e. not just Python. But the functionality is similar for both of these with the added advantage of displaying
interactive visualizations and much more on jupyter notebooks.

Installation and Execution
We don’t require any additional installation for Jupyter notebooks, as it is already installed by the Anaconda
distribution. We can invoke the jupyter notebook by executing the following command at the command
prompt or terminal.
C:\>jupyter notebook
This will start a notebook server at the address localhost:8888 of your machine. An important point to
note here is that you access the notebook using a browser so you can even initiate it on a remote server and
use it locally using techniques like ssh tunneling. This feature is extremely useful in case you have a powerful
computing resource that you can only access remotely but lack a GUI for it. Jupyter notebook allows you to
access those resources in a visually interactive shell. Once you invoke this command, you can navigate to the
address localhost:8888 in your browser, to find the landing page depicted in Figure 2-3, which can be used
to access existing notebooks or create new ones.

Figure 2-3. Jupyter notebook landing page
On the landing page we can initiate a new notebook by clicking the New button on top right. By default
it will use the default kernel (i.e., the Python 3.5 kernel) but we can also associate the notebook with a

73

Chapter 2 ■ The Python Machine Learning Ecosystem

different kernel (for example a Python 2.7 kernel, if installed in your system). A notebook is just a collection
of cells. There are three major types of cells in a notebook:
1.

Code cells: Just like the name suggests, these are the cells that you can use to write
your code and associated comments. The contents of these cells are sent to the
kernel associated with the notebook and the computed outputs are displayed as
the cells’ outputs.

2.

Markdown cells: Markdown can be used to intelligently notate the computation
process. These can contain simple text comments, HTML tags, images, and even
Latex equations. These will come in very handy when we are dealing with a new
and non-standard algorithm and we also want to capture the stepwise math and
logic related to the algorithm.

3.

Raw cells: These are the simplest of the cells and they display the text written in
them as is. These can be used to add text that you don’t want to be converted by
the conversion mechanism of the notebooks.

In Figure 2-4 we see a sample jupyter notebook, which touches on the ideas we just discussed in this section.

Figure 2-4. Sample jupyter notebook

74

Chapter 2 ■ The Python Machine Learning Ecosystem

NumPy
Numpy is the backbone of Machine Learning in Python. It is one of the most important libraries in Python
for numerical computations. It adds support to core Python for multi-dimensional arrays (and matrices) and
fast vectorized operations on these arrays. The present day NumPy library is a successor of an early library,
Numeric, which was created by Jim Hugunin and some other developers. Travis Oliphant, Anaconda’s
president and co-founder, took the Numeric library as a base and added a lot of modifications, to launch the
present day NumPy library in 2005. It is a major open source project and is one of the most popular Python
libraries. It’s used in almost all Machine Learning and scientific computing libraries. The extent of popularity
of NumPy is verified by the fact that major OS distributions, like Linux and MacOS, bundle NumPy as a
default package instead of considering it as an add-on package.

N
 umpy ndarray
All of the numeric functionality of numpy is orchestrated by two important constituents of the numpy package,
ndarray and Ufuncs (Universal function). Numpy ndarray is a multi-dimensional array object which is the
core data container for all of the numpy operations. Universal functions are the functions which operate on
ndarrays in an element by element fashion. These are the lesser known members of the numpy package and
we will try to give a brief introduction to them in the later stage of this section. We will mostly be learning
about ndarrays in subsequent sections. (We will refer to them as arrays from now on for simplicity’s sake.)
Arrays (or matrices) are one of the fundamental representations of data. Mostly an array will be of
a single data type (homogeneous) and possibly multi-dimensional sometimes. The numpy ndarray is a
generalization of the same. Let’s get started with the introduction by creating an array.
In [4]: import numpy as np
   ...: arr = np.array([1,3,4,5,6])
   ...: arr
Out[4]: array([1, 3, 4, 5, 6])
In [5]: arr.shape
Out[5]: (5,)
In [6]: arr.dtype
Out[6]: dtype('int32')
In the previous example, we created a one-dimensional array from a normal list containing integers.
The shape attribute of the array object will tell us about the dimensions of the array. The data type was
picked up from the elements as they were all integers the data type is int32. One important thing to keep
in mind is that all the elements in an array must have the same data type. If you try to initialize an array in
which the elements are mixed, i.e. you mix some strings with the numbers then all of the elements will get
converted into a string type and we won’t be able to perform most of the numpy operations on that array. So a
simple rule of thumb is dealing only with numeric data. You are encouraged to type in the following code in
an ipython shell to look at the error message that comes up in such a scenario!
In [16]: arr = np.array([1,'st','er',3])
    ...: arr.dtype
Out[16]: dtype('0]
Out[6]:
array([ 1.78780089,  1.41016167,  0.4912639 ,  1.48625779,  0.62758167,
        0.77321756])
We observe that the shape of the array is not maintained so we directly cannot always use this indexing
method. But this method is quite useful in doing conditional data substitution. Suppose in the previous case,
we want to substitute all the non-zero values with 0. We can achieve that operation by the following code.
In [7]: city_data[city_data >0] = 0
   ...: city_data
Out[7]:
array([[ 0.        , -0.25099029, -0.26002244],
       [ 0.        , -0.43878679,  0.        ],
       [-0.32176723, -0.01912549, -1.22891881],
       [-0.93371835, -0.03604015, -0.37319556],
       [ 0.        ,  0.        ,  0.        ]])

Operations on Arrays
At the start of this section, we mentioned the concept of Universal functions (Ufuncs). In this sub-section,
we learn some of the functionalities provided by those functions. Most of the operations on the numpy arrays
is achieved by using these functions. Numpy provides a rich set of functions that we can leverage for various
operations on arrays. We cover some of those functions in brief, but we recommend you to always refer to
the official documentation of the project to learn more and leverage them in your own projects.
Universal functions are functions that operate on arrays in an element by element fashion. The
implementation of Ufunc is vectorized, which means that the execution of Ufuncs on arrays is quite fast. The
Ufuncs implemented in the numpy package are implemented in compiled C code for speed and efficiency.
But it is possible to write custom functions by extending the numpy.ufunc class of the numpy package.

80

Chapter 2 ■ The Python Machine Learning Ecosystem

Ufuncs are simple and easy to understand once you are able to relate the output they produce on a
particular array.
In [23]: arr = np.arange(15).reshape(3,5)
    ...: arr
    ...:
Out[23]:
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
In [24]: arr + 5
Out[24]:
array([[ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])
In [25]: arr * 2
Out[25]:
array([[ 0,  2,  4,  6,  8],
       [10, 12, 14, 16, 18],
       [20, 22, 24, 26, 28]])
We see that the standard operators when used in conjunction with arrays work element-wise. Some
Ufuncs will take two arrays as input and output a single array, while a rare few will output two arrays also.
In [29]: arr1 = np.arange(15).reshape(5,3)
    ...: arr2 = np.arange(5).reshape(5,1)
    ...: arr2 + arr1
Out[29]:
array([[ 0,  1,  2],
       [ 4,  5,  6],
       [ 8,  9, 10],
       [12, 13, 14],
       [16, 17, 18]])
In [30]: arr1
Out[30]:
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])
In [31]: arr2
Out[31]:

81

Chapter 2 ■ The Python Machine Learning Ecosystem

array([[0],
       [1],
       [2],
       [3],
       [4]])
Here we see that we were able to add up two arrays even when they were of different sizes. This is
achieved by the concept of broadcasting. We will conclude this brief discussion on operations on arrays by
demonstrating a function that will return two arrays.
In [32]: arr1 = np.random.randn(5,3)
    ...: arr1
Out[32]:
array([[-0.57863219, -0.36613451, -0.92311378],
       [ 0.81557068,  0.20486617, -0.16740779],
       [ 0.73806067,  1.30173294,  0.6144705 ],
       [ 0.26294157, -0.09300711,  1.1794524 ],
       [ 0.25011242, -0.65374314, -0.57663904]])
In [35]: np.modf(arr1)
Out[35]:
(array([[-0.57863219, -0.36613451, -0.92311378],
        [ 0.81557068,  0.20486617, -0.16740779],
        [ 0.73806067,  0.30173294,  0.6144705 ],
        [ 0.26294157, -0.09300711,  0.1794524 ],
        [ 0.25011242, -0.65374314, -0.57663904]]),
array([[-0., -0., -0.],
        [ 0.,  0., -0.],
        [ 0.,  1.,  0.],
        [ 0., -0.,  1.],
        [ 0., -0., -0.]]))
The function modf will return the fractional and the integer part of the input supplied to it. Hence it will
return two arrays of the same size. We tried to give you a basic idea of the operations on arrays provided by
the numpy package. But this list is not exhaustive; for the complete list you can refer to the reference page for
Ufuncs at https://docs.scipy.org/doc/numpy/reference/ufuncs.html.

Linear Algebra Using numpy
Linear algebra is an integral part of the domain of Machine Learning. Most of the algorithms we will deal
with can be concisely expressed using the operations of linear algebra. Numpy was initially built to provide
the functions similar to MATLAB and hence linear algebra functions on arrays were always an important
part of it. In this section, we learn a bit about performing linear algebra on ndarrays using the functions
implemented in the numpy package.
One of the most widely used operations in linear algebra is the dot product. This can be performed on
two compatible (brush up on your matrices and array skills if you need to know which arrays are compatible
for a dot product) ndarrays by using the dot function.
In [39]: A = np.array([[1,2,3],[4,5,6],[7,8,9]])
    ...: B = np.array([[9,8,7],[6,5,4],[1,2,3]])

82

Chapter 2 ■ The Python Machine Learning Ecosystem

In [40]: A.dot(B)
Out[40]:
array([[ 24,  24,  24],
       [ 72,  69,  66],
       [120, 114, 108]])
Similarly, there are functions implemented for finding different products of matrices like inner, outer, and so
on. Another popular matrix operation is transpose of a matrix. This can be easily achieved by using the T function.
In [41]:
In [46]:
Out[46]:
array([[
       [
       [
       [
       [

A = np.arange(15).reshape(3,5)
A.T
0,  5,
1,  6,
2,  7,
3,  8,
4,  9,

10],
11],
12],
13],
14]])

Oftentimes, we need to find out decomposition of a matrix into its constituents factors. This is called
matrix factorization. This can be achieved by the appropriate functions. A popular matrix factorization
method is SVD factorization (covered briefly in Chapter 1 concepts), which returns decomposition of a
matrix into three different matrices. This can be done using linalg.svd function.
In [48]: np.linalg.svd(A)
Out[48]:
(array([[-0.15425367,  0.89974393,  0.40824829],
        [-0.50248417,  0.28432901, -0.81649658],
        [-0.85071468, -0.3310859 ,  0.40824829]]),
array([  3.17420265e+01,   2.72832424e+00,   4.58204637e-16]),
array([[-0.34716018, -0.39465093, -0.44214167, -0.48963242, -0.53712316],
        [-0.69244481, -0.37980343, -0.06716206,  0.24547932,  0.55812069],
        [ 0.33717486, -0.77044776,  0.28661392,  0.38941603, -0.24275704],
        [-0.36583339,  0.32092943, -0.08854543,  0.67763613, -0.54418674],
        [-0.39048565,  0.05843412,  0.8426222 , -0.29860414, -0.21196653]]))
Linear algebra is often also used to solve a system of equations. Using the matrix notation of system of
equations and the provided function of numpy, we can easily solve such a system of equation. Consider the
system of equations:
                          7x + 5y -3z = 16
                          3x - 5y + 2z = -8
                          5x + 3y - 7z = 0
This can be represented as two matrices: the coefficient matrix (a in the example) and the constants
vector (b in the example).
In [51]:
    ...:
    ...:
    ...:

a = np.array([[7,5,-3], [3,-5,2],[5,3,-7]])
b = np.array([16,-8,0])
x = np.linalg.solve(a, b)
x

Out[51]: array([ 1.,  3.,  2.])

83

Chapter 2 ■ The Python Machine Learning Ecosystem

We can also check if the solution is correct using the np.allclose function.
In [52]: np.allclose(np.dot(a, x), b)
Out[52]: True
Similarly, functions are there for finding the inverse of a matrix, eigen vectors and eigen values of
a matrix, norm of a matrix, determinant of a matrix, and so on, some of which we covered in detail in
Chapter 1. Take a look at the details of the function implemented at https://docs.scipy.org/doc/numpy/
reference/routines.linalg.html.

Pandas
Pandas is an important Python library for data manipulation, wrangling, and analysis. It functions as an
intuitive and easy-to-use set of tools for performing operations on any kind of data. Initial work for pandas
was done by Wes McKinney in 2008 while he was a developer at AQR Capital Management. Since then,
the scope of the pandas project has increased a lot and it has become a popular library of choice for data
scientists all over the world. Pandas allows you to work with both cross-sectional data and time series based
data. So let’s get started exploring pandas!

Data Structures of Pandas
All the data representation in pandas is done using two primary data structures:
•

Series

•

Dataframes

Series
Series in pandas is a one-dimensional ndarray with an axis label. It means that in functionality, it is
almost similar to a simple array. The values in a series will have an index that needs to be hashable. This
requirement is needed when we perform manipulation and summarization on data contained in a series
data structure. Series objects can be used to represent time series data also. In this case, the index is a
datetime object.

D
 ataframe
Dataframe is the most important and useful data structure, which is used for almost all kind of data
representation and manipulation in pandas. Unlike numpy arrays (in general) a dataframe can contain
heterogeneous data. Typically tabular data is represented using dataframes, which is analogous to an Excel
sheet or a SQL table. This is extremely useful in representing raw datasets as well as processed feature sets
in Machine Learning and Data Science. All the operations can be performed along the axes, rows, and
columns, in a dataframe. This will be the primary data structure which we will leverage, in most of the use
cases in our later chapters.

84

Chapter 2 ■ The Python Machine Learning Ecosystem

Data Retrieval
Pandas provides numerous ways to retrieve and read in data. We can convert data from CSV files, databases,
flat files, and so on into dataframes. We can also convert a list of dictionaries (Python dict) into a dataframe.
The sources of data which pandas allows us to handle cover almost all the major data sources. For our
introduction, we will cover three of the most important data sources:
•

List of dictionaries

•

CSV files

•

Databases

L ist of Dictionaries to Dataframe
This is one of the simplest methods to create a dataframe. It is useful in scenarios where we arrive at the data
we want to analyze, after performing some computations and manipulations on the raw data. This allows us
to integrate a pandas based analysis into data being generated by other Python processing pipelines.
In[27]: import pandas as pd
In[28]: d =  [{'city':'Delhi',"data":1000},
   ...:       {'city':'Bangalore',"data":2000},
   ...:       {'city':'Mumbai',"data":1000}]
In[29]: pd.DataFrame(d)
Out[29]:
        city  data
0      Delhi  1000
1  Bangalore  2000
2     Mumbai  1000
In[30]: df = pd.DataFrame(d)
In[31]: df
Out[31]:
        city  data
0      Delhi  1000
1  Bangalore  2000
2     Mumbai  1000
Here we provided a list of Python dictionaries to the DataFrame class of the pandas library and the
dictionary was converted into a DataFrame. Two important things to note here: first the keys of dictionary
are picked up as the column names in the dataframe (we can also supply some other name as arguments
for different column names), secondly we didn’t supply an index and hence it picked up the default index of
normal arrays.

CSV Files to Dataframe
CSV (Comma Separated Files) files are perhaps one of the most widely used ways of creating a
dataframe. We can easily read in a CSV, or any delimited file (like TSV), using pandas and convert
into a dataframe. For our example we will read in the following file and convert into a dataframe by using
Python. The data in Figure 2-5 is a sample slice of a CSV file containing the data of cities of the world from
http://simplemaps.com/data/world-cities. We will use the same data in a later part of this chapter also.

85

Chapter 2 ■ The Python Machine Learning Ecosystem

Figure 2-5. A sample CSV file

We can convert this file into a dataframe with the help of the following code leveraging pandas.
In [1]: import pandas as pd
In [2]: city_data = pd.read_csv(filepath_or_buffer='simplemaps-worldcities-basic.csv')
In [3]: city_data.head(n=10)
Out[3]:
city city_ascii lat lng pop country \
0 Qal eh-ye Now Qal eh-ye 34.983000 63.133300 2997 Afghanistan
1 Chaghcharan Chaghcharan 34.516701 65.250001 15000 Afghanistan
2 Lashkar Gah Lashkar Gah 31.582998 64.360000 201546 Afghanistan
3 Zaranj Zaranj 31.112001 61.886998 49851 Afghanistan
4 Tarin Kowt Tarin Kowt 32.633298 65.866699 10000 Afghanistan
5 Zareh Sharan Zareh Sharan 32.850000 68.416705 13737 Afghanistan
6 Asadabad Asadabad 34.866000 71.150005 48400 Afghanistan
7 Taloqan Taloqan 36.729999 69.540004 64256 Afghanistan
8 Mahmud-E Eraqi Mahmud-E Eraqi 35.016696 69.333301 7407 Afghanistan
9 Mehtar Lam Mehtar Lam 34.650000 70.166701 17345 Afghanistan
iso2
0 AF
1 AF
2 AF
3 AF
4 AF
5 AF
6 AF
7 AF
8 AF
9 AF

iso3 province
AFG Badghis
AFG Ghor
AFG Hilmand
AFG Nimroz
AFG Uruzgan
AFG Paktika
AFG Kunar
AFG Takhar
AFG Kapisa
AFG Laghman

As the file we supplied had a header included, those values were used as the name of the columns in
the resultant dataframe. This is a very basic yet core usage of the function pandas.read_csv. The function
comes with a multitude of parameters that can be used to modify its behavior as required. We will not cover

86

Chapter 2 ■ The Python Machine Learning Ecosystem

the entire gamut of parameters available and you are encouraged to read the documentation of this function
as this is one of the starting point of most Python based data analysis.

Databases to Dataframe
The most important data source for data scientists is the existing data sources used by their organizations.
Relational databases (DBs) and data warehouses are the de facto standard of data storage in almost all of the
organizations. Pandas provides capabilities to connect to these databases directly, execute queries on them
to extract data, and then convert the result of the query into a structured dataframe. The pandas.from_sql
function combined with Python’s powerful database library implies that the task of getting data from DBs is
simple and easy. Due to this capability, no intermediate steps of data extraction are required. We will now take
an example of reading data from a Microsoft SQL Server database. The following code will achieve this task.
server = 'xxxxxxxx' # Address of the database server
user = 'xxxxxx'     # the username for the database server
password = 'xxxxx'  # Password for the above user
database = 'xxxxx'  # Database in which the table is present
conn = pymssql.connect(server=server, user=user, password=password, database=database)
query = "select * from some_table"
df = pd.read_sql(query, conn)
The important to thing to notice here is the connection object (conn in the code). This object is the
one which will identify the database server information and the type of database to pandas. Based on the
endpoint database server we will change the connection object. For example we are using the pymssql
library for access to Microsoft SQL server here. If our data source is changed to a Postgres database, the
connection object will change but the rest of the procedure will be similar. This facility is really handy when
we need to perform similar analyses on data originating from different sources. Once again, the read_sql
function of pandas provides a lot of parameters that allow us to control its behavior. We also recommend you
to check out the sqlalchemy library, which makes creating connection objects easier irrespective of the type
of database vendor and also provides a lot of other utilities.

Data Access
The most important part after reading in our data is that of accessing that data using the data structure’s
access mechanisms. Accessing data in the pandas dataframe and series objects is very much similar to the
access mechanism that exist for Python lists or numpy arrays. But they also offer some extra methods for
data access specific to dataframe/series.

Head and Tail
In the previous section we witnessed the method head. It gives us the first few rows (by default 5) of the
data. A corresponding function is tail, which gives us the last few rows of the dataframe. These are one of
the most widely used pandas functions, as we often need to take a peek at our data as and when we apply
different operations/selections on it. We already have seen the output of head, so we’ll use the tail function
on the same dataframe and see its output.
In [11]: city_data.tail()
Out[11]:
city city_ascii lat lng pop country \
7317 Mutare Mutare -18.970019 32.650038 216785.0 Zimbabwe

87

Chapter 2 ■ The Python Machine Learning Ecosystem

7318
7319
7320
7321

Kadoma Kadoma -18.330006 29.909947 56400.0 Zimbabwe
Chitungwiza Chitungwiza -18.000001 31.100003 331071.0 Zimbabwe
Harare Harare -17.817790 31.044709 1557406.5 Zimbabwe
Bulawayo Bulawayo -20.169998 28.580002 697096.0 Zimbabwe

iso2
7317
7318
7319
7320
7321

iso3 province
ZW ZWE Manicaland
ZW ZWE Mashonaland West
ZW ZWE Harare
ZW ZWE Harare
ZW ZWE Bulawayo

Slicing and Dicing
The usual rules of slicing and dicing data that we used in Python lists apply to the Series object as well.
In [12]: series_es = city_data.lat
In [13]: type(series_es)
Out[13]: pandas.core.series.Series
In [14]: series_es[1:10:2]
Out[14]:
1 34.516701
3 31.112001
5 32.850000
7 36.729999
9 34.650000
Name: lat, dtype: float64
In [15]: series_es[:7]
Out[15]:
0    34.983000
1    34.516701
2    31.582998
3    31.112001
4    32.633298
5    32.850000
6    34.866000
Name: lat, dtype: float64
In [23]: series_es[:-7315]
Out[23]:
0    34.983000
1    34.516701
2    31.582998
3    31.112001
4    32.633298
5    32.850000
6    34.866000
Name: lat, dtype: float64

88

Chapter 2 ■ The Python Machine Learning Ecosystem

The examples given here are self-explanatory and you can refer to the numpy section for more details.
Similar slicing rules apply for dataframes also but the only difference is that now simple slicing refers to
the slicing of rows and all the other columns will end up in the result. Consider the following example
In [24]: city_data[:7]
Out[24]:
city city_ascii lat lng pop country \
0 Qal eh-ye Now Qal eh-ye 34.983000 63.133300 2997 Afghanistan
1 Chaghcharan Chaghcharan 34.516701 65.250001 15000 Afghanistan
2 Lashkar Gah Lashkar Gah 31.582998 64.360000 201546 Afghanistan
3 Zaranj Zaranj 31.112001 61.886998 49851 Afghanistan
4 Tarin Kowt Tarin Kowt 32.633298 65.866699 10000 Afghanistan
5 Zareh Sharan Zareh Sharan 32.850000 68.416705 13737 Afghanistan
6 Asadabad Asadabad 34.866000 71.150005 48400 Afghanistan
iso2
0 AF
1 AF
2 AF
3 AF
4 AF
5 AF
6 AF

iso3 province
AFG Badghis
AFG Ghor
AFG Hilmand
AFG Nimroz
AFG Uruzgan
AFG Paktika
AFG Kunar

For providing access to specific rows and specific columns, pandas provides useful functions like iloc
and loc which can be used to refer to specific rows and columns in a dataframe. There is also the ix function
but we recommend using either loc or iloc. The following examples leverages the iloc function provided
by pandas. This allows us to select the rows and columns using structure similar to array slicing. In the
example, we will only pick up the first five rows and the first four columns.
In [28]: city_data.iloc[:5,:4]
Out[28]:
city city_ascii lat lng
0 Qal eh-ye Now Qal eh-ye 34.983000 63.133300
1 Chaghcharan Chaghcharan 34.516701 65.250001
2 Lashkar Gah Lashkar Gah 31.582998 64.360000
3 Zaranj Zaranj 31.112001 61.886998
4 Tarin Kowt Tarin Kowt 32.633298 65.866699
Another access mechanism is Boolean based access to the dataframe rows or columns. This is
particularly important for dataframes, as it allows us to work with a specific set of rows and columns. Let’s
consider the following example in which we want to select cities that have population of more than 10
million and select columns that start with the letter l:
In [56]: city_data[city_data['pop'] >  
                  10000000][city_data.columns[pd.Series(city_data.columns).str.
startswith('l')]]
Out[53]:
lat lng

89

Chapter 2 ■ The Python Machine Learning Ecosystem

360 -34.602502 -58.397531
1171 -23.558680 -46.625020
2068 31.216452 121.436505
3098 28.669993 77.230004
3110 19.016990 72.856989
3492 35.685017 139.751407
4074 19.442442 -99.130988
4513 24.869992 66.990009
5394 55.752164 37.615523
6124 41.104996 29.010002
7071 40.749979 -73.980017
When we select data based on some condition, we always get the part of dataframe that satisfies the
condition supplied. Sometimes we want to test a condition against a dataframe but want to preserve the
shape of the dataframe. In these cases, we can use the where function (check out numpy's where function
also to see the analogy!). We’ll illustrate this function with an example in which we will try to select all the
cities that have population greater than 15 million.
In [6]: city_greater_10mil = city_data[city_data['pop'] > 10000000]
In [23]: city_greater_10mil.where(city_greater_10mil.population > 15000000)
Out[23]:
city city_ascii lat lng population country iso2 iso3 \
360 NaN NaN NaN NaN NaN NaN NaN NaN
1171 NaN NaN NaN NaN NaN NaN NaN NaN
2068 NaN NaN NaN NaN NaN NaN NaN NaN
3098 NaN NaN NaN NaN NaN NaN NaN NaN
3110 Mumbai Mumbai 19.016990 72.856989 15834918.0 India IN IND
3492 Tokyo Tokyo 35.685017 139.751407 22006299.5 Japan JP JPN
4074 NaN NaN NaN NaN NaN NaN NaN NaN
4513 NaN NaN NaN NaN NaN NaN NaN NaN
5394 NaN NaN NaN NaN NaN NaN NaN NaN
6124 NaN NaN NaN NaN NaN NaN NaN NaN
7071 NaN NaN NaN NaN NaN NaN NaN NaN
province
360 NaN
1171 NaN
2068 NaN
3098 NaN
3110 Maharashtra
3492 Tokyo
4074 NaN
4513 NaN
5394 NaN
6124 NaN
7071 NaN
Here we see that we get the output dataframe of the same size but the rows that don’t conform to the
condition are replaced with NaN.

90

Chapter 2 ■ The Python Machine Learning Ecosystem

In this section, we learned some of the core data access mechanisms of pandas dataframes. The data
access mechanism of pandas are as simple and extensive to use as with numpy this ensures that we have
various way to access our data.

Data Operations
In subsequent chapters of our book, the pandas dataframe will be our data structure of choice for most data
processing and wrangling operations. So we would like to spend some more time exploring some important
operations that can be performed on dataframes using specific supplied functions.

Values Attribute
Each pandas dataframe will have certain attributes. One of the important attributes is values. It is important
as it allows us access to the raw values stored in the dataframe and if they all homogenous i.e., of the
same kind then we can use numpy operations on them. This becomes important when our data is a mix of
numeric and other data types and after some selections and computations, we arrive at the required subset
of numeric data. Using the values attribute of the output dataframe, we can treat it in the same way as a
numpy array. This is very useful when working with feature sets in Machine Learning. Traditionally, numpy
vectorized operations are much faster than function based operations on dataframes.
In [55]:  df = pd.DataFrame(np.random.randn(8, 3),
    ...:   columns=['A', 'B', 'C'])
In [56]: df
Out[56]:
          A         B         C
0 -0.271131  0.084627 -1.707637
1  1.895796  0.590270 -0.505681
2 -0.628760 -1.623905  1.143701
3  0.005082  1.316706 -0.792742
4  0.135748 -0.274006  1.989651
5  1.068555  0.669145  0.128079
6 -0.783522  0.167165 -0.426007
7  0.498378 -0.950698  2.342104
In [58]: nparray = df.values
In [59]: type(nparray)
Out[59]: numpy.ndarray

Missing Data and the fillna Function
In real-world datasets, the data is seldom clean and polished. We usually will have a lot of issues with data
quality (missing values, wrong values and so on). One of the most common data quality issues is that of
missing data. Pandas provides us with a convenient function that allows us to handle the missing values of a
dataframe.
For demonstrating the use of the fillna function, we will use the dataframe we created in the previous
example and introduce missing values in it.
In [65]: df.iloc[4,2] = NA
In [66]: df
Out[66]:

91

Chapter 2 ■ The Python Machine Learning Ecosystem

          A         B         C
0 -0.271131  0.084627 -1.707637
1  1.895796  0.590270 -0.505681
2 -0.628760 -1.623905  1.143701
3  0.005082  1.316706 -0.792742
4  0.135748 -0.274006       NaN
5  1.068555  0.669145  0.128079
6 -0.783522  0.167165 -0.426007
7  0.498378 -0.950698  2.342104
In [70]: df.fillna (0)
Out[70]:
          A         B         C
0 -0.271131  0.084627 -1.707637
1  1.895796  0.590270 -0.505681
2 -0.628760 -1.623905  1.143701
3  0.005082  1.316706 -0.792742
4  0.135748 -0.274006  0.000000
5  1.068555  0.669145  0.128079
6 -0.783522  0.167165 -0.426007
7  0.498378 -0.950698  2.342104
Here we have substituted the missing value with a default value. We can use a variety of methods to
arrive at the substituting value (mean, median, and so on). We will see more methods of missing value
treatment (like imputation) in subsequent chapters.

Descriptive Statistics Functions
A general practice of dealing with datasets is to know as much about them as possible. Descriptive statistics
of a dataframe give data scientists a comprehensive look into important information about any attributes
and features in the dataset. Pandas packs a bunch of functions, which facilitate easy access to these statistics.
Consider the cities dataframe (city_data) that we consulted in the earlier section. We will use pandas
functions to gather some descriptive statistical information about the attributes of that dataframe. As we
only have three numeric columns in that particular dataframe, we will deal with a subset of the dataframe
which contains only those three values.
In [76]: columns_numeric = ['lat','lng','pop']
In [78]: city_data[columns_numeric].mean()
Out[78]:
lat        20.662876
lng        10.711914
pop    265463.071633
dtype: float64
In [79]: city_data[columns_numeric].sum()
Out[79]:
lat    1.512936e+05
lng    7.843263e+04
pop    1.943721e+09
dtype: float64

92

Chapter 2 ■ The Python Machine Learning Ecosystem

In [80]: city_data[columns_numeric].count()
Out[80]:
lat    7322
lng    7322
pop    7322
dtype: int64
In [81]: city_data[columns_numeric].median()
Out[81]:
lat       26.792730
lng       18.617509
pop    61322.750000
dtype: float64
In [83]: city_data[columns_numeric].quantile(0.8)
Out[83]:
lat        46.852480
lng        89.900018
pop    269210.000000
dtype: float64
All these operations were applied to each of the columns, the default behavior. We can also get all these
statistics for each row by using a different axis. This will give us the calculated statistics for each row in the
dataframe.
In [85]: city_data[columns_numeric].sum(axis = 1)
Out[85]:
0       3.095116e+03
1       1.509977e+04
2       2.016419e+05
3       4.994400e+04
4       1.009850e+04
Pandas also provides us with another very handy function called describe. This function will calculate
the most important statistics for numerical data in one go so that we don’t have to use individual functions.
In [86]: city_data[columns_numeric].describe()
Out[86]:
               lat          lng           pop
count  7322.000000  7322.000000  7.322000e+03
mean     20.662876    10.711914  2.654631e+05
std      29.134818    79.044615  8.287622e+05
min     -89.982894  -179.589979 -9.900000e+01
25%      -0.324710   -64.788472  1.734425e+04
50%      26.792730    18.617509  6.132275e+04
75%      43.575448    73.103628  2.001726e+05
max      82.483323   179.383304  2.200630e+07

93

Chapter 2 ■ The Python Machine Learning Ecosystem

Concatenating Dataframes
Most Data Science projects will have data from more than one data source. These data sources will mostly
have data that’s related in some way to each other and the subsequent steps in data analysis will require
them to be concatenated or joined. Pandas provides a rich set of functions that allow us to merge different
data sources. We cover a small subset of such methods. In this section, we explore and learn about two
methods that can be used to perform all kinds of amalgamations of dataframes.

Concatenating Using the concat Method
The first method to concatenate different dataframes in pandas is by using the concat method. The majority
of the concatenation operations on dataframes will be possible by tweaking the parameters of the concat
method. Let’s look at a couple of examples to understand how the concat method works.
The simplest scenario of concatenating is when we have more than one fragment of the same dataframe
(which may happen if you are reading it from a stream or in chunks). In that case, we can just supply the
constituent dataframes to the concat function as follows.
In [25]: city_data1 = city_data.sample(3)
In [26]: city_data2 = city_data.sample(3)
In [29]: city_data_combine = pd.concat([city_data1,city_data2])
In [30]: city_data_combine
Out[30]:
city city_ascii lat lng pop \
4255 Groningen Groningen 53.220407 6.580001 198941.0
5171 Tambov Tambov 52.730023 41.430019 296207.5
4204 Karibib Karibib -21.939003 15.852996 6898.0
4800 Focsani Focsani 45.696551 27.186547 92636.5
1183 Pleven Pleven 43.423769 24.613371 110445.5
7005 Indianapolis Indianapolis 39.749988 -86.170048 1104641.5
country iso2 iso3 province
4255 Netherlands NL NLD Groningen
5171 Russia RU RUS Tambov
4204 Namibia NaN NAM Erongo
4800 Romania RO ROU Vrancea
1183 Bulgaria BG BGR Pleven
7005 United States of America US USA Indiana
Another common scenario of concatenating is when we have information about the columns of same
dataframe split across different dataframes. Then we can use the concat method again to combine all the
dataframes. Consider the following example.
In [32]: df1 = pd.DataFrame({'col1': ['col10', 'col11',
    ...:                     'col2': ['col20', 'col21',
    ...:                     'col3': ['col30', 'col31',
    ...:                     'col4': ['col40', 'col41',
    ...:                    index=[0, 1, 2, 3])

94

'col12',
'col22',
'col32',
'col42',

'col13'],
'col23'],
'col33'],
'col43']},

Chapter 2 ■ The Python Machine Learning Ecosystem

In [33]: df1
Out[33]:
   col1  col2  col3  col4
0 col10 col20 col30 col40
1 col11 col21 col31 col41
2 col12 col22 col32 col42
3 col13 col23 col33 col43
In [34]: df4 = pd.DataFrame({'col2': ['col22', 'col23', 'col26', 'col27'],
    ...:                     'Col4': ['Col42', 'Col43', 'Col46', 'Col47'],
    ...:                     'col6': ['col62', 'col63', 'col66', 'col67']},
    ...:                    index=[2, 3, 6, 7])
In [37]: pd.concat([df1,df4], axis=1)
Out[37]:
   col1  col2  col3  col4  Col4  col2  col6
0 col10 col20 col30 col40   NaN   NaN   NaN
1 col11 col21 col31 col41   NaN   NaN   NaN
2 col12 col22 col32 col42 Col42 col22 col62
3 col13 col23 col33 col43 Col43 col23 col63
6   NaN   NaN   NaN   NaN Col46 col26 col66
7   NaN   NaN   NaN   NaN Col47 col27 col67

Database Style Concatenations Using the merge Command
The most familiar way to concatenate data (for those acquainted with relational databases) is using the
join operation provided by the databases. Pandas provides a database friendly set of join operations for
dataframes. These operations are optimized for high performance and are often the preferred method for
joining disparate dataframes.
Joining by columns: This is the most natural way of joining two dataframes. In this method, we have
two dataframes sharing a common column and we can join the two dataframes using that column. The
pandas library has a full range of join operations (inner, outer, left, right, etc.) and we will demonstrate
the use of inner join in this sub-section. You can easily figure out how to do the rest of join operations by
checking out the pandas documentation.
For this example, we will break our original cities data into two different dataframes, one having the
city information and the other having the country information. Then, we can join them using one of the
shared common columns.
In [51]: country_data = city_data[['iso3','country']].drop_duplicates()
In [52]: country_data.shape
Out[52]: (223, 2)
In [53]: country_data.head()
Out[53]:
iso3 country
0 AFG Afghanistan
33 ALD Aland
34 ALB Albania
60 DZA Algeria
111 ASM American Samoa

95

Chapter 2 ■ The Python Machine Learning Ecosystem

In [56]: del(city_data['country'])
In [59]: city_data.merge(country_data, 'inner').head()
Out[59]:
city city_ascii lat lng pop iso2 iso3 \
0 Qal eh-ye Now Qal eh-ye 34.983000 63.133300 2997 AF AFG
1 Chaghcharan Chaghcharan 34.516701 65.250001 15000 AF AFG
2 Lashkar Gah Lashkar Gah 31.582998 64.360000 201546 AF AFG
3 Zaranj Zaranj 31.112001 61.886998 49851 AF AFG
4 Tarin Kowt Tarin Kowt 32.633298 65.866699 10000 AF AFG
province country
0 Badghis Afghanistan
1 Ghor Afghanistan
2 Hilmand Afghanistan
3 Nimroz Afghanistan
4 Uruzgan Afghanistan
Here we had a common column in both the dataframes, iso3, which the merge function was able to
pick up automatically. In case of the absence of such common names, we can provide the column names
to join on, by using the parameter on of the merge function. The merge function provides a rich set of
parameters that can be used to change its behavior as and when required. We will leave it on you to discover
more about the merge function by trying out a few examples.

Scikit-learn
Scikit-learn is one of the most important and indispensable Python frameworks for Data Science and
Machine Learning in Python. It implements a wide range of Machine Learning algorithms covering major
areas of Machine Learning like classification, clustering, regression, and so on. All the mainstream Machine
Learning algorithms like support vector machines, logistic regression, random forests, K-means clustering,
hierarchical clustering, and many many more, are implemented efficiently in this library. Perhaps this
library forms the foundation of applied and practical Machine Learning. Besides this, its easy-to-use API and
code design patterns have been widely adopted across other frameworks too!
The scikit-learn project was initiated as a Google summer of code project by David Cournapeau.
The first public release of the library was in late 2010. It is one of the most active Python projects and is
still under active development with new capabilities and existing enhancements being added constantly.
Scikit-learn is mostly written in Python but for providing a better performance some of the core code is
written in Cython. It also uses wrappers around popular implementations of learning algorithms like logistic
regression (using LIBLINEAR) and support vector machine (using LIBSVM).
In our introduction of scikit-learn we will first go through the basic design principles of the library
and then build on this theoretical knowledge of the package. We will implement some of the algorithms
on sample data to get you acquainted with the basic syntax. We leverage scikit-learn extensively in
subsequent chapters, so the intent here is to acquaint you with how the library is structured and its core
components.

96

Chapter 2 ■ The Python Machine Learning Ecosystem

Core APIs
Scikit-learn is an evolving and active project, as witnessed by its GitHub repository statistics. This
framework is built on quite a small and simple list of core API ideas and design patterns. In this section we
will briefly touch on the core APIs on which the central operations of scikit-learn are based.
•

Dataset representation: The data representation of most Machine Learning tasks
are quite similar to each other. Very often we will have a collection of data points
represented by a stacking of data point vectors. Basically considering a dataset,
each row in the dataset represents a vector for a specific data point observation. A
data point vector contains multiple independent variables (or features) and one or
more dependent variables (response variables). For example, if we have a linear
regression problem which can be represented as [(X1, X2, X3, X4, …, Xn), (Y)] where
the independent variables (features) are represented by the Xs and the dependent
variable (response variable) is represented by Y. The idea is to predict Y by fitting
a model on the features This data representation resembles a matrix (considering
multiple data point vectors), and a natural way to depict it is by using numpy arrays.
This choice of data representation is quite simple yet powerful as we are able to
access the powerful functionalities and the efficient nature of vectorized numpy array
operations. In fact recent updates of scikit-learn even accept pandas dataframes as
inputs instead of explicitly needing you to convert them to feature arrays!

•

Estimators: The estimator interface is one of the most important components of
the scikit-learn library. All the Machine Learning algorithms in the package
implement the estimator interface. The learning process is handled in a two-step
process. The first step is the initialization of the estimator object; this involves
selecting the appropriate class object for the algorithm and supplying the parameters
or hyperparameters for it. The second step is applying the fit function to the
data supplied (feature set and response variables). The fit function will learn the
output parameters of the Machine Learning algorithm and expose them as public
attributes of the object for easy inspection of the final model. The data to the fit
function is generally supplied in the form of an input-output matrix pair. In addition
to the Machine Learning algorithms, several data transformation mechanisms are
also implemented using the estimators APIs (for example, scaling of features, PCA,
etc.). This allows for simple data transformation and a simple mechanism to expose
transformation mechanisms in a consistent way.

•

Predictors: The predictor interface is implemented to generate predictions,
forecasts, etc. using a learned estimator for unknown data. For example, in the case
of a supervised learning problem, the predictor interface will provide predicted
classes for the unknown test array supplied to it. Predictor interface also contains
support for providing quantified values of the output it supplies. A requirement of a
predictor implementation is to provide a score function; this function will provide
a scalar value for the test input provided to it which will quantify the effectiveness
of the model used. Such values will be used in the future for tuning our Machine
Learning models.

97

Chapter 2 ■ The Python Machine Learning Ecosystem

•

Transformers: Transformation of input data before learning of a model is a very
common task in Machine Learning. Some data transformations are simple, for
example replacing some missing data with a constant, taking a log transform,
while some data transformations are similar to learning algorithms themselves
(for example, PCA). To simplify the task of such transformations, some estimator
objects will implement the transformer interface. This interface allows us to perform
a non-trivial transformation on the input data and supply the output to our actual
learning algorithm. Since the transformer object will retain the estimator used for
transformation, it becomes very easy to apply the same transformation to unknown
test data using the transform function.

Advanced APIs
In the earlier section we saw some of the most basic tenets of the scikit-learn package. In this section we
will briefly touch on the advanced constructs that are built on those basics. These advanced set of APIs will
often help data scientists in expressing a complex set of essential operations using a simple and stream-lined
syntax.

98

•

Meta estimators: The meta estimator interface (implemented using the multiclass
interface) is a collection of estimators which can be composed by accumulating
simple binary classifiers. It allows us to extend the binary classifiers to implement
multi-class, multi-label, multi-regression, and multi-class-multi-label classifications.
This interface is important as these scenarios are common in modern day
Machine Learning and the capability to implement this out-of-the-box reduces the
programming requirements for data scientists. We should also remember that most
binary estimators in the scikit-learn library have multiclass capabilities built in
and we won’t be using the meta-estimators unless we need custom behavior.

•

Pipeline and feature unions: The steps of Machine Learning are mostly sequential
in nature. We will read in the data, apply some simple or complex transformations,
fit an appropriate model, and predict using the model for unseen data. Another
hallmark of the Machine Learning process is the iteration of these steps multiple
times due to its iterative nature, to arrive at the best possible model and then deploy
the same. It is convenient to chain these operations together and repeat them as a
single unit instead of applying operations piecemeal. This concept is also known
as Machine Learning pipelines. Scikit-learn provides a Pipeline API to achieve
similar purpose. A Pipeline() object from the pipeline module can chain multiple
estimators together (transformations, modeling, etc.) and the resultant object can
be used as an estimator itself. In addition to the pipeline API, which applies these
estimators in a sequential method, we also have access to a FeatureUnion API, which
will perform a specified set of operation in parallel and show the output of all the
parallel operations. The use of pipelines is a fairly advanced topic and it will be made
clearer, when we specifically see an example in the subsequent chapters.

Chapter 2 ■ The Python Machine Learning Ecosystem

•

Model tuning and selection: Each learning algorithm will have a bunch of
parameters or hyperparameters associated with it. The iterative process of Machine
Learning aims at finding the best set of parameters that give us the model having the
best performance. For example, the process of tuning various hyperparameters of a
random forest algorithm, to find the set which gives the best prediction accuracy (or
any other performance metric). This process sometimes involves traversing through
the parameter space, searching for the best parameter set. Do note that even though
we mention the term parameter here, we typically indicate the hyperparameters of
a model. Scikit-learn provides useful APIs that help us navigate this parameter
space easily to find the best possible parameter combinations. We can use two metaestimators—GridSearchCV and RandomizedSearchCV—for facilitating the search of
the best parameters. GridSearchCV, as the name suggests, involves providing a grid
of possible parameters and trying each possible combination among them to arrive
at the best one. An optimized approach often is to use a random search through
the possible parameter set; this approach is provided by the RandomizedSearchCV
API. It samples the parameters and avoids the combinatorial explosions that can
result in the case of a higher number of parameters. In addition to the parameter
search, these model selection methods also allow us to use different cross-validation
schemes and score functions to measure performance.

Scikit-learn Example: Regression Models
In the first chapter, we discussed an example which involved the task of classification. In this section, we will
tackle another interesting Machine Learning problem, that of regression. Keep in mind the focus here is to
introduce you to the basic steps involved in using some of the scikit-learn library APIs. We will not try to
over-engineer our solution to arrive at the best model. Future chapters will focus on those aspects with realworld datasets.
For our regression example, we will use one of the datasets bundled with the scikit-learn library, the
diabetes dataset.

The Dataset
The diabetes dataset is one of the bundled datasets with the scikit-learn library. This small dataset allows
the new users of the library to learn and experiment various Machine Learning concepts, with a well-known
dataset. It contains observations of 10 baseline variables, age, sex, body mass index, average blood pressure.
and six blood serum measurements for 442 diabetes patients. The dataset bundled with the package is
already standardized (scaled), i.e. they have zero mean and unit L2 norm. The response (or target variable)
is a quantitative measure of disease progression one year after baseline. The dataset can be used to answer
two questions:
•

What is the baseline prediction of disease progression for future patients?

•

Which independent variables (features) are important factors for predicting disease
progression?

We will try to answer the first question here by building a simple linear regression model. Let’s get
started by loading the data.
In [60]: from sklearn import datasets
In [61]: diabetes = datasets.load_diabetes()
In [63]: y = diabetes.target
In [66]: X = diabetes.data

99

Chapter 2 ■ The Python Machine Learning Ecosystem

In [67]: X.shape
Out[67]: (442L, 10L)
In [68]: X[:5]
Out[68]:
array([[ 0.03807591, 0.05068012, 0.06169621, 0.02187235, -0.0442235 ,
        -0.03482076, -0.04340085, -0.00259226, 0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, -0.02632783, -0.00844872,
        -0.01916334, 0.07441156, -0.03949338, -0.06832974, -0.09220405],
       [ 0.08529891, 0.05068012, 0.04445121, -0.00567061, -0.04559945,
        -0.03419447, -0.03235593, -0.00259226, 0.00286377, -0.02593034],
       [-0.08906294, -0.04464164, -0.01159501, -0.03665645, 0.01219057,
         0.02499059, -0.03603757, 0.03430886, 0.02269202, -0.00936191],
       [ 0.00538306, -0.04464164, -0.03638469, 0.02187235, 0.00393485,
         0.01559614, 0.00814208, -0.00259226, -0.03199144, -0.04664087])
In [69]: y[:10]
Out[69]: array([ 151., 75., 141., 206., 135., 97., 138., 63., 110., 310.])
Since we are using the data in the form of numpy arrays, we don’t get the name of the features in the data
itself. But we will keep the reference to the variable names as they may be needed later in our process or just
for future reference.
In [78]: feature_names=['age', 'sex', 'bmi', 'bp',
    ...:                's1', 's2', 's3', 's4', 's5', 's6']
For prediction of the response variable here, we will learn a Lasso model. A Lasso model is an extension
of the normal linear regression model which allows us to apply L1 regularization to the model. Simply put, a
lasso regression will try to minimize the number of independent variables in the final model. This will give
us the model with the most important variables only (feature selection).
In [2]:
   ...:
   ...:
   ...:
   ...:

from sklearn import datasets
from sklearn.linear_model import Lasso
import numpy as np
from sklearn import linear_model, datasets
from sklearn.model_selection import GridSearchCV

We will split our data into separate test and train sets of data (train is used to train the model and
test is used for model performance testing and evaluation).
In [3]:
   ...:
   ...:
   ...:
   ...:
   ...:

100

diabetes = datasets.load_diabetes()
X_train = diabetes.data[:310]
y_train = diabetes.target[:310]
X_test = diabetes.data[310:]
y_test = diabetes.data[310:]

Chapter 2 ■ The Python Machine Learning Ecosystem

Then we will define the model we want to use and the parameter space for one of the model’s
hyperparameters. Here we will search the parameter alpha of the Lasso model. This parameter basically
controls the strictness our regularization.
In [4]: lasso = Lasso(random_state=0)
   ...: alphas = np.logspace(-4, -0.5, 30)
Then we will initialize an estimator that will identify the model to be used. Here we notice that the
process is identical for both learning a single model and a grid search of models, i.e. they both are objects of
the estimator class.
In [9]: estimator = GridSearchCV(lasso, dict(alpha=alphas))
In [10]: estimator.fit(X_train, y_train)
Out[10]:
GridSearchCV(cv=None, error_score='raise',
             estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
                            normalize=False, positive=False, precompute=False, random_state=0,
                            selection='cyclic', tol=0.0001, warm_start=False),
             fit_params={}, iid=True, n_jobs=1,         
             param_grid={'alpha': array([ 1.00000e-04, 1.32035e-04, 1.74333e-04, 2.30181e-04,
                                          3.03920e-04, ..., 2.39503e-01, 3.16228e-01])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True, scoring=None,   
             verbose=0)
This will take our train set and learn a group of Lasso models by varying the value of the alpha
hyperparameter. The GridSearchCV object will also score the models that we are learning and we can us the
best_estimator_ attribute to identify the model and the optimal value of the hyperparameter that gave us
the best score. Also we can directly use the same object for predicting with the best model on unknown data.
In [12]: estimator.best_score_
Out[12]: 0.46540637590235312
In [13]: estimator.best_estimator_
Out[13]:
Lasso(alpha=0.025929437974046669, copy_X=True, fit_intercept=True, max_iter=1000,    
      normalize=False, positive=False, precompute=False, random_state=0, selection='cyclic',     
      tol=0.0001, warm_start=False)
In [18]: estimator.predict(X_test)
Out[18]:
array([ 203.42104984, 177.6595529 , 122.62188598, 212.81136958, 173.61633075, 114.76145025,   
        202.36033584, 171.70767813, 164.28694562, 191.29091477, 191.41279009, 288.2772433,
        296.47009002, 234.53378413, 210.61427168, 228.62812055,...])
The next steps involve reiterating the whole process making changes to the data transformation,
Machine Learning algorithm, tuning hyperparameters of the algorithm etc., but the basic steps will remain
the same. We will go into the elaborate details of these processes in future chapters of the book. Here we will
conclude our introduction to the scikit-learn framework and encourage you to check out their extensive
documentation at http://scikit-learn.org/stable, which points to the home page of the most current
stable version of scikit-learn.

101

Chapter 2 ■ The Python Machine Learning Ecosystem

Neural Networks and Deep Learning
Deep learning has become one of the most well-known representations of Machine Learning in the recent
years. Deep Learning applications have achieved remarkable accuracy and popularity in various fields
especially in image and audio related domains. Python is the language of choice when it comes to learning
deep networks and complex representations of data. In this section, we briefly discuss ANNs (Artificial
Neural Networks) and Deep Learning networks. Then we will move on to the popular Deep Learning
frameworks for Python. Since, the mathematics involved behind ANNs is quite advanced we will keep our
introduction minimal and focused on the practical aspects of learning a neural network. We recommend
you check out some standard literature on the theoretical aspects of Deep Learning and neural networks like
Deep Learning by Goodfellow and Bengio, if you are more interested in its internal implementations. The
following section gives a brief refresher on neural networks and Deep Learning based on what we covered in
detail in Chapter 1.

Artificial Neural Networks
Deep learning can be considered as an extension of Artificial Neural Networks (ANNs) . Neural networks
were first introduced as a method of learning by Frank Rosenblatt in 1958, although the learning model
called perceptron was different from modern day neural networks, we can still regard the perceptron as the
first artificial neural network.
Artificial neural networks loosely work on the principle of learning a distributed distribution of data.
The underlying assumption is that the generated data is a result of nonlinear combination of a set of latent
factors and if we are able to learn this distributed representation then we can make accurate predictions
about a new set of unknown data. The simplest neural network will have an input layer, a hidden layer (a
result of applying a nonlinear transformation to the input data), and an output layer. The parameters of the
ANN model are the weights of each connection that exist in the network and sometimes a bias parameter.
This simple neural network is represented as shown in Figure 2-6.

102

Chapter 2 ■ The Python Machine Learning Ecosystem

Figure 2-6. A simple neural network
This network is having an input vector of size 3, a hidden layer of size 4, and a binary output layer. The
process of learning an ANN will involve the following steps.
1.

Define the structure or architecture of the network we want to use. This is critical
as if we choose a very extensive network containing a lot of neurons/units (each
circle in Figure 2-6 can be labeled as neuron or a unit) then we can overfit our
training data and our model won’t generalize well.

2.

Choose the nonlinear transformation to be applied to each connection. This
transformation controls the activeness of each neuron in the network.

103

Chapter 2 ■ The Python Machine Learning Ecosystem

3.

Decide on a loss function we will use for the output layer. This is applicable in the
case when we have a supervised learning problem, i.e. we have an output label
associated with each of the input data points.

4.

Learning the parameters of the neural network , i.e. determine the values of each
connection weight. Each arrow in Figure 2-6 carries a connection weight. We
will learn these weights by optimizing our loss function using some optimization
algorithm and a method called backpropagation.

We will not go into the details of backpropagation here, as it is beyond the scope of the present chapter.
We will extend these topics when we actually use neural networks.

Deep Neural Networks
Deep neural networks are an extension of normal artificial neural networks. There are two major differences
that deep neural networks have, as compared to normal neural networks.

Number of Layers
Normal neural networks are shallow, which means that they will have at max one or two hidden layers.
Whereas the major difference in deep neural networks is that they have a lot more hidden layers. And this
number is usually very large. For example, the Google brain project used a neural network that had millions
of neurons.

Diverse Architectures
Based on what we discussed in Chapter 1, we have a wide variety of deep neural network architectures
ranging from DNNs, CNNs, RNNs, and LSTMs. Recent research have even given us attention based networks
to place special emphasis on specific parts of a deep neural network. Hence with Deep Learning, we have
definitely gone past the traditional ANN architecture.

Computation Power
The larger the network and the more layers it has, the more complex the network becomes and training
it takes a lot of time and resources. Deep neural networks work best on GPU based architectures and
take far less time to train than on traditional CPUs, although recent improvements have vastly decreased
training times.

Python Libraries for Deep Learning
Python is a language of choice, across both academia and enterprises, to develop and use normal/deep
neural networks. We will learn about two packages—Theano and TensorFlow—which will allow us to build
neural network based models on datasets. In addition to these we will learn to use Keras, which is a high
level interface to building neural networks easily and has a concise API, capable of running on top of both
TensorFlow and Theano. Besides these, there are some more excellent frameworks For Deep Learning. We
also recommend you to check out PyTorch, MXNet, Caffe (recently Caffe2 was released), and Lasagne.

104

Chapter 2 ■ The Python Machine Learning Ecosystem

Theano
The first library popularly used for learning neural networks is Theano. Although by itself, Theano is not a
traditional Machine Learning or a neural network learning framework, what it provides is a powerful set of
constructs that can be used to train both normal Machine Learning models and neural networks. Theano
allows us to symbolically define mathematical functions and automatically derive their gradient expression.
This is one of the frequently used steps in learning any Machine Learning model. Using Theano, we can
express our learning process with normal symbolic expressions and then Theano can generate optimized
functions that carry out those steps.
Training of Machine Learning models is a computationally intensive process. Especially neural networks
have steep computational requirements due to both the number of learning steps involved and the non-linearity
involved in them. This problem is increased manifold when we decide to learn a deep neural network. One of
the important reasons of Theano being important for neural network learning is due to its capability to generate
code which executes seamlessly on both CPUs and GPUs. Thus if we specify our Machine Learning models
using Theano, we are also able to get the speed advantage offered by modern day GPUs.
In the rest of this section, we see how we can install Theano and learn a very simple neural network
using the expressions provided by Theano.

Installation
Theano can be easily installed by using the Python package manager pip or conda.
pip install theano
Often the pip installer fails on Windows, hence we recommend using conda install theano on the
Windows platform. We can verify the installation by importing our newly installed package in a Python shell.
In [1]: import theano
If you get no errors, then this indicates you have successfully installed the theano library in your system.

Theano Basics (Barebones Version)
In this section, we discuss some basics of the symbolic abilities offered by theano and how those can be
leveraged to build some simple learning models. We will not directly use theano to build a neural network in
this section, but you will know how to carry out symbolic operations in theano. Besides this, you will see in the
coming section that building neural networks is much easier when we use a higher level library such as keras.
Theano expresses symbolical expressions using something called tensors. A tensor in its simplest
definition is a multi-dimensional array. So a zero-order tensor array is a scalar, a one-order tensor is a vector,
and a two-order tensor is a matrix.
Now we look at how we can work on a zero-order tensor or a scalar by using constructs provided by theano.
In [3]:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:

import numpy
import theano.tensor as T
from theano import function
x = T.dscalar('x')
y = T.dscalar('y')
z = x + y
f = function([x, y], z)
f(8, 2)

Out[3]: array(10.0)

105

Chapter 2 ■ The Python Machine Learning Ecosystem

Here, we defined a symbolical operation (denoted by the symbol z) and then bound the input and the
operations in a function. This was achieved by using the function construct provided by theano. Contrast it
with the normal programming paradigm and we would need to define the whole function by ourselves. This
is one of the most powerful aspects of using a symbolical mathematical package like theano. Using construct
similar to these, we can define a complex set of operations.
Graph structure: Theano represents symbolical mathematical operations as graphs. So when we
define an operation like z, as depicted in the earlier example, no calculation happens instead what we get is
a graph representation of the expression. These graphs are made up of Apply, Op, and variable nodes. The
Apply node represents application of some op on some set of variable nodes. So if we wanted to visualize the
operation we defined in the preceding step as a graph, it would look like the depiction in Figure 2-7. (Source:
http://deeplearning.net/software/theano/extending/graphstructures.html.)

Figure 2-7. Graph structure of Theano operation
Theano has various low-level tensor APIs for building neural network architectures using Tensor
arithmetic and Ops. This is available in the theano.tensor.nnet module and you can check out relevant
functions at http://deeplearning.net/software/theano/library/tensor/nnet/index.html, which
include conv for convolutional neural networks and nnet for regular neural network operations. This
concludes our basic introduction to theano. We kept it simple because we will rarely be using theano directly
and instead rely on high-level libraries like keras to build powerful deep neural networks with minimal code
and focus more on solving problems efficiently and effectively.

106

Chapter 2 ■ The Python Machine Learning Ecosystem

Tensorflow
Tensorflow is an open source software library for Machine Learning released by Google in November 2015.
Tensorflow is based on the internal system that Google uses to power its research and production systems.
Tensorflow is quite similar to Theano and can be considered as Google’s attempt to provide an upgrade to
Theano by providing easy-to-use interfaces into Deep Learning, neural networks, and Machine Learning
with a strong focus on rapid prototyping and model deployment constructs. Like Theano it also provides
constructs for symbolical mathematics, which are then translated into computational graphs. These graphs
are then compiled into lower-level code and executed efficiently. Like theano, tensorflow also supports
CPUs and GPUs seamlessly. In fact tensorflow works best on a TPU, known as the Tensor Processing Unit,
which was invented by Google. In addition to having a Python API, tensorflow is also exposed by APIs to
C++, Haskell, Java, and Go languages. One of the major differences tensorflow has as compared to
theano is the support for higher-level operations, which ease the process of Machine Learning and
its focus on model development as well as deployment to production and model serving via multiple
mechanisms (https://www.tensorflow.org/serving/serving_basic). Also the documentation and usage
of theano is not so intuitive to use, which is another area tensorflow aims to fill, by its easy-to-understand
implementations and extensive documentation.
The constructs provided by tensorflow are quite similar to those of Theano so we will not be reiterating
those. You can always refer to the documentation provided for tensorflow at https://www.tensorflow.org/
for more details.

Installation
Tensorflow works well on Linux and Mac systems, but was not directly available on Windows due to internal
dependencies on Bazel. The good news is that it was recently successfully launched for Windows platforms
too. It requires a minimum of Python 3.5 for its execution. The library can be installed by using pip or by
using conda install function. Note that for successful installation of Tensorflow, we will also require
updated dask and pandas libraries on our system.
conda install tensorflow
Once we have installed the library, we can verify a successful install by verifying it in the ipython
console with the following commands.
In [21]:
    ...:
    ...:
    ...:

import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))


b'Hello, TensorFlow!'
The message verifies our successful install of the tensorflow library. You are also likely to see a bunch
of warning messages but you can safely ignore them. The reason for those messages is the fact that the
default tensorflow build is not built with support for some instruction sets, which may slow down the
process of learning a bit.

107

Chapter 2 ■ The Python Machine Learning Ecosystem

Keras
Keras is a high-level Deep Learning framework for Python, which is capable of running on top of both
Theano and Tensorflow. Developed by Francois Chollet, the most important advantage of using Keras is the
time saved by its easy-to-use but powerful high level APIs that enable rapid prototyping for an idea. Keras
allows us to use the constructs offered by Tensorflow and Theano in a much more intuitive and easy-to-use
way without writing excess boilerplate code for building neural network based models. This ease of flexibility
and simplicity is the major reason for popularity of keras. In addition to providing an easy access to both
of these somewhat esoteric libraries, keras ensures that we are still able to take the advantages that these
libraries offer. In this section, you learn how to install Keras, learn about the basics of model development
using Keras, and then learn how to develop an example neural network model using keras and tensorflow.

Installation
Keras is easy to install using the familiar pip or conda command. We will assume that we have both
tensorflow and theano installed, as they will be required to be used as backend for keras model
development.
conda install keras
We can check for the successful installation of keras in our environment by importing it in IPython.
Upon a successful import it will display the current backend, which is usually theano by default. So you need
to go to the keras.json file, available under the .keras directory under your user account directory. Our
config file contents are as follows.
{"epsilon": 1e-07, "floatx": "float32",
"backend": "tensorflow", "image_data_format": "channels_last"}
You can refer to https://keras.io/backend/, which tells you how easily you can switch the backend in
keras from theano to tensorflow. Once the backend in specified in the config file, on importing keras, you
should see the following message in your ipython shell.
In [22]: import keras
Using TensorFlow backend

Keras Basics
The main abstraction for a neural network is a model in keras. A model is a collection of neurons that will
define the structure of a neural network. There are two different types of models:

108

•

Sequential model: Sequential models are just stacks of layers. These layers can
together define a neural network. If you refer back to Figure 2-6 when we introduced
neural networks, that network can be defined by specifying three layers in a sequential
keras model. We will see an example of a sequential model later in this section.

•

Functional API Model: Sequential models are very useful but sometimes our
requirement will exceed the constructs possible using sequential models. This is
where the function model APIs will come in to the picture. This API allows us to
specify complex networks i.e., networks that can have multiple outputs, networks
with shared layers, etc. These kinds of models are needed when we need to use
advanced neural networks like convolutional neural networks or recurrent neural
networks.

Chapter 2 ■ The Python Machine Learning Ecosystem

Model Building
The model building process with keras is a three-step process. The first step is specifying the structure of the
model. This is done by configuring the base model that we want to use, which is either a sequential model or
a functional model. Once we have identified a base model for our problem we will further enrich that model
by adding layers to the model. We will start with the input layer, to which we will feed our input data feature
vectors. The subsequent layers to be added to the model are based on requirements of the model. keras
provides a bunch of layers which can be added to the model (hidden layers, fully connected, CNN, LSTM,
RNN, and so on), we will describe some of them while running through our neural network example. We can
stack these layers together in a complex manner and add the final output layer, to arrive at our overall model
architecture.
The next step in the model learning process is the compilation of the model architecture that we
defined in the first step. Based on what we learned in the preceding sections on Theano and Tensorflow,
most of the model building steps are symbolic and the actual learning is deferred until later. In the
compilation step, we configure the learning process. The learning process, in addition to the structure of the
model, needs to specify the following additional three important parameters:
•

Optimizer: We learned in the first chapter that the simplest explanation of a learning
process is the optimization of a loss function. Once we have the model and the loss
function, we can specify the optimizer that will identify the actual optimization
algorithm or program we will use, to train the model and minimize the loss or error.
This could be a string identifier to the already implemented optimizers, a function,
or an object to the Optimizer class that we can implement.

•

Loss function: A loss function, also known as an objective function, will specify the
objective of minimizing loss/error, which our model will leverage to get the best
performance over multiple epochs\iterations. It again can be a string identifier to
some pre-implemented loss functions like cross-entropy loss (classification) or mean
squared error (regression) or it can be a custom loss function that we can develop.

•

Performance metrics: A metric is a quantifiable measure of the learning process.
While compiling a model, we can specify a performance metric we want to track
(for example, accuracy for a classification model), which will educate us about the
effectiveness of the learning process. This helps in evaluating model performance.

The last step in the model building process is executing the compiled method to start the training
process. This will execute the lower level compiled code to find out the necessary parameters and weights of
our model during the training process. In keras, like scikit-learn, it is achieved by calling the fit function
on our model. We can control the behavior of the function by supplying appropriate arguments. You can
learn about these arguments at https://keras.io/models/sequential/.

Learning an Example Neural Network
We will conclude this section by building a simple working neural network model on one of the datasets that
comes bundled with the scikit-learn package. We will use the tensorflow backend in our example, but
you can try to use a theano backend and verify the execution of model on both the backends.
For our example, we will use the Wisconsin Breast Cancer dataset, which is bundled with the
scikit-learn library. The dataset contains attribute drawn from a digitized image of fine needle aspirate
of a breast mass. They describe characteristics of the cell nuclei present in the image. On the basis of those
attributes, the mass can be marked as malignant or benign. The goal of our classification system is to predict
that level. So let’s get started by loading the dataset.

109

Chapter 2 ■ The Python Machine Learning Ecosystem

In [33]:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:

from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X_train = cancer.data[:340]
y_train = cancer.target[:340]
X_test = cancer.data[340:]
y_test = cancer.target[340:]

The next step of the process is to define the model architecture using the keras model class. We see that
our input vector is having 30 attributes so we will have a shallow network having one hidden layer of half
the units (neurons), i.e., we will have 15 units in the hidden layer. We add a one unit output layer to predict
either 1 or 0 based on whether the input data point is benign or malignant. This is a simple neural network
and doesn’t involve Deep Learning.
In [39]: import numpy as np
    ...: from keras.models import Sequential
    ...: from keras.layers import Dense, Dropout
    ...:
In [40]: model = Sequential()
    ...: model.add(Dense(15, input_dim=30, activation='relu'))
    ...: model.add(Dense(1, activation='sigmoid'))
Here we have defined a sequential keras model, which is having a dense hidden layer of 15 units.
The dense layer means a fully connected layer so it means that each of those 15 units (neurons) is fully
connected to the 30 input features. The output layer for our example is a dense layer with the sigmoid
activation. The sigmoid activation is used to convert a real valued input into a binary output (1 or 0). Once
we have defined the model we will then compile the model by supplying the necessary optimizer, loss
function, and the metric on which we want to evaluate the model performance.
In [41]: model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
Here we used a loss function of binary_crossentropy, which is a standard loss function for binary
classification problems. For the optimizer, we used rmsprop, which is an upgrade from the normal gradient
descent algorithm. The next step is to fit the model using the fit function.
In [41]: model.fit(X_train, y_train, epochs=20,
Epoch 1/20
340/340 [==============================] - 0s Epoch 2/20
340/340 [==============================] - 0s ...   
Epoch 19/20
340/340 [==============================] - 0s Epoch 20/20
340/340 [==============================] - 0s -

batch_size=50)
loss: 7.3616 - acc: 0.5382     
loss: 7.3616 - acc: 0.5382  

loss: 7.3616 - acc: 0.5382     
loss: 7.3616 - acc: 0.5382     

Here, the epochs parameter indicates one complete forward and backward pass of all the training
examples. The batch_size parameter indicates the total number of samples which are propagated through
the NN model at a time for one backward and forward pass for training the model and updating the gradient.

110

Chapter 2 ■ The Python Machine Learning Ecosystem

Thus if you have 100 observations and your batch size is 10, each epoch will consist of 10 iterations where 10
observations (data points) will be passed through the network at a time and the weights on the hidden layer
units will be updated. However we can see that the overall loss and training accuracy remains the same.
Which means the model isn’t really learning anything from the looks of it!
The API for keras again follows the convention for scikit-learn models, hence we can use the predict
function to predict for the data points in the test set. In fact we use predict_classes to get the actual class
label predicted for each test data instance.
In [43]: predictions = model.predict_classes(X_test)
128/229 [===============>..............] - ETA: 0s
Let’s evaluate the model performance by looking at the test data accuracy and other performance
metrics like precision, recall, and F1 score. Do not despair if you do not understand some of these terms, as
we will be covering them in detail in Chapter 5. For now, you should know that scores closer to 1 indicate
better results i.e., an accuracy of 1 would indicate 100% model accuracy, which is perfection. Luckily,
scikit-learn provides us with necessary performance metric measuring APIs.
In [44]: from sklearn import metrics
    ...: print('Accuracy:', metrics.accuracy_score(y_true=y_test, y_pred=predictions))
    ...: print(metrics.classification_report(y_true=y_test, y_pred=predictions))score
Accuracy: 0.759825327511
             precision    recall  f1-score   support
          0       0.00      0.00      0.00        55
          1       0.76      1.00      0.86       174
avg / total       0.58      0.76      0.66       229
From the previous performance metrics, we can see that even though model accuracy is 76%, for data
points having cancer (malignant) i.e., label 0, it misclassifies them as 1 (55 instances) and remaining 174
instances where class label is 1 (benign), it classifies them perfectly. Thus this model hasn’t learned much
and predicts every response as benign (label 1). Can we do better than this?

The Power of Deep Learning
The idea of Deep Learning is to use multiple hidden layers to learn latent and complex data patterns,
relationships, and representations to build a model that learns and generalizes well on the underlying
data. Let’s take the previous example and convert it to a fully connected deep neural network (DNN)
by introducing two more hidden layers. The following snippet builds and trains a DNN with the same
configuration as our previous experiment only with the addition of two new hidden layers.
In [45]: model = Sequential()
    ...: model.add(Dense(15, input_dim=30, activation='relu'))
    ...: model.add(Dense(15, activation='relu'))
    ...: model.add(Dense(15, activation='relu'))
    ...: model.add(Dense(1, activation='sigmoid'))
    ...:
    ...: model.compile(loss='binary_crossentropy',
    ...:            
optimizer='rmsprop',
    ...:            
metrics=['accuracy'])
    ...:
    ...: model.fit(X_train, y_train,

111

Chapter 2 ■ The Python Machine Learning Ecosystem

    ...:        
epochs=20,
    ...:        
batch_size=50)
    ...:
Epoch 1/20
340/340 [==============================]
Epoch 2/20
340/340 [==============================]
Epoch 3/20
340/340 [==============================]
...
Epoch 19/20
340/340 [==============================]
Epoch 20/20
340/340 [==============================]

- 0s - loss: 3.3799 - acc: 0.3941     
- 0s - loss: 1.3740 - acc: 0.6059     
- 0s - loss: 0.4258 - acc: 0.8471     

- 0s - loss: 0.2361 - acc: 0.9235     
- 0s - loss: 0.3154 - acc: 0.9000

We see a remarkable jump in the training accuracy and a drop in the loss based on the preceding
training output. This is indeed excellent and seems promising! Let’s check out our model performance on
the test data now.
In [46]: predictions = model.predict_classes(X_test)
    ...: print('Accuracy:', metrics.accuracy_score(y_true=y_test, y_pred=predictions))
    ...: print(metrics.classification_report(y_true=y_test, y_pred=predictions))score
Accuracy: 0.912663755459
             precision    recall  f1-score   support
          0       0.78      0.89      0.83        55
          1       0.96      0.92      0.94       174
avg / total       0.92      0.91      0.91       229
We achieve an overall accuracy and F1 score of 91% and we can see that we also have an F1 score of 83%
as compared to 0% from the previous model, for class label 0 (malignant). Thus you can clearly get a feel of
the power of Deep Learning, which is evident by just introducing more hidden layers in our network, which
enabled our model to learn better representations of our data. Try experimenting with other architectures or
even introducing regularization aspects like dropout.
Thus, in this section, you learned about some of the important frameworks relevant to neural networks
and Deep Learning. We will revisit the more advanced aspects of these frameworks in subsequent chapters
when we work on real-world case studies.

Text Analytics and Natural Language Processing
In the sections till now we have mostly dealt with structured data formats and datasets i.e., data in which
we have the observations occurring as rows and the features or attributes for each of those observations
occurring as columns. This format is most convenient for Machine Learning algorithms but the problem is
that raw data is not always available in this easy-to-interpret format. This is the case with unstructured data
formats like audio, video, textual datasets. In this section, we try to get a brief overview of the frameworks
we can use to solve this problem if the data that we are working with is unstructured text data. We will not
go into detailed examples of using these frameworks and if you are interested, we recommend checking out
Chapter 7 of this book, which deals with a real-world case study on analyzing text data.

112

Chapter 2 ■ The Python Machine Learning Ecosystem

The Natural Language Tool Kit
Perhaps the most important library of Python to work with text data is NLTK or the Natural Language Tool
Kit. This section introduce NLTK and its important modules. We go over the installation procedure of the
library and a brief description of its important modules.

Installation and Introduction
The nltk package can be installed in the same way as most of the other packages used in this book, which is
by using the pip or conda command.
conda install nltk
We can verify the installation by importing the package in an IPython/Python shell.
In [1]: import nltk
There’s an important difference for the nltk library as compared to other standard libraries. In case of
other libraries, in general, we don’t need to download any auxiliary data. But for the nltk library to work
to its full potential, we would require some auxiliary data, which are mostly various corpora. This data is
leveraged by multiple functions and modules in the library. We can download this data by executing the
following command in the Python shell.
In [5]: nltk.download()
This command will give us the screen shown in Figure 2-8, where we can select the additional data
we want to install and select the installation location. We will select to install all the additional data and
packages available.

Figure 2-8. nltk download option

113

Chapter 2 ■ The Python Machine Learning Ecosystem

You can also choose to download all necessary datasets without the GUI by using the following
command from the ipython or Python shell.
nltk.download('all', halt_on_error=False)
Once the download is finished we will be able to use all the necessary functionalities and the bundled
data of the nltk package. We will now take a look at the major modules of nltk library and introduce the
functionality that each of them provides.

Corpora
The starting point of any text analytics process is the process of collecting the documents of interest in
a single dataset. This dataset is central to the next steps of processing and analysis. This collection of
documents is generally called a corpus. Multiple corpus datasets are called corpora. The nltk module
nltk.corpus provides necessary functions that can be used to read corpus files in a variety of formats. It
supports the reading of corpora from the datasets bundled in nltk package as well as external corpora.

Tokenization
Tokenization is one of the core steps in text pre-processing and normalization. Each text document
has several components like paragraphs, sentences, and words that together make up the document.
The process of tokenization is used to break down the document into these smaller components. This
tokenization can be into sentences, words, clauses, and so on. The most popular way to tokenize any
document is by using sentence tokenization and\or word tokenization. The nltk.tokenize module of the
nltk library provides functionality that enables efficient tokenization of any textual data.

Tagging
A text document is constructed based on various grammatical rules and constructs. The grammar depends
on the language of the text document. Each language’s grammar will contain different entities and parts of
speech like nouns, pronouns, adjectives, adverbs, and so on. The process of tagging will involve getting a text
corpus, tokenizing the text and assigning metadata information like tags to each word in the corpora. The
nltk.tag module contains implementation of different algorithms that can be used for such tagging and
other related activities.

Stemming and Lemmatization
A word can have several different forms based on what part of speech it is representing. Consider the
word fly; it can be present in various forms in the same text, like flying, flies, flyer, and so on. The process of
stemming is used to convert all the different forms of a word in to the base form, which is known as the root
step. Lemmatization is similar to stemming but the base form is known as the root word and it’s always a
semantically and lexicographically correct word. This conversion is crucial, as a lot of times the core word
contains more information about the document, which can be diluted by these different forms. The nltk
module nltk.stem contains different techniques that can be used for stemming and lemmatizing a corpus.

114

Chapter 2 ■ The Python Machine Learning Ecosystem

Chunking
Chunking is a process which is similar to parsing or tokenization but the major difference is that instead of
trying to parse each word, we will target phrases present in the document. Consider the sentence “The brown
fox saw the yellow dog”. In this sentence, we have two phrases which are of interest. The first is the phrase
“the brown fox,” which is a noun phrase and the second one is the phrase “the yellow dog,” which again is
a noun phrase. By using the process of chunking, we are able to tag phrases with additional parts of speech
information, which is important for understanding the structure of the document. The nltk module
nltk.chunk consists of necessary techniques that can be used for applying the chunking process to our corpora.

S entiment
Sentiment or emotion analysis is one of the most recognizable applications on text data. Sentiment analysis
is the process of taking a text document and trying to determine the opinion and polarity being represented
by that document. Polarity in the reference of a text document can mean the emotion, e.g., positive, negative,
or neutral being represented by the data. The sentiment analysis on textual data can be done using different
algorithms and at different levels of text segmentation. The nltk.sentiment package is the module that can
be used to perform different sentiment analyses on text documents. Check out Chapter 7 for a real-world
case study on sentiment analysis!

C lassification/Clustering
Classification of text documents is a supervised learning problem, as we explained in the first chapter.
Classification of text documents may involve learning the sentiment, topic, theme, category, and so on of
several text documents (corpus) and then using the trained model to label unknown documents in the
future. The major difference from normal structured data comes in the form of feature representations of
unstructured text we will be using. Clustering involves grouping together similar documents based on some
similarity measure, like cosine similarity, bm25 distance, or even semantic similarity. The nltk.classify
and nltk.cluster modules are typically used to perform these operations once we do the necessary feature
engineering and extraction.

Other Text Analytics Frameworks
Typically, nltk is our go-to library for dealing with text data, but the Python ecosystem also contains other
libraries that can be useful in dealing with textual data. We will briefly mention some of these libraries so
that you get a good grasp of the toolkit that you can arm yourself with when dealing with unstructured
textual data.
•

pattern: The pattern framework is a web mining module for the Python
programming language. It has tools for web mining (extracting data from Google,
Twitter, a web crawler, or an HTML DOM parser), information retrieval, NLP,
Machine Learning, sentiment analysis and network analysis, and visualization.
Unfortunately, pattern currently works best on Python 2.7 and there is no official
port for Python 3.x.

•

gensim: The gensim framework, which stands for generate similar, is a Python
library that has a core purpose of topic modeling at scale! This can be used to extract
semantic topics from documents. The focus of gensim is on providing efficient
topic modeling and similarity analysis. It also contains a Python implementation of
Google’s popular word2vec model.

115

Chapter 2 ■ The Python Machine Learning Ecosystem

•

textblob: This is another Python library that promises simplified text processing.
It provides a simple API for doing common text processing tasks including parts of
speech tagging, tokenization, phrase extraction, sentiment analysis, classification,
translation, and much more!

•

spacy: This is a recent addition to the Python text processing landscape but an
excellent and robust framework nonetheless. The focus of spacy is industrial
strength natural language processing, so it targets efficient text analytics for largescale corpora. It achieves this efficiency by leveraging carefully memory-managed
operations in Cython. We recommend using spacy for natural language processing
and you will also see it being used extensively for our text normalization process in
Chapter 7.

Statsmodels
Statsmodels is a library for statistical and econometric analysis in Python. The advantage of languages like
R is that it’s a statistically focused language with lot of capabilities. It consists of easy-to-use yet powerful
models that can be used for statistical analysis and modeling. However from deployment, integration, and
performance aspects, data scientists and engineers often prefer Python but it doesn’t have the power of
easy-to-use statistical functions and libraries like R. The statsmodels library aims to bridge this gap for
Python users. It provides the capabilities for statistical, financial and econometric operations with the aim
of combining the advantages of Python with the statistical powers of languages like R. Hence users familiar
with R, SAS, Stata, SPSS, and so on who might want similar functionality in Python can use statsmodels.
The initial statsmodel package was developed by Jonathan Taylor, a statistician at Stanford, as part of SciPy
under the name models. Improving this codebase was then accepted as a SciPy-focused project for the
Google Summer of Code in 2009 and again in 2010. The current package is available as a SciKit or an addon package for SciPy. We recommend you to check out the paper by Seabold, Skipper, and Josef Perktold,
“Statsmodels: Econometric and statistical modeling with Python,” proceedings of the 9th Python in Science
Conference, 2010.

Installation
The package can be installed using pip or conda install and the following commands.
pip install statsmodels
conda install -c conda-forge statsmodels

Modules
In this section, we briefly cover the important modules that comprise the statsmodel package and the
capability those models provides. This should give you enough idea of what to leverage to build statistical
models and perform statistical analysis and inference.

Distributions
One of the central ideas in statistics is the distributions of statistical datasets. Distributions are a listing or
function that assigns a probability value to all the possible values of the data. The distributions module of
the statsmodels package implements some important functions related to statistical distribution including
sampling from the distribution, transformations of distributions, generating cumulative distribution
functions of important distributions, and so on.

116

Chapter 2 ■ The Python Machine Learning Ecosystem

L inear Regression
Linear regression is the simplest form of statistical modeling for modeling the relationship between a
response dependent variable and one or more independent variables such that the response variable
typically follows a normal distribution. The statsmodels.regression module allows us to learn linear
models on data with IID i.e., independently and identically distributed errors. This module allows us to use
different methods like ordinary least squares (OLS), weighted least squares (WLS), generalized least squares
(GLS), and so on, for the estimation of the linear model parameters.

G eneralized Linear Models
Normal linear regression can be generalized if the dependent variable follows a different distribution than
the normal distribution. The statsmodels.genmod module allows us to extend the normal linear models to
different response variables. This allows us to predict the linear relationship between the independent and
dependent variable when the dependent variable follows distributions other than normal distributions.

ANOVA
Analysis of variance is a process of statistical processes used to analyze the difference between group means
and associated procedures. ANOVA analysis is an important way to test whether the means of several groups
are equal or unequal. This is an extremely powerful tool in hypothesis testing and statistical inference and is
implemented in the anova_lm module of the statsmodel package.

Time Series Analysis
Time series analysis is an important part of data analytics. A lot of data sources like stock prices, rainfall,
population statistics, etc. are periodic in nature. Time series analysis is used find structures, trends, and
patterns in these streams of data. These trends can be used to understand the underlying phenomena using
a mathematical model and even make predictions and forecasts about future events. Basic time series
models include univariate autoregressive models (AR), vector autoregressive models (VAR), univariate
autoregressive moving average models (ARMA), as well as the very popular autoregressive integrated
moving average (ARIMA) model. The tsa module of the statsmodels package provides implementation of
time series models and also provides tools for time series data manipulation.

Statistical Inference
An important part of traditional statistical inference is the process of hypothesis testing. A statistical
hypothesis is an assumption about a population parameter. Hypothesis testing is the formal process of
accepting or rejecting the assumption made about the data on the basis of observational data collected from
samples taken from the population. The stats.stattools module of statsmodels package implements the
most important of the hypothesis tests. Some of these tests are independent of any model, while some are
tied to a particular model only.

Nonparametric Methods
Nonparametric statistics refers to statistics that is not based on any parameterized family of probability
distributions. When we make an assumption about the distribution of a random variable we assign the
number of parameters required to ascertain its behavior. For example, if we say that some metric of interest
follows a normal distribution it means that we can understand its behavior if we are able to determine

117

Chapter 2 ■ The Python Machine Learning Ecosystem

the mean and variance of that metric. This is the key difference in non-parametric methods, i.e., we don’t
have a fixed number of parameters that are required to describe an unknown random variable. Instead
the number of parameters are dependent on the amount of training data. The module nonparametric
in the statsmodels library will help us perform non-parametric analysis on our data. It includes kernel
density estimation for univariate and multivariate data, kernel regression, and locally weighted scatterplot
smoothing.

Summary
This chapter introduced a select group of packages that we will use routinely to process, analyze, and model
our data. You can consider these libraries and frameworks as the core tools of a data scientist’s toolbox. The
list of packages we covered is far from exhaustive but they certainly are the most important packages. We
strongly suggest you get more familiar with the packages by going through their documentation and relevant
tutorials. We will keep introducing and explaining other important features and aspects of these frameworks
in future chapters. The examples in this chapter, along with the conceptual knowledge provided in the first
chapter, should give you a good grasp toward understanding Machine Learning and solving problems in
a simple and concise way. We will observe, in the subsequent chapters, that often the process of learning
models on our data is a reiteration of these simple steps and concepts. In the next chapter, you learn how to
wield the set of tools to solve bigger and complex problems in the areas of data processing, wrangling, and
visualization.

118

PART II

The Machine Learning Pipeline

CHAPTER 3

Processing, Wrangling, and
Visualizing Data
The world around us has changed tremendously since computers and the Internet became mainstream.
With the ubiquitous mobile phones and now Internet enabled devices, the line between the digital and
physical worlds is more blurred than it ever was. At the heart of all this is data. Data is at the center of
everything around us, be it finance, supply chains, medical science, space exploration, communication, and
what not. It is not surprising that we have generated 90% of the world’s data in just the last few years and this
is just the beginning. Rightly, data is being termed as the oil of the 21st Century. The last couple of chapters
introduced the concepts of Machine Learning and the Python ecosystem to get started. This chapter
introduces the core entity upon which the Machine Learning world relies to show its magic and wonders.
Everything digital has data at its core in some form or the other. Data is generated at various rates by
numerous sources across the globe in numerous formats. Before we dive into the specifics of Machine
Learning, we will spend some time and effort understanding this central entity called data. It is important
that we understand various aspects of it and get equipped with different techniques to handle it based on
requirements.
In this chapter we will cover the journey data takes through a typical Machine Learning related use
case where it goes from its initial raw form to a form where it can be used by Machine Learning algorithms/
models to work upon. We cover various data formats, processing and wrangling techniques to get the
data into a form where it can be utilized by Machine Learning algorithms for analysis. We also learn about
different visualization techniques to better understand the data at hand. Together these techniques will help
us be prepared for the problems to be solved in the coming chapters as well as in real-world scenarios.
Chapter 1 introduced the CRISP-DM methodology. It is one of the standard workflows followed by
Data Science teams across the world. In the coming sections of this chapter, we will concentrate on the
following sub-sections of this methodology:
•

Data collection: To understand different data retrieval mechanisms for different
data types

•

Data description: To understand various attributes and properties of the
data collected

•

Data wrangling: To prepare data for consumption in the modeling steps

•

Data visualization: To visualize different attributes for sharing results, better
understanding, and so on

The code samples, jupyter notebooks, and sample datasets for this chapter are available in the GitHub
repository for this book at https://github.com/dipanjanS/practical-machine-learning-with-python
under the directory/folder for Chapter 3.

© Dipanjan Sarkar, Raghav Bali and Tushar Sharma 2018
D. Sarkar et al., Practical Machine Learning with Python, https://doi.org/10.1007/978-1-4842-3207-1_3

121

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Data Collection
Data collection is where it all begins. Though listed as a step that comes post business understanding and
problem definition, data collection often happens in parallel. This is done in order to assist in augmenting
the business understanding process with facts like availability, potential value, and so on before a complete
use case can be formed and worked upon. Of course, data collection takes a formal and better form once the
problem statement is defined and the project gets underway.
Data is at the center of everything around us, which is a tremendous opportunity. Yet this also presents
the fact that it must be present in different formats, shapes, and sizes. Its omnipresence also means that
it exists in systems such as legacy machines (say mainframes), web (say web sites and web applications),
databases, flat files, sensors, mobile devices, and so on.
Let’s look at some of the most commonly occurring data formats and ways of collecting such data.

CSV
A CSV data file is one of the most widely available formats of data. It is also one of the oldest formats still
used and preferred by different systems across domains. Comma Separated Values (CSV) are data files that
contain data with each of its attributes delimited by a “,” (a comma). Figure 3-1 depicts a quick snapshot of
how a typical CSV file looks.
The sample CSV shows how data is typically arranged. It contains attributes of different data types
separated/delimited by a comma. A CSV may contain an optional header row (as shown in the example).
CSVs may also optionally enclose each of the attributes in single or double quotes to better demarcate.
Though usually CSVs are used to store tabular data, i.e., data in the form of rows and columns, this is not the
only way.

Figure 3-1. Sample CSV file
CSVs come in different variations and just changing the delimiter to a tab makes one a TSV (or a tab
separated values) file. The basic ideology here is to use a unique symbol to delimit/separate different attributes.
Now that we know how a CSV looks, let’s employ some Python magic to read/extract this data for use.
One of the advantages of using a language like Python is its ability to abstract and handle a whole lot of stuff.
Unlike other languages where specific libraries or a lot of code is required to get basic stuff done, Python
handles it with élan. Along the same lines is reading a CSV file. The simplest way to read a CSV is through the
Python csv module. This module provides an abstraction function called the reader().

122

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

The reader function takes a file object as input to return an iterator containing the information read from
the csv file. The following code snippet uses the csv.reader() function to read a given file.
csv_reader = csv.reader(open(file_name, 'rb'), delimiter=',')
Once the iterator is returned, we can easily iterate through the contents and get the data in the
form/format required. For the sake of completeness let’s go through an example where we read the contents
of the CSV shown in Figure 3-1 using the csv module. We will then extract each of its attributes and convert
the data into a dict with keys representing them. The following snippet forms the actions.
csv_rows = list()
csv_attr_dict = dict()
csv_reader = None
# read csv
csv_reader = csv.reader(open(file_name, 'rb'), delimiter=delimiter)
# iterate and extract data
for row in csv_reader:
    print(row)
    csv_rows.append(row)
# iterate and add data to attribute lists
for row in csv_rows[1:]:
    csv_attr_dict['sno'].append(row[0])
    csv_attr_dict['fruit'].append(row[1])
    csv_attr_dict['color'].append(row[2])
    csv_attr_dict['price'].append(row[3])
The output is a dict containing each attribute as a key with values and as an ordered list of values
read from the CSV file.
CSV Attributes::
{'color': ['red', 'yellow', 'yellow', 'orange', 'green', 'yellow', 'green'],
'fruit': ['apple', 'banana', 'mango', 'orange', 'kiwi', 'pineapple', 'guava'],
'price': ['110.85', '50.12', '70.29', '80.00', '150.00', '90.00', '20.00'],
'sno': ['1', '2', '3', '4', '5', '6', '7']}
The extraction of data from a CSV and its transformation depends on the use case requirements. The
conversion of our sample CSV into a dict of attributes is one way. We may choose different output format
depending on the data and our requirements.
Though the workflow to handle and read a CSV file is pretty straightforward and easy to use, we would
like to standardize and speed up our process. Also, more often than not, it is easier to understand data in
a tabular format. We were introduced to the pandas library in the previous chapter with some amazing
capabilities. Let’s now utilize pandas to read a CSV as well.
The following snippet shows how pandas makes reading and extracting data from a CSV that’s simpler
and consistent as compared to the csv module.
df = pd.read_csv(file_name,sep=delimiter)

123

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

With a single line and a few optional parameters (as per requirements), pandas extracts data from a
CSV file into a dataframe, which is a tabular representation of the same data. One of the major advantages of
using pandas is the fact that it can handle a lot of different variations in CSV files, such as files with or without
headers, attribute values enclosed in quotes, inferring data types, and many more. Also, the fact that various
machine learning libraries have the capability to directly work on pandas dataframes, makes it virtually a de
facto standard package to handle CSV files.
The previous snippet generates the following output dataframe:
   sno      fruit   color   price
0    1      apple     red  110.85
1    2     banana  yellow   50.12
2    3      mango  yellow   70.29
3    4     orange  orange   80.00
4    5       kiwi   green  150.00
5    6  pineapple  yellow   90.00
6    7     guava   green   20.00

■■Note pandas makes the process of reading CSV files a breeze, yet the csv module comes in handy when
we need more flexibility. For example, not every use case requires data in tabular form or the data might not be
consistently formatted and requires a flexible library like csv to enable custom logic to handle such data.
Along the same lines, data from flat files containing delimiters other than ’,’ (comma) like tabs or
semicolons can be easily handled with these two modules. We will use these utilities while working on
specific use cases in further chapters; until then, you are encouraged to explore and play around with these
for a better understanding.

JSON
Java Script Object Notation (JSON) is one of the most widely used data interchange formats across the digital
realm. JSON is a lightweight alternative to legacy formats like XML (we shall discuss this format next). JSON
is a text format that is language independent with certain defined conventions. JSON is a human-readable
format that is easy/simple to parse in most programming/scripting languages. A JSON file/object is simply
a collection of name(key)-value pairs. Such key-value pair structures have corresponding data structures
available in programming languages in the form of dictionaries (Python dict), struct, object, record,
keyed lists, and so on. More details are available at http://www.json.org/.
The JSON standard defines the structure, as depicted in Figure 3-2.

Figure 3-2. JSON object structure (reference: http://www.json.org/)

124

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Figure 3-3 is a sample JSON depicting a record of glossary with various attributes of different data types.

Figure 3-3. Sample JSON (reference: http://www.json.org/)
JSONs are widely to send information across systems. The Python equivalent of a JSON object is the
dict data type, which itself is a key-value pair structure. Python has various JSON related libraries that
provide abstractions and utility functions. The json library is one such option that allows us to handle JSON
files/objects. Let’s first take a look at our sample JSON file and then use this library to bring this data into
Python for use.

Figure 3-4. Sample JSON with nested attributes

125

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

The JSON object in Figure 3-4 depicts a fairly nested structure that contains values of string, numeric,
and array type. JSON also supports objects, Booleans, and other data types as values as well. The following
snippet reads the contents of the file and then utilizes json.loads() utility to parse and convert it into a
standard Python dict.
json_filedata = open(file_name).read()
json_data = json.loads(json_filedata)
json_data is a Python dict with keys and values of the JSON file parsed and type casted as Python data
types. The json library also provides utilities to write back Python dictionaries as JSON files with capabilities
of error checking and typecasting. The output of the previous operation is as follows.
outer_col_1 :
        nested_inner_col_1
        nested_inner_col_2
        nested_inner_col_1
        nested_inner_col_2
outer_col_2 :
        inner_col_1 : 3
outer_col_3 : 4

:
:
:
:

val_1
2
val_2
2

Before we move on to our next format, it is worth noting that pandas also provides utilities to parse
JSONs. The pandas read_json() is a very powerful utility that provides multiple options to handle JSONs
created in different styles. Figure 3-5 depicts a sample JSON representing multiple data points, each with two
attributes listed as col_1 and col_2.

126

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Figure 3-5. Sample JSON depicting records with similar attributes
We can easily parse such a JSON using pandas by setting the orientation parameter to “records”, as
shown here.
df = pd.read_json(file_name,orient="records")
The output is a tabular dataframe with each data point represented by two attribute values as follows.
  col_1 col_2
0     a     b
1     c     d
2     e     f
3     g     h
4     i     j
5     k     l

127

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

You are encouraged to read more about pandas read_json() at https://pandas.pydata.org/pandasdocs/stable/generated/pandas.read_json.html.

XML
Having covered two of the most widely used data formats, so now let’s take a look at XML. XMLs are quite a
dated format yet is used by a lot many systems. XML or eXtensible Markup Language is a markup language
that defines rules for encoding data/documents to be shared across the Internet. Like JSON, XML is also a
text format that is human readable. Its design goals involved strong support for various human languages
(via Unicode), platform independence, and simplicity. XMLs are widely used for representing data of varied
shapes and sizes.
XMLs are widely used as configuration formats by different systems, metadata, and data representation
format for services like RSS, SOAP, and many more.
XML is a language with syntactic rules and schemas defined and refined over the years. The most
import components of an XML are as follows:
•

Tag: A markup construct denoted by strings enclosed with angled braces (“<” and “>”).

•

Content: Any data not marked within the tag syntax is the content of the XML file/object.

•

Element: A logical construct of an XML. An element may be defined with a start and
an end tag with or without attributes, or it may be simply an empty tag.

•

Attribute: Key-value pairs that represent the properties or attributes of the element
in consideration. These are enclosed within a start or an empty tag.

Figure 3-6 is a sample XML depicting various components of the eXtensible Markup Language. More
details on key concepts and details can be browsed at https://www.w3schools.com/xml/.

Figure 3-6. Sample XML annotated with key components
XMLs can be viewed as tree structures, starting with one root element that branches off into various
elements, each with their own attributes and further branches, the content being at leaf nodes.

128

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Most XML parsers use this tree-like structure to read XML content. The following are the two major
types of XML parsers:
•

DOM parser: The Document Object Model parser is the closest form of tree
representation of an XML. It parses the XML and generates the tree structure. One
big disadvantage with DOM parsers is their instability with huge XML files.

•

SAX parser: The Simple API for XML (or SAX for short) is a variant widely used on
the web. This is an event-based parser that parses an XML element by element and
provides hooks to trigger events based on tags. This overcomes the memory-based
restrictions of DOM but lacks overall representation power.

There are multiple variants available that derive from these two types. To begin with, let’s take a look at the
ElementTree parser available from Python’s xml library. The ElementTree parser is an optimization over the
DOM parser and it utilizes Python data structures like lists and dicts to handle data in a concise manner.
The following snippet uses the ElementTree parser to load and parse the sample XML file we saw
previously. The parse() function returns a tree object, which has various attributes, iterators, and utilities to
extract root and further components of the parsed XML.
tree = ET.parse(file_name)
root = tree.getroot()
print("Root tag:{0}".format(root.tag))
print("Attributes of Root:: {0}".format(root.attrib))
The two print statements provide us with values related to the root tag and its attributes (if there are
any). The root object also has an iterator attached to it which can be used to extract information related to all
child nodes. The following snippet iterates the root object to print the contents of child nodes.
for child in xml:root:
        print("{0}tag:{1}, attribute:{2}".format(
                                            "\t"*indent_level,
                                            child.tag,
                                            child.attrib))
        print("{0}tag data:{1}".format("\t"*indent_level,
                                        child.text))
The final output generated by parsing the XML using ElementTree is as follows. We used a custom print
utility to make the output more readable, the code for which is available on the repository.
Root tag:records
Attributes of Root:: {'attr': 'sample xml records'}
tag:record, attribute:{'name': 'rec_1'}
tag data:
        tag:sub_element, attribute:{}
        tag data:
                tag:detail1, attribute:{}
                tag data:Attribute 1
                tag:detail2, attribute:{}
                tag data:2

129

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

        tag:sub_element_with_attr, attribute:{'attr': 'complex'}
        tag data:
            Sub_Element_Text
        tag:sub_element_only_attr, attribute:{'attr_val': 'only_attr'}
        tag data:None
tag:record, attribute:{'name': 'rec_2'}
tag data:
        tag:sub_element, attribute:{}
        tag data:
                tag:detail1, attribute:{}
                tag data:Attribute 1
                tag:detail2, attribute:{}
                tag data:2
        tag:sub_element_with_attr, attribute:{'attr': 'complex'}
        tag data:
            Sub_Element_Text
        tag:sub_element_only_attr, attribute:{'attr_val': 'only_attr'}
        tag data:None
The xml library provides very useful utilities exposed through the ElementTree parser, yet it lacks a
lot of fire power. Another Python library, xmltodict, provides similar capabilities but uses Python’s native
data structures like dicts to provide a more Pythonic way to handle XMLs. The following is a quick snippet
to parse the same XML. Unlike ElementTree, the parse() function of xmltodict reads a file object and
converts the contents into nested dictionaries.
xml_filedata = open(file_name).read()
ordered_dict = xmltodict.parse(xml_filedata)
The output generated is similar to the one generated using ElementTree with the exception that
xmltodict uses the @ symbol to mark elements and attributes automatically. The following is the sample
output.
records :
        @attr : sample xml records
record :
                @name : rec_1
sub_element :
                        detail1 : Attribute 1
                        detail2 : 2
sub_element_with_attr :
                        @attr : complex
                        #text : Sub_Element_Text
sub_element_only_attr :
                        @attr_val : only_attr

130

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

HTML and Scraping
We began the chapter talking about the immense amount of information/data being generated at breakneck speeds. The Internet or the web is one of the driving forces for this revolution coupled with immense
reach due to computers, smartphones and tablets.
The Internet is a huge interconnected web of information connected through hyperlinks. A large
amount of data on the Internet is in the form of web pages. These web pages are generated, updated,
and consumed millions of times day in and day out. With information residing in these web pages, it is
imperative that we must learn how to interact and extract this information/data as well.
So far we have dealt with formats like CSV, JSON, and XML, which can be made available/extracted
through various methods like manual downloads, APIs, and so on. With web pages, the methods change.
In this section we will discuss the HTML format (the most common form of web page related format) and
web-scraping techniques.

H
 TML
The Hyper Text Markup Language (HTML) is a markup language similar to XML. HTML is mainly used by
web browsers and similar applications to render web pages for consumption.
HTML defines rules and structure to describe web pages using markup. The following are standard
components of an HTML page:
•

Element: Logical constructs that form the basic building blocks of an HTML page

•

Tags: A markup construct defined by angled braces (< and >). Some of the important
tags are:
•

: This pair of tags contains the whole of HTML document.
It marks the start and end of the HTML page.

•

: This pair of tags contains the main content of the HTML page
rendered by the browser.

There are many more standard set of tags defined in the HTML standard; further information is
available at https://www.w3schools.com/html/html_intro.asp.
The following is a snippet to generate an HTML page that’s rendered by a web browser, as shown in the
screenshot in Figure 3-7.



Sample HTML Page


Sample WebPage
HTML has been rendered



131

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Figure 3-7. Sample HTML page as rendered in browser
Browsers use markup tags to understand special instructions like text formatting, positioning,
hyperlinks, and so on but only renders the content for the end user to see. For use cases where data/
information resides in HTML pages, we need special techniques to extract this content.

Web Scraping
Web scraping is a technique to scrape or extract data from the web, particularly from web pages. Web
scraping may involve manually copying the data or using automation to crawl, parse, and extract
information from web pages. In most contexts, web scraping refers to automatically crawling a particular
web site or a portion of the web to extract and parse information that can be later on used for analytics or
other use cases. A typical web scraping flow can be summarized as follows:
•

Crawl: A bot or a web crawler is designed to query a web server using the required
set of URLs to fetch the web pages. A crawler may employ sophisticated techniques
to fetch information from pages linked from the URLs in question and even parse
information to a certain extent. Web sites maintain a file called robots.txt to employ
what is called as the “Robots Exclusion Protocol” to restrict/provide access to their
content. More details are available at http://www.robotstxt.org/robotstxt.html.

•

Scrape: Once the raw web page has been fetched, the next task is to extract
information from it. The task of scraping involves utilizing techniques like regular
expressions, extraction based on XPath, or specific tags and so on to narrow down to
the required information on the page.

Web scraping involves creativity from the point of view of narrowing down to the exact piece of
information required. With web sites changing constantly and web pages becoming dynamic (see asp, jsp,
etc.), presence of access controls (username/password, CAPTCHA, and so on) complicate the task even
more. Python is a very powerful programming language, which should be evident by now, and scraping the
web is another task for which it provides multiple utilities. Let’s begin with extracting a blog post’s text from
the Apress blog to better understand web scraping.
The first task is to identify the URL we are interested in. For our current example, we concentrate on the
first blog post of the day on Apress web site’s blog page at http://www.apress.com/in/blog/all-blog-posts.
Clicking on the top most blog post takes us to the main article in consideration. The article is shown in the
screen in Figure 3-8.

132

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Figure 3-8. A blog post on Apress.com
Now that we have the required page and its URL, we will use the requests library to query the required
URL and get a response. The following snippet does the same.
base_url = "http://www.apress.com/in/blog/all-blog-posts"
blog_suffix = "/wannacry-how-to-prepare/12302194"
response = requests.get(base_url+blog_suffix)
If the get request is successful, the response object’s status_code attribute contains a value of 200
(equivalent to HTML success code). Upon getting a successful response, the next task is to devise a method
to extract the required information.
Since in this case we are interested in the blog post’s actual content, let’s analyze the HTML behind the
page and see if we can find specific tags of interest.

■■Note Most modern browsers come with HTML inspection tools built-in. If you are using Google Chrome,
press F12 or right-click on the page and select Inspect or View Source. This opens the HTML code for you to
analyze.

133

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Figure 3-9 depicts a snapshot of the HTML behind the blog post we are interested in.

Figure 3-9. Inspecting the HTML content of a blog post on Apress.com
Upon careful inspection, we can clearly see text of the blog post is contained within the div tag
. Now that we have narrowed down to the tag of interest, we use Python’s
regular expression library re to search and extract data contained within these tags only. The following
snippet utilizes re.compile() to compile a regular expression and then uses re.findall() to extract the
information from the fetched response.
content_pattern = re.compile(r'(.*?)')
result = re.findall(content_pattern, content)
The output of the find operation is the required text from the blog post. Of course, it still contains
HTML tags interlaced between the actual text. We can perform further clean up to reach the required levels,
yet this is a good start. The following is snapshot of information extracted using regular expressions.
Out[59]: 'By Mike Halsey

It was a
perfectly ordinary Friday when the Wannacry ransomware struck in May 2017. The malware
spread around the world to more than 150 countries in just a matter of a few hours,
affecting the National Health Service in the UK, telecoms provider Telefonica in Spain,
and many other organisations and businesses in the USA, Canada, China, Japan, Russia, and
right across Europe, the Middle-East, and Asia.
The malware was reported to have been
stolen in an attack on the US National Security Agency (NSA), though the NSA denied this,
and exploited vulnerabilities in the Microsoft Windows operating system. Microsoft had been
aware of the vulnerabilities since early in the year, and had patched them back in March.

134

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

This was a straightforward and a very basic approach to get the required data. What if we want to go a
step further and extract information related to all blog posts on the page and perform a better cleanup?
For such a task, we utilize the BeautifulSoup library. BeautifulSoup is the go-to standard library for
web scraping and related tasks. It provides some amazing functionality to ease out the scraping process.
For the task at hand, our process would be to first crawl the index page and extract the URLs to all the blog
post links listed on the page. For this we would use the requests.get() function to extract the content and
then utilize BeautifulSoup’s utilities to get the content from the URLs. The following snippet showcases the
function get_post_mapping(), which parses the home page content to extract the blog post headings and
corresponding URLs into a dictionary. The function finally returns a list of such dictionaries.
def get_post_mapping(content):
    """This function extracts blog post title and url from response object
    Args:
        content (request.content): String content returned from requests.get
    Returns:
        list: a list of dictionaries with keys title and url
    """
    post_detail_list = []
    post_soup = BeautifulSoup(content,"lxml")
    h3_content = post_soup.find_all("h3")
    for h3 in h3_content:
        post_detail_list.append(
            {'title':h3.a.get_text(),'url':h3.a.attrs.get('href')}
            )
    return post_detail_list
The pervious function first creates an object of BeautifulSoup specifying lxml as its parser. It then
uses the h3 tag and a regex based search to extract the required list of tags (we got to the h3 tag by the same
inspect element approach we utilized previously). The next task was to simply iterate through the list of the
h3 tags and utilize the get_text() utility function from BeautifulSoup to get the blog post heading and its
corresponding URL. The list returned from the function is as follows.
[{'title': u"Wannacry: Why It's Only the Beginning, and How to Prepare for What Comes Next",
  'url': '/in/blog/all-blog-posts/wannacry-how-to-prepare/12302194'},
{'title': u'Reusing ngrx/effects in Angular (communicating between reducers)',
  'url': '/in/blog/all-blog-posts/reusing-ngrx-effects-in-angular/12279358'},
{'title': u'Interview with Tony Smith - Author and SharePoint Expert',
  'url': '/in/blog/all-blog-posts/interview-with-tony-smith-author-and-sharepoint-expert/12271238'},
{'title': u'Making Sense of Sensors \u2013 Types and Levels of Recognition',
  'url': '/in/blog/all-blog-posts/making-sense-of-sensors/12253808'},
{'title': u'VS 2017, .NET Core, and JavaScript Frameworks, Oh My!',
  'url': '/in/blog/all-blog-posts/vs-2017-net-core-and-javascript-frameworks-oh-my/12261706'}]

135

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Now that we have the list, the final step it to iterate through this list of URLs and extract each blog post’s
text. The following function showcases how BeautifulSoup simplifies the task as compared to our previous
method of using regular expressions. The method of identifying the required tag remains the same, though
we utilize the power of this library to get text that is free from all HTML tags.
def get_post_content(content):
    """This function extracts blog post content from response object
    Args:
        content (request.content): String content returned from requests.get
    Returns:
        str: blog's content in plain text
    """
    plain_text = ""
    text_soup = BeautifulSoup(content,"lxml")
    para_list = text_soup.find_all("div",
                                   {'class':'cms-richtext'})
    for p in para_list[0]:
        plain_text += p.getText()
    return plain_text
The following output is the content from one of the posts. Pay attention to the cleaner text in this case as
compared to our previous approach.
By Mike HalseyIt was a perfectly ordinary Friday when the Wannacry ransomware struck in May
2017. The malware spread around the world to more than 150 countries in just a matter of a
few hours, affecting the National Health Service in the UK, telecoms provider Telefonica in
Spain, and many other organisations and businesses in the USA, Canada, China, Japan, Russia,
and right across Europe, the Middle-East, and Asia.The malware was reported to have been
stolen in an attack on the US National Security Agency (NSA), though the NSA denied this,
and exploited vulnerabilities in the Microsoft Windows operating system. Microsoft had been
aware of the vulnerabilities since early in the year, and had patched them back in March.
Through these two methods, we crawled and extracted information related to blog posts from our web site
of interest. You are encouraged to experiment with other utilities from BeautifulSoup along with other web
sites for a better understanding. Of course, do read the robots.txt and honor the rules set by the webmaster.

SQL
Databases date back to the 1970s and represent a large volume of data stored in relational form. Data
available in the form of tables in databases, or to be more specific, relational databases, comprise of another
format of structured data that we encounter when working on different use cases. Over the years, there have
been various flavors of databases available, most of them conforming to the SQL standard.
The Python ecosystem handles data from databases in two major ways. The first and the most common
way used while working on data science and related use cases is to access data using SQL queries directly. To
access data using SQL queries, powerful libraries like sqlalchemy and pyodbc provide convenient interfaces
to connect, extract, and manipulate data from a variety of relational databases like MS SQL Server, MySQL,

136

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Oracle, and so on. The sqlite3 library provides a lightweight easy-to-use interface to work with SQLite
databases, though the same can be handled by the other two libraries as well.
The second way of interacting with databases is the ORM or the Object Relational Mapper method.
This method is synonymous to the object oriented model of data, i.e., relational data is mapped in terms
of objects and classes. Sqlalchemy provides a high-level interface to interact with databases in the ORM
fashion. We will explore more on these based on the use cases in the subsequent chapters.

Data Description
In the previous section, we discussed various data formats and ways of extracting information from them.
Each of the data formats comprised of data points with attributes of diverse types. These data types in their
raw data forms form the basis of input features utilized by Machine Learning algorithms and other tasks
in the overall Data Science workflow. In this section, we touch upon major data types we deal with while
working on different use cases.

Numeric
This is simplest of the data types available. It is also the type that is directly usable and understood by most
algorithms (though this does not imply that we use numeric data in its raw form). Numeric data represents
scalar information about entities being observed, for instance, number of visits to a web site, price of a
product, weight of a person, and so on. Numeric values also form the basis of vector features, where each
dimension is represented by a scalar value. The scale, range, and distribution of numeric data has an implicit
effect on the algorithm and/or the overall workflow. For handling numeric data, we use techniques such as
normalization, binning, quantization, and many more to transform numeric data as per our requirements.

Text
Data comprising of unstructured, alphanumeric content is one of most common data types. Textual data
when representing human language content contains implicit grammatical structure and meaning. This
type of data requires additional care and effort for transformation and understanding. We cover aspects of
transforming and using textual data in the coming chapters.

Categorical
This data type stands in between the numeric and text. Categorical variables refer to categories of entities
being observed. For instance, hair color being black, brown, blonde and red or economic status as low,
medium, or high. The values may be represented as numeric or alphanumeric, which describe properties of
items in consideration. Based on certain characteristics, categorical variables can be seen as:
•

Nominal: These define only the category of the data point without any ordering
possible. For instance, hair color can be black, brown, blonde, etc., but there cannot
be any order to these categories.

•

Ordinal: These define category but can also be ordered based on rules on the
context. For example, people categorized by economic status of low, medium, or
high can be clearly ordered/sorted in the respective order.

137

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

It is important to note that standard mathematical operations like, addition, subtraction, multiplication,
etc. do not carry meaning for categorical variables even though that may be allowed syntactically
(categorical variables represented as numbers). Thus is it important to handle categorical variables with care
and we will see a couple of ways of handling categorical data in the coming section.
Different data types form the basis of features that are ingested by algorithms for analysis of data at
hand. In the coming sections and chapters, especially Chapter 4: Feature Engineering and Selection, you will
learn more on how to work with specific data types.

Data Wrangling
So far in this chapter we discussed data formats and data types and learned about ways of collecting
data from different sources. Now that we have an understanding of the initial process of collecting and
understanding data, the next logical step is to be able to use it for analysis using various Machine Learning
algorithms based upon the use case at hand. But before we reach the stage where this “raw” data is
anywhere close to be useable for the algorithms or visualizations, we need to polish and shape it up.
Data wrangling or data munging is the process of cleaning, transforming, and mapping data from one
form to another to utilize it for tasks such as analytics, summarization, reporting, visualization, and so on.

Understanding Data
Data wrangling is one of most important and involving steps in the whole Data Science workflow. The output
of this process directly impacts all downstream steps such as exploration, summarization, visualization,
analysis and even the final result. This clearly shows why Data Scientists spend a lot of time in Data
Collection and Wrangling. There are a lot many surveys which help in bringing this fact out that more than
often, Data Scientists end up spending 80% of their time in data processing and wrangling!
So before we get started with actual use cases and algorithms in the coming chapters, it is imperative
that we understand and learn how to wrangle our data and transform it into a useable form. To begin with,
let’s first describe the dataset at hand. For the sake of simplicity, we prepare a sample dataset describing
product purchase transactions by certain users. Since we already discussed ways of collecting/extracting
data, we will skip that step for this section. Figure 3-10 shows a snapshot of dataset.

Figure 3-10. Sample dataset

138

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

■■Note The dataset in consideration has been generated using standard Python libraries like
random, datetime, numpy, pandas, and so on. This dataset has been generated using a utility function called
generate_sample_data() available in the code repository for this book. The data has been randomly generated
and is for representational purposes only.
The dataset describes transactions having the following attributes/features/properties:
•

Date: The date of the transaction

•

Price: The price of the product purchased

•

Product ID: Product identification number

•

Quantity Purchased: The quantity of product purchased in this transaction

•

Serial No: The transaction serial number

•

User ID: Identification number for user performing the transaction

•

User Type: The type of user

Let’s now begin our wrangling/munging process and understand various methods/tricks to clean,
transform, and map our dataset to bring it into a useable form. The first and the foremost step usually is
to get a quick peak into the number of records/rows, the number of columns/attributes, column/attribute
names, and their data types.
For the majority of this section and subsequent ones, we will be relying on pandas and its utilities to
perform the required tasks. The following snippet provides the details on row counts, attribute counts,
and details.
print("Number of rows::",df.shape[0])
print("Number of columns::",df.shape[1] )
print("Column Names::",df.columns.values.tolist())
print("Column Data Types::\n",df.dtypes)
The required information is available straight from the pandas dataframe itself. The shape attribute
is a two-value tuple representing the row count and column count, respectively. The column names are
available through the columns attributes, while the dtypes attribute provides us with the data type of each of
the columns in the dataset. The following is the output generated by this snippet.
Number of rows:: 1001
Number of columns:: 7
Column Names:: ['Date', 'Price', 'Product ID', 'Quantity Purchased', 'Serial No', 'User ID',
'User Type']
Column Data Types::
Date                   object
Price                 float64

139

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Product ID              int32
Quantity Purchased      int32
Serial No               int32
User ID                 int32
User Type              object
dtype: object
The column names are clearly listed and have been explained previously. Upon inspecting the data types,
we can clearly see that the Date attribute is represented as an object. Before we move on to transformations and
cleanup, let’s dig in further and collect more information to understand and prepare a strategy of required tasks
for dataset wrangling. The following snippet helps get information related to attributes/columns containing
missing values, count of rows, and indices that have missing values in them.
print("Columns with Missing Values::",df.columns[df.isnull().any()].tolist())
print("Number of rows with Missing Values::",len(pd.isnull(df).any(1).nonzero()[0].tolist()))
print("Sample Indices with missing data::",pd.isnull(df).any(1).nonzero()[0].tolist()[0:5] )
With pandas, subscripting works with both rows and columns (see Chapter 2 for details). We use
isnull() to identify columns containing missing values. The utilities any() and nonzero() provide nice
abstractions to identify any row/column conforming to a condition (in this case pointing to rows/columns
having missing values). The output is as follows.
Columns with Missing Values:: ['Date', 'Price', 'User Type']
Number of rows with Missing Values:: 61
Sample Indices with missing data:: [0L, 1L, 6L, 7L, 10L]
Let’s also do a quick fact checking to get details on non-null rows for each column and the amount of
memory consumed by this dataframe. We also get some basic summary statistics like min, max, and so on;
these will be useful in coming tasks. For the first task, we use the info() utility while the summary statistics
are provided by the describe() function. The following snippet does this.
print("General Stats::")
print(df.info())
print("Summary Stats::" )
print(df.describe())
The following is the output generated using the info() and describe() utilities. It shows Date and Price
both have about 970 non-null rows, while the dataset consumes close to 40KB of memory. The summary
stats are self-explanatory and drop the non-numeric columns like Date and User Type from the output.
General Stats::

RangeIndex: 1001 entries, 0 to 1000
Data columns (total 7 columns):
Date                  970 non-null object
Price                 970 non-null float64
Product ID            1001 non-null int32
Quantity Purchased    1001 non-null int32

140

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Serial No             1001 non-null int32
User ID               1001 non-null int32
User Type             1000 non-null object
dtypes: float64(1), int32(4), object(2)
memory usage: 39.2+ KB
None
Summary Stats::
             Price   Product ID  Quantity Purchased    Serial No      User ID
count   970.000000  1001.000000         1001.000000  1001.000000  1001.000000
mean   1523.906402   600.236763           20.020979  1452.528472  5335.669331
std    1130.331869   308.072110           11.911782   386.505376   994.777199
min       2.830000     0.000000            0.000000    -1.000000  -101.000000
25%     651.622500   342.000000           10.000000  1223.000000  5236.000000
50%    1330.925000   635.000000           20.000000  1480.000000  5496.000000
75%    2203.897500   875.000000           30.000000  1745.000000  5726.000000
max    5840.370000  1099.000000           41.000000  2000.000000  6001.000000

Filtering Data
We have completed our first pass of the dataset at hand and understood what it has and what is missing.
The next stage is about cleanup. Cleaning a dataset involves tasks such as removing/handling incorrect or
missing data, handling outliers, and so on. Cleaning also involves standardizing attribute column names
to make them more readable, intuitive, and conforming to certain standards for everyone involved to
understand. To perform this task, we write a small function and utilize the rename() utility of pandas to
complete this step. The rename() function takes a dict with keys representing the old column names while
values point to newer ones. We can also decide to modify the existing dataframe or generate a new one by
setting the inplace flag appropriately. The following snippet showcases this function.
def cleanup_column_names(df,rename_dict={},do_inplace=True):
    """This function renames columns of a pandas dataframe
       It converts column names to snake case if rename_dict is not passed.
    Args:
        rename_dict (dict): keys represent old column names and values point to
                            newer ones
        do_inplace (bool): flag to update existing dataframe or return a new one
    Returns:
        pandas dataframe if do_inplace is set to False, None otherwise
    """
    if not rename_dict:
        return df.rename(columns={col: col.lower().replace(' ','_')
                    for col in df.columns.values.tolist()},
                  inplace=do_inplace)
    else:
        return df.rename(columns=rename_dict,inplace=do_inplace)
Upon using this function on our dataframe in consideration, the output in Figure 3-11 is generated. Since
we do not pass any dict with old and new column names, the function updates all columns to snake case.

141

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Figure 3-11. Dataset with columns renamed
For different algorithms, analysis and even visualizations, we often require only a subset of attributes
to work with. With pandas, we can vertically slice (select a subset of columns) in a variety of ways. pandas
provides different ways to suit different scenarios as we shall see in the following snippet.
print("Using Column Index::" )
print(df[[3]].values[:, 0] )
print("Using Column Name::" )
print(df.quantity_purchased.values)
print(Using Column Data Type::" )
print(df.select_dtypes(include=['float64']).values[:,0] )
In this snippet, we have performed attribute selection in three different ways. The first method utilizes
column index number to get the required information. In this case, we wanted to work with only the field
quantity_purchased, hence index number 3 (pandas columns are 0 indexed). The second method also
extracts data for the same attribute by directly referring to the column name in dot notation. While the first
method is very handy when working in loops, the second one is more readable and blends well when we
are utilizing the object oriented nature on Python. Yet there are times when we would need to get attributes
based on their data types alone. The third method makes use of select_dtypes() utility to get this job done.
It provides ways of both including and excluding columns based on data types alone. In this example we
selected the column(s) with data type as float (price column in our dataset). The output from this snippet is
as follows.
Using Column Index::
[13  1  2 ...,  2 30 17]
Using Column Name::
[13  1  2 ...,  2 30 17]
Using Column Data Type::
[ 3021.06  1822.62   542.36 ...,  1768.66  1848.5   1712.22]

142

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Selecting specific attributes/columns is one of the ways of subsetting a dataframe. There may be
requirements to horizontally splitting a dataframe as well. To work with a subset of rows, pandas provides
ways as outlined in the following snippet.
print("Select Specific row indices::")
print(df.iloc[[10,501,20]] )
print(Excluding Specific Row indices::" )
print(df.drop([0,24,51], axis=0).head())
print("Subsetting based on logical condition(s)::" )
print(df[df.quantity_purchased>25].head())
print("Subsetting based on offset from top (bottom)::" )
print(df[100:].head() #df.tail(-100) )
The first method utilizes the iloc (or integer index/location) based selection, we need to specify a list
of indices we need from the dataframe. The second method allows in removing/filtering out specific row
indices from the dataframe itself. This comes in handy in scenarios where rows not satisfying certain criteria
need to be filtered out. The third method showcases conditional logic based filtering of rows. The final
method filters based on offset from the top of the dataframe. A similar method, called tail(), can be used to
offset from bottom as well. The output generated is depicted in Figure 3-12.

Figure 3-12. Different ways of subsetting rows

143

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Typecasting
Typecasting or converting data into appropriate data types, is an important part of cleanup and wrangling in
general. Often data gets converted into wrong data types while being extracted or converted from one form
to the other. Also different platforms and systems handle each data type differently and thus getting the right
data type is important. While starting the wrangling discussion, we checked upon the data types of all the
columns of our dataset. If you remember, the date column was marked as an object. Though it may not be an
issue if we are not going to work with dates, but in cases we need date and related attributes, having them as
objects/strings can pose problems. Moreover, it is difficult to handle date operations if they are available as
strings. To fix our dataframe, we use to_datetime() function from pandas. This is a very flexible utility that
allows us to set different attributes like date time formats, timezone, and so on. Since in our case, the values
are just dates, we use the function as follows with defaults.
df['date'] = pd.to_datetime(df.date)
print(df.dtypes)
Similarly, we can convert numeric columns marked as strings using to_numeric() along with direct
Python style typecasting as well. Upon checking the data types now, we clearly see the date column in the
correct data type of datetime64.
date                  datetime64[ns]
price                        float64
product_id                     int32
quantity_purchased             int32
serial_no                      int32
user_id                        int32
user_type                     object
dtype: object

Transformations
Another common task with data wrangling is to transform existing columns or derive new attributes based
on requirements of the use case or data itself. To derive or transform column, pandas provide three different
utilities—apply(), applymap(), and map(). The apply() function is used to perform actions on the
whole object, depending upon the axis (default is on all rows). The applymap() and map() functions work
element-wise with map() coming from the pandas.Series hierarchy.
As an example to understand these three utilities, let’s derive some new attributes. First, let’s expand
the user_type attribute using the map() function. We write a small function to map each of the distinct
user_type codes into their corresponding user classes as follows.
def expand_user_type(u_type):
        if u_type in ['a','b']:
            return 'new'
        elif u_type == 'c':
            return 'existing'
        elif u_type == 'd':
            return 'loyal_existing'
        else:
            return 'error'
df['user_class'] = df['user_type'].map(expand_user_type)

144

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Along the same lines, we use the applymap() function to perform another element-wise operation to get
the week of the transaction from the date attribute. For this case, we use the lambda function to get the job
done quickly. Refer to previous chapters for more details on lambda functions. The following snippet gets us
the week for each of the transactions.
df['purchase_week'] = df[['date']].applymap(lambda dt:dt.week
                                                if not pd.isnull(dt.week)
                                                else 0)
Figure 3-13 depicts our dataframe with two additional attributes—user_class and purchase_week.

Figure 3-13. Dataframe with derived attributes using map and applymap
Let’s now use the apply() function to perform action on the whole of the dataframe object itself. The
following snippet uses the apply() function to get range (maximum value to minimum value) for all numeric
attributes. We use the previously discussed select_dtypes and lambda function to complete the task.
df.select_dtypes(include=[np.number]).apply(lambda x: x.max()- x.min())
The output is a reduced pandas.Series object showcasing range values for each of the numeric columns.
price                 5837.54
product_id            1099.00
quantity_purchased      41.00
serial_no             2001.00
user_id               6102.00
purchase_week           53.00

Imputing Missing Values
Missing values can lead to all sorts of problems when dealing with Machine Learning and Data Science related
use cases. Not only can they cause problems for algorithms, they can mess up calculations and even final
outcomes. Missing values also pose risk of being interpreted in non-standard ways as well leading to confusion
and more errors. Hence, imputing missing values carries a lot of weight in the overall data wrangling process.
One of the easiest ways of handling missing values is to ignore or remove them altogether from the
dataset. When the dataset is fairly large and we have enough samples of various types required, this option
can be safely exercised. We use the dropna() function from pandas in the following snippet to remove rows
of data where the date of transaction is missing.
print("Drop Rows with missing dates::" )
df_dropped = df.dropna(subset=['date'])
print("Shape::",df_dropped.shape)

145

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

The result is a dataframe with rows without any missing dates. The output dataframe is depicted in
Figure 3-14.

Figure 3-14. Dataframe without any missing date information
Often dropping rows is a very expensive and unfeasible option. In many scenarios, missing values are
imputed using the help of other values in the dataframe. One commonly used trick is to replace missing
values with a central tendency measure like mean or median. One may also choose other sophisticated
measures/statistics as well. In our dataset, the price column seems to have some missing data. We utilize the
fillna() method from pandas to fill these values with mean price value from our dataframe.
On the same lines, we use the ffill() and bfill() functions to impute missing values for the
user_type attribute. Since, user_type is a string type attribute, we use a proximity based solution to handle
missing values in this case. The ffill() and bfill() functions copy forward the data from the previous row
(forward fill) or copy the value from the next row (backward fill). The following snippet showcases the three
functions.
print("Fill Missing Price values with mean price::" )
df_dropped['price'].fillna(value=np.round(df.price.mean(),decimals=2),
                            inplace=True)
print("Fill Missing user_type values with value from \
         previous row (forward fill) ::" )
df_dropped['user_type'].fillna(method='ffill',inplace=True)
print("Fill Missing user_type values with value from \
        next row (backward fill) ::" )
df_dropped['user_type'].fillna(method='bfill',inplace=True)
Apart from these ways, there are certain conditions where a record is not much of use if it has more than
a certain threshold of attribute values missing. For instance, if in our dataset a transaction has less than three
attributes as non-null, the transaction might almost be unusable. In such a scenario, it might be advisable to
drop that data point itself. We can filter out such data points using the function dropna() with the parameter
thresh set to the threshold of non-null attributes. More details are available on the official documentation page.

146

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Handling Duplicates
Another issue with many datasets is the presence of duplicates. While data is important and more the
merrier, duplicates do not add much value per se. Even more, duplicates help us identify potential areas of
errors in recording/collecting the data itself. To identify duplicates, we have a utility called duplicated()
that can applied on the whole dataframe as well as on a subset of it. We may handle duplicates by fixing the
errors and use the duplicated() function, although we may also choose to drop the duplicate data points
altogether. To drop duplicates, we use the method drop_duplicates(). The following snippet showcases
both functions discussed here.
df_dropped[df_dropped.duplicated(subset=['serial_no'])]
df_dropped.drop_duplicates(subset=['serial_no'],inplace=True)
The output of identifying a subset of a dataframe having duplicate values for the field serial_no is
depicted in Figure 3-15. The second line in the previous snippet simply drops those duplicates.

Figure 3-15. Dataframe with duplicate serial_no values

Handling Categorical Data
As discussed in the section “Data Description,” categorical attributes consist of data that can take a limited
number of values (not always though). Here in our dataset, the attribute user_type is a categorical variable
that can take only a limited number of values from the allowed set {a,b,c,d}. The algorithms that we would be
learning and utilizing in the coming chapters mostly work with numerical data and categorical variables may
pose some issues. With pandas, we can handle categorical variables in a couple of different ways. The first
one is using the map() function, where we simply map each value from the allowed set to a numeric value.
Though this may be useful, this approach should be handled with care and caveats. For instance, statistical
operations like addition, mean, and so on, though syntactically valid, should be avoided for obvious reasons
(more on this in coming chapters). The second method is to convert the categorical variable into indicator
variables using the get_dummies() function. The function is simply a wrapper to generate one hot encoding
for the variable in consideration. One hot encoding and other encodings can be handled using libraries like
sklearn as well (we will see more examples in coming chapters).
The following snippet showcases both the methods discussed previously using map() and get_dummies().
# using map to dummy encode
type_map={'a':0,'b':1,'c':2,'d':3,np.NAN:-1}
df['encoded_user_type'] = df.user_type.map(type_map)
print(df.head())
# using get_dummies to one hot encode
print(pd.get_dummies(df,columns=['user_type']).head())

147

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

The output is generated as depicted in Figure 3-16 and Figure 3-17. Figure 3-16 shows the output of
dummy encoding. With the map() approach we keep the number of features in check, yet have to be careful
about the caveats mentioned in this section.

Figure 3-16. Dataframe with user_type attribute dummy encoded
The second image, Figure 3-17, showcases the output of one hot encoding the user_type attribute.
We discuss more these approaches in detail in Chapter 4, when we discuss feature engineering.

Figure 3-17. Dataframe with user_type attribute one hot encoded

Normalizing Values
Attribute normalization is the process of standardizing the range of values of attributes. Machine learning
algorithms in many cases utilize distance metrics, attributes or features of different scales/ranges which
might adversely affect the calculations or bias the outcomes. Normalization is also called feature scaling.
There are various ways of scaling/normalizing features, some of them are rescaling, standardization
(or zero-mean unit variance), unit scaling and many more. We may choose a normalization technique
based upon the feature, algorithm and use case at hand. This will be clearer when we work on use cases.
We also cover feature scaling strategies in detail in Chapter 4: Feature Engineering and Selection. The
following snippet showcases a quick example of using a min-max scaler, available from the preprocessing
module of sklearn, which rescales attributes to the desired given range.
df_normalized = df.dropna().copy()
min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(df_normalized['price'].reshape(-1,1))
df_normalized['normalized_price'] = np_scaled.reshape(-1,1)
Figure 3-18 showcases the unscaled price values and the normalized price values that have been scaled
to a range of [0, 1].

148

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Figure 3-18. Original and normalized values for price

String Manipulations
Raw data presents all sorts of issues and complexities before it can be used for analysis. Strings are another
class of raw data which needs special attention and treatment before our algorithms can make sense out
of them. As mentioned while discussing wrangling methods for categorical data, there are limitations and
issues while directly using string data in algorithms.
String data representing natural language is highly noisy and requires its own set of steps for wrangling.
Though most of these steps are use case dependent, it is worth mentioning them here (we will cover these in
detail along with use cases for better clarity). String data usually undergoes wrangling steps such as:
•

Tokenization: Splitting of string data into constituent units. For example, splitting
sentences into words or words into characters.

•

Stemming and lemmatization: These are normalization methods to bring words into
their root or canonical forms. While stemming is a heuristic process to achieve the
root form, lemmatization utilizes rules of grammar and vocabulary to derive the root.

•

Stopword Removal: Text contains words that occur at high frequency yet
do not convey much information (punctuations, conjunctions, and so on). These
words/phrases are usually removed to reduce dimensionality and noise from data.

Apart from the three common steps mentioned previously, there are other manipulations like POS
tagging, hashing, indexing, and so on. Each of these are required and tuned based on the data and problem
statement on hand. Stay tuned for more details on these in the coming chapters.

Data Summarization
Data summarization refers to the process of preparing a compact representation of raw data at hand.
This process involves aggregation of data using different statistical, mathematical, and other methods.
Summarization is helpful for visualization, compressing raw data, and better understanding of its attributes.
The pandas library provides various powerful summarization techniques to suit different requirements.
We will cover a couple of them here as well. The most widely used form of summarization is to group values
based on certain conditions or attributes. The following snippet illustrates one such summarization.
print(df['price'][df['user_type']=='a'].mean())
print(df['purchase_week'].value_counts())

149

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

The first statement calculates the mean price for all transactions by user_type, while the second one
counts the number of transactions per week. Though these calculations are helpful, grouping data based on
attributes helps us get a better understanding of it. The groupby() function helps us perform the same, as
shown in the following snippet.
print(df.groupby(['user_class'])['quantity_purchased'].sum())
This statement generates a tabular output representing sum of quantities purchased by each
user_class. The output is generated is as follows.
user_class
existing           4830
loyal_existing     5515
new               10100
Name: quantity_purchased, dtype: int32
The groupby() function is a powerful interface that allows us to perform complex groupings and
aggregations. In the previous example we grouped only on a single attribute and performed a single aggregation
(i.e., sum). With groupby() we can perform multi-attribute groupings and apply multiple aggregations across
attributes. The following snippet showcases three variants of groupby() usage and their corresponding outputs.
# variant-1: multiple aggregations on single attribute
df.groupby(['user_class'])['quantity_purchased'].agg([np.sum, np.mean,
                                                              np.count_nonzero])
# variant-2: different aggregation functions for each attribute
df.groupby(['user_class','user_type']).agg({'price':np.mean,
                                                            'quantity_purchased':np.max})
# variant-3
df.groupby(['user_class','user_type']).agg({'price':{'total_price':np.sum,
                                                               'mean_price':np.mean,
                                                               'variance_price':np.std,
                                                               'count':np.count_nonzero},
                                                               'quantity_purchased':np.sum})
The three different variants can be explained as follows.
Variant 1: Here we apply three different aggregations on quantity purchased which is grouped by
user_class (see Figure 3-19).

Figure 3-19. Groupby with multiple aggregations on single attribute

150

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Variant 2: Here, we apply different aggregation functions on two different attributes. The agg()
function takes a dictionary as input containing attributes as keys and aggregation functions as values (see
Figure 3-20).

Figure 3-20. Groupby with different aggregation functions for different attributes
Variant 3: Here, we do a combination of variants 1 and 2, i.e., we apply multiple aggregations on the
price field while applying only a single one on quantity_purchased. Again a dictionary is passed, as shown
in the snippet. The output is shown in Figure 3-21.

Figure 3-21. Groupby with showcasing a complex operation
Apart from groupby() based summarization, other functions such as pivot(), pivot_table(),
stack(), unstack(), crosstab(), and melt() provide capabilities to reshape the pandas dataframe as
per requirements. A complete description of these methods with examples is available as part of pandas
documentation at https://pandas.pydata.org/pandas-docs/stable/reshaping.html. We encourage you
to go through the same.

Data Visualization
Data Science is a type of storytelling that involves data as its lead character. As Data Science practitioners we
work with loads of data which undergo processing, wrangling, and analysis day in and day out for various
use cases. Augmenting this storytelling with visual aspects like charts, graphs, maps and so on not just helps
in improving the understanding of data (and in turn the use case/business problem) but also provides
opportunities to find hidden patterns and potential insights.

151

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Data visualization is thus the process of visually representing information in the form of charts, graphs,
pictures, and so on for a better and universally consistent understanding.
We mention universally consistent understanding to point out a very common issue with human
languages. Human languages are inherently complex and depending upon the intentions and skills of the
writer, the audience may perceive the written information in different ways (causing all sorts of problems).
Presenting data visually thus provides us with a consistent language to present and understand information
(though this as well is not free from misinterpretation yet it provides certain consistency).
In this section, we begin by utilizing pandas and its capabilities to visually understand data through
different visualizations. We will then introduce visualizations from the matplotlib perspective.

■■Note Data visualization in itself is a popular and deep field of study utilized across domains. This chapter
and section only presents a few topics to get us started. This is by no means a comprehensive and detailed
guide on data visualization. Interested readers may explore further, though topics covered here and in coming
chapters should be enough for most common tasks related to visualizations.

Visualizing with Pandas
Data visualization is a diverse field and a science on its own. Though the selection of the type of visualization
highly depends on the data, the audience, and more, we will continue with our product transaction dataset
from the previous section to understand and visualize.
Just as a quick recap, the dataset at hand consisted of transactions indicating purchase of products by
certain users. Each transaction had the following attributes.
•

Date: The date of the transaction

•

Price: The price of the product purchased

•

Product ID: Product identification number

•

Quantity Purchased: The quantity of product purchased in this transaction

•

Serial No: The transaction serial number

•

User ID: Identification number of user performing the transaction

•

User Type: The type of user

We wrangle our dataset to clean up the column names, convert attributes to correct data types, and
derive additional attributes of user_class and purchase_week, as discussed in the previous section.
pandas is a very popular and powerful library, examples of which we have been seeing throughout the
chapter. Visualization is another important and widely used feature of pandas. It exposes its visualization
capabilities through the plot interface and closely follows matplotlib style visualization syntax.

Line Charts
We begin with first looking at the purchase patterns of a user who has a maximum number of transactions
(we leave this as an exercise for you to identify such a user). A trend is best visualized using the line chart.
Simply subsetting the dataframe on the required fields, the plot() interface charts out a line chart by
default. The following snippet shows the price-wise trend for the given user.
df[df.user_id == max_user_id][['price']].plot(style='blue')
plt.title('Price Trends for Particular User')

152

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

The plt alias is for matplotlib.pyplot. We will discuss this more in the coming section, for now
assume we require this to add-on enhancements to plots generated by pandas. In this case we use it to add a
title to our plot. The plot generated is depicted in Figure 3-22.

Figure 3-22. Line chart showing price trend for a user
Though we can see a visual representation of prices of different transactions by this user, it is not
helping us much. Let’s now use the line chart again to understand how his/her purchase trends over
time (remember we have date of transactions available in the dataset). We use the same plot interface by
subsetting the dataframe to the two required attributes. The following code snippet outlines the process.
df[df.user_id == max_user_id].plot(x='date',y='price',style='blue')   
plt.title('Price Trends for Particular User Over Time')
This time, since we have two attributes, we inform pandas to use the date as our x-axis and price as
the y-axis. The plot interface handles datetime data types with élan as is evident in the following output
depicted in Figure 3-23.

153

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Figure 3-23. Price trend over time for a given user
This time our visualization clearly helps us see the purchase pattern for this user. Though we can
discuss about insights from this visualization at length, a quick inference is clearly visible. As the plot shows,
the user seems to have purchased high valued items in the starting of the year, with a decreasing trend as
the year has progressed. Also, the number of transactions in the beginning of the year are more and closer as
compared to rest of the year. We can correlate such details with more data to identify patterns and behaviors.
We shall cover more such aspects in coming chapters.

Bar Plots
Having seen trends for a particular user, let’s take a look at our dataset at an aggregated level. Since we
already have a derived attribute called the purchase_week, let’s use it to aggregate quantities purchased by
users over time. We first aggregate the data at a week level using the groupby() function and then aggregate
the attribute quantity_purchased. The final step is to plot the aggregation on a bar plot. The following
snippet helps us plot this information.
df[['purchase_week',
        'quantity_purchased']].groupby('purchase_week').sum().plot.barh(
                                                                color='orange')
plt.title('Quantities Purchased per Week')
We use the barh() function to prepare a horizontal bar chart. It is similar to a standard bar() plot in
terms of the way it represents information. The difference is in the orientation of the plots. Figure 3-24 shows
the generated output.

154

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Figure 3-24. Bar plot representing quantities purchased at a weekly level

Histograms
One of the most important aspects of exploratory data analysis (EDA) is to understand distribution of
various numerical attributes of a given dataset. The simplest and the most common way of visualizing a
distribution is through histograms. We plot the price distribution of the products purchased in our dataset as
shown in the following snippet.
df.price.hist(color='green')
plt.title('Price Distribution')
We use the hist() function to plot the price distribution in a single line of code. The output is depicted
in Figure 3-25.

Figure 3-25. Histogram representing price distribution

155

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

The output shown in Figure 3-25 clearly shows a skewed and tailed distribution. This information will
be useful while using such attributes in our algorithms. More will be clear when we work on actual use cases.
We can take this a step further and try to visualize the price distribution on a per week basis. We do so
by using the parameter by in the hist() function. This parameter helps us group data based on the attribute
mentioned, as by and then generates a subplot for each such grouping. In our case, we group by purchase
week as shown in the following snippet.
df[['price','purchase_week']].hist(by='purchase_week' ,sharex=True)
The output depicted in Figure 3-26 showcases distribution of price on a weekly basis with the highest
bin clearly marked in a different color.

Figure 3-26. Histograms on a weekly basis

Pie Charts
One of the most commonly sought after questions while understanding the data or extracting insights is to
know which type is contributing the most. To visualize percentage distribution, pie charts are best utilized.
For our dataset, the following snippet helps us visualize which user type purchased how much.
class_series = df.groupby('user_class').size()
class_series.name = 'User Class Distribution'
class_series.plot.pie(autopct='%.2f')
plt.title('User Class Share')
plt.show()

156

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

The previous snippet uses groupby() to extract a series representing number of transactions on a per
user_class level. We then use the pie() function to plot the percentage distribution. We use the autopct
parameter to annotate the plot with actual percentage contribution by each use_class. Figure 3-27 depicts
the output pie chart.

Figure 3-27. Pie chart representing user class transaction distribution
The plot in Figure 3-27 clearly points out that new users are having more than 50% of the total
transaction share while existing and loyal_existing ones complete the remaining. We do not recommend
using pie charts especially when you have more than three or four categories. Use bar charts instead.

Box Plots
Box plots are important visualizations that help us understand quartile distribution of numerical data. A
box plot or box-whisker plot is a concise representation to help understand different quartiles, skewness,
dispersion, and outliers in the data.

157

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

We’ll look at the attributes quantity_purchased and purchase_week using box plots. The following
snippet generates the required plot for us.
df[['quantity_purchased','purchase_week']].plot.box()
plt.title('Quantity and Week value distribution')
Now, we’ll look at the plots generated (see Figure 3-28). The bottom edge of the box in box plot
marks the first quartile, while the top one marks the third. The line in the middle of the box marks the
second quartile or the median. The top and bottom whiskers extending from the box mark the range of
values. Outliers are marked beyond the whisker boundaries. In our example, for quantity purchased, the
median is quite close to the middle of the box while the purchase week has it toward the bottom (clearly
pointing out the skewness in the data). You are encouraged to read more about box plots for an in-depth
understanding at http://www.physics.csbsju.edu/stats/box2.html, http://www.stat.yale.edu/
Courses/1997-98/101/boxplot.htm.

Figure 3-28. Box plots using pandas

Scatter Plots
Scatter plots are another class of visualizations usually used to identify correlations or patterns between
attributes. Like most visualization we have seen so far, scatter plots are also available through the plot()
interface of pandas.
To understand scatter plots, we first need to perform a couple of steps of data wrangling to get our data
into required shape. We first encode the user_class with dummy encoding (as discussed in the previous
section) using map() and then getting mean price and count of transactions on a per week per user_class
level using groupby(). The following snippet helps us get our dataframe.
uclass_map = {'new': 1, 'existing': 2, 'loyal_existing': 3,'error':0}
df['enc_uclass'] = df.user_class.map(uclass_map)

158

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

bubble_df = df[['enc_uclass',
                'purchase_week',
                'price','product_id']].groupby(['purchase_week',
                                                enc_uclass']).agg(
                                                         {'price':'mean',
                                                          'product_id':'count'}
                                                         ).reset_index()
bubble_df.rename(columns={'product_id':'total_transactions'},inplace=True)

Figure 3-29. Dataframe aggregated on a per week per user_class level
Figure 3-29 showcases the resultant dataframe. Now, let’s visualize this data using a scatter plot. The
following snippet does the job for us.
bubble_df.plot.scatter(x='purchase_week',
                       y='price')
plt.title('Purchase Week Vs Price ')
plt.show()

159

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

This generates the plot in Figure 3-30 showcasing an almost random spread of data across weeks and
average price with some slight concentration in the top left of the plot.

Figure 3-30. Scatter plot showing spread of data across purchase_week and price
Scatter plot also provides us the capability to visualize more than the basic dimensions. We can plot
third and fourth dimensions using color and size. The following snippet helps us understand the spread with
color denoting the user_class while size of the bubble indicates number of transaction.
bubble_df.plot.scatter(x='purchase_week',
                       y='price',
                       c=bubble_df['enc_uclass'],
                       s=bubble_df['total_transactions']*10)
plt.title('Purchase Week Vs Price Per User Class Based on Tx')  

160

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

The parameters are self-explanatory—c represents color while s stands for size of the bubble. Such plots
are also called bubble charts. The output generated is shown in Figure 3-31.

Figure 3-31. Scatter plot visualizing multi dimensional data
In this section, we utilized pandas to plot all sorts of visualizations. These were some of the most widely
used visualizations and pandas provides a lot of flexibility to do more with these. Also, there is an extended
list of plots that can be visualized using pandas. The complete information is available on the pandas
documentation.

Visualizing with Matplotlib
matplotlib is a popular plotting library. It provides interfaces and utilities to generate publication quality
visualizations. Since its first version in 2003 until today, matplotlib is being continuously improved by
its active developer community. It also forms the base and inspiration of many other plotting libraries. As
discussed in the previous section, pandas along with SciPy (another popular Python library for scientific
computing) provide wrappers over matplotlib implementations for ease of visualizing data.
matplotlib provides two primary modules to work with, pylab and pyplot. In this section, we will
concentrate only on pyplot module (the use of pylab is not much encouraged). The pyplot interface is an
object oriented interface that favors explicit instantiations as opposed to pylab’s implicit ones.
In the previous section, we briefly introduced different visualizations and saw a few ways of tweaking
them as well. Since pandas visualizations are derived from matplotlib itself, we will cover additional
concepts and capabilities of matplotlib. This will enable you to not only use matplotlib with ease but also
provide with tricks to improve visualizations generated using pandas.

161

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Figures and Subplots
First things first. The base of any matplotlib style visualization begins with figure and subplot objects. The
figure module helps matplotlib generate the plotting window object and its associated elements. In short,
it is the top-level container for all visualization components. In matplotlib syntax, a figure is the top-most
container and, within one figure, we have the flexibility to visualize multiple plots. Thus, subplots are the
plots within the high-level figure container.
Let’s get started with a simple example and the required imports. We will then build on the example to
better understand the concept of figure and subplots. The following snippet imports the pyplot module of
matplotlib and plots a simple sine curve using numpy to generate x and y values.
import numpy as np
import matplotlib.pyplot as plt
# sample plot
x = np.linspace(-10, 10, 50)
y=np.sin(x)
plt.plot(x,y)
plt.title('Sine Curve using matplotlib')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
The pyplot module exposes methods such as plot() to generate visualizations. In the example, with
plt.plot(x, y), matplotlib is working behind the scenes to generate the figure and axes objects to
output the plot in Figure 3-32. For completeness’ sake, the statements plt.title(), plt.xlabel(), and so
on provide ways to set the figure title and axis labels, respectively.

Figure 3-32. Sample plot

162

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Now that we have a sample plot done, let’s look at how different objects interact in the matplotlib
universe. As mentioned, the figure object is the top-most container of all elements. Before complicating
things, we begin by first plotting different figures, i.e., each figure containing only a single plot. The following
snippet plots a sine and a cosine wave in two different figures using numpy and matplotlib.
# first figure
plt.figure(1)
plt.plot(x,y)
plt.title('Fig1: Sine Curve')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
# second figure
plt.figure(2)
y=np.cos(x)
plt.plot(x,y)
plt.title('Fig2: Cosine Curve')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
The statement plt.figure() creates an instance of type Figure. The number passed in as a parameter
is the figure identifier, which is helpful while referring to the same figure in case multiple exist. The rest of
the statements are similar to our sample plot with pyplot always referring to the current figure object to
draw to. Note that the moment a new figure is instantiated, pyplot refers to the newly created objects unless
specified otherwise. The output generated is shown in Figure 3-33.

163

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Figure 3-33. Multiple figures using matplotlib

We plot multiple figures while telling the data story for a use case. Yet, there are cases where we need
multiple plots in the same figure. This is where the concept of subplots comes into the picture. A subplot
divides a figure into a grid of specified rows and columns and also provides interfaces to interact with
plotting elements. Subplots can be generated using a few different ways and their use depends on personal
preferences and use case demands. We begin with the most intuitive one, the add_subplot() method. This
method is exposed through the figure object itself. Its parameters help define the grid layout and other
properties. The following snippet generates four subplots in a figure.
y = np.sin(x)
figure_obj = plt.figure(figsize=(8, 6))
ax1 = figure_obj.add_subplot(2,2,1)
ax1.plot(x,y)
ax2 = figure_obj.add_subplot(2,2,2)
ax3 = figure_obj.add_subplot(2,2,3)

164

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

ax4 = figure_obj.add_subplot(2,2,4)
ax4.plot(x+10,y)
This snippet first defines a figure object using plt.figure(). We then get an axes object pointing to
the first subplot generated using the statement figure_obj.add_subplot(2,2,1). This statement is actually
dividing the figure into two rows and two columns. The last parameter (value 1) is pointing to the first
subplot in this grid. The snippet is simply plotting the sine curve in the top-left subplot (identified as 2,2,1)
and another sine curve shifted by 10 units on the x-axis in the fourth subplot (identified as 2,2,4). The output
generated is shown in Figure 3-34.

Figure 3-34. Subplots using add_subplot method
The second method for generating subplots is through the pyplot module directly. The pyplot module
exposes a method subplots(), which returns figure object and a list of axes object, each of which is pointing
to a subplot in the layout mentioned in the subplots() parameters. This method is useful when we have an
idea about how many subplots will be required. The following snippet showcases the same.
fig, ax_list = plt.subplots(2,1,sharex=True, figsize=(8, 6))
y= np.sin(x)
ax_list[0].plot(x,y)
y= np.cos(x)
ax_list[1].plot(x,y)
The statement plt.subplots(2,1,sharex=True) does three things in one go. It first of all generates
a figure object which is then divided into 2 rows and 1 column each (i.e., two subplots in total). The two
subplots are returned in the form of a list of axes objects. The final and the third thing is the sharing of
x-axis, which we achieve using the parameter sharex. This sharing of x-axis allows all subplots in this figure

165

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

to have the same x-axis. This allows us to view data on the same scale along with aesthetic improvements.
The output is depicted in Figure 3-35 showcasing sine and cosine curves on the same x-axis.

Figure 3-35. Subplots using subplots() method
Another variant is the subplot() function, which is also exposed through the pyplot module directly.
This closely emulates the add_subplot() method of the figure object. You can find examples listed in the
code for this chapter. Before moving onto other concepts, we quickly touch on the subplot2grid() function,
also exposed through the pyplot module. This function provides capabilities similar to the ones already
discussed along with finer control to define grid layout where subplots can span an arbitrary number of
columns and rows. The following snippet showcases a grid with subplots of different sizes.
y = np.abs(x)
z = x**2
plt.subplot2grid((4,3), (0, 0), rowspan=4, colspan=2)
plt.plot(x, y,'b',x,z,'r')
ax2 = plt.subplot2grid((4,3), (0, 2),rowspan=2)
plt.plot(x, y,'b')
plt.setp(ax2.get_xticklabels(), visible=False)
plt.subplot2grid((4,3), (2, 2), rowspan=2)
plt.plot(x, z,'r')

166

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

The subplot2grid() function takes a number of parameters, explained as follows:
•

shape: A tuple representing rows and columns in the grid as (rows, columns).

•

loc: A tuple representing the location of a subplot. This parameter is 0 indexed.

•

rowspan: This parameter represents the number of rows the subplot covers.

•

colspan: This parameter represents the number of columns the subplot extends to.

The output generated in Figure 3-36 by the snippet has one subplot covering four rows and two
columns containing two functions. The other two subplots cover two rows and one column each.

Figure 3-36. Subplots using subplot2grid()

Plot Formatting
Formatting a plot is another important aspect of storytelling and matplotlib provides us plenty of features
here. From changing colors to markers and so on, matplotlib provides easy-to-use intuitive interfaces.
We begin with the color attribute, which is available as part of the plot() interface. The color attribute
works on the RGBA specifications, allowing us to provide alpha and color values as strings (red, green, and
so on), as single letters (r, g, and so on) and even as hex values. More details are available on the matplotlib
documentation at https://matplotlib.org/api/colors_api.html.
The following example and output depicted in Figure 3-37 illustrates how easy it is to set color and
alpha properties for plots.
y = x
# color
ax1 = plt.subplot(321)
plt.plot(x,y,color='green')
ax1.set_title('Line Color')
# alpha
ax2 = plt.subplot(322,sharex=ax1)
alpha = plt.plot(x,y)

167

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

alpha[0].set_alpha(0.3)
ax2.set_title('Line Alpha')
plt.setp(ax2.get_yticklabels(), visible=False)

Figure 3-37. Setting color and alpha properties of a plot
Along the same lines, we also have options to use different shapes to mark data points as well as
different styles to plot lines. These options come in handy while representing different attributes/classes
onto the same plot. The following snippet and output depicted in Figure 3-38 showcase the same.
# marker
# markers -> '+', 'o', '*', 's', ',', '.', etc
ax3 = plt.subplot(323,sharex=ax1)
plt.plot(x,y,marker='*')
ax3.set_title('Point Marker')
# linestyle
# linestyles -> '-','--','-.', ':', 'steps'
ax4 = plt.subplot(324,sharex=ax1)
plt.plot(x,y,linestyle='--')
ax4.set_title('Line Style')
plt.setp(ax4.get_yticklabels(), visible=False)

168

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Figure 3-38. Setting marker and line style properties of a plot
Though there are many more fine tuning options available, you are encouraged to go through the
documentation for detailed information. We conclude the formatting section with final two tricks related to
line width and a shorthand notation to do it all quickly, as shown in the following snippet.
# line width
ax5 = plt.subplot(325,sharex=ax1)
line = plt.plot(x,y)
line[0].set_linewidth(3.0)
ax5.set_title('Line Width')
# combine linestyle
ax6 = plt.subplot(326,sharex=ax1)
plt.plot(x,y,'b^')
ax6.set_title('Styling Shorthand')
plt.setp(ax6.get_yticklabels(), visible=False)
This snippet uses the line object returned by the plot() function to set the line width. The second part
of the snippet showcases the shorthand notation to set the line color and data point marker in one go as, b^.
The output shown in Figure 3-39 helps show this effect.

169

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Figure 3-39. Example to show line_width and shorthand notation

Legends
A graph legend is a key that helps us map color/shape or other attributes to different attributes being
visualized by the plot. Though in most cases matplotlib does a wonderful job at preparing and showing the
legends, there are times when we require a finer level of control.
The legend of a plot can be controlled using the legend() function, available through the pyplot
module directly. We can set the location, size, and other formatting attributes through this function. The
following example shows the legend being placed in the best possible location.
plt.plot(x,y,'g',label='y=x^2')
plt.plot(x,z,'b:',label='y=x')
plt.legend(loc="best")
plt.title('Legend Sample')

170

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Figure 3-40. Sample plot with legend
One of the primary goals of matplotlib is to provide publication quality visualizations. Matplotlib
supports LaTEX style formatting of the legends to cleanly visualize mathematical symbols and equations.
The $ symbol is used to mark the start and end of LaTEX style formatting. The same is shown in the following
snippet and output plot (see Figure 3-41).
# legend with latex formatting
plt.plot(x,y,'g',label='$y = x^2$')
plt.plot(x,z,'b:',linewidth=3,label='$y = x^2$')
plt.legend(loc="best",fontsize='x-large')
plt.title('Legend with $LaTEX$ formatting')

Figure 3-41. Sample plot with LaTEX formatted legend

171

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

A
 xis Controls
The next feature from matplotlib is the ability to control the x- and y-axes of a plot. Apart from basic
features like setting the axis labels and colors using the methods set_xlabel() and set_ylabel(), there
are finer controls available as well. Let’s first see how to add a secondary y-axis. There are many scenarios
when we plot data related to different features (having values at different scales) on the same plot. To get a
proper understanding, it usually helps to have both features on different y-axis (each scaled to respective
range). To get additional y-axis, we use the function twinx() exposed through the axes object. The
following snippet outlines the scenario.
## axis controls
# secondary y-axis
fig, ax1 = plt.subplots()
ax1.plot(x,y,'g')
ax1.set_ylabel(r"primary y-axis", color="green")
ax2 = ax1.twinx()
ax2.plot(x,z,'b:',linewidth=3)
ax2.set_ylabel(r"secondary y-axis", color="blue")
plt.title('Secondary Y Axis')
At first it may sound odd to have a function named twinx() to generate secondary y-axis. Smartly,
matplotlib has such a function to point out the fact that the additional y-axis would share the same x-axis
and hence the name twinx(). On the same lines, additional x-axis is obtained using the function twiny().
The output plot is depicted in Figure 3-42.

Figure 3-42. Sample plot with secondary y-axis

172

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

By default, matplotlib identifies the range of values being plotted and adjusts the ticks and the range
of both x- and y-axes. It also provides capability to manually set these through the axis() function. Through
this function, we can set the axis range using predefined keywords like tight, scaled, and equal, along with
passing a list such that it marks the values as [xmin, xmax, ymin, ymax]. The following snippet shows how to
adjust axis range manually.
#
y
z
w

manual
= np.log(x)
= np.log2(x)
= np.log10(x)

plt.plot(x,y,'r',x,z,'g',x,w,'b')
plt.axis([0,2,-1,2])
plt.title('Manual Axis Range')
The output in Figure 3-43 showcases the plot generated without any axis adjustments on the left, while
the right one shows the axis adjustment as done in the previous snippet.

Figure 3-43. Plots showcasing default axis and manually adjusted axis
Now that we have seen how to set the axis range, we will quickly touch on setting the ticks or axis
markers manually as well. For axis ticks, we have two separate functions available, one for setting the
range of ticks while the second sets the tick labels. The functions are intuitively named as set_ticks() and
set_ticklabels(), respectively. In the following example, we set the ticks to be marked for the x-axis while
for y-axis we set both the tick range and the labels using the appropriate functions.
# Manual ticks
plt.plot(x, y)
ax = plt.gca()
ax.xaxis.set_ticks(np.arange(-2, 2, 1))
ax.yaxis.set_ticks(np.arange(0, 5))
ax.yaxis.set_ticklabels(["min", 2, 4, "max"])
plt.grid(True)
plt.title("Manual ticks on the x-axis")

173

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

The output is a plot with x-axis having labels marked only between -2 and 1, while y-axis has a range of
0 to 5 with labels changed manually. The output plot is shown in Figure 3-44.

Figure 3-44. Plot showcasing axes with manual ticks
Before we move on to next set of features/capabilities, it is worth noting that with matplotlib, we also
get the capability of scaling the axis based on the data range in a standard manner apart from manually
setting it (as seen previously). The following is a quick snippet scaling the y-axis on a log scale. The output
is shown in Figure 3-45.
# scaling
plt.plot(x, y)
ax = plt.gca()
# values: log, logit, symlog
ax.set_yscale("log")
plt.grid(True)
plt.title("Log Scaled Axis")

Figure 3-45. Plot showcasing log scaled y-axis

174

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

A
 nnotations
The text() interface from the pyplot module exposes the annotation capabilities of matplotlib. We can
annotate any part of the figure/plot/subplot using this interface. It takes the x and y coordinates, the text to
be displayed, alignment, and fontsize parameters as inputs to place the annotations at the desired place on
the plot. The following snippet annotates the minima of a parabolic plot.
# annotations
y = x**2
min_x = 0
min_y = min_x**2
plt.plot(x, y, "b-", min_x, min_y, "ro")
plt.axis([-10,10,-25,100])
plt.text(0, 60, "Parabola\n$y = x^2$", fontsize=15, ha="center")
plt.text(min_x, min_y+2, "Minima", ha="center")
plt.text(min_x, min_y-6, "(%0.1f, %0.1f)"%(min_x, min_y), ha='center',color='gray')
plt.title("Annotated Plot")
The text() interface provides many more capabilities and formatting features. You are encouraged
to go through the official documentation and examples for details on this. The output plot showcasing the
annotated parabola is shown in Figure 3-46.

Figure 3-46. Plot showcasing annotations

175

Chapter 3 ■ Processing, Wrangling, and Visualizing Data

Global Parameters
To maintain consistency, we usually try to keep the plot sizes, fonts, and colors the same across the visual
story. Setting each of these attributes creates complexity and difficulties in maintaining the code base. To
overcome these issues, we can set formatting settings globally, as shown in the following snippet.
# global formatting params
params = {'legend.fontsize': 'large',
          'figure.figsize': (10, 10),
         'axes.labelsize': 'large',
         'axes.titlesize':'large',
         'xtick.labelsize':'large',
         'ytick.labelsize':'large'}
plt.rcParams.update(params)
Once set using rcParams.update(), the attributes provided in the params dictionary are applied to
every plot generated. You are encouraged to apply these setting and generate the plots discussed in this
section again to understand the difference.

Python Visualization Ecosystem
The matplotlib library is a very powerful and popular visualization/plotting library without any doubts.
It provides most of the tools and tricks required to plot any type of data with capability to control even the
finest elements.
Yet matplotlib leaves a lot more to be desired by even pro-users. Being a low-level API, it requires a lot
of boilerplate code, interactivity is limited, and styling and other formatting defaults seem dated.
To address these issues and provide high-level interfaces and ability to work with current Python
ecosystem, the Python universe has quite a few visualization libraries to choose from. Some of the most
popular and powerful ones are bokeh, seaborn, ggplot, and plotly. Each of these libraries builds on the
understanding and feature set of matplotlib, while providing their own set of features and easy-to-use
wrappers to plug the gaps.
You are encouraged to explore these libraries and understand the differences. We will introduce some
of these in the coming chapters as and when required. Though different, most libraries work on concepts
similar to matplotlib and hence the learning curve is shorter if you’re well versed in matplotlib.

Summary
This chapter covered quite a lot of ground in terms of understanding, processing, and wrangling data. We
covered major data formats like flat files (CSV, JSON, XML, HTML, etc.), and used standard libraries to
extract/collect data. We touched on standard data types and their importance in the overall process of data
science. Major part of this chapter covered data wrangling tasks to transform, clean, and process data so as
to bring it into usable form. Though the techniques were explained using the pandas library, the concepts
are universal and applied in most Data Science related use cases. You may use these techniques as pointers
that can be easily applied using different libraries and programming/scripting languages. We covered major
plots using sample datasets describing their usage. We also touched on the basics and the powerful tricks of
matplotlib. We strongly encourage you to read the referred links for in-depth understanding. This chapter
covers the initial steps in the CRISP DM model of data collection, processing, and visualization. In the
coming chapters, we build on these concepts and apply them for solving specific real-world problems.
Stay tuned!

176

CHAPTER 4

Feature Engineering and Selection
Building Machine Learning systems and pipelines take significant effort, which is evident from the
knowledge you gained in the previous chapters. In the first chapter, we presented some high-level
architecture for building Machine Learning pipelines. The path from data to insights and information is
not an easy and direct one. It is tough and also iterative in nature involving data scientists and analysts to
reiterate through several steps multiple times to get to the perfect model and derive correct insights. The
limitation of Machine Learning algorithms is the fact that they can only understand numerical values as
inputs. This is because, at the heart of any algorithm, we usually have multiple mathematical equations,
constraints, optimizations and computations. Hence it is almost impossible for us to feed raw data into any
algorithm and expect results. This is where features and attributes are extremely helpful in building models
on top of our data.
Building machine intelligence is a multi-layered process having multiple facets. In this book, so far,
we have already explored how you can retrieve, process, wrangle, and visualize data. Exploratory data
analysis and visualizations are the first step toward understanding your data better. Understanding your data
involves understanding the complete scope encompassing your data including the domain, constraints,
caveats, quality and available attributes. From Chapter 3, you might remember that data is comprised of
multiple fields, attributes, or variables. Each attribute by itself is an inherent feature of the data. You can then
derive further features from these inherent features and this itself forms a major part of feature engineering.
Feature selection is another important task that comes hand in hand with feature engineering, where the
data scientist is tasked with selecting the best possible subset of features and attributes that would help in
building the right model.
An important point to remember here is that feature engineering and selection is not a one-time
process which should be carried out in an ad hoc manner. The nature of building Machine Learning systems
is iterative (following the CRISP-DM principle) and hence extracting and engineering features from the
dataset is not a one-time task. You may need to extract new features and try out multiple selections each
time you build a model to get the best and optimal model for your problem. Data processing and feature
engineering is often described to be the toughest task or step in building any Machine Learning system by
data scientists. With the need of both domain knowledge as well as mathematical transformations, feature
engineering is often said to be both an art as well as a science. The obvious complexities involve dealing
with diverse types of data and variables. Besides this, each Machine Learning problem or task needs
specific features and there is no one solution fits all in the case of feature engineering. This makes feature
engineering all the more difficult and complex.
Hence we follow a proper structured approach in this chapter covering the following three major areas
in the feature engineering workflow. They are mentioned as follows.
•

Feature extraction and engineering

•

Feature scaling

•

Feature selection

© Dipanjan Sarkar, Raghav Bali and Tushar Sharma 2018
D. Sarkar et al., Practical Machine Learning with Python, https://doi.org/10.1007/978-1-4842-3207-1_4

177

Chapter 4 ■ Feature Engineering and Selection

This chapter covers essential concepts for all the three major areas mentioned above. Techniques for
feature engineering will be covered in detail for diverse data types including numeric, categorical, temporal,
text and image data. We would like to thank our good friend and fellow data scientist, Gabriel Moreira for
helping us with some excellent compilations of feature engineering techniques over these diverse data
types. We also cover different feature scaling methods typically used as a part of the feature engineering
process to normalize values preventing higher valued features from taking unnecessary prominence. Several
feature selection techniques like filter, wrapper, and embedded methods will also be covered. Techniques
and concepts will be supplemented with sufficient hands-on examples and code snippets. Remember to
check out the relevant code under Chapter 4 in the GitHub repository at https://github.com/dipanjanS/
practical-machine-learning-with-python which contains necessary code, notebooks, and data. This will
make things easier to understand, help you gain enough knowledge to know which technique should be
used in which scenario and thus help you get started on your own journey toward feature engineering for
building Machine Learning models!

Features: Understand Your Data Better
The essence of any Machine Learning model is comprised of two components namely, data and algorithms.
You might remember the same from the Machine Learning paradigm which we introduced in Chapter 1.
Any Machine Learning algorithm is at essence a combination of mathematical functions, equations and
optimizations which are often augmented with business logic as needed. These algorithms are not intelligent
enough to usually process raw data and discover latent patterns from the same which would be used to train
the system. Hence we need better data representations for building Machine Learning models, which are also
known as data features or attributes. Let’s look at some important concepts associated with data and features
in this section.

Data and Datasets
Data is essential for analytics and Machine Learning. Without data we are literally powerless to implement
any intelligent system. The formal definition of data would be a collection or set of qualitative and/or
quantitative variables containing values based on observations. Typically data is usually measured and
collected from various observations. This is then stored it is raw form which can then be processed further
and analyzed as required. Typically in any analytics or Machine Learning system, you might need multiple
sources of data and processed data from one component can be fed as raw data to another component for
further processing. Data can be structured having definite rows and columns indicating observations and
attributes or unstructured like free textual data.
A dataset can be defined as a collection of data. Typically this indicates data present in the form of flat
files like CSV files or MS Excel files, relational database tables or views, or even raw data two-dimensional
matrices. Sample datasets which are quite popular in Machine Learning are available in the scikit-learn
package to quickly get started. The sklearn.datasets module has these sample datasets readily available
and other utilities pertaining to loading and handling datasets. You can find more details in this link
http://scikit-learn.org/stable/datasets/index.html#datasets to learn more about the toy datasets
and best practices for handling and loading data. Another popular resource for Machine Learning based
datasets is the UC Irvine Machine Learning repository which can be found here http://archive.ics.uci.
edu/ml/index.php and this contains a wide variety of datasets from real-world problems, scenarios and
devices. In fact the popular Machine Learning and predictive analytics competitive platform Kaggle also
features some datasets from UCI and other datasets pertaining to various competitions. Feel free to check
out these resources and we will in fact be using some datasets from these resources in this chapter as well as
in subsequent chapters.

178

Chapter 4 ■ Feature Engineering and Selection

Features
Raw data is hardly used to build any Machine Learning model, mostly because algorithms can’t work with
data which is not properly processed and wrangled in a desired format. Features are attributes or properties
obtained from raw data. Each feature is a specific representation on top of the raw data. Typically, each
feature is an individual measurable attribute which usually is depicted by a column in a two dimensional
dataset. Each observation is depicted by a row and each feature will have a specific value for an observation.
Thus each row typically indicates a feature vector and the entire set of features across all the observations
forms a two-dimensional feature matrix also known as a feature set. Features are extremely important
toward building Machine Learning models and each feature represents a specific chunk of representation
and information from the data which is used by the model. Both quality as well as quantity of features
influences the performance of the model.
Features can be of two major types based on the dataset. Inherent raw features are obtained directly
from the dataset with no extra data manipulation or engineering. Derived features are usually what we
obtain from feature engineering where we extract features from existing data attributes. A simple example
would be creating a new feature Age from an employee dataset containing Birthdate by just subtracting their
birth date from the current date. The next major section covers more details on how to handle, extract, and
engineer features based on diverse data types.

Models
Features are better representations of underlying raw data which act as inputs to any Machine Learning
model. Typically a model is comprised of data features, optional class labels or numeric responses for
supervised learning and a Machine Learning algorithm. The algorithm is chosen based on the type of
problem we want to solve after converting it into a specific Machine Learning task. Models are built after
training the system on data features iteratively till we get the desired performance. Thus, a model is basically
used to represent relationships among the various features of our data.
Typically the process of modeling involves multiple major steps. Model building focuses on training
the model on data features. Model tuning and optimization involves tuning specific model parameters,
known as hyperparameters and optimizing the model to get the best model. Model evaluation involves
using standard performance evaluation metrics like accuracy to evaluate model performance. Model
deployment is usually the final step where, once we have selected the most suitable model, we deploy it live
in production which usually involves building an entire system around this model based on the CRISP-DM
methodology. Chapter 5 will focus on these aspects in further detail.

Revisiting the Machine Learning Pipeline
We covered the standard Machine Learning pipeline in detail in Chapter 1, which was based on the CRISPDM standard. Let’s refresh our memory by looking at Figure 4-1, which depicts our standard generic
Machine Learning pipeline with the major components identified with the various building blocks.

179

Chapter 4 ■ Feature Engineering and Selection

Figure 4-1. Revisiting our standard Machine Learning pipeline
The figure clearly depicts the main components in the pipeline, which you should already be
well-versed on by now. These components are mentioned once more for ease of understanding.
•

Data retrieval

•

Data preparation

•

Modeling

•

Model evaluation and tuning

•

Model deployment and monitoring

Our area of focus in this chapter falls under the blocks under “Data Preparation”. We already covered
processing and wrangling data in Chapter 3 in detail. Here, we will be focusing on the three major steps
essential toward handling data features. These are mentioned as follows.
1.

Feature extraction and engineering

2.

Feature scaling

3.

Feature selection

These blocks are highlighted in Figure 4-1 and are essential toward the process of transforming
processed data into features. By processed, we mean the raw data, after going through necessary preprocessing and wrangling operations. The sequence of steps that are usually followed in the pipeline for
transforming processed data into features is depicted in a more detailed view in Figure 4-2.

Figure 4-2. A standard pipeline for feature engineering, scaling, and selection

180

Chapter 4 ■ Feature Engineering and Selection

It is quite evident that based on the sequence of steps depicted in the figure, features are first crafted
and engineering, necessary normalization and scaling is performed and finally the most relevant features
are selected to give us the final set of features. We will cover these three components in detail in subsequent
sections following the same sequence as depicted in the figure.

Feature Extraction and Engineering
The process of feature extraction and engineering is perhaps the most important one in the entire Machine
Learning pipeline. Good features depicting the most suitable representations of the data help in building
effective Machine Learning models. In fact, more than often it’s not the algorithms but the features that
determine the effectiveness of the model. In simple words, good features give good models. A data scientist
approximately spends around 70% to 80% of his time in data processing, wrangling, and feature engineering
for building any Machine Learning model. Hence it’s of paramount importance to understand all aspects
pertaining to feature engineering if you want to be proficient in Machine Learning.
Typically feature extraction and feature engineering are synonyms that indicate the process of using a
combination of domain knowledge, hand-crafted techniques and mathematical transformations to convert
data into features. Henceforth we will be using the term feature engineering to refer to all aspects concerning
the task of extracting or creating new features from data. While the choice of Machine Learning algorithm
is very important when building a model, more than often, the choice and number of features tend to have
more impact toward the model performance. In this section, we will be looking to answer some questions
such as the why, what, and how of feature engineering to get a more in-depth understanding toward feature
engineering.

What Is Feature Engineering?
We already informally explained the core concept behind feature engineering, where we use specific
components from domain knowledge and specific techniques to transform data into features. Data in
this case is raw data after necessary pre-processing and wrangling, which we have mentioned earlier. This
includes dealing with bad data, imputing missing values, transforming specific values, and so on. Features
are the final end result from the process of feature engineering, which depicts various representations of the
underlying data.
Let’s now look at a couple of definitions and quotes relevant to feature engineering from several
renowned people in the world of data science! Renowned computer and data scientist Andrew Ng talks
about Machine Learning and feature engineering.

“Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied
Machine Learning’ is basically feature engineering.”
—Prof. Andrew Ng
This basically reinforces what we mentioned earlier about data scientists spending close to 80% of
their time in engineering features which is a difficult and time-consuming process, requiring both domain
knowledge and mathematical computations. Besides this, practical or applied Machine Learning is mostly
feature engineering because the time taken in building and evaluating models is considerably less than the
total time spent toward feature engineering. However, this doesn’t mean that modeling and evaluation are
any less important than feature engineering.

181

Chapter 4 ■ Feature Engineering and Selection

We will now look at a definition of feature engineering by Dr. Jason Brownlee, data scientist and ML
practitioner who provides a lot of excellent resources over at http://machinelearningmastery.com with
regard to Machine Learning and data science. Dr. Brownlee defines feature engineering as follows.

“Feature engineering is the process of transforming raw data into features that better
represent the underlying problem to the predictive models, resulting in improved model
accuracy on unseen data.”
—Dr. Jason Brownlee
Let’s spend some more time on this definition of feature engineering. It tells us that the process of
feature engineering involves transforming data into features taking into account several aspects pertaining
to the problem, model, performance, and data. These aspects are highlighted in this definition and are
explained in further detail as follows.
•

Raw data: This is data in its native form after data retrieval from source. Typically
some amount of data processing and wrangling is done before the actual process of
feature engineering.

•

Features: These are specific representations obtained from the raw data after the
process of feature engineering.

•

The underlying problem: This refers to the specific business problem or usecase we want to solve with the help of Machine Learning. The business problem is
typically converted into a Machine Learning task.

•

The predictive models: Typically feature engineering is used for extracting features
to build Machine Learning models that learn about the data and the problem to be
solved from these features. Supervised predictive models are widely used for solving
diverse problems.

•

Model accuracy: This refers to model performance metrics that are used to evaluate
the model.

•

Unseen data: This is basically new data that was not used previously to build or train
the model. The model is expected to learn and generalize well for unseen data based
on good quality features.

Thus feature engineering is the process of transforming data into features to act as inputs for Machine
Learning models such that good quality features help in improving the overall model performance. Features
are also very much dependent on the underlying problem. Thus, even though the Machine Learning task
might be same in different scenarios, like classification of e-mails into spam and non-spam or classifying
handwritten digits, the features extracted in each scenario will be very different from the other.
By now you must be getting a good grasp on the idea and significance of feature engineering. Always
remember that for solving any Machine Learning problem, feature engineering is the key! This in fact is
reinforced by Prof. Pedro Domingos from the University of Washington, in his paper titled, “A Few Useful
Things to Know about Machine Learning” available at http://homes.cs.washington.edu/~pedrod/papers/
cacm12.pdf, which tells us the following.

“At the end of the day, some Machine Learning projects succeed and some fail. What makes
the difference? Easily the most important factor is the features used.”
—Prof. Pedro Domingos

182

Chapter 4 ■ Feature Engineering and Selection

Feature engineering is indeed both an art and a science to transform data into features for feeding
into models. Sometimes you need a combination of domain knowledge, experience, intuition, and
mathematical transformations to give you the features you need. By solving more problems over time, you
will gain the experience you need to know what features might be best suited for a problem. Hence do not be
overwhelmed, practice will make you master feature engineering with time. The following list depicts some
examples of engineering features.
•

Deriving a person’s age from birth date and the current date

•

Getting the average and median view count of specific songs and music videos

•

Extracting word and phrase occurrence counts from text documents

•

Extracting pixel information from raw images

•

Tabulating occurrences of various grades obtained by students

The final quote to whet your appetite on feature engineering is from renowned Kaggler, Xavier Conort.
Most of you already know that tough Machine Learning problems are often posted on Kaggle regularly which
is usually open to everyone. Xavier’s thoughts on feature engineering are mentioned as follows.

“The algorithms we used are very standard for Kagglers. ...We spent most of our efforts in
feature engineering. ...We were also very careful to discard features likely to expose us to
the risk of over-fitting our model.”
—Xavier Conort
This should give you a good idea what is feature engineering, the various aspects surrounding it and
a very basic introduction into why do we really need feature engineering. In the following section, we will
expand more on why we need feature engineering, its benefits and advantages.

Why Feature Engineering?
We have defined feature engineering in the previous section and also touched upon the basics pertaining to
the importance of feature engineering. Let’s now look at why we need feature engineering and how can it be
an advantage for us when we are building Machine Learning models and working with data.
•

Better representation of data: Features are basically various representations
of the underlying raw data. These representations can be better understood by
Machine Learning algorithms. Besides this, we can also often easily visualize
these representations. A simple example would be to visualize the frequent word
occurrences of a newspaper article as opposed to being totally perplexed as to what
to do with the raw text!

•

Better performing models: The right features tend to give models that outperform
other models no matter how complex the algorithm is. In general if you have the
right feature set, even a simple model will perform well and give desired results. In
short, better features make better models.

•

Essential for model building and evaluation: We have mentioned this numerous
times by now, raw data cannot be used to build Machine Learning models. Get
your data, extract features, and start building models! Also on evaluating model
performance and tuning the models, you can reiterate over your feature set to choose
the right set of features to get the best model.

183

Chapter 4 ■ Feature Engineering and Selection

•

More flexibility on data types: While is it definitely easier to use numeric data types
directly with Machine Learning algorithms with little or no data transformations,
the real challenge is to build models on more complex data types like text, images,
and even videos. Feature engineering helps us build models on diverse data types
by applying necessary transformations and enables us to work even on complex
unstructured data.

•

Emphasis on the business and domain: Data scientists and analysts are usually
busy in processing, cleaning data and building models as a part of their day to day
tasks. This often creates a gap between the business stakeholders and the technical/
analytics team. Feature engineering involves and enables data scientists to take
a step back and try to understand the domain and the business better, by taking
valuable inputs from the business and subject matter experts. This is necessary to
create and select features that might be useful for building the right model to solve
the problem. Pure statistical and mathematical knowledge is rarely sufficient to solve
a complex real-world problem. Hence feature engineering emphasizes to focus on
the business and the domain of the problem when building features.

This list, though not an exhaustive one, gives us a pretty good insight into the importance of feature
engineering and how it is an essential aspect of building Machine Learning models. The importance of the
problem to be solved and the domain is also pretty important in feature engineering.

How Do You Engineer Features?
There are no fixed rules for engineering features. It involves using a combination of domain knowledge,
business constraints, hand-crafted transformations and mathematical transformations to transform the
raw data into desired features. Different data types have different techniques for feature extraction. Hence
in this chapter, we focus on various feature engineering techniques and strategies for the following major
data types.
•

Numeric data

•

Categorical data

•

Text data

•

Temporal data

•

Image data

Subsequent sections in this chapter focus on dealing with these diverse data types and specific
techniques which can be applied to engineer features. You can use them as a reference and guidebook for
engineering features from your own datasets in the future.
Another aspect into feature engineering has recently gained prominence. Here, you do not use handcrafted features but, make the machine itself try to detect patterns and extract useful data representations
from the raw data, which can be used as features. This process is also known as auto feature generation.
Deep Learning has proved to be extremely effective in this area and neural network architectures like
convolutional neural networks (CNNs), recurrent neural networks (RNNs), and Long Short Term Memory
networks (LSTMs) are extensively used for auto feature engineering and extraction. Let’s dive into the world
of feature engineering now with some real-world datasets and examples.

184

Chapter 4 ■ Feature Engineering and Selection

Feature Engineering on Numeric Data
Numeric data, fields, variables, or features typically represent data in the form of scalar information that
denotes an observation, recording, or measurement. Of course, numeric data can also be represented as a
vector of scalars where each specific entity in the vector is a numeric data point in itself. Integers and floats
are the most common and widely used numeric data types. Besides this, numeric data is perhaps the easiest
to process and is often used directly by Machine Learning models. If you remember we have talked about
numeric data previously in the “Data Description” section in Chapter 3.
Even though numeric data can be directly fed into Machine Learning models, you would still need to
engineer features that are relevant to the scenario, problem, and domain before building a model. Hence
the need for feature engineering remains. Important aspects of numeric features include feature scale and
distribution and you will observe some of these aspects in the examples in this section. In some scenarios,
we need to apply specific transformations to change the scale of numeric values and in other scenarios we
need to change the overall distribution of the numeric values, like transforming a skewed distribution to a
normal distribution.
The code used for this section is available in the code files for this chapter. You can load feature_
engineering_numeric.py directly and start running the examples or use the jupyter notebook, Feature
Engineering on Numeric Data.ipynb, for a more interactive experience. Before we begin, let’s load the
following dependencies and configuration settings.
In [1]:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:

import
import
import
import
import

pandas as pd
matplotlib.pyplot as plt
matplotlib as mpl
numpy as np
scipy.stats as spstats

%matplotlib inline
mpl.style.reload_library()
mpl.style.use('classic')
mpl.rcParams['figure.facecolor'] = (1, 1, 1, 0)
mpl.rcParams['figure.figsize'] = [6.0, 4.0]
mpl.rcParams['figure.dpi'] = 100

Now that we have the initial dependencies loaded, let’s look at some ways to engineer features from
numeric data in the following sections.

Raw Measures
Just like we mentioned earlier, numeric features can be directly fed to Machine Learning models often since
they are in a format which can be easily understood, interpreted, and operated on. Raw measures typically
indicated using numeric variables directly as features without any form of transformation or engineering.
Typically these features can indicate values or counts.

185

Chapter 4 ■ Feature Engineering and Selection

V alues
Usually, scalar values in its raw form indicate a specific measurement, metric, or observation belonging to
a specific variable or field. The semantics of this field is usually obtained from the field name itself or a data
dictionary if present. Let’s load a dataset now about Pokémon! This dataset is also available on Kaggle. If you
do not know, Pokémon is a huge media franchise surrounding fictional characters called Pokémon which
stands for pocket monsters. In short, you can think of them as fictional animals with superpowers! The
following snippet gives us an idea about this dataset.
In [2]: poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8')
   ...: poke_df.head()

Figure 4-3. Raw data from the Pokémon dataset
If you observe the dataset depicted in Figure 4-3, there are several attributes there which represent
numeric raw values which can be used directly. The following snippet depicts some of these features with
more emphasis.
In [3]: poke_df[['HP', 'Attack', 'Defense']].head()
Out[3]:
   HP  Attack  Defense
0  45      49       49
1  60      62       63
2  80      82       83
3  80     100      123
4  39      52       43
You can directly use these attributes as features that are depicted in the previous dataframe. These
include each Pokémon’s HP (Hit Points), Attack, and Defense stats. In fact, we can also compute some basic
statistical measures on these fields using the following code.
In [4]: poke_df[['HP', 'Attack', 'Defense']].describe()
Out[4]:
               HP      Attack     Defense
count  800.000000  800.000000  800.000000
mean    69.258750   79.001250   73.842500
std     25.534669   32.457366   31.183501
min      1.000000    5.000000    5.000000
25%     50.000000   55.000000   50.000000
50%     65.000000   75.000000   70.000000
75%     80.000000  100.000000   90.000000
max    255.000000  190.000000  230.000000

186

Chapter 4 ■ Feature Engineering and Selection

We can see multiple statistical measures like count, average, standard deviation, and quartiles for each
of the numeric features in this output. Try plotting their distributions if possible!

C
 ounts
Raw numeric measures can also indicate counts, frequencies and occurrences of specific attributes. Let’s
look at a sample of data from the million-song dataset, which depicts counts or frequencies of songs that
have been heard by various users.
In [5]: popsong_df = pd.read_csv('datasets/song_views.csv', encoding='utf-8')
   ...: popsong_df.head(10)

Figure 4-4. Song listen counts as a numeric feature
We can see that the listen_count field in the data depicted in Figure 4-4 can be directly used as a
count/frequency based numeric feature.

Binarization
Often raw numeric frequencies or counts are not necessary in building models especially with regard to
methods applied in building recommender engines. For example if I want to know if a person is interested
or has listened to a particular song, I do not need to know the total number of times he/she has listened to
the same song. I am more concerned about the various songs he/she has listened to. In this case, a binary
feature is preferred as opposed to a count based feature. We can binarize our listen_count field from our
earlier dataset in the following way.
In [6]: watched = np.array(popsong_df['listen_count'])
   ...: watched[watched >= 1] = 1
   ...: popsong_df['watched'] = watched
You can also use scikit-learn’s Binarizer class here from its preprocessing module to perform the
same task instead of numpy arrays, as depicted in the following code.

187

Chapter 4 ■ Feature Engineering and Selection

In [7]:
   ...:
   ...:
   ...:
   ...:
   ...:

from sklearn.preprocessing import Binarizer
bn = Binarizer(threshold=0.9)
pd_watched = bn.transform([popsong_df['listen_count']])[0]
popsong_df['pd_watched'] = pd_watched
popsong_df.head(11)

Figure 4-5. Binarizing song counts
You can clearly see from Figure 4-5 that both the methods have produced the same results depicted in
features watched and pd_watched. Thus, we have the song listen counts as a binarized feature indicating if
the song was listened to or not by each user.

Rounding
Often when dealing with numeric attributes like proportions or percentages, we may not need values with a
high amount of precision. Hence it makes sense to round off these high precision percentages into numeric
integers. These integers can then be directly used as raw numeric values or even as categorical (discreteclass based) features. Let’s try applying this concept in a dummy dataset depicting store items and their
popularity percentages.
In [8]: items_popularity = pd.read_csv('datasets/item_popularity.csv', encoding='utf-8')
   ...: # rounding off percentages
   ...: items_popularity['popularity_scale_10'] =
                    np.array(np.round((items_popularity['pop_percent'] * 10)), dtype='int')
   ...: items_popularity['popularity_scale_100'] =
                    np.array(np.round((items_popularity['pop_percent'] * 100)), dtype='int')
   ...: items_popularity
Out[8]:
    item_id  pop_percent  popularity_scale_10  popularity_scale_100
0  it_01345      0.98324                   10                    98
1  it_03431      0.56123                    6                    56
2  it_04572      0.12098                    1                    12
3  it_98021      0.35476                    4                    35

188

Chapter 4 ■ Feature Engineering and Selection

4  it_01298      0.92101                    9                    92
5  it_90120      0.81212                    8                    81
6  it_10123      0.56502                    6                    57
Thus after our rounding operations, you can see the new features in the data depicted in the previous
dataframe. Basically we tried two forms of rounding. The features depict the item popularities now both on
a scale of 1-10 and on a scale of 1-100. You can use these values both as numerical or categorical features
based on the scenario and problem.

Interactions
A model is usually built in such a way that we try to model the output responses (discrete classes or
continuous values) as a function of the input feature variables. For example, a simple linear regression
equation can be depicted as y = c1x1 + c2x2 + ... + cnxn where the input features are depicted by variables {x1, x2,
... xn} having weights or coefficients of {c1, c2, ... cn} respectively and the goal is the predict response y. In this
case, this simple linear model depicts the relationship between the output and inputs, purely based on the
individual, separate input features.
However, often in several real-world datasets and scenarios, it makes sense to also try to capture the
interactions between these feature variables as a part of the input feature set. A simple depiction of the
extension of the above linear regression formulation with interaction features would be y = c1x1 + c2x2 + ... +
cnxn + c11x12 + c22x22 + c12x1x2 + ... where features like {x1x2, x12, ...} denote the interaction features. Let’s try
engineering some interaction features on our Pokémon dataset now.
In [9]: atk_def = poke_df[['Attack', 'Defense']]
   ...: atk_def.head()
Out[9]:
   Attack  Defense
0      49       49
1      62       63
2      82       83
3     100      123
4      52       43
We can see in this output, the two numeric features depicting Pokémon attack and defense. The
following code helps us build interaction features from these two features. We will build features up to the
second degree using the PolynomialFeatures class from scikit-learn's API.
In [10]: from sklearn.preprocessing import PolynomialFeatures
    ...:
    ...: pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
    ...: res = pf.fit_transform(atk_def)
    ...: res
Out[10]:
array([[    49.,     49.,   2401.,   2401.,   2401.],
       [    62.,     63.,   3844.,   3906.,   3969.],
       [    82.,     83.,   6724.,   6806.,   6889.],
       ...,
       [   110.,     60.,  12100.,   6600.,   3600.],
       [   160.,     60.,  25600.,   9600.,   3600.],
       [   110.,    120.,  12100.,  13200.,  14400.]])

189

Chapter 4 ■ Feature Engineering and Selection

We can clearly see from this output that we have a total of five features including the new interaction
features. We can see the degree of each feature in the matrix, using the following snippet.
In [11]: pd.DataFrame(pf.powers_, columns=['Attack_degree', 'Defense_degree'])
Out[11]:
   Attack_degree  Defense_degree
0              1               0
1              0               1
2              2               0
3              1               1
4              0               2
Now that we know what each feature actually represented from the degrees depicted, we can assign a
name to each feature as follows to get the updated feature set.
In [12]: intr_features = pd.DataFrame(res,
    ...:                             columns=['Attack', 'Defense',
    ...:                                    
'Attack^2', 'Attack x Defense', 'Defense^2'])
    ...: intr_features.head(5)
Out[12]:
   Attack  Defense  Attack^2  Attack x Defense  Defense^2
0    49.0     49.0    2401.0            2401.0     2401.0
1    62.0     63.0    3844.0            3906.0     3969.0
2    82.0     83.0    6724.0            6806.0     6889.0
3   100.0    123.0   10000.0          12300.0    15129.0
4    52.0     43.0    2704.0            2236.0     1849.0
Thus we can see our original and interaction features in Figure 4-10. The fit_transform(...) API
function from scikit-learn is useful to build a feature engineering representation object on the training
data, which can be reused on new data during model predictions by calling on the transform(...) function.
Let’s take some sample new observations for Pokémon attack and defense features and try to transform
them using this same mechanism.
In [13]: new_df = pd.DataFrame([[95, 75],[121, 120], [77, 60]],
    ...:                    
columns=['Attack', 'Defense'])
    ...: new_df
Out[13]:
   Attack  Defense
0      95       75
1     121      120
2      77       60
We can now use the pf object that we created earlier and transform these input features to give us the
interaction features as follows.
In [14]: new_res = pf.transform(new_df)
    ...: new_intr_features = pd.DataFrame(new_res,
    ...:                                 columns=['Attack', 'Defense',
    ...:                                         'Attack^2', 'Attack x Defense', 'Defense^2'])
    ...: new_intr_features
Out[14]:

190

Chapter 4 ■ Feature Engineering and Selection

   Attack  Defense  Attack^2  Attack x Defense  Defense^2
0    95.0     75.0    9025.0            7125.0     5625.0
1   121.0    120.0   14641.0          14520.0    14400.0
2    77.0     60.0    5929.0            4620.0     3600.0
Thus you can see that we have successfully obtained the necessary interaction features for the new
dataset. Try building interaction features on three or more features now!

Binning
Often when working with numeric data, you might come across features or attributes which depict raw
measures such as values or frequencies. In many cases, often the distributions of these attributes are skewed
in the sense that some sets of values will occur a lot and some will be very rare. Besides that, there is also the
added problem of varying range of these values. Suppose we are talking about song or video view counts.
In some cases, the view counts will be abnormally large and in some cases very small. Directly using these
features in modeling might cause issues. Metrics like similarity measures, cluster distances, regression
coefficients and more might get adversely affected if we use raw numeric features having values which range
across multiple orders of magnitude. There are various ways to engineer features from these raw values so
we can these issues. These methods include transformations, scaling and binning/quantization.
In this section, we will talk about binning which is also known as quantization. The operation of binning
is used for transforming continuous numeric values into discrete ones. These discrete numbers can be
thought of as bins into which the raw values or numbers are binned or grouped into. Each bin represents a
specific degree of intensity and has a specific range of values which must fall into that bin. There are various
ways of binning data which include fixed-width and adaptive binning. Specific techniques can be employed
for each binning process. We will use a dataset extracted from the 2016 FreeCodeCamp Developer/Coder
survey which talks about various attributes pertaining to coders and software developers. You can check it
out yourself at https://github.com/freeCodeCamp/2016-new-coder-survey for more details. Let’s load the
dataset and take a peek at some interesting attributes.
In [15]: fcc_survey_df = pd.read_csv('datasets/fcc_2016_coder_survey_subset.csv',
                                     encoding='utf-8')
    ...: fcc_survey_df[['ID.x', 'EmploymentField', 'Age', 'Income']].head()

Figure 4-6. Important attributes from the FCC coder survey dataset
The dataframe depicted in Figure 4-6 shows us some interesting attributes of the coder survey dataset,
some of which we will be analyzing in this section. The ID.x variable is basically a unique identifier for each
coder/developer who took the survey and the other fields are pretty self-explanatory.

191

Chapter 4 ■ Feature Engineering and Selection

Fixed-Width Binning
In fixed-width binning, as the name indicates, we have specific fixed widths for each of the bins, which are
usually pre-defined by the user analyzing the data. Each bin has a pre-fixed range of values which should be
assigned to that bin on the basis of some business or custom logic, rules, or necessary transformations.
Binning based on rounding is one of the ways, where you can use the rounding operation that we
discussed earlier to bin raw values. Let’s consider the Age feature from the coder survey dataset. The
following code shows the distribution of developer ages who took the survey.
In [16]:
    ...:
    ...:
    ...:
    ...:

fig, ax = plt.subplots()
fcc_survey_df['Age'].hist(color='#A9C5D3')
ax.set_title('Developer Age Histogram', fontsize=12)
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)

Figure 4-7. Histogram depicting developer age distribution
The histogram in Figure 4-7 depicts the distribution of developer ages, which is slightly right skewed as
expected. Let’s try to assign these raw age values into specific bins based on the following logic.
Age Range: Bin
--------------0 -  9  : 0
10 - 19  : 1
20 - 29  : 2
30 - 39  : 3

192

Chapter 4 ■ Feature Engineering and Selection

40 - 49  : 4
50 - 59  : 5
60 - 69  : 6
  ... and so on
We can easily do this using what we learned in the “Rounding” section earlier where we round off these
raw age values by taking the floor value after dividing it by 10. The following code depicts the same.
In [17]: fcc_survey_df['Age_bin_round'] = np.array(np.floor(np.array(fcc_survey_df['Age']) /
                                                                                      10.))
    ...: fcc_survey_df[['ID.x', 'Age', 'Age_bin_round']].iloc[1071:1076]
Out[17]:
                                  ID.x   Age  Age_bin_round
1071  6a02aa4618c99fdb3e24de522a099431  17.0            1.0
1072  f0e5e47278c5f248fe861c5f7214c07a  38.0            3.0
1073  6e14f6d0779b7e424fa3fdd9e4bd3bf9  21.0            2.0
1074  c2654c07dc929cdf3dad4d1aec4ffbb3  53.0            5.0
1075  f07449fc9339b2e57703ec7886232523  35.0            3.0
We take a specific slice of the dataset (rows 1071-1076) to depict users of varying ages. You can see
the corresponding bins for each age have been assigned based on rounding. But what if we need more
flexibility? What if I want to decide and fix the bin widths myself?
Binning based on custom ranges is the answer to the all our questions about fixed-width binning,
some of which I just mentioned. Let’s define some custom age ranges for binning developer ages using the
following scheme.
Age Range : Bin
--------------0 -  15  : 1
16 -  30  : 2
31 -  45  : 3
46 -  60  : 4
61 -  75  : 5
75 - 100  : 6
Based on this custom binning scheme, we will now label the bins for each developer age value with the
help of the following code. We will store both the bin range as well as the corresponding label.
In [18]: bin_ranges = [0, 15, 30, 45, 60, 75, 100]
    ...: bin_names = [1, 2, 3, 4, 5, 6]
    ...: fcc_survey_df['Age_bin_custom_range'] = pd.cut(np.array(fcc_survey_df['Age']),
    ...:                                                bins=bin_ranges)
    ...: fcc_survey_df['Age_bin_custom_label'] = pd.cut(np.array(fcc_survey_df['Age']),
    ...:                                                bins=bin_ranges, labels=bin_names)
    ...: fcc_survey_df[['ID.x', 'Age', 'Age_bin_round',
    ...:                'Age_bin_custom_range', 'Age_bin_custom_label']].iloc[1071:1076]

193

Chapter 4 ■ Feature Engineering and Selection

Figure 4-8. Custom age binning for developer ages
We can see from the dataframe output in Figure 4-8 that the custom bins based on our scheme have
been assigned for each developer’s age. Try out some of your own binning schemes!

A
 daptive Binning
So far, we have decided the bin width and ranges in fixed-width binning. However, this technique can lead to
irregular bins that are not uniform based on the number of data points or values which fall in each bin. Some
of the bins might be densely populated and some of them might be sparsely populated or even be empty!
Adaptive binning is a safer and better approach where we use the data distribution itself to decide what
should be the appropriate bins.
Quantile based binning is a good strategy to use for adaptive binning. Quantiles are specific values or
cut-points which help in partitioning the continuous valued distribution of a specific numeric field into
discrete contiguous bins or intervals. Thus, q-Quantiles help in partitioning a numeric attribute into q equal
partitions. Popular examples of quantiles include the 2-Quantile known as the median which divides the
data distribution into two equal bins, 4-Quantiles known as the quartiles, which divide the data into four
equal bins and 10-Quantiles also known as the deciles which create 10 equal width bins. Let’s now look at a
slice of data pertaining to developer income values in our coder survey dataset.
In [19]: fcc_survey_df[['ID.x', 'Age', 'Income']].iloc[4:9]
Out[19]:
                               ID.x   Age   Income
4  9368291c93d5d5f5c8cdb1a575e18bec  20.0   6000.0
5  dd0e77eab9270e4b67c19b0d6bbf621b  34.0  40000.0
6  7599c0aa0419b59fd11ffede98a3665d  23.0  32000.0
7  6dff182db452487f07a47596f314bddc  35.0  40000.0
8  9dc233f8ed1c6eb2432672ab4bb39249  33.0  80000.0
The slice of data depicted by the dataframe shows us the income values for each developer in our
dataset. Let’s look at the whole data distribution for this Income variable now using the following code.
In [20]:
    ...:
    ...:
    ...:
    ...:

194

fig, ax = plt.subplots()
fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3')
ax.set_title('Developer Income Histogram', fontsize=12)
ax.set_xlabel('Developer Income', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)

Chapter 4 ■ Feature Engineering and Selection

Figure 4-9. Histogram depicting developer income distribution
We can see from the distribution depicted in Figure 4-9 that as expected there is a right skew with lesser
developers earning more money and vice versa. Let’s take a 4-Quantile or a quartile based adaptive binning
scheme. The following snippet helps us obtain the income values that fall on the four quartiles in the distribution.
In [21]: quantile_list = [0, .25, .5, .75, 1.]
    ...: quantiles = fcc_survey_df['Income'].quantile(quantile_list)
    ...: quantiles
Out[21]:
0.00      6000.0
0.25     20000.0
0.50     37000.0
0.75     60000.0
1.00    200000.0
To visualize the quartiles obtained in this output better, we can plot them in our data distribution using
the following code snippet.
In [22]: fig, ax = plt.subplots()
    ...: fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3')
    ...:
    ...: for quantile in quantiles:
    ...:     qvl = plt.axvline(quantile, color='r')
    ...: ax.legend([qvl], ['Quantiles'], fontsize=10)
    ...:
    ...: ax.set_title('Developer Income Histogram with Quantiles', fontsize=12)
    ...: ax.set_xlabel('Developer Income', fontsize=12)
    ...: ax.set_ylabel('Frequency', fontsize=12)

195

Chapter 4 ■ Feature Engineering and Selection

Figure 4-10. Histogram depicting developer income distribution with quartile values
The 4-Quantile values for the income attribute are depicted by red vertical lines in Figure 4-10.
Let’s now use quantile binning to bin each of the developer income values into specific bins using the
following code.
In [23]: quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']
    ...: fcc_survey_df['Income_quantile_range'] = pd.qcut(fcc_survey_df['Income'],
    ...:                                                 q=quantile_list)
    ...: fcc_survey_df['Income_quantile_label'] = pd.qcut(fcc_survey_df['Income'],
    ...:                                                 q=quantile_list,
    ...:                                                 labels=quantile_labels)
    ...: fcc_survey_df[['ID.x', 'Age', 'Income',
                        'Income_quantile_range', 'Income_quantile_label']].iloc[4:9]

Figure 4-11. Quantile based bin ranges and labels for developer incomes

196

Chapter 4 ■ Feature Engineering and Selection

The result dataframe depicted in Figure 4-11 clearly shows the quantile based bin range and
corresponding label assigned for each developer income value in the Income_quantile_range and
Income_quantile_labels features, respectively.

Statistical Transformations
Let’s look at a different strategy of feature engineering on numerical data by using statistical or mathematical
transformations. In this section, we will look at the Log transform as well as the Box-Cox transform. Both of
these transform functions belong to the Power Transform family of functions. These functions are typically
used to create monotonic data transformations, but their main significance is that they help in stabilizing
variance, adhering closely to the normal distribution and making the data independent of the mean based
on its distribution. Several transformations are also used as a part of feature scaling, which we cover in a
future section.

Log Transform
The log transform belongs to the power transform family of functions. This function can be defined as
y = logb(x) which reads as log of x to the base b is equal to y. This translates to by = x, which indicates as
to what power must the base b be raised to in order to get x. The natural logarithm uses the base b = e
where e = 2.71828 popularly known as Euler’s number. You can also use base b = 10 used popularly in the
decimal system. Log transforms are useful when applied to skewed distributions as they tend to expand
the values which fall in the range of lower magnitudes and tend to compress or reduce the values which
fall in the range of higher magnitudes. This tends to make the skewed distribution as normal-like as
possible. Let’s use log transform on our developer income feature from our coder survey dataset.
In [24]: fcc_survey_df['Income_log'] = np.log((1+ fcc_survey_df['Income']))
    ...: fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_log']].iloc[4:9]
Out[24]:
                               ID.x   Age   Income  Income_log
4  9368291c93d5d5f5c8cdb1a575e18bec  20.0   6000.0    8.699681
5  dd0e77eab9270e4b67c19b0d6bbf621b  34.0  40000.0   10.596660
6  7599c0aa0419b59fd11ffede98a3665d  23.0  32000.0   10.373522
7  6dff182db452487f07a47596f314bddc  35.0  40000.0   10.596660
8  9dc233f8ed1c6eb2432672ab4bb39249  33.0  80000.0   11.289794
The dataframe obtained in this output depicts the log transformed income feature in the Income_log
field. Let’s now plot the data distribution of this transformed feature using the following code.
In [25]:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:

income_log_mean = np.round(np.mean(fcc_survey_df['Income_log']), 2)
fig, ax = plt.subplots()
fcc_survey_df['Income_log'].hist(bins=30, color='#A9C5D3')
plt.axvline(income_log_mean, color='r')
ax.set_title('Developer Income Histogram after Log Transform', fontsize=12)
ax.set_xlabel('Developer Income (log scale)', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.text(11.5, 450, r'$\mu$='+str(income_log_mean), fontsize=10)

197

Chapter 4 ■ Feature Engineering and Selection

Figure 4-12. Histogram depicting developer income distribution after log transform
Thus we can clearly see that the original developer income distribution that was right skewed in
Figure 4-10 is more Gaussian or normal-like in Figure 4-12 after applying the log transform.

B
 ox-Cox Transform
Let’s now look at the Box-Cox transform, another popular function belonging to the power transform family
of functions. This function has a prerequisite that the numeric values to be transformed must be positive
(similar to what log transform expects). In case they are negative, shifting using a constant value helps.
Mathematically, the Box-Cox transform function can be defined as,
ì xl -1
ü
for l > 0 ï
ï
y = f ( x ,l ) = x = í l
ý
ïlog ( x ) for l = 0 ï
î e
þ
l

Such that the resulted transformed output y is a function of input x and transformation parameter λ
such that when λ = 0, the resultant transform is the natural log transform, which we discussed earlier. The
optimal value of λ is usually determined using a maximum likelihood or log-likelihood estimation. Let’s
apply the Box-Cox transform on our developer income feature. To do this, first we get the optimal lambda
value from the data distribution by removing the non-null values using the following code.

198

Chapter 4 ■ Feature Engineering and Selection

In [26]: # get optimal lambda value from non null income values
    ...: income = np.array(fcc_survey_df['Income'])
    ...: income_clean = income[~np.isnan(income)]
    ...: l, opt_lambda = spstats.boxcox(income_clean)
    ...: print('Optimal lambda value:', opt_lambda)
Optimal lambda value: 0.117991239456
Now that we have obtained the optimal λ value, let’s use the Box-Cox transform for two values of λ such
that λ = 0 & λ = λoptimal and transform the raw numeric values pertaining to developer incomes.
In [27]: fcc_survey_df['Income_boxcox_lambda_0'] = spstats.boxcox((1+fcc_survey_df['Income']),
    ...:                                                         lmbda=0)
    ...: fcc_survey_df['Income_boxcox_lambda_opt'] = spstats.boxcox(fcc_survey_df['Income'],
    ...:                                                            lmbda=opt_lambda)
    ...: fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_log',
    ...:                'Income_boxcox_lambda_0', 'Income_boxcox_lambda_opt']].iloc[4:9]

Figure 4-13. Dataframe depicting developer income distribution after box-cox transform
The dataframe obtained in the output shown in Figure 4-13 depicts the income feature after applying
the Box-Cox transform for λ = 0 and λ = λoptimal in the Income_boxcox_lambda_0 and Income_boxcox_lambda_
opt fields respectively. Also as expected, the Income_log field has the same values as the Box-Cox transform
with λ = 0. Let’s now plot the data distribution for the Box-Cox transformed developer values with optimal
lambda. See Figure 4-14.
In [30]:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:

income_boxcox_mean = np.round(np.mean(fcc_survey_df['Income_boxcox_lambda_opt']), 2)
fig, ax = plt.subplots()
fcc_survey_df['Income_boxcox_lambda_opt'].hist(bins=30, color='#A9C5D3')
plt.axvline(income_boxcox_mean, color='r')
ax.set_title('Developer Income Histogram after Box–Cox Transform', fontsize=12)
ax.set_xlabel('Developer Income (Box–Cox transform)', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.text(24, 450, r'$\mu$='+str(income_boxcox_mean), fontsize=10)

199

Chapter 4 ■ Feature Engineering and Selection

Figure 4-14. Histogram depicting developer income distribution after box-cox transform (λ = λoptimal)
The distribution of the transformed numeric values for developer income after the Box-Cox distribution
also look similar to the one we had obtained after the Log transform such that it is more normal-like and the
extreme right skew that was present in the raw data has been minimized here.

Feature Engineering on Categorical Data
So far, we have been working on continuous numeric data and you have also seen various techniques for
engineering features from the same. We will now look at another structured data type, which is categorical
data. Any attribute or feature that is categorical in nature represents discrete values that belong to a specific
finite set of categories or classes. Category or class labels can be text or numeric in nature. Usually there are
two types of categorical variables—nominal and ordinal.
Nominal categorical features are such that there is no concept of ordering among the values, i.e., it does
not make sense to sort or order them. Movie or video game genres, weather seasons, and country names are
some examples of nominal attributes. Ordinal categorical variables can be ordered and sorted on the basis of
their values and hence these values have specific significance such that their order makes sense. Examples
of ordinal attributes include clothing size, education level, and so on.
In this section, we look at various strategies and techniques for transforming and encoding categorical
features and attributes. The code used for this section is available in the code files for this chapter. You can
load feature_engineering_categorical.py directly and start running the examples or use the jupyter
notebook, Feature Engineering on Categorical Data.ipynb, for a more interactive experience. Before
we begin, let’s load the following dependencies.
In [1]: import pandas as pd
   ...: import numpy as np

200

Chapter 4 ■ Feature Engineering and Selection

Once you have these dependencies loaded, let’s get started and engineer some features from
categorical data.

Transforming Nominal Features
Nominal features or attributes are categorical variables that usually have a finite set of distinct discrete
values. Often these values are in string or text format and Machine Learning algorithms cannot understand
them directly. Hence usually you might need to transform these features into a more representative numeric
format. Let’s look at a new dataset pertaining to video game sales. This dataset is also available on Kaggle
(https://www.kaggle.com/gregorut/videogamesales). We have downloaded a copy of this for your
convenience. The following code helps us load this dataset and view some of the attributes of our interest.
In [2]: vg_df = pd.read_csv('datasets/vgsales.csv', encoding='utf-8')
   ...: vg_df[['Name', 'Platform', 'Year', 'Genre', 'Publisher']].iloc[1:7]
Out[2]:
                       Name Platform    Year         Genre Publisher
1         Super Mario Bros.      NES  1985.0      Platform  Nintendo
2            Mario Kart Wii      Wii  2008.0        Racing  Nintendo
3         Wii Sports Resort      Wii  2009.0        Sports  Nintendo
4  Pokemon Red/Pokemon Blue       GB  1996.0  Role-Playing  Nintendo
5                    Tetris       GB  1989.0        Puzzle  Nintendo
6     New Super Mario Bros.       DS  2006.0      Platform  Nintendo
The dataset depicted in this dataframe shows us various attributes pertaining to video games. Features
like Platform, Genre, and Publisher are nominal categorical variables. Let’s now try to transform the video
game Genre feature into a numeric representation. Do note here that this doesn’t indicate that the transformed
feature will be a numeric feature. It will still be a discrete valued categorical feature with numbers instead of
text for each genre. The following code depicts the total distinct genre labels for video games.
In [3]: genres = np.unique(vg_df['Genre'])
   ...: genres
Out[3]:
array(['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle',
       'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports',
       'Strategy'], dtype=object)
This output tells us we have 12 distinct video game genres in our dataset. Let’s transform this feature
now using a mapping scheme in the following code.
In [4]: from sklearn.preprocessing import LabelEncoder
   ...:
   ...: gle = LabelEncoder()
   ...: genre_labels = gle.fit_transform(vg_df['Genre'])
   ...: genre_mappings = {index: label for index, label in enumerate(gle.classes_)}
   ...: genre_mappings
Out[4]:
{0: 'Action', 1: 'Adventure', 2: 'Fighting', 3: 'Misc',
4: 'Platform', 5: 'Puzzle', 6: 'Racing', 7: 'Role-Playing',
8: 'Shooter', 9: 'Simulation', 10: 'Sports', 11: 'Strategy'}

201

Chapter 4 ■ Feature Engineering and Selection

From the output, we can see that a mapping scheme has been generated where each genre value is
mapped to a number with the help of the LabelEncoder object gle. The transformed labels are stored in the
genre_labels value. Let’s write it back to the original dataframe and view the results.
In [5]: vg_df['GenreLabel'] = genre_labels
   ...: vg_df[['Name', 'Platform', 'Year', 'Genre', 'GenreLabel']].iloc[1:7]
Out[5]:
                       Name Platform    Year         Genre  GenreLabel
1         Super Mario Bros.      NES  1985.0      Platform           4
2            Mario Kart Wii      Wii  2008.0        Racing           6
3         Wii Sports Resort      Wii  2009.0        Sports          10
4  Pokemon Red/Pokemon Blue       GB  1996.0  Role-Playing           7
5                    Tetris       GB  1989.0        Puzzle           5
6     New Super Mario Bros.       DS  2006.0      Platform           4
The GenreLabel field depicts the mapped numeric labels for each of the Genre labels and we can clearly
see that this adheres to the mappings that we generated earlier.

Transforming Ordinal Features
Ordinal features are similar to nominal features except that order matters and is an inherent property with
which we can interpret the values of these features. Like nominal features, even ordinal features might be
present in text form and you need to map and transform them into their numeric representation. Let’s now
load our Pokémon dataset that we used earlier and look at the various values of the Generation attribute for
each Pokémon.
In [6]:
   ...:
   ...:
   ...:
Out[6]:

poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8')
poke_df = poke_df.sample(random_state=1, frac=1).reset_index(drop=True)
np.unique(poke_df['Generation'])
array(['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6'], dtype=object)

We resample the dataset in this code just so we can get a good slice of data later on that represents
all the distinct values which we are looking for. From this output we can see that there are a total of six
generations of Pokémon. This attribute is definitely ordinal because Pokémon belonging to Generation 1
were introduced earlier in the video games and the television shows than Generation 2 and so on. Hence
they have a sense of order among them. Unfortunately, since there is a specific logic or set of rules involved
in case of each ordinal variable, there is no generic module or function to map and transform these features
into numeric representations. Hence we need to hand-craft this using our own logic, which is depicted in the
following code snippet.
In [7]: gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3,
   ...:                'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}
   ...:
   ...: poke_df['GenerationLabel'] = poke_df['Generation'].map(gen_ord_map)
   ...: poke_df[['Name', 'Generation', 'GenerationLabel']].iloc[4:10]
Out[7]:
                  Name Generation  GenerationLabel
4            Octillery      Gen 2            
2
5           Helioptile      Gen 6           
6
6               Dialga      Gen 4                4

202

Chapter 4 ■ Feature Engineering and Selection

7  DeoxysDefense Forme      Gen 3                3
8             Rapidash      Gen 1             
1
9               Swanna      Gen 5                5
Thus, you can see that it is really easy to build your own transformation mapping scheme with the help
of Python dictionaries and use the map(...) function from pandas to transform the ordinal feature.

Encoding Categorical Features
We have mentioned several times in the past that Machine Learning algorithms usually work well with
numerical values. You might now be wondering we already transformed and mapped the categorical
variables into numeric representations in the previous sections so why would we need more levels
of encoding again? The answer to this is pretty simple. If we directly fed these transformed numeric
representations of categorical features into any algorithm, the model will essentially try to interpret these as
raw numeric features and hence the notion of magnitude will be wrongly introduced in the system.
A simple example would be from our previous output dataframe, a model fit on GenerationLabel
would think that value 6 > 5 > 4 and so on. While order is important in the case of Pokémon generations
(ordinal variable), there is no notion of magnitude here. Generation 6 is not larger than Generation 5 and
Generation 1 is not smaller than Generation 6. Hence models built using these features directly would
be sub-optimal and incorrect models. There are several schemes and strategies where dummy features are
created for each unique value or label out of all the distinct categories in any feature. In the subsequent
sections, we will discuss some of these schemes including one hot encoding, dummy coding, effect coding,
and feature hashing schemes.

One Hot Encoding Scheme
Considering we have numeric representation of any categorical feature with m labels, the one hot encoding
scheme, encodes or transforms the feature into m binary features, which can only contain a value of 1 or
0. Each observation in the categorical feature is thus converted into a vector of size m with only one of the
values as 1 (indicating it as active). Let’s take our Pokémon dataset and perform some one hot encoding
transformations on some of its categorical features.
In [8]: poke_df[['Name', 'Generation', 'Legendary']].iloc[4:10]
Out[8]:
                  Name Generation  Legendary
4            Octillery      Gen 2      False
5           Helioptile      Gen 6      False
6               Dialga      Gen 4       True
7  DeoxysDefense Forme      Gen 3       True
8             Rapidash      Gen 1      False
9               Swanna      Gen 5      False
Considering the dataframe depicted in the output, we have two categorical features, Generation and
Legendary, depicting the Pokémon generations and their legendary status. First, we need to transform these
text labels into numeric representations. The following code helps us achieve this.
In [9]: from sklearn.preprocessing import OneHotEncoder, LabelEncoder
   ...:
   ...: # transform and map pokemon generations
   ...: gen_le = LabelEncoder()

203

Chapter 4 ■ Feature Engineering and Selection

   ...: gen_labels = gen_le.fit_transform(poke_df['Generation'])
   ...: poke_df['Gen_Label'] = gen_labels
   ...:
   ...: # transform and map pokemon legendary status
   ...: leg_le = LabelEncoder()
   ...: leg_labels = leg_le.fit_transform(poke_df['Legendary'])
   ...: poke_df['Lgnd_Label'] = leg_labels
   ...:
   ...: poke_df_sub = poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary', 'Lgnd_Label']]
   ...: poke_df_sub.iloc[4:10]
Out[9]:
                  Name Generation  Gen_Label  Legendary  Lgnd_Label
4            Octillery      Gen 2      
1      False      
0
5           Helioptile      Gen 6      
5      False           0
6               Dialga      Gen 4      
3       True      
1
7  DeoxysDefense Forme      Gen 3          2       True           1
8             Rapidash      Gen 1      
0      False      
0
9               Swanna      Gen 5      
4      False      
0
The features Gen_Label and Lgnd_Label now depict the numeric representations of our categorical
features. Let’s now apply the one hot encoding scheme on these features using the following code.
In [10]:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:

# encode generation labels using one-hot encoding scheme
gen_ohe = OneHotEncoder()
gen_feature_arr = gen_ohe.fit_transform(poke_df[['Gen_Label']]).toarray()
gen_feature_labels = list(gen_le.classes_)
gen_features = pd.DataFrame(gen_feature_arr, columns=gen_feature_labels)
# encode legendary status labels using one-hot encoding scheme
leg_ohe = OneHotEncoder()
leg_feature_arr = leg_ohe.fit_transform(poke_df[['Lgnd_Label']]).toarray()
leg_feature_labels = ['Legendary_'+str(cls_label) for cls_label in leg_le.classes_]
leg_features = pd.DataFrame(leg_feature_arr, columns=leg_feature_labels)

Now, you should remember that you can always encode both the features together using the fit_
transform(...) function by passing it a two-dimensional array of the two features. But we are depicting this
encoding for each feature separately, to make things easier to understand. Besides this, we can also create
separate dataframes and label them accordingly. Let’s now concatenate these feature frames and see the
final result.
In [11]: poke_df_ohe = pd.concat([poke_df_sub, gen_features, leg_features], axis=1)
    ...: columns = sum([['Name', 'Generation', 'Gen_Label'],gen_feature_labels,
    ...:            
['Legendary', 'Lgnd_Label'],leg_feature_labels], [])
    ...: poke_df_ohe[columns].iloc[4:10]

204

Chapter 4 ■ Feature Engineering and Selection

Figure 4-15. Feature set depicting one hot encoded features for Pokémon generation and legendary status
From the result feature set depicted in Figure 4-15, we can clearly see the new one hot encoded features
for Gen_Label and Lgnd_Label. Each of these one hot encoded features is binary in nature and if they
contain the value 1, it means that feature is active for the corresponding observation. For example, row 6
indicates the Pokémon Dialga is a Gen 4 Pokémon having Gen_Label 3 (mapping starts from 0) and the
corresponding one hot encoded feature Gen 4 has the value 1 and the remaining one hot encoded features
are 0. Similarly, its Legendary status is True, corresponding Lgnd_Label is 1 and the one hot encoded feature
Legendary_True is also 1, indicating it is active.
Suppose we used this data in training and building a model but now we have some new Pokémon data
for which we need to engineer the same features before we want to run it by our trained model. We can use
the transform(...) function for our LabelEncoder and OneHotEncoder objects, which we have previously
constructed to engineer the features from the training data. The following code shows us two dummy data
points pertaining to new Pokémon.
In [12]: new_poke_df = pd.DataFrame([['PikaZoom', 'Gen 3', True],
    ...:                            ['CharMyToast', 'Gen 4', False]],
    ...:                            columns=['Name', 'Generation', 'Legendary'])
    ...: new_poke_df
Out[12]:
          Name Generation  Legendary
0     PikaZoom      Gen 3      True
1  CharMyToast      Gen 4      False
We will follow the same process as before of first converting the text categories into numeric
representations using our previously built LabelEncoder objects, as depicted in the following code.
In [13]: new_gen_labels = gen_le.transform(new_poke_df['Generation'])
    ...: new_poke_df['Gen_Label'] = new_gen_labels
    ...:
    ...: new_leg_labels = leg_le.transform(new_poke_df['Legendary'])
    ...: new_poke_df['Lgnd_Label'] = new_leg_labels
    ...:
    ...: new_poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary', 'Lgnd_Label']]
Out[13]:
          Name Generation  Gen_Label  Legendary  Lgnd_Label
0     PikaZoom      Gen 3          2      True           1
1  CharMyToast      Gen 4          3      False           0
We can now use our previously built LabelEncoder objects and perform one hot encoding on these new
data observations using the following code. See Figure 4-16.

205

Chapter 4 ■ Feature Engineering and Selection

In [14]: new_gen_feature_arr = gen_ohe.transform(new_poke_df[['Gen_Label']]).toarray()
    ...: new_gen_features = pd.DataFrame(new_gen_feature_arr, columns=gen_feature_labels)
    ...:
    ...: new_leg_feature_arr = leg_ohe.transform(new_poke_df[['Lgnd_Label']]).toarray()
    ...: new_leg_features = pd.DataFrame(new_leg_feature_arr, columns=leg_feature_labels)
    ...:
    ...: new_poke_ohe = pd.concat([new_poke_df, new_gen_features, new_leg_features], axis=1)
    ...: columns = sum([['Name', 'Generation', 'Gen_Label'], gen_feature_labels,
    ...:                ['Legendary', 'Lgnd_Label'], leg_feature_labels], [])
    ...: new_poke_ohe[columns]

Figure 4-16. Feature set depicting one hot encoded features for new pokemon data points
Thus, you can see how we used the fit_transform(...) functions to engineer features on our
dataset and then we were able to use the encoder objects to engineer features on new data using the
transform(...) function based on the data what it observed previously, specifically the distinct categories
and their corresponding labels and one hot encodings. You should always follow this workflow in the future
for any type of feature engineering when you deal with training and test datasets when you build models.
Pandas also provides a wonderful function called to_dummies(...), which helps us easily perform one hot
encoding. The following code depicts how to achieve this.
In [15]: gen_onehot_features = pd.get_dummies(poke_df['Generation'])
    ...: pd.concat([poke_df[['Name', 'Generation']], gen_onehot_features], axis=1).
iloc[4:10]
Out[15]:
                  Name Generation  Gen 1  Gen 2  Gen 3  Gen 4  Gen 5  Gen 6
4            Octillery      Gen 2      0      1      0      0      0      0
5           Helioptile      Gen 6      0      0      0      0      0      1
6               Dialga      Gen 4      0      0      0      1      0      0
7  DeoxysDefense Forme      Gen 3      0      0      1      0      0      0
8             Rapidash      Gen 1      1      0      0      0      0      0
9               Swanna      Gen 5      0      0      0      0      1      0
The output depicts the one hot encoding scheme for Pokémon generation values similar to what we
depicted in our previous analyses.

Dummy Coding Scheme
The dummy coding scheme is similar to the one hot encoding scheme, except in the case of dummy coding
scheme, when applied on a categorical feature with m distinct labels, we get m-1 binary features. Thus each
value of the categorical variable gets converted into a vector of size m-1. The extra feature is completely
disregarded and thus if the category values range from {0, 1, ..., m-1} the 0th or the m-1th feature is usually
represented by a vector of all zeros (0).

206

Chapter 4 ■ Feature Engineering and Selection

The following code depicts the dummy coding scheme on Pokémon Generation by dropping the first
level binary encoded feature (Gen 1).
In [16]: gen_dummy_features = pd.get_dummies(poke_df['Generation'], drop_first=True)
    ...: pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features], axis=1).iloc[4:10]
Out[16]:
                  Name Generation  Gen 2  Gen 3  Gen 4  Gen 5  Gen 6
4            Octillery      Gen 2      1      0      0      0      0
5           Helioptile      Gen 6      0      0      0      0      1
6               Dialga      Gen 4      0      0      1      0      0
7  DeoxysDefense Forme      Gen 3      0      1      0      0      0
8             Rapidash      Gen 1      0      0      0      0      0
9               Swanna      Gen 5      0      0      0      1      0
If you want, you can also choose to drop the last level binary encoded feature (Gen 6) by using the
following code.
In [17]: gen_onehot_features = pd.get_dummies(poke_df['Generation'])
    ...: gen_dummy_features = gen_onehot_features.iloc[:,:-1]
    ...: pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features], axis=1).iloc[4:10]
Out[17]:
                  Name Generation  Gen 1  Gen 2  Gen 3  Gen 4  Gen 5
4            Octillery      Gen 2      0      1      0      0      0
5           Helioptile      Gen 6      0      0      0      0      0
6               Dialga      Gen 4      0      0      0      1      0
7  DeoxysDefense Forme      Gen 3      0      0      1      0      0
8             Rapidash      Gen 1      1      0      0      0      0
9               Swanna      Gen 5      0      0      0      0      1
Thus from these outputs you can see that based on the encoded level binary feature which we drop, that
particular categorical value is represented by a vector/encoded features, which all represent 0. For example
in the previous result feature set, Pokémon Heloptile belongs to Gen 6 and is represented by all 0s in the
encoded dummy features.

Effect Coding Scheme
The effect coding scheme is very similar to the dummy coding scheme in most aspects. However, the
encoded features or feature vector, for the category values that represent all 0s in the dummy coding scheme,
is replaced by -1s in the effect coding scheme. The following code depicts the effect coding scheme on the
Pokémon Generation feature.
In [18]: gen_onehot_features = pd.get_dummies(poke_df['Generation'])
    ...: gen_effect_features = gen_onehot_features.iloc[:,:-1]
    ...: gen_effect_features.loc[np.all(gen_effect_features == 0, axis=1)] = -1.
    ...: pd.concat([poke_df[['Name', 'Generation']], gen_effect_features], axis=1).iloc[4:10]
Out[18]:
                  Name Generation  Gen 1  Gen 2  Gen 3  Gen 4  Gen 5
4            Octillery      Gen 2    0.0    1.0    0.0    0.0    0.0
5           Helioptile      Gen 6   -1.0   -1.0   -1.0   -1.0   -1.0
6               Dialga      Gen 4    0.0    0.0    0.0    1.0    0.0

207

Chapter 4 ■ Feature Engineering and Selection

7  DeoxysDefense Forme      Gen 3    0.0    0.0    1.0    0.0    0.0
8             Rapidash      Gen 1    1.0    0.0    0.0    0.0    0.0
9               Swanna      Gen 5    0.0    0.0    0.0    0.0    1.0
We can clearly see from the output feature set that all 0s have been replaced by -1 in case of values
which were previously all 0 in the dummy coding scheme.

Bin-Counting Scheme
The encoding schemes discovered so far work quite well on categorical data in general, but they start
causing problems when the number of distinct categories in any feature becomes very large. Essential for
any categorical feature of m distinct labels, you get m separate features. This can easily increase the size of
the feature set causing problems like storage issues, model training problems with regard to time, space
and memory. Besides this, we also have to deal with what is popularly known as the curse of dimensionality
where basically with an enormous number of features and not enough representative samples, model
performance starts getting affected. Hence we need to look toward other categorical data feature
engineering schemes for features having a large number of possible categories (like IP addresses).
The bin-counting scheme is useful for dealing with categorical variables with many categories. In
this scheme, instead of using the actual label values for encoding, we use probability based statistical
information about the value and the actual target or response value which we aim to predict in our modeling
efforts. A simple example would be based on past historical data for IP addresses and the ones which were
used in DDOS attacks; we can build probability values for a DDOS attack being caused by any of the IP
addresses. Using this information, we can encode an input feature which depicts that if the same IP address
comes in the future, what is the probability value of a DDOS attack being caused. This scheme needs
historical data as a pre-requisite and is an elaborate one. Depicting this with a complete example is out of
scope of this chapter but there are several resources online that you can refer to.

Feature Hashing Scheme
The feature hashing scheme is another useful feature engineering scheme for dealing with large scale
categorical features. In this scheme, a hash function is typically used with the number of encoded features
pre-set (as a vector of pre-defined length) such that the hashed values of the features are used as indices in
this pre-defined vector and values are updated accordingly. Since a hash function maps a large number of
values into a small finite set of values, multiple different values might create the same hash which is termed
as collisions. Typically, a signed hash function is used so that the sign of the value obtained from the hash is
used as the sign of the value which is stored in the final feature vector at the appropriate index. This should
ensure lesser collisions and lesser accumulation of error due to collisions.
Hashing schemes work on strings, numbers and other structures like vectors. You can think of hashed
outputs as a finite set of h bins such that when hash function is applied on the same values, they get assigned
to the same bin out of the h bins based on the hash value. We can assign the value of h, which becomes the
final size of the encoded feature vector for each categorical feature we encode using the feature hashing
scheme. Thus even if we have over 1000 distinct categories in a feature and we set h = 10, the output feature
set will still have only 10 features as compared to 1000 features if we used a one hot encoding scheme.
Let’s look at the following code snippet, which shows us the number of distinct genres we have in our
video game dataset.
In [19]: unique_genres = np.unique(vg_df[['Genre']])
    ...: print("Total game genres:", len(unique_genres))
    ...: print(unique_genres)

208

Chapter 4 ■ Feature Engineering and Selection

Total game genres: 12
['Action' 'Adventure' 'Fighting' 'Misc' 'Platform' 'Puzzle' 'Racing'
'Role-Playing' 'Shooter' 'Simulation' 'Sports' 'Strategy']
We can clearly see from the output that there are 12 distinct genres and if we used a one hot encoding
scheme on the Genre feature, we would end up having 12 binary features. Instead, we will now use a feature
hashing scheme by leveraging scikit-learn's FeatureHasher class, which uses a signed 32-bit version of
the Murmurhash3 hash function. The following code shows us how to use the feature hashing scheme where
we will pre-set the feature vector size to be 6 (6 features instead of 12).
In [21]:
    ...:
    ...:
    ...:
    ...:
    ...:

from sklearn.feature_extraction import FeatureHasher
fh = FeatureHasher(n_features=6, input_type='string')
hashed_features = fh.fit_transform(vg_df['Genre'])
hashed_features = hashed_features.toarray()
pd.concat([vg_df[['Name', 'Genre']], pd.DataFrame(hashed_features)], axis=1).
iloc[1:7]

Out[21]:
                       Name         Genre    0    1    2    3    4    5
1         Super Mario Bros.      Platform  0.0  2.0  2.0 -1.0  1.0  0.0
2            Mario Kart Wii        Racing -1.0  0.0  0.0  0.0  0.0 -1.0
3         Wii Sports Resort        Sports -2.0  2.0  0.0 -2.0  0.0  0.0
4  Pokemon Red/Pokemon Blue  Role-Playing -1.0  1.0  2.0  0.0  1.0 -1.0
5                    Tetris        Puzzle  0.0  1.0  1.0 -2.0  1.0 -1.0
6     New Super Mario Bros.      Platform  0.0  2.0  2.0 -1.0  1.0  0.0
Thus we can clearly see from the result feature set that the Genre categorical feature has been encoded
using the hashing scheme into 6 features instead of 12. We can also see that rows 1 and 6 denote the same
genre of games, Platform which have been rightly encoded into the same feature vector as expected.

Feature Engineering on Text Data
Dealing with structured data attributes like numeric or categorical variables are usually not as challenging
as unstructured attributes like text and images. In case of unstructured data like text documents, the first
challenge is dealing with the unpredictable nature of the syntax, format, and content of the documents,
which make it a challenge to extract useful information for building models. The second challenge is
transforming these textual representations into numeric representations that can be understood by Machine
Learning algorithms. There exist various feature engineering techniques employed by data scientists
daily to extract numeric feature vectors from unstructured text. In this section, we discuss several of these
techniques. Before we get started, you should remember that there are two aspects to execute feature
engineering on text data.
•

Pre-processing and normalizing text

•

Feature extraction and engineering

Without text pre-processing and normalization, the feature engineering techniques will not work
at their core efficiency hence it is of paramount importance to pre-process textual documents. You can
load feature_engineering_text.py directly and start running the examples or use the jupyter notebook,
Feature Engineering on Text Data.ipynb, for a more interactive experience. Let’s load the following
necessary dependencies before we start.

209

Chapter 4 ■ Feature Engineering and Selection

In [1]:
   ...:
   ...:
   ...:

import
import
import
import

pandas as pd
numpy as np
re
nltk

Let’s now load some sample text documents, do some basic pre-processing, and learn about various
feature engineering strategies to deal with text data. The following code creates our sample text corpus (a
collection of text documents), which we will use in this section.
In [2]: corpus = ['The sky is blue and beautiful.',
   ...:          'Love this blue and beautiful sky!',
   ...:          'The quick brown fox jumps over the lazy dog.',
   ...:          'The brown fox is quick and the blue dog is lazy!',
   ...:          'The sky is very blue and the sky is very beautiful today',
   ...:          'The dog is lazy but the brown fox is quick!'   
   ...: ]
   ...: labels = ['weather', 'weather', 'animals', 'animals', 'weather', 'animals']
   ...: corpus = np.array(corpus)
   ...: corpus_df = pd.DataFrame({'Document': corpus,
   ...:                           'Category': labels})
   ...: corpus_df = corpus_df[['Document', 'Category']]
   ...: corpus_df
Out[2]:
                                            Document Category
0                     The sky is blue and beautiful.  weather
1                  Love this blue and beautiful sky!  weather
2       The quick brown fox jumps over the lazy dog.  animals
3   The brown fox is quick and the blue dog is lazy!  animals
4  The sky is very blue and the sky is very beaut...  weather
5    The dog is lazy but the brown fox is quick!     animals
We can see that we have a total of six documents, where three of them are relevant to weather and the
other three talk about animals as depicted by the Category class label.

Text Pre-Processing
Before feature engineering, we need to pre-process, clean, and normalize the text like we mentioned before.
There are multiple pre-processing techniques, some of which are quite elaborate. We will not be going into
a lot of details in this section but we will be covering a lot of them in further detail in a future chapter when
we work on text classification and sentiment analysis. Following are some of the popular pre-processing
techniques.

210

•

Text tokenization and lower casing

•

Removing special characters

•

Contraction expansion

•

Removing stopwords

•

Correcting spellings

•

Stemming

•

Lemmatization

Chapter 4 ■ Feature Engineering and Selection

For more details on these topics, you can jump ahead to Chapter 7 of this book or refer to the section
“Text Normalization,” Chapter 3, page 115 of Text Analytics with Python (Apress; Dipanjan Sarkar, 2016).
which covers each of these techniques in detail. We will be normalizing our text here by lowercasing,
removing special characters, tokenizing, and removing stopwords. The following code helps us achieve this.
In [3]: wpt = nltk.WordPunctTokenizer()
   ...: stop_words = nltk.corpus.stopwords.words('english')
   ...:
   ...: def normalize_document(doc):
   ...:    # lower case and remove special characters\whitespaces
   ...:    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I)
   ...:    doc = doc.lower()
   ...:    doc = doc.strip()
   ...:    # tokenize document
   ...:    tokens = wpt.tokenize(doc)
   ...:    # filter stopwords out of document
   ...:    filtered_tokens = [token for token in tokens if token not in stop_words]
   ...:    # re-create document from filtered tokens
   ...:    doc = ' '.join(filtered_tokens)
   ...:    return doc
   ...:
   ...: normalize_corpus = np.vectorize(normalize_document)
The np.vectorize(...) function helps us run the same function over all elements of a numpy array
instead of writing a loop. We will now use this function to pre-process our text corpus.
In [4]: norm_corpus = normalize_corpus(corpus)
   ...: norm_corpus
Out[4]:
array(['sky blue beautiful', 'love blue beautiful sky',
       'quick brown fox jumps lazy dog', 'brown fox quick blue dog lazy',
       'sky blue sky beautiful today', 'dog lazy brown fox quick'],
      dtype=' 0.6]
    ...:     print(topic)
    ...:     print()
[('fox', 1.7265536238698524), ('quick', 1.7264910761871224), ('dog', 1.7264019823624879),
('brown', 1.7263774760262807), ('lazy', 1.7263567668213813), ('jumps', 1.0326450363521607),
('blue', 0.7770158513472083)]
[('sky', 2.263185143458752), ('beautiful', 1.9057084998062579), ('blue',
1.7954559705805626), ('love', 1.1476805311187976), ('today', 1.0064979209198706)]
The preceding output represents each of the two topics as a collection of terms and their importance
is depicted by the corresponding weight. It is definitely interesting to see that the two topics are quite
distinguishable from each other by looking at the terms. The first topic shows terms relevant to animals and
the second topic shows terms relevant to weather. This is reinforced by applying our unsupervised K-means
clustering algorithm on our document-topic feature matrix (dt_matrix) using the following code snippet.
In [13]: km = KMeans(n_clusters=2)
    ...: km.fit_transform(features)
    ...: cluster_labels = km.labels_
    ...: cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
    ...: pd.concat([corpus_df, cluster_labels], axis=1)
Out[13]:
                                            Document Category  ClusterLabel
0                     The sky is blue and beautiful.  weather             0
1                  Love this blue and beautiful sky!  weather             0
2       The quick brown fox jumps over the lazy dog.  animals             1
3   The brown fox is quick and the blue dog is lazy!  animals             1
4  The sky is very blue and the sky is very beaut...  weather             0
5        The dog is lazy but the brown fox is quick!  animals             1
This clearly makes sense and we can see that by just using two topic-model based features, we are still
able to cluster our documents efficiently!

Word Embeddings
There are several advanced word vectorization models that have recently gained a lot of prominence. Almost
all of them deal with the concept of word embeddings. Basically, word embeddings can be used for feature
extraction and language modeling. This representation tries to map each word or phrase into a complete
numeric vector such that semantically similar words or terms tend to occur closer to each other and
these can be quantified using these embeddings. The word2vec model is perhaps one of the most popular
neural network based probabilistic language models and can be used to learn distributed representational
vectors for words. Word embeddings produced by word2vec involve taking in a corpus of text documents,
representing words in a large high dimensional vector space such that each word has a corresponding vector
in that space and similar words (even semantically) are located close to one another, analogous to what we
observed in document similarity earlier.
The word2vec model was released by Google in 2013 and uses a neural network based implementation
with architectures like continuous Bag of Words and Skip-Grams to learn the distributed vector
representations of words in a corpus. We will be using the gensim framework to implement the same model
on our corpus to extract features. Some of the important parameters in the model are explained briefly as
follows.

217

Chapter 4 ■ Feature Engineering and Selection

•

size: Represents the feature vector size for each word in the corpus when
transformed.

•

window: Sets the context window size specifying the length of the window of words to
be taken into account as belonging to a single, similar context when training.

•

min_count: Specifies the minimum word frequency value needed across the corpus
to consider the word as a part of the final vocabulary during training the model.

•

sample: Used to downsample the effects of words which occur very frequently.

The following snippet builds a word2vec embedding model on the documents of our sample corpus.
Remember to tokenize each document before passing it to the model.
In [14]: from gensim.models import word2vec
    ...:
    ...: wpt = nltk.WordPunctTokenizer()
    ...: tokenized_corpus = [wpt.tokenize(document) for document in norm_corpus]
    ...:
    ...: # Set values for various parameters
    ...: feature_size = 10    # Word vector dimensionality  
    ...: window_context = 10         # Context window size
    ...: min_word_count = 1   # Minimum word count
    ...: sample = 1e-3   # Downsample setting for frequent words
    ...:
    ...: w2v_model = word2vec.Word2Vec(tokenized_corpus, size=feature_size,
    ...:                        
window=window_context, min_count = min_word_count,
    ...:                        
sample=sample)
Using TensorFlow backend.
Each word in the corpus will essentially now be a vector itself of size 10. We can verify the same using
the following code.
In [15]: w2v_model.wv['sky']
Out[15]:
array([ 0.02626196, -0.02171229, -0.04910386,  0.0194816 ,  0.01649994,
        0.01200452,  0.04641563,  0.01844106,  0.02693636, -0.02992732], dtype=float32)
A question might arise in your mind now that so far, we had feature vectors for each complete
document, but now we have vectors for each word. How on earth do we represent entire documents now?
We can do that using various aggregation and combinations. A simple scheme would be to use an averaged
word vector representation, where we simply sum all the word vectors occurring in a document and then
divide by the count of word vectors to represent an averaged word vector for the document. The following
code enables us to do the same.
In [16]: def average_word_vectors(words, model, vocabulary, num_features):
    ...:
    ...:     feature_vector = np.zeros((num_features,),dtype="float64")
    ...:     nwords = 0.
    ...:
    ...:     for word in words:
    ...:         if word in vocabulary:
    ...:             nwords = nwords + 1.

218

Chapter 4 ■ Feature Engineering and Selection

    ...:             feature_vector = np.add(feature_vector, model[word])
    ...:
    ...:     if nwords:
    ...:         feature_vector = np.divide(feature_vector, nwords)
    ...:
    ...:     return feature_vector
    ...:
    ...:
    ...: def averaged_word_vectorizer(corpus, model, num_features):
    ...:     vocabulary = set(model.wv.index2word)
    ...:     features = [average_word_vectors(tokenized_sentence, model, vocabulary,
                                             num_features)
    ...:                     for tokenized_sentence in corpus]
    ...:     return np.array(features)
In [17]: w2v_feature_array = averaged_word_vectorizer(corpus=tokenized_corpus, model=w2v_model,
    ...:                                             num_features=feature_size)
    ...: pd.DataFrame(w2v_feature_array)

Figure 4-19. Averaged word vector feature set for our corpus documents
Thus, we have our averaged word vector based feature set for all our corpus documents, as depicted
by the dataframe in Figure 4-19. Let’s use a different clustering algorithm this time known as Affinity
Propagation to try to cluster our documents based on these new features. Affinity Propagation is based on
the concept of message passing and you do not need to specify the number of clusters beforehand like you
did in K-means clustering.
In [18]: from sklearn.cluster import AffinityPropagation
    ...:
    ...: ap = AffinityPropagation()
    ...: ap.fit(w2v_feature_array)
    ...: cluster_labels = ap.labels_
    ...: cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
    ...: pd.concat([corpus_df, cluster_labels], axis=1)
Out[18]:
                                            Document Category  ClusterLabel
0                     The sky is blue and beautiful.  weather             0
1                  Love this blue and beautiful sky!  weather             0
2       The quick brown fox jumps over the lazy dog.  animals             1

219

Chapter 4 ■ Feature Engineering and Selection

3   The brown fox is quick and the blue dog is lazy!  animals             1
4  The sky is very blue and the sky is very beaut...  weather             0
5        The dog is lazy but the brown fox is quick!  animals             1
The preceding output uses the averaged word vectors based on word embeddings to cluster
the documents in our corpus and we can clearly see that it has obtained the right clusters! There are
several other schemes of aggregating word vectors like using TF-IDF weights along with the word vector
representations. Besides this there have been recent advancements in the field of Deep Learning where
architectures like RNNs and LSTMs are also used for engineering features from text data.

Feature Engineering on Temporal Data
Temporal data involves datasets that change over a period of time and time-based attributes are of
paramount importance in these datasets. Usually temporal attributes include some form of data, time,
and timestamp values and often optionally include other metadata like time zones, daylight savings time
information, and so on. Temporal data, especially time-series based data is extensively used in multiple
domains like stock, commodity, and weather forecasting. You can load feature_engineering_temporal.
py directly and start running the examples or use the jupyter notebook, Feature Engineering on Temporal
Data.ipynb, for a more interactive experience. Let’s load the following dependencies before we move on to
acquiring some temporal data.
In [1]:
   ...:
   ...:
   ...:
   ...:

import datetime
import numpy as np
import pandas as pd
from dateutil.parser import parse
import pytz

We will now use some sample time-based data as our source of temporal data by loading the following
values in a dataframe.
In [2]: time_stamps = ['2015-03-08 10:30:00.360000+00:00', '2017-07-13 15:45:05.755000-07:00',
   ...:                '2012-01-20 22:30:00.254000+05:30', '2016-12-25
00:30:00.000000+10:00']
   ...: df = pd.DataFrame(time_stamps, columns=['Time'])
   ...: df
Out[2]:
                               Time
0  2015-03-08 10:30:00.360000+00:00
1  2017-07-13 15:45:05.755000-07:00
2  2012-01-20 22:30:00.254000+05:30
3  2016-12-25 00:30:00.000000+10:00
Of course by default, they are stored as strings or text in the dataframe so we can convert time into
Timestamp objects by using the following code snippet.
In [3]: ts_objs = np.array([pd.Timestamp(item) for item in np.array(df.Time)])
   ...: df['TS_obj'] = ts_objs
   ...: ts_objs
Out[3]:
array([Timestamp('2015-03-08 10:30:00.360000+0000', tz='UTC'),

220

Chapter 4 ■ Feature Engineering and Selection

       Timestamp('2017-07-13 15:45:05.755000-0700', tz='pytz.FixedOffset(-420)'),
       Timestamp('2012-01-20 22:30:00.254000+0530', tz='pytz.FixedOffset(330)'),
       Timestamp('2016-12-25 00:30:00+1000', tz='pytz.FixedOffset(600)')], dtype=object)
You can clearly see from the temporal values that we have multiple components for each Timestamp
object which include date, time, and even a time based offset, which can be used to identify the time zone
also. Of course there is no way we can directly ingest or use these features in any Machine Learning model.
Hence we need specific strategies to extract meaningful features from this data. In the following sections, we
cover some of these strategies that you can start using on your own temporal data in the future.

Date-Based Features
Each temporal value has a date component that can be used to extract useful information and features
pertaining to the date. These include features and components like year, month, day, quarter, day of the
week, day name, day and week of the year, and many more. The following code depicts how we can obtain
some of these features from our temporal data.
In [4]: df['Year'] = df['TS_obj'].apply(lambda d: d.year)
   ...: df['Month'] = df['TS_obj'].apply(lambda d: d.month)
   ...: df['Day'] = df['TS_obj'].apply(lambda d: d.day)
   ...: df['DayOfWeek'] = df['TS_obj'].apply(lambda d: d.dayofweek)
   ...: df['DayName'] = df['TS_obj'].apply(lambda d: d.weekday_name)
   ...: df['DayOfYear'] = df['TS_obj'].apply(lambda d: d.dayofyear)
   ...: df['WeekOfYear'] = df['TS_obj'].apply(lambda d: d.weekofyear)
   ...: df['Quarter'] = df['TS_obj'].apply(lambda d: d.quarter)
   ...:
   ...: df[['Time', 'Year', 'Month', 'Day', 'Quarter',
   ...:    'DayOfWeek', 'DayName', 'DayOfYear', 'WeekOfYear']]

Figure 4-20. Date based features in temporal data
The features depicted in Figure 4-20 show some of the attributes we talked about earlier and have
been derived purely from the date segment of each temporal value. Each of these features can be used
as categorical features and further feature engineering can be done like one hot encoding, aggregations,
binning, and more.

221

Chapter 4 ■ Feature Engineering and Selection

Time-Based Features
Each temporal value also has a time component that can be used to extract useful information and features
pertaining to the time. These include attributes like hour, minute, second, microsecond, UTC offset, and
more. The following code snippet extracts some of the previously mentioned time-based features from our
temporal data.
In [5]:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:

df['Hour'] = df['TS_obj'].apply(lambda d: d.hour)
df['Minute'] = df['TS_obj'].apply(lambda d: d.minute)
df['Second'] = df['TS_obj'].apply(lambda d: d.second)
df['MUsecond'] = df['TS_obj'].apply(lambda d: d.microsecond)
df['UTC_offset'] = df['TS_obj'].apply(lambda d: d.utcoffset())
df[['Time', 'Hour', 'Minute', 'Second', 'MUsecond', 'UTC_offset']]

Figure 4-21. Time based features in temporal data
The features depicted in Figure 4-21 show some of the attributes we talked about earlier which have
been derived purely from the time segment of each temporal value. We can further engineer these features
based on categorical feature engineering techniques and even derive other features like extracting time
zones. Let’s try to use binning to bin each temporal value into a specific time of the day by leveraging the
Hour feature we just obtained.
In [6]: hour_bins = [-1, 5, 11, 16, 21, 23]
   ...: bin_names = ['Late Night', 'Morning', 'Afternoon', 'Evening', 'Night']
   ...: df['TimeOfDayBin'] = pd.cut(df['Hour'],
   ...:                            bins=hour_bins, labels=bin_names)
   ...: df[['Time', 'Hour', 'TimeOfDayBin']]
Out[6]:
                               Time  Hour TimeOfDayBin
0  2015-03-08 10:30:00.360000+00:00    10      Morning
1  2017-07-13 15:45:05.755000-07:00    15    Afternoon
2  2012-01-20 22:30:00.254000+05:30    22        Night
3  2016-12-25 00:30:00.000000+10:00     0   Late Night
Thus you can see from the preceding output that based on hour ranges (0-5, 5-11, 11-16, 16-21,
21-23) we have assigned a specific time of the day bin for each temporal value. The UTC offset component
of the temporal data is very useful in knowing how far ahead or behind is that time value from the UTC
(Coordinated Universal Time), which is the primary time standard that clocks and time are regulated from.
This information can also be used to engineer new features like potential time zones from which each
temporal value might have been obtained. The following code helps us achieve the same.

222

Chapter 4 ■ Feature Engineering and Selection

In [7]: df['TZ_info'] = df['TS_obj'].apply(lambda d: d.tzinfo)
   ...: df['TimeZones'] = df['TS_obj'].apply(lambda d: list({d.astimezone(tz).tzname()
   ...:                               for tz in map(pytz.timezone,
   ...:                                              pytz.all_timezones_set)
   ...:                                  if d.astimezone(tz).utcoffset() == d.utcoffset()}))
   ...:
   ...: df[['Time', 'UTC_offset', 'TZ_info', 'TimeZones']]

Figure 4-22. Time zone relevant features in temporal data
Thus as we mentioned earlier, the features depicted in Figure 4-22 show some of the attributes
pertaining to time zone relevant information for each temporal value. We can also get time components in
other formats, like the Epoch, which is basically the number of seconds that have elapsed since January 1,
1970 (midnight UTC) and the Gregorian Ordinal, where January 1st of year 1 is represented as 1 and so on.
The following code helps us extract these representations. See Figure 4-23.
In [8]:
   ...:
   ...:
   ...:
   ...:

df['TimeUTC'] = df['TS_obj'].apply(lambda d: d.tz_convert(pytz.utc))
df['Epoch'] = df['TimeUTC'].apply(lambda d: d.timestamp())
df['GregOrdinal'] = df['TimeUTC'].apply(lambda d: d.toordinal())
df[['Time', 'TimeUTC', 'Epoch', 'GregOrdinal']]

Figure 4-23. Time components depicted in various representations
Do note we converted each temporal value to UTC before deriving the other features. These alternate
representations of time can be further used for easy date arithmetic. The epoch gives us time elapsed in
seconds and the Gregorian ordinal gives us time elapsed in days. We can use this to derive further features
like time elapsed from the current time or time elapsed from major events of importance based on the
problem we are trying to solve. Let’s compute the time elapsed for each temporal value since the current
time. See Figure 4-24.
In [9]:
   ...:
   ...:
   ...:
   ...:
   ...:

curr_ts = datetime.datetime.now(pytz.utc)
# compute days elapsed since today
df['DaysElapsedEpoch'] = (curr_ts.timestamp() - df['Epoch']) / (3600*24)
df['DaysElapsedOrdinal'] = (curr_ts.toordinal() - df['GregOrdinal'])
df[['Time', 'TimeUTC', 'DaysElapsedEpoch', 'DaysElapsedOrdinal']]

223

Chapter 4 ■ Feature Engineering and Selection

Figure 4-24. Deriving elapsed time difference from current time
Based on our computations, each new derived feature should give us the elapsed time difference
between the current time and the time value in the Time column (actually TimeUTC since conversion to UTC
is necessary). Both the values are almost equal to one another, which is expected. Thus you can use time and
date arithmetic to extract and engineer more features which can help build better models. Alternate time
representations enable you to do date time arithmetic directly instead of dealing with specific API methods
of Timestamp and datetime objects from Python. However you can use any method to get to the results you
want. It’s all about ease of use and efficiency!

Feature Engineering on Image Data
Another very popular format of unstructured data is images. Sound and visual data in the form of images,
video, and audio are very popular sources of data which pose a lot of challenge to data scientists in terms
of processing, storage, feature extraction and modeling. However their benefits as sources of data are quite
rewarding especially in the field of artificial intelligence and computer vision. Due to the unstructured
nature of data, it is not possible to directly use images for training models. If you are given a raw image, you
might have a hard time trying to think of ways to represent it so that any Machine Learning algorithm can
utilize it for model training. There are various strategies and techniques that can be used in this case to
engineer the right features from images. One of the core principles to remember when dealing with images
is that any image can be represented as a matrix of numeric pixel values. With that thought in mind, let’s get
started! You can load feature_engineering_image.py directly and start running the examples or use the
jupyter notebook, Feature Engineering on Image Data.ipynb, for a more interactive experience. Let’s
start by loading the necessary dependencies and configuration settings.
In [1]:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:

import skimage
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from skimage import io
%matplotlib inline

The scikit-image (skimage) library is an excellent framework consisting of several useful interfaces
and algorithms for image processing and feature extraction. Besides this, we will also leverage the mahotas
framework, which is useful in computer vision and image processing. Open CV is another useful framework
that you can check out if interested in aspects pertaining to computer vision. Let’s now look at ways to
represent images as useful feature vector representations.

224

Chapter 4 ■ Feature Engineering and Selection

Image Metadata Features
There are tons of useful features obtainable from the image metadata itself without even processing the
image. Most of this information can be found from the EXIF data, which is usually recorded for each image
by the device when the picture is being taken. Following are some of the popular features that are obtainable
from the image EXIF data.
•

Image create date and time

•

Image dimensions

•

Image compression format

•

Device make and model

•

Image resolution and aspect ratio

•

Image artist

•

Flash, aperture, focal length, and exposure

For more details on what other data points can be used as features from image EXIF metadata,
you can refer to https://sno.phy.queensu.ca/~phil/exiftool/TagNames/EXIF.html, which lists the
possible EXIF tags.

Raw Image and Channel Pixels
An image can be represented by the value of each of its pixels as a two dimensional array. We can leverage
numpy arrays for this. However, color images usually have three components also known as channels. The
R, G, and B channels stand for the red, green, and blue channels, respectively. This can be represented as
a three dimensional array (m, n, c) where m indicates the number of rows in the image, n indicates the
number of columns. These are determined by the image dimensions. The c indicates which channel it
represents (R, G or B). Let’s load some sample color images now and try to understand their representation.
In [2]: cat = io.imread('datasets/cat.png')
   ...: dog = io.imread('datasets/dog.png')
   ...: df = pd.DataFrame(['Cat', 'Dog'], columns=['Image'])
   ...:
   ...: print(cat.shape, dog.shape)
(168, 300, 3) (168, 300, 3)
In [3]:
   ...:
   ...:
   ...:
   ...:

fig = plt.figure(figsize = (8,4))
ax1 = fig.add_subplot(1,2, 1)
ax1.imshow(cat)
ax2 = fig.add_subplot(1,2, 2)
ax2.imshow(dog)

225

Chapter 4 ■ Feature Engineering and Selection

Figure 4-25. Our two sample color images
We can clearly see from Figure 4-25 that we have two images of a cat and a dog having dimensions
168x300 pixels where each row and column denotes a specific pixel of the image. The third dimension
indicates these are color images having three color channels. Let’s now try to use numpy indexing to slice out
and extract the three color channels separately for the dog image.
In [4]:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:

dog_r = dog.copy() # Red Channel
dog_r[:,:,1] = dog_r[:,:,2] = 0 # set G,B pixels = 0
dog_g = dog.copy() # Green Channel
dog_g[:,:,0] = dog_r[:,:,2] = 0 # set R,B pixels = 0
dog_b = dog.copy() # Blue Channel
dog_b[:,:,0] = dog_b[:,:,1] = 0 # set R,G pixels = 0
plot_image = np.concatenate((dog_r, dog_g, dog_b), axis=1)
plt.figure(figsize = (10,4))
plt.imshow(plot_image)

Figure 4-26. Extracting red, green, and blue channels from our color RGB image
We can clearly see from Figure 4-26 how we can easily use numpy indexing and extract out the three
color channels from the sample image. You can now refer to any of these channel’s raw image pixel matrix
and even flatten it if needed to form a feature vector.
In [5]: dog_r[:,:,0]
Out[5]:
array([[160, 160, 160, ..., 113, 113, 112],
       [160, 160, 160, ..., 113, 113, 112],
       ...,

226

Chapter 4 ■ Feature Engineering and Selection

       [165, 165, 165, ..., 212, 211, 210],
       [165, 165, 165, ..., 210, 210, 209],
       [164, 164, 164, ..., 209, 209, 209]], dtype=uint8)
This image pixel matrix is a two-dimensional matrix so you can extract features from this further or even
flatten it to a one-dimensional vector to use as inputs for any Machine Learning algorithm.

Grayscale Image Pixels
If you are dealing with color images, it might get difficult working with multiple channels and
three-dimensional arrays. Hence converting images to grayscale is a nice way of keeping the necessary pixel
intensity values but getting an easy to process two-dimensional image. Grayscale images usually capture the
luminance or intensity of each pixel such that each pixel value can be computed using the equation
Y = 0.2125 x R + 0.7154 x G + 0.0721 x B
Where R, G & B are the pixel values of the three channels and Y captures the final pixel intensity
information and is usually ranges from 0(complete intensity absence - black) to 1(complete intensity
presence - white). The following snippet shows us how to convert RGB color images to grayscale and extract
the raw pixel values, which can be used as features.
In [6]: from skimage.color import rgb2gray
   ...:
   ...: cgs = rgb2gray(cat)
   ...: dgs = rgb2gray(dog)
   ...:
   ...: print('Image shape:', cgs.shape, '\n')
   ...:
   ...: # 2D pixel map
   ...: print('2D image pixel map')
   ...: print(np.round(cgs, 2), '\n')
   ...:
   ...: # flattened pixel feature vector
   ...: print('Flattened pixel map:', (np.round(cgs.flatten(), 2)))
Image shape: (168, 300)
2D image pixel map
[[ 0.42  0.41  0.41 ...,  0.5   0.52  0.53]
[ 0.41  0.41  0.4  ...,  0.51  0.52  0.54]
...,
[ 0.11  0.11  0.1  ...,  0.51  0.51  0.51]
[ 0.11  0.11  0.1  ...,  0.51  0.51  0.51]]
Flattened pixel map: [ 0.42  0.41  0.41 ...,  0.51  0.51  0.51]

Binning Image Intensity Distribution
We already obtained the raw image intensity values for the grayscale images in the previous section. One
approach would be to use these raw pixel values themselves as features. Another approach would be to binning
the image intensity distribution based on intensity values using a histogram and using the bins as features. The
following code snippet shows us how the image intensity distribution looks for the two sample images.

227

Chapter 4 ■ Feature Engineering and Selection

In [7]:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:

fig = plt.figure(figsize = (8,4))
ax1 = fig.add_subplot(2,2, 1)
ax1.imshow(cgs, cmap="gray")
ax2 = fig.add_subplot(2,2, 2)
ax2.imshow(dgs, cmap='gray')
ax3 = fig.add_subplot(2,2, 3)
c_freq, c_bins, c_patches = ax3.hist(cgs.flatten(), bins=30)
ax4 = fig.add_subplot(2,2, 4)
d_freq, d_bins, d_patches = ax4.hist(dgs.flatten(), bins=30)

Figure 4-27. Binning image intensity distributions with histograms
As we mentioned, image intensity ranges from 0 to 1 and is evident by the x-axes depicted in
Figure 4-27. The y-axes depict the frequency of the respective bins. We can clearly see that the dog image
has more concentration of the bin frequencies around 0.6 - 0.8 indicating higher intensity and the reason
for that being that the Labrador dog is white in color and white has a high intensity value like we mentioned
in the previous section. The variables c_freq, c_bins, and d_freq, d_bins can be used to get the numeric
values pertaining to the bins and used as features.

Image Aggregation Statistics
We already obtained the raw image intensity values for the grayscale images in the previous section.
One approach would be to use them as features directly or use some level of aggregations and statistical
measures which can be obtained from the pixels and intensity. We already saw an approach of binning
intensity values using histograms. In this section, we use descriptive statistical measures and aggregations to
compute specific features from the image pixel values.
We can compute RGB ranges for each image by basically subtracting the maximum from the minimum
value for pixel values in each channel. The following code helps us achieve this.
In [8]: from scipy.stats import describe
   ...:
   ...: cat_rgb = cat.reshape((168*300), 3).T
   ...: dog_rgb = dog.reshape((168*300), 3).T

228

Chapter 4 ■ Feature Engineering and Selection

   ...:
   ...: cs = describe(cat_rgb, axis=1)
   ...: ds = describe(dog_rgb, axis=1)
   ...:
   ...: cat_rgb_range = cs.minmax[1] - cs.minmax[0]
   ...: dog_rgb_range = ds.minmax[1] - ds.minmax[0]
   ...: rgb_range_df = pd.DataFrame([cat_rgb_range, dog_rgb_range],
   ...:                            columns=['R_range', 'G_range', 'B_range'])
   ...: pd.concat([df, rgb_range_df], axis=1)
Out[8]:
  Image  R_range  G_range  B_range
0   Cat      240      223      235
1   Dog      246      250      246
We can then use these range features as specific characteristic attributes of each image. Besides this, we
can also compute other metrics like mean, median, variance, skewness, and kurtosis for each image channel
as follows.
In [9]: cat_stats= np.array([np.round(cs.mean, 2),np.round(cs.variance, 2),
   ...:                      np.round(cs.kurtosis, 2),np.round(cs.skewness, 2),
   ...:                      np.round(np.median(cat_rgb, axis=1), 2)]).flatten()
   ...: dog_stats= np.array([np.round(ds.mean, 2),np.round(ds.variance, 2),
   ...:                         np.round(ds.kurtosis, 2),np.round(ds.skewness, 2),
   ...:                         np.round(np.median(dog_rgb, axis=1), 2)]).flatten()
   ...:
   ...: stats_df = pd.DataFrame([cat_stats, dog_stats],
    ...:                         columns=['R_mean', 'G_mean', 'B_mean', 'R_var', 'G_var',                                                      
   ...:                                  'B_var', 'R_kurt', 'G_kurt', 'B_kurt', 'R_skew',
   ...:                                  'G_skew', 'B_skew', 'R_med', 'G_med', 'B_med'])
   ...: pd.concat([df, stats_df], axis=1)

Figure 4-28. Image channel aggregation statistical features
We can observe from the features obtained in Figure 4-28 that the mean, median, and kurtosis values
for the various channels for the dog image are mostly greater than corresponding ones in the cat image.
Variance and skewness are however more for the cat image.

Edge Detection
One of the more interesting and sophisticated techniques involve detecting edges in an image. Edge
detection algorithms can be used to detect sharp intensity and brightness changes in an image and find
areas of interest. The canny edge detector algorithm developed by John Canny is one of the most widely
used edge detector algorithms today. This algorithm typically involves using a Gaussian distribution with
a specific standard deviation σ (sigma) to smoothen and denoise the image. Then we apply a Sobel filter
to extract image intensity gradients. Norm value of this gradient is used to determine the edge strength.

229

Chapter 4 ■ Feature Engineering and Selection

Potential edges are thinned down to curves with width of 1 pixel and hysteresis based thresholding is used
to label all points above a specific high threshold as edges and then recursively use the low threshold value
to label points above the low threshold as edges connected to any of the previously labeled points. The
following code applied the canny edge detector to our sample images.
In [10]:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:

from skimage.feature import canny
cat_edges = canny(cgs, sigma=3)
dog_edges = canny(dgs, sigma=3)
fig = plt.figure(figsize = (8,4))
ax1 = fig.add_subplot(1,2, 1)
ax1.imshow(cat_edges, cmap='binary')
ax2 = fig.add_subplot(1,2, 2)
ax2.imshow(dog_edges, cmap='binary')

Figure 4-29. Canny edge detection to extract edge based features
The image plots based on the edge feature arrays depicted in Figure 4-29 clearly show the prominent
edges of our cat and dog. You can use these edge feature arrays (cat_edges and dog_edges) by flattening
them, extracting pixel values and positions pertaining to the edges (non-zero values), or even by aggregating
them like finding out the total number of pixels making edges, mean value, and so on.

Object Detection
Another interesting technique in the world of computer vision is object detection where features useful
in highlighting specific objects in the image are detected and extracted. The histogram of oriented
gradients, also known as HOG, is one of the techniques that’s extensively used in object detection. Going
into the details of this technique would not be possible in the current scope but for the process of feature
engineering, you need to remember that the HOG algorithm works by following a sequence of steps similar
to edge detection. The image is normalized and denoised to remove excess illumination effects. First
order image gradients are computed to capture image attributes like contour, texture, and so on. Gradient
histograms are built on top of these gradients based on specific windows called cells. Finally these cells
are normalized and a flattened feature descriptor is obtained, which can be used as a feature vector for our
models. The following code shows the HOG object detection technique on our sample images.
In [11]: from skimage.feature import hog
    ...: from skimage import exposure
    ...:

230

Chapter 4 ■ Feature Engineering and Selection

    ...: fd_cat, cat_hog = hog(cgs, orientations=8, pixels_per_cell=(8, 8),
    ...:                     cells_per_block=(3, 3), visualise=True)
    ...: fd_dog, dog_hog = hog(dgs, orientations=8, pixels_per_cell=(8, 8),
    ...:                     cells_per_block=(3, 3), visualise=True)
    ...:
    ...: # rescaling intensity to get better plots
    ...: cat_hogs = exposure.rescale_intensity(cat_hog, in_range=(0, 0.04))
    ...: dog_hogs = exposure.rescale_intensity(dog_hog, in_range=(0, 0.04))
    ...:
    ...: fig = plt.figure(figsize = (10,4))
    ...: ax1 = fig.add_subplot(1,2, 1)
    ...: ax1.imshow(cat_hogs, cmap='binary')
    ...: ax2 = fig.add_subplot(1,2, 2)
    ...: ax2.imshow(dog_hogs, cmap='binary')

Figure 4-30. HOG object detector to extract features based on object detection

The image plots in Figure 4-30 show us how the HOG detector has identified the objects in our sample
images. You can also get the flattened feature descriptors as follows.
In [12]: print(fd_cat, fd_cat.shape)
[ 0.00288784  0.00301086  0.0255757  ...,  0.          0.          0.        ] (47880,)

Localized Feature Extraction
We have talked about aggregating pixel values from two-dimensional image or feature matrices and also
flattening them into feature vectors. Localized feature extraction based techniques are slightly better
methods which try to detect and extract localized feature descriptors on various small localized regions of
our input images. This is hence rightly named localized feature extraction. We will be using the popular and
patented SURF algorithm invented by Herbert Bay, et al. SURF stands for Speeded Up Robust Features. The
main idea is to get scale invariant local feature descriptors from images which can be used later as image
features. This algorithm is similar to the popular SIFT algorithm. There are mainly two major phases in this
algorithm. The first phase is to detect points of interest using square shaped filters and hessian matrices. The
second phase is to build feature descriptors by extracting localized features around these points of interest.
There are usually computed by taking a localized square image region around a point of interest and then
aggregating Haar wavelet responses at specific interval based sample points. We use the mahotas Python
framework for extracting SURF feature descriptors from our sample images.

231

Chapter 4 ■ Feature Engineering and Selection

In [13]: from mahotas.features import surf
    ...: import mahotas as mh
    ...:
    ...: cat_mh = mh.colors.rgb2gray(cat)
    ...: dog_mh = mh.colors.rgb2gray(dog)
    ...:
    ...: cat_surf = surf.surf(cat_mh, nr_octaves=8, nr_scales=16, initial_step_size=1,
                              threshold=0.1, max_points=50)
    ...: dog_surf = surf.surf(dog_mh, nr_octaves=8, nr_scales=16, initial_step_size=1,
                              threshold=0.1, max_points=54)
    ...:
    ...: fig = plt.figure(figsize = (10,4))
    ...: ax1 = fig.add_subplot(1,2, 1)
    ...: ax1.imshow(surf.show_surf(cat_mh, cat_surf))
    ...: ax2 = fig.add_subplot(1,2, 2)
    ...: ax2.imshow(surf.show_surf(dog_mh, dog_surf))

Figure 4-31. Localized feature extraction with SURF
The square boxes in the image plots in Figure 4-31 depict the square image regions around the points of
interest which were used for localized feature extraction. You can also use the surf.dense(...) function to
extract uniform dimensional feature descriptors at dense points with regular interval spacing in pixels. The
following code depicts how to achieve this.
In [14]:
    ...:
    ...:
Out[14]:

cat_surf_fds = surf.dense(cat_mh, spacing=10)
dog_surf_fds = surf.dense(dog_mh, spacing=10)
cat_surf_fds.shape
(140, 64)

We see from the preceding output that we have obtained 140 feature descriptors of size 64 (elements)
each. You can further apply other schemes on this like aggregation, flattening, and so on to derive further
features. Another sophisticated technique that you can use to extract features on these SURF feature
descriptors is to use the visual bag of words model, which we discuss in the next section.

232

Chapter 4 ■ Feature Engineering and Selection

Visual Bag of Words Model
We have seen the effectiveness of the popular Bag of Words model in extracting meaningful features from
unstructured text documents. Bag of words refers to the document being broken down into its constituents,
words and computing frequency of occurrences or other measures like tf-idf. Similarly, in case of image raw
pixel matrices or derived feature descriptors from other algorithms, we can apply a bag of words principle.
However the constituents will not be words in this case but they will be subset of features/pixels extracted
from images which are similar to each other.
Imagine you have multiple pictures of octopuses and you were able to extract the 140 dense surf
features each having 64 values in each feature vector. You can now use an unsupervised learning algorithm
like clustering to extract clusters of similar feature descriptors. Each cluster can be labeled as a visual word
or a visual feature. Subsequently, each feature descriptor can be binned into one of these clusters or visual
words. Thus, you end up getting a one-dimensional visual bag of words vector with counts of number of
feature descriptors assigned to each of the visual words for the 140x64 feature descriptor matrix. Each
feature or visual word tends to capture some portion of the images that are similar to each other like octopus
eyes, tentacles, suckers, and so on, as depicted in Figure 4-32.

Figure 4-32. Visual bag of words (Courtesy of Ian London, Image Classification in Python with Visual Bag
of Words)
The basic idea is hence to get a feature descriptor matrix from using any algorithm like SURF, apply an
unsupervised algorithm like K-means clustering, and extract out k bins or visual features/words and their
counts (based on number of feature descriptors assigned to each bin). Then for each subsequent image, once
you extract the feature descriptors, you can use the K-means model to assign each feature descriptor to one
of the visual feature clusters and get a one-dimensional vector of counts. This is depicted in Figure 4-33 for
a sample octopus image, assuming our VBOW (Visual Bag of Words) model has three bins of eyes, tentacles,
and suckers.

233

Chapter 4 ■ Feature Engineering and Selection

Figure 4-33. Transforming an image into a VBOW vector (Courtesy of Ian London, Image Classification in
Pythonwith Visual Bag of Words)
Thus you can see from Figure 4-33, how a two-dimensional image and its corresponding feature
descriptors can be easily transformed into a one-dimensional VBOW vector [1, 3, 5]. Going into
extensive details of the VBOW model would not be possible in the current scope, but I would like to thank
my friend and fellow data scientist, Ian London, for helping me out with providing the two figures on VBOW
models. I would also recommend you to check out his wonderful blog article https://ianlondon.github.
io/blog/visual-bag-of-words/, which talks about using the VBOW model for image classification.
We will now use our 140x64 SURF feature descriptors for our two sample images and use K-means
clustering on them and compute VBOW vectors for each image by assigning each feature descriptor to one
of the bins. We will take k=20 in this case. See Figure 4-34.
In [15]: from sklearn.cluster import KMeans
    ...:
    ...: k = 20
    ...: km = KMeans(k, n_init=100, max_iter=100)
    ...:
    ...: surf_fd_features = np.array([cat_surf_fds, dog_surf_fds])
    ...: km.fit(np.concatenate(surf_fd_features))
    ...:
    ...: vbow_features = []
    ...: for feature_desc in surf_fd_features:
    ...:     labels = km.predict(feature_desc)
    ...:     vbow = np.bincount(labels, minlength=k)
    ...:     vbow_features.append(vbow)
    ...:
    ...: vbow_df = pd.DataFrame(vbow_features)
    ...: pd.concat([df, vbow_df], axis=1)

Figure 4-34. Transforming SURF descriptors into VBOW vectors for sample images

234

Chapter 4 ■ Feature Engineering and Selection

You can see how easy it is to transform complex two-dimensional SURF feature descriptor matrices into
easy to interpret VBOW vectors. Let’s now take a new image and think about how we could apply the VBOW
pipeline. First we would need to extract the SURF feature descriptors from the image using the following
snippet (This is only to depict the localized image subsets used in SURF we will actually use the dense
features as before.) See Figure 4-35.
In [16]: new_cat = io.imread('datasets/new_cat.png')
    ...: newcat_mh = mh.colors.rgb2gray(new_cat)
    ...: newcat_surf = surf.surf(newcat_mh, nr_octaves=8, nr_scales=16, initial_step_size=1,
                                 threshold=0.1, max_points=50)
    ...:
    ...: fig = plt.figure(figsize = (10,4))
    ...: ax1 = fig.add_subplot(1,2, 1)
    ...: ax1.imshow(surf.show_surf(newcat_mh, newcat_surf))

Figure 4-35. Localized feature extraction with SURF for new image
Let’s now extract the dense SURF features and transform them into a VBOW vector using our previously
trained VBOW model. The following code helps us achieve this. See Figure 4-36.
In [17]:
    ...:
    ...:
    ...:
    ...:

new_surf_fds = surf.dense(newcat_mh, spacing=10)
labels = km.predict(new_surf_fds)
new_vbow = np.bincount(labels, minlength=k)
pd.DataFrame([new_vbow])

Figure 4-36. Transforming new image SURF descriptors into a VBOW vector
Thus you can see the final VBOW feature vector for the new image based on SURF feature descriptors. This
is also an example of using an unsupervised Machine Learning model for feature engineering. You can now
compare the similarity of this new image with the other two sample images using some similarity metrics.

235

Chapter 4 ■ Feature Engineering and Selection

In [18]: from sklearn.metrics.pairwise import euclidean_distances, cosine_similarity
    ...:
    ...: eucdis = euclidean_distances(new_vbow.reshape(1,-1) , vbow_features)
    ...: cossim = cosine_similarity(new_vbow.reshape(1,-1) , vbow_features)
    ...:
    ...: result_df = pd.DataFrame({'EuclideanDistance': eucdis[0],
    ...:            
'CosineSimilarity': cossim[0]})
    ...: pd.concat([df, result_df], axis=1)
Out[18]:
  Image  CosineSimilarity  EuclideanDistance
0   Cat          0.871609          21.260292
1   Dog          0.722096          30.000000
Based on the distance and similarity metrics, we can see that our new image (of a cat) is definitely closer
to the cat image than the dog image. Try this out with a bigger dataset to get better results!

Automated Feature Engineering with Deep Learning
We have used a lot of simple and sophisticated feature engineering techniques so far in this section. Building
complex feature engineering systems and pipelines is time consuming and building algorithms for the same
is even more tasking. Deep Learning is a novel and new approach toward automating this complex task of
feature engineering by making the machine extract features automatically by learning multiple layered and
complex representations of the underlying raw data.
Convolutional neural networks or CNNs are extensively used for automated feature extraction in
images. We have already covered the basic principles of CNNs in Chapter 1. Go ahead and refresh your
memory you heading to the “Important Concepts” sub-section under the “Deep Learning” section in
Chapter 1. Just like we mentioned before, the idea of CNNs operate on the principles of convolution and
pooling besides your regular activation function layers.
Convolutional layers typically slides or convolves learnable filters (also known as kernels or convolution
matrix) across the entire width and height of the input image pixels. Dot products between the input pixels
and the filter are computed at each position on sliding the filter. Two-dimensional activation maps for the
filter get created and consequently the network is able to learn these filters when it activates on detecting
specific features like edges, corners and so on. If we take n filters, we will get n separate two-dimensional
activation maps, which can then be stacked along the depth dimension to get the output volume.
Pooling is a kind of aggregation or downsampling layer where typically a non-linear downsampling
operation is inserted between convolutional layers. Filters are applied here too. They are slided along the
convolution output matrix and, for each sliding operation, also known as a stride, elements in the slice of
matrix covered by the pooling filter are either summed (Sum pooling) or averaged (Mean pooling) or the
maximum value is selected (Max pooling). More than often max pooling works really well in several realworld scenarios. Pooling helps in reducing feature dimensionality and control model overfitting. Let’s now
try to use Deep Learning for automated feature extraction on our sample images using CNNs. Load the
following dependencies necessary for building deep networks.
In [19]: from keras.models import Sequential
    ...: from keras.layers.convolutional import Conv2D
    ...: from keras.layers.convolutional import MaxPooling2D
    ...: from keras import backend as K
Using TensorFlow backend.

236

Chapter 4 ■ Feature Engineering and Selection

You can use Theano or Tensorflow as your backend Deep Learning framework for keras to work on.
I am using tensorflow in this scenario. Let’s build a basic two-layer CNN now with a Max Pooling layer
between them.
In [20]: model = Sequential()
    ...: model.add(Conv2D(4, (4, 4), input_shape=(168, 300, 3), activation='relu',
    ...:                 kernel_initializer='glorot_uniform'))
    ...: model.add(MaxPooling2D(pool_size=(2, 2)))
    ...: model.add(Conv2D(4, (4, 4), activation='relu',
    ...:                 kernel_initializer='glorot_uniform'))
We can actually visualize this network architecture using the following code snippet to understand the
layers that have been used in this network, in a better way.
In [21]: from IPython.display import SVG
    ...: from keras.utils.vis_utils import model_to_dot
    ...:
    ...: SVG(model_to_dot(model, show_shapes=True,
    ...:                 show_layer_names=True, rankdir='TB').create(prog='dot', format='svg'))

Figure 4-37. Visualizing our two-layer convolutional neural network architecture
You can now understand from the depiction in Figure 4-37 that we are using two two-dimensional
Convolutional layers containing four (4x4) filters. We also have a Max Pool layer between them of size (2x2)
for some downsampling. Let’s now build some functions to extract features from these intermediate network
layers.

237

Chapter 4 ■ Feature Engineering and Selection

In [22]: first_conv_layer = K.function([model.layers[0].input, K.learning_phase()],
    ...:                            
[model.layers[0].output])
    ...: second_conv_layer = K.function([model.layers[0].input, K.learning_phase()],
    ...:                                [model.layers[2].output])
Let’s now use these functions to extract the feature representations learned in the convolutional layers
and visualize these features to see what the network is trying to learn from the images.
In [23]:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:

catr = cat.reshape(1, 168, 300,3)
# extract features
first_conv_features = first_conv_layer([catr])[0][0]
second_conv_features = second_conv_layer([catr])[0][0]
# view feature representations
fig = plt.figure(figsize = (14,4))
ax1 = fig.add_subplot(2,4, 1)
ax1.imshow(first_conv_features[:,:,0])
ax2 = fig.add_subplot(2,4, 2)
ax2.imshow(first_conv_features[:,:,1])
ax3 = fig.add_subplot(2,4, 3)
ax3.imshow(first_conv_features[:,:,2])
ax4 = fig.add_subplot(2,4, 4)
ax4.imshow(first_conv_features[:,:,3])
ax5 = fig.add_subplot(2,4, 5)
ax5.imshow(second_conv_features[:,:,0])
ax6 = fig.add_subplot(2,4, 6)
ax6.imshow(second_conv_features[:,:,1])
ax7 = fig.add_subplot(2,4, 7)
ax7.imshow(second_conv_features[:,:,2])
ax8 = fig.add_subplot(2,4, 8)
ax8.imshow(second_conv_features[:,:,3])

Figure 4-38. Intermediate feature maps obtained after passing though convolutional Layers

238

Chapter 4 ■ Feature Engineering and Selection

The feature map visualizations depicted in Figure 4-38 are definitely interesting. You can clearly see
that each feature matrix produced by the convolutional neural network is trying to learn something about
the image like its texture, corners, edges, illumination, hue, brightness, and so on. This should give you an
idea of how these activation feature maps can then be used as features for images. In fact you can stack the
output of a CNN, flatten it if needed, and pass it as an input layer to a multi-layer fully connected perceptron
neural network and use it to solve the problem of image classification. This should give you a head start on
automated feature extraction with the power of Deep Learning!
Don’t worry if you did not understand some of the terms mentioned in this section; we will cover Deep
Learning and CNNs in more depth in a subsequent chapter. If can’t wait to get started with Deep Learning,
you can fire up the bonus notebook provided with this chapter, called Bonus - Classifying handwritten
digits using Deep CNNs.ipynb, for a complete real-world example of applying CNNs and Deep Learning
to classify hand-written digits!

Feature Scaling
When dealing with numeric features, we have specific attributes which may be completely unbounded in nature,
like view counts of a video or web page hits. Using the raw values as input features might make models biased
toward features having really high magnitude values. These models are typically sensitive to the magnitude or
scale of features like linear or logistic regression. Other models like tree based methods can still work without
feature scaling. However it is still recommended to normalize and scale down the features with feature scaling,
especially if you want to try out multiple Machine Learning algorithms on input features. We have already seen
some examples of scaling and transforming features using log and box-cox transforms earlier in this chapter.
In this section, we look at some popular feature scaling techniques. You can load feature_scaling.py directly
and start running the examples or use the jupyter notebook, Feature Scaling.ipynb for a more interactive
experience. Let’s start by loading the following dependencies and configurations.
In [1]:
   ...:
   ...:
   ...:

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import numpy as np
import pandas as pd
np.set_printoptions(suppress=True)

Let’s now load some sample data of user views pertaining to online videos. The following snippet
creates this sample dataset.
In [2]: views = pd.DataFrame([1295., 25., 19000., 5., 1., 300.], columns=['views'])
   ...: views
Out[2]:
     views
0   1295.0
1     25.0
2  19000.0
3      5.0
4      1.0
5    300.0
From the preceding dataframe we can see that we have five videos that have been viewed by users and
the total view count for each video is depicted by the feature views. It is quite evident that some videos have
been viewed a lot more than the others, giving a rise to values of high scale and magnitude. Let’s look at how
we can scale this feature using several handy techniques.

239

Chapter 4 ■ Feature Engineering and Selection

Standardized Scaling
The standard scaler tries to standardize each value in a feature column by removing the mean and scaling
the variance to be 1 from the values. This is also known as centering and scaling and can be denoted
mathematically as
SS ( X i ) =

X i - mX
sX

where each value in feature X is subtracted by the mean μX and the resultant is divided by the standard
deviation σX. This is also popularly known as Z-score scaling. You can also divide the resultant by the
variance instead of the standard deviation if needed. The following snippet helps us achieve this.
In [3]: ss = StandardScaler()
   ...: views['zscore'] = ss.fit_transform(views[['views']])
   ...: views
Out[3]:
     views    zscore
0   1295.0 -0.307214
1     25.0 -0.489306
2  19000.0  2.231317
3      5.0 -0.492173
4      1.0 -0.492747
5    300.0 -0.449877
We can see the standardized and scaled values in the zscore column in the preceding dataframe. In
fact, you can manually use the formula we used earlier to compute the same result. The following example
computes the z-score mathematically.
In [4]: vw = np.array(views['views'])
   ...: (vw[0] - np.mean(vw)) / np.std(vw)
Out[4]: -0.30721413311687235

Min-Max Scaling
With min-max scaling, we can transform and scale our feature values such that each value is within the
range of [0, 1]. However the MinMaxScaler class in scikit-learn also allows you to specify your own upper
and lower bound in the scaled value range using the feature_range variable. Mathematically we can
represent this scaler as
MMS ( X i ) =

X i - min ( X )
max ( X ) - min ( X )

where we scale each value in the feature X by subtracting it from the minimum value in the feature min
(X) and dividing the resultant by the difference between the maximum and minimum values in the feature
max(X) - min (X). The following snippet helps us compute this.
In [5]: mms = MinMaxScaler()
   ...: views['minmax'] = mms.fit_transform(views[['views']])

240

Chapter 4 ■ Feature Engineering and Selection

   ...: views
Out[5]:
     views    zscore    minmax
0   1295.0 -0.307214  0.068109
1     25.0 -0.489306  0.001263
2  19000.0  2.231317  1.000000
3      5.0 -0.492173  0.000211
4      1.0 -0.492747  0.000000
5    300.0 -0.449877  0.015738
The preceding output shows the min-max scaled values in the minmax column and as expected, the
maximum viewed video in row index 2 has a value of 1, and the minimum viewed video in row index 4 has a
value of 0. You can also compute this mathematically using the following code (sample computation for the
first row).
In [6]: (vw[0] - np.min(vw)) / (np.max(vw) - np.min(vw))
Out[6]: 0.068108847834096528

Robust Scaling
The disadvantage of min-max scaling is that often the presence of outliers affects the scaled values for any
feature. Robust scaling tries to use specific statistical measures to scale features without being affected by
outliers. Mathematically this scaler can be represented as
RS ( X i ) =

X i - median ( X )
IQR(1,3) ( X )

where we scale each value of feature X by subtracting the median of X and dividing the resultant by the IQR
also known as the Inter-Quartile Range of X which is the range (difference) between the first quartile (25th
%ile) and the third quartile (75th %ile). The following code performs robust scaling on our sample feature.
In [7]: rs = RobustScaler()
   ...: views['robust'] = rs.fit_transform(views[['views']])
   ...: views
Out[7]:
     views    zscore    minmax     robust
0   1295.0 -0.307214  0.068109   1.092883
1     25.0 -0.489306  0.001263  -0.132690
2  19000.0  2.231317  1.000000  18.178528
3      5.0 -0.492173  0.000211  -0.151990
4      1.0 -0.492747  0.000000  -0.155850
5    300.0 -0.449877  0.015738   0.132690
The scaled values are depicted in the robust column and you can compare them with the scaled
features in the other columns. You can also compute the same using the mathematical equation we
formulated for the robust scaler as depicted in the following snippet (for the first row index value).
In [8]: quartiles = np.percentile(vw, (25., 75.))

241

Chapter 4 ■ Feature Engineering and Selection

   ...: iqr = quartiles[1] - quartiles[0]
   ...: (vw[0] - np.median(vw)) / iqr
Out[8]: 1.0928829915560916
There are several other techniques for feature scaling and normalization, but these should be sufficient
to get you started and are used extensively in building Machine Learning systems. Always remember to
check if you need to scale and standardize features whenever you are dealing with numerical features.

Feature Selection
While it is good to try to engineering features that try to capture some latent representations and patterns
in the underlying data, it is not always a good thing to deal with feature sets having maybe thousands
of features or even more. Dealing with a large number of features bring us to the concept of the curse of
dimensionality which we mentioned earlier during the “Bin Counting” section in “Feature Engineering on
Categorical Data”. More features tend to make models more complex and difficult to interpret. Besides this,
it can often lead to models over-fitting on the training data. This basically leads to a very specialized model
tuned only to the data which it used for training and hence even if you get a high model performance, it
will end up performing very poorly on new, previously unseen data. The ultimate objective is to select an
optimal number of features to train and build models that generalize very well on the data and prevent
overfitting.
Feature selection strategies can be divided into three main areas based on the type of strategy and
techniques employed for the same. They are described briefly as follows.
•

Filter methods: These techniques select features purely based on metrics like
correlation, mutual information and so on. These methods do not depend on results
obtained from any model and usually check the relationship of each feature with
the response variable to be predicted. Popular methods include threshold based
methods and statistical tests.

•

Wrapper methods: These techniques try to capture interaction between multiple
features by using a recursive approach to build multiple models using feature
subsets and select the best subset of features giving us the best performing model.
Methods like backward selecting and forward elimination are popular wrapper
based methods.

•

Embedded methods: These techniques try to combine the benefits of the other
two methods by leveraging Machine Learning models themselves to rank and score
feature variables based on their importance. Tree based methods like decision trees
and ensemble methods like random forests are popular examples of embedded
methods.

The benefits of feature selection include better performing models, less overfitting, more generalized
models, less time for computations and model training, and to get a good insight into understanding
the importance of various features in your data. In this section, we look at some of the most widely used
techniques in feature selection. You can load feature_selection.py directly and start running the
examples or use the jupyter notebook, Feature Selection.ipynb for a more interactive experience. Let’s
start by loading the following dependencies and configurations.
In [1]:
   ...:
   ...:
   ...:

242

import numpy as np
import pandas as pd
np.set_printoptions(suppress=True)
pt = np.get_printoptions()['threshold']

Chapter 4 ■ Feature Engineering and Selection

We will now look at various ways of selecting features including statistical and model based techniques
by using some sample datasets.

Threshold-Based Methods
This is a filter based feature selection strategy, where you can use some form of cut-off or thresholding for
limiting the total number of features during feature selection. Thresholds can be of various forms. Some of
them can be used during the feature engineering process itself, where you can specify threshold parameters.
A simple example of this would be to limit feature terms in the Bag of Words model, which we used for text
based feature engineering earlier. The scikit-learn framework provides parameters like min_df and max_
df which can be used to specify thresholds for ignoring terms which have document frequency above and
below user specified thresholds. The following snippet depicts a way to do this.
In [2]: from sklearn.feature_extraction.text import CountVectorizer
   ...:
   ...: cv = CountVectorizer(min_df=0.1, max_df=0.85, max_features=2000)
   ...: cv
Out[2]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=, encoding='utf-8', input='content',
        lowercase=True, max_df=0.85, max_features=2000, min_df=0.1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
This basically builds a count vectorizer which ignores feature terms which occur in less than 10% of the
total corpus and also ignores terms which occur in more than 85% of the total corpus. Besides this we also
put a hard limit of 2000 maximum features in the feature set.
Another way of using thresholds is to use variance based thresholding where features having low
variance (below a user-specified threshold) are removed. This signifies that we want to remove features that
have values that are more or less constant across all the observations in our datasets. We can apply this to
our Pokémon dataset, which we used earlier in this chapter. First we convert the Generation feature to a
categorical feature as follows.
In [3]: df = pd.read_csv('datasets/Pokemon.csv')
   ...: poke_gen = pd.get_dummies(df['Generation'])
   ...: poke_gen.head()
Out[3]:
   Gen 1  Gen 2  Gen 3  Gen 4  Gen 5  Gen 6
0      1      0      0      0      0      0
1      1      0      0      0      0      0
2      1      0      0      0      0      0
3      1      0      0      0      0      0
4      1      0      0      0      0      0
Next, we want to remove features from the one hot encoded features where the variance is less than
0.15. We can do this using the following snippet.

243

Chapter 4 ■ Feature Engineering and Selection

In [4]:
   ...:
   ...:
   ...:
Out[4]:

from sklearn.feature_selection import VarianceThreshold
vt = VarianceThreshold(threshold=.15)
vt.fit(poke_gen)
VarianceThreshold(threshold=0.15)

To view the variances as well as which features were finally selected by this algorithm, we can use the
variances_ property and the get_support(...) function respectively. The following snippet depicts this
clearly in a formatted dataframe.
In [5]: pd.DataFrame({'variance': vt.variances_,
   ...:               'select_feature': vt.get_support()},
   ...:             index=poke_gen.columns).T
Out[5]:
                   Gen 1     Gen 2 Gen 3     Gen 4     Gen 5      Gen 6
select_feature      True     False  True     False      True      False
variance        0.164444  0.114944  0.16  0.128373  0.163711  0.0919937
We can clearly see which features have been selected based on their True values and also their variance
being above 0.15. To get the final subset of selected features, you can use the following code.
In [6]: poke_gen_subset = poke_gen.iloc[:,vt.get_support()].head()
   ...: poke_gen_subset
Out[6]:
   Gen 1  Gen 3  Gen 5
0      1      0      0
1      1      0      0
2      1      0      0
3      1      0      0
4      1      0      0
The preceding feature subset depicts that features Gen 1, Gen 3, and Gen 5 have been finally selected
out of the original six features.

Statistical Methods
Another widely used filter based feature selection method, which is slightly more sophisticated, is to
select features based on univariate statistical tests. You can use several statistical tests for regression and
classification based models including mutual information, ANOVA (analysis of variance) and chi-square
tests. Based on scores obtained from these statistical tests, you can select the best features on the basis
of their score. Let’s load a sample dataset now with 30 features. This dataset is known as the Wisconsin
Diagnostic Breast Cancer dataset, which is also available in its native or raw format at https://archive.
ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic), which is the UCI Machine Learning
repository. We will use scikit-learn to load the data features and the response class variable.
In [7]:
   ...:
   ...:
   ...:
   ...:
   ...:

244

from sklearn.datasets import load_breast_cancer
bc_data = load_breast_cancer()
bc_features = pd.DataFrame(bc_data.data, columns=bc_data.feature_names)
bc_classes = pd.DataFrame(bc_data.target, columns=['IsMalignant'])

Chapter 4 ■ Feature Engineering and Selection

   ...: # build featureset and response class labels
   ...: bc_X = np.array(bc_features)
   ...: bc_y = np.array(bc_classes).T[0]
   ...: print('Feature set shape:', bc_X.shape)
   ...: print('Response class shape:', bc_y.shape)
Feature set shape: (569, 30)
Response class shape: (569,)
We can clearly see that, as we mentioned before, there are a total of 30 features in this dataset and a
total of 569 rows of observations. To get some more detail into the feature names and take a peek at the data
points, you can use the following code.
In [8]: np.set_printoptions(threshold=30)
   ...: print('Feature set data [shape: '+str(bc_X.shape)+']')
   ...: print(np.round(bc_X, 2), '\n')
   ...: print('Feature names:')
   ...: print(np.array(bc_features.columns), '\n')
   ...: print('Response Class label data [shape: '+str(bc_y.shape)+']')
   ...: print(bc_y, '\n')
   ...: print('Response variable name:', np.array(bc_classes.columns))
   ...: np.set_printoptions(threshold=pt)
Feature set data [shape: (569, 30)]
[[  17.99   10.38  122.8  ...,    0.27    0.46    0.12]
[  20.57   17.77  132.9  ...,    0.19    0.28    0.09]
[  19.69   21.25  130.   ...,    0.24    0.36    0.09]
...,
[  16.6    28.08  108.3  ...,    0.14    0.22    0.08]
[  20.6    29.33  140.1  ...,    0.26    0.41    0.12]
[   7.76   24.54   47.92 ...,    0.      0.29    0.07]]
Feature names:
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
'mean smoothness' 'mean compactness' 'mean concavity'
'mean concave points' 'mean symmetry' 'mean fractal dimension'
'radius error' 'texture error' 'perimeter error' 'area error'
'smoothness error' 'compactness error' 'concavity error'
'concave points error' 'symmetry error' 'fractal dimension error'
'worst radius' 'worst texture' 'worst perimeter' 'worst area'
'worst smoothness' 'worst compactness' 'worst concavity'
'worst concave points' 'worst symmetry' 'worst fractal dimension']
Response Class label data [shape: (569,)]
[0 0 0 ..., 0 0 1]
Response variable name: ['IsMalignant']
This gives us a better perspective on the data we are dealing with. The response class variable is a binary
class where 1 indicates the tumor detected was benign and 0 indicates it was malignant. We can also see
the 30 features that are real valued numbers that describe characteristics of cell nuclei present in digitized
images of breast mass. Let’s now use the chi-square test on this feature set and select the top 15 best features
out of the 30 features. The following snippet helps us achieve this.

245

Chapter 4 ■ Feature Engineering and Selection

In [9]:
   ...:
   ...:
   ...:
Out[9]:

from sklearn.feature_selection import chi2, SelectKBest
skb = SelectKBest(score_func=chi2, k=15)
skb.fit(bc_X, bc_y)
SelectKBest(k=15, score_func=)

You can see that we have passed our input features (bc_X) and corresponding response class outputs
(bc_y) to the fit(...) function when computing the necessary metrics. The chi-square test will compute
statistics between each feature and the class variable (univariate tests). Selecting the top K features is more
than likely to remove features having a low score and consequently they are most likely to be independent
of the class variable and hence not useful in building models. We sort the scores to see the most relevant
features using the following code.
In [10]: feature_scores = [(item, score) for item, score in zip(bc_data.feature_names,  
                                                                skb.scores_)]
    ...: sorted(feature_scores, key=lambda x: -x[1])[:10]
Out[10]:
[('worst area', 112598.43156405364),
('mean area', 53991.655923750892),
('area error', 8758.5047053344697),
('worst perimeter', 3665.0354163405909),
('mean perimeter', 2011.1028637679051),
('worst radius', 491.68915743332195),
('mean radius', 266.10491719517802),
('perimeter error', 250.57189635982184),
('worst texture', 174.44939960571074),
('mean texture', 93.897508098633352)]
We can now create a subset of the 15 selected features obtained from our original feature set of 30
features with the help of the chi-square test by using the following code.
In [11]: select_features_kbest = skb.get_support()
    ...: feature_names_kbest = bc_data.feature_names[select_features_kbest]
    ...: feature_subset_df = bc_features[feature_names_kbest]
    ...: bc_SX = np.array(feature_subset_df)
    ...: print(bc_SX.shape)
    ...: print(feature_names_kbest)
(569, 15)
['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean concavity'
'radius error' 'perimeter error' 'area error' 'worst radius'
'worst texture' 'worst perimeter' 'worst area' 'worst compactness'
'worst concavity' 'worst concave points']
Thus from the preceding output, you can see that our new feature subset bc_SX has 569 observations
of 15 features instead of 30 and we also printed the names of the selected features for your ease of
understanding. To view the new feature set, you can use the following snippet.
In [12]: np.round(feature_subset_df.iloc[20:25], 2)

246

Chapter 4 ■ Feature Engineering and Selection

Figure 4-39. Selected feature subset of the Wisconsin Diagnostic Breast Cancer dataset using chi-square tests
The dataframe with the top scoring features is depicted in Figure 4-39. Let’s now build a simple
classification model using logistic regression on the original feature set of 30 features and compare the
model accuracy performance with another model built using our selected 15 features. For model evaluation,
we will use the accuracy metric (percent of correct predictions) and use a five-fold cross-validation scheme.
We will be covering model evaluation and tuning strategies in detail in Chapter 5, so do not despair if
you cannot understand some of the terminology right now. The main idea here is to compare the model
prediction performance between models trained on different feature sets.
In [13]: from sklearn.linear_model import LogisticRegression
    ...: from sklearn.model_selection import cross_val_score
    ...:
    ...: # build logistic regression model
    ...: lr = LogisticRegression()
    ...:
    ...: # evaluating accuracy for model built on full featureset
    ...: full_feat_acc = np.average(cross_val_score(lr, bc_X, bc_y, scoring='accuracy', cv=5))
    ...: # evaluating accuracy for model built on selected featureset
    ...: sel_feat_acc = np.average(cross_val_score(lr, bc_SX, bc_y, scoring='accuracy', cv=5))
    ...:
    ...: print('Model accuracy statistics with 5-fold cross validation')
    ...: print('Model accuracy with complete feature set', bc_X.shape, ':', full_feat_acc)
    ...: print('Model accuracy with selected feature set', bc_SX.shape, ':', sel_feat_acc)
Model accuracy statistics with 5-fold cross validation
Model accuracy with complete feature set (569, 30) : 0.950904193921
Model accuracy with selected feature set (569, 15) : 0.952643324356
The accuracy metrics clearly show us that we actually built a better model having accuracy of 95.26%
when trained on the selected 15 feature subset as compared to the model built with the original 30 features
which had an accuracy of 95.09%. Try this out on your own datasets! Do you see any improvements?

Recursive Feature Elimination
You can also rank and score features with the help of a Machine Learning based model estimator such that
you recursively keep eliminating lower scored features till you arrive at the specific feature subset count.
Recursive Feature Elimination, also known as RFE, is a popular wrapper based feature selection technique,
which allows you to use this strategy. The basic idea is to start off with a specific Machine Learning estimator
like the Logistic Regression algorithm we used for our classification needs. Next we take the entire feature set
of 30 features and the corresponding response class variables. RFE aims to assign weights to these features
based on the model fit. Features with the smallest weights are pruned out and then a model is fit again on

247

Chapter 4 ■ Feature Engineering and Selection

the remaining features to obtain the new weights or scores. This process is recursively carried out multiple
times and each time features with the lowest scores/weights are eliminated, until the pruned feature subset
contains the desired number of features that the user wanted to select (this is taken as an input parameter at
the start). This strategy is also popularly known as backward elimination. Let’s select the top 15 features on
our breast cancer dataset now using RFE.
In [14]: from sklearn.feature_selection import RFE
    ...:
    ...: lr = LogisticRegression()
    ...: rfe = RFE(estimator=lr, n_features_to_select=15, step=1)
    ...: rfe.fit(bc_X, bc_y)
Out[14]:
RFE(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
  n_features_to_select=15, step=1, verbose=0)
We can now use the get_support(...) function to obtain the final 15 selected features. This is depicted
in the following snippet.
In [15]: select_features_rfe = rfe.get_support()
    ...: feature_names_rfe = bc_data.feature_names[select_features_rfe]
    ...: print(feature_names_rfe)
['mean radius' 'mean texture' 'mean perimeter' 'mean smoothness'
'mean concavity' 'mean concave points' 'mean symmetry' 'texture error'
'worst radius' 'worst texture' 'worst smoothness' 'worst concavity'
'worst concave points' 'worst symmetry' 'worst fractal dimension']
Can we compare this feature subset with the one we obtained using statistical tests in the previous
section and see which features are common among both these subsets? Of course we can! Let’s use set
operations to get the list of features that were selected by both these techniques.
In [16]: set(feature_names_kbest) & set(feature_names_rfe)
Out[16]:
{'mean concavity', 'mean perimeter', 'mean radius', 'mean texture',
'worst concave points', 'worst concavity', 'worst radius', 'worst texture'}
Thus we can see that 8 out of 15 features are common and have been chosen by both the feature
selection techniques, which is definitely interesting!

Model-Based Selection
Tree based models like decision trees and ensemble models like random forests (ensemble of trees) can
be utilized not just for modeling alone but for feature selection. These models can be used to compute
feature importances when building the model that can in turn be used for selecting the best features and
discarding irrelevant features with lower scores. Random forest is an ensemble model. This can be used as
an embedded feature selection method, where each decision tree model in the ensemble is built by taking
a training sample of data from the entire dataset. This sample is a bootstrap sample (sample taken with
replacement). Splits at any node are taken by choosing the best split from a random subset of the features
rather than taking all the features into account. This randomness tends to reduce the variance of the model

248

Chapter 4 ■ Feature Engineering and Selection

at the cost of slightly increasing the bias. Overall this produces a better and more generalized model. We will
cover the bias-variance tradeoff in more detail in Chapter 5. Let’s now use the random forest model to score
and rank features based on their importance.
In [17]: from sklearn.ensemble import RandomForestClassifier
    ...:
    ...: rfc = RandomForestClassifier()
    ...: rfc.fit(bc_X, bc_y)
Out[17]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
The following code uses this random forest estimator to score the features based on their importance
and we display the top 10 most important features based on this score.
In [18]: importance_scores = rfc.feature_importances_
    ...: feature_importances = [(feature, score) for feature, score in zip(bc_data.feature_
names, importance_scores)]
    ...: sorted(feature_importances, key=lambda x: -x[1])[:10]
Out[18]:
[('worst area', 0.25116985146898885),
('worst radius', 0.16995187376059454),
('worst concavity', 0.1164662504282163),
('worst concave points', 0.11253251729478526),
('mean concave points', 0.10839170432994949),
('mean concavity', 0.063554137255925847),
('mean area', 0.023771318604377804),
('worst perimeter', 0.020636790800076958),
('worst texture', 0.019171556030722112),
('mean radius', 0.014908508522792335)]
You can now use a threshold based parameter to filter out the top n features as needed or you can even
make use of the SelectFromModel meta-transformer provided by scikit-learn by using it as a wrapper on
top of this model. Can you find out how many of the higher ranked features from the random forest model
are in common with the previous two feature selectors?

Dimensionality Reduction
Dealing with a lot of features can lead to issues like model overfitting, complex models, and many more that
all roll up to what we have mentioned as the curse of dimensionality. Refer to the section “Dimensionality
Reduction” in Chapter 1 to refresh your memory. Dimensionality reduction is the process of reducing
the total number of features in our feature set using strategies like feature selection or feature extraction.
We have already talked about feature selection extensively in the previous section. We now cover feature
extraction where the basic objective is to extract new features from the existing set of features such that
the higher-dimensional dataset with many features can be reduced into a lower-dimensional dataset of
these newly created features. A very popular technique of linear data transformation from higher to lower
dimensions is Principal Component Analysis, also known as PCA. Let’s try to understand more about PCA
and how we can use it for feature extraction in the following sections.

249

Chapter 4 ■ Feature Engineering and Selection

Feature Extraction with Principal Component Analysis
Principal component analysis, popularly known as PCA, is a statistical method that uses the process of
linear, orthogonal transformation to transform a higher-dimensional set of features that could be possibly
correlated into a lower-dimensional set of linearly uncorrelated features. These transformed and newly
created features are also known as Principal Components or PCs. In any PCA transformation, the total
number of PCs is always less than or equal to the initial number of features. The first principal component
tries to capture the maximum variance of the original set of features. Each of the succeeding components
tries to capture more of the variance such that they are orthogonal to the preceding components. An
important point to remember is that PCA is sensitive to feature scaling.
Our main task is to take a set of initial features with dimension let’s say D and reduce it to a subset of
extracted principal components of a lower dimension LD. The matrix decomposition process of Singular
Value Decomposition is extremely useful in helping us obtain the principal components. You can quickly
refresh your memory on SVD by referring to the sub-section of “Singular Value Decomposition” under the
“Important Concepts” in the “Mathematics” section in Chapter 1 to check out the necessary mathematical
formula and concepts. Considering we have a data matrix F(n x D), where we have n observations and
D dimensions (features), we can depict SVD of the feature matrix as (F(n x D)) = USVT such that all the principal
components are contained in the component VT, which can be depicted as follows:
é PC1(1´D ) ù
ê
ú
ê PC 2 (1´D ) ú
T
V ( D´D ) = ê
ú
ê ¼ ú
ê PC D 1´D ú
( )û
ë
The principal components are represented by {PC1, PC2, ... PCD} , which are all one-dimensional vectors
of dimensions (1 x D). For extracting the first d principal components, we can first transpose this matrix to
obtain the following representation.
T
PC ( D´D ) = (V T ) = é PC1( D x1) PC 2 ( D x1) ¼ PC D ( D x1) ù
ëê
ûú

Now we can extract out the first d principal components such that d ≤ D and the reduced principal
component set can be depicted as follows.
T
PC ( D´D ) = (V T ) = é PC1( D x1) PC 2 ( D x1) ¼ PC D ( D x1) ù
ëê
ûú

Finally, to perform dimensionality reduction, we can get the reduced feature set using the following
mathematical transformation F(n x d) = F(n x D)⋅PC(D x d) where the dot product between the original feature
matrix and the reduced subset of principal components gives us a reduced feature set of d features. A very
important point to remember here is that you might need to center your initial feature matrix by removing
the mean because by default, PCA assumes that your data is centered around the origin.

250

Chapter 4 ■ Feature Engineering and Selection

Let’s try to extract the first three principal components now from our breast cancer feature set of
30 features using SVD. We first center our feature matrix and then use SVD and subsetting to extract the first
three PCs using the following code.
In [19]:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:
Out[19]:

# center the feature set
bc_XC = bc_X - bc_X.mean(axis=0)
# decompose using SVD
U, S, VT = np.linalg.svd(bc_XC)
# get principal components
PC = VT.T
# get first 3 principal components
PC3 = PC[:, 0:3]
PC3.shape
(30, 3)

We can now get the reduced feature set of three features by using the dot product operation we
discussed earlier. The following snippet gives us the final reduced feature set that can be used for modeling.
# reduce feature set dimensionality
np.round(bc_XC.dot(PC3), 2)
Out[20]:
array([[-1160.14,  -293.92,   -48.58],
       [-1269.12,    15.63,    35.39],
       [ -995.79,    39.16,     1.71],
       ...,
       [ -314.5 ,    47.55,    10.44],
       [-1124.86,    34.13,    19.74],
       [  771.53,   -88.64,   -23.89]])
Thus you can see how powerful SVD and PCA can be in helping us reduce dimensionality by extracting
necessary features. Of course in Machine Learning systems and pipelines you can use utilities from scikitlearn instead of writing unnecessary code and equations. The following code enables us to perform PCA on
our breast cancer feature set leveraging scikit-learn's APIs.
In [21]: from sklearn.decomposition import PCA
    ...: pca = PCA(n_components=3)
    ...: pca.fit(bc_X)
Out[21]:
PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
To understand how much of the variance is explained by each of these principal components, you can
use the following code.
In [22]: pca.explained_variance_ratio_
Out[22]: array([ 0.98204467,  0.01617649,  0.00155751])

251

Chapter 4 ■ Feature Engineering and Selection

From the preceding output, as expected, we can see the maximum variance is explained by the first
principal component. To obtain the reduced feature set, we can use the following snippet.
In [23]: bc_pca = pca.transform(bc_X)
    ...: np.round(bc_pca, 2)
Out[23]:
array([[ 1160.14,  -293.92,    48.58],
       [ 1269.12,    15.63,   -35.39],
       [  995.79,    39.16,    -1.71],
       ...,
       [  314.5 ,    47.55,   -10.44],
       [ 1124.86,    34.13,   -19.74],
       [ -771.53,   -88.64,    23.89]])
If you compare the values of this reduced feature set with the values obtained in our mathematical
implementation based code, you will see they are exactly the same except sign inversions in some cases.

The reason for sign inversion in some of the values in principal components is because the direction of these
principal components is unstable. The sign indicates direction. Hence even if the principal components point in
opposite directions, they should still be on the same plane and hence shouldn't have an effect when modeling
with this data.
Let’s now quickly build a logistic regression model as before and use model accuracy and five-fold cross
validation to evaluate the model quality using these three features.
In [24]: np.average(cross_val_score(lr, bc_pca, bc_y, scoring='accuracy', cv=5))
Out[24]: 0.92808003078106949
We can see from the preceding output that even though we used only three features derived from the
principal components instead of the original 30 features, we still obtained a model accuracy close to 93%,
which is quite decent!

Summary
This was a content packed chapter with a lot of hands-on examples based on real-world datasets. The main
intent of this chapter is to get you familiarized with essential concepts, tools, techniques, and strategies used
for feature extraction, engineering, scaling, and selection. One of the toughest tasks that data scientists face
day in and day out is data processing and feature engineering. Hence it is of paramount importance that you
understand the various aspects involved with deriving features from raw data. This chapter is intended to be
used both as a starting ground as well as a reference guide for understanding what techniques and strategy
should be applied when trying to engineer features on your own datasets. We cover the basic concepts of
feature engineering, scaling, and selection and also the importance behind each of these processes. Feature
engineering techniques are covered extensively for diverse data types including numerical, categorical, text,

252

Chapter 4 ■ Feature Engineering and Selection

temporal and images. Multiple feature scaling techniques are also covered, which are useful to tone down
the scale and magnitude of features before modeling. Finally, we cover feature selection techniques in detail
with emphasis on the three different strategies of feature selection namely filter, wrapper, and embedded
methods. Special sections on dimensionality reduction and automated feature extraction using Deep
Learning have also been included since they have gained a lot of prominence in both research as well as the
industry. I want to conclude this chapter by leaving you with the following quote by Peter Norvig, renowned
computer scientist and director at Google, which should reinforce the importance of feature engineering.

“More data beats clever algorithms, but better data beats more data.”
—Peter Norvig

253

CHAPTER 5

Building, Tuning,
and Deploying Models
A very popular saying in the Machine Learning community is “70% of Machine Learning is data processing”
and going by the structure of this book, the quote seems quite apt. In the preceding chapters, you saw how
you can extract, process, and transform data to convert it to a form suitable for learning using Machine
Learning algorithms. This chapter deals with the most important part of using that processed data, to
learn a model that you can then use to solve real-world problems. You also learned about the CRISP-DM
methodology for developing data solutions and projects—the step involving building and tuning these
models is the final step in the iterative cycle of Machine Learning.
If you followed all the steps prescribed in the earlier chapters by now you must have a cleaned and
processed data\feature set. This data will mostly be numeric in the form of arrays or dataframes (feature
set). Most Machine Learning algorithms require the data to be in a numeric format as at the heart of any
Machine Learning algorithm, we have some mathematical equations and an optimization problem to either
minimize error\loss or maximize profit. Hence Machine Learning algorithms always work on numeric data.
Check out Chapter 4 for feature engineering techniques to convert structured as well as unstructured data
into ready-to-use numeric formats. We start this chapter by learning about different types of algorithms you
can use. Then you will learn how to choose a relevant algorithm for the data that you have, you will then be
introduced to the concept of hyperparameters and learn how to tune the hyperparameters of any algorithm.
The chapter also covers a novel approach to interpreting models using open source frameworks. Besides
this, you will also learn about persisting and deploying the developed models so you can start using them for
your own needs and benefits.
Based on the preceding topics, the chapter includes into the following five major sections:
•

Building models

•

Model evaluation techniques

•

Model tuning

•

Model interpretation

•

Deploying models in action

You should be fully acquainted with the material of the earlier chapters, since it will help in a better
understanding of the various aspects of this chapter. All the code snippets and examples used in this
chapter are available in the GitHub repository for this book at https://github.com/dipanjanS/practicalmachine-learning-with-python under the directory/folder for Chapter 5. You can refer to the Python file
named model_build_tune_deploy.py for all the examples used in this chapter and try the examples as you
read this chapter or you can even refer to the jupyter notebook named Building, Tuning and Deploying
Models.ipynb for a more interactive experience.
© Dipanjan Sarkar, Raghav Bali and Tushar Sharma 2018
D. Sarkar et al., Practical Machine Learning with Python, https://doi.org/10.1007/978-1-4842-3207-1_5

255

Chapter 5 ■ Building, Tuning, and Deploying Models

B
 uilding Models
Before we get on with the process of building models, we should try to understand what a model represents.
In the simplest of terms, a model can be described as a relationship between output or response variables
and its corresponding input or independent variables in a dataset. Sometimes this relationship can just be
among input variables (in case of datasets with no defined output or dependent variables). This relationship
among variables can be expressed in terms of mathematical equations, functions, and rules, which link the
output of the model to its inputs.
Consider the case of linear regression analysis, the output in that case is a set of parameters also known
as weights or coefficients (we explore this later in the chapter) and those parameters define the relationship
between the input and output variables. The idea is to build a model using a learning process, such that
you can learn the necessary parameters (coefficients) in the model that help translate the input variables
(independent) into the corresponding output variable (dependent) with the least error for a dataset
(leveraging validation metrics like mean squared error). The idea is not to predict a correct output value for
every input data point (leads to model over-fitting) but to generalize well over lots of data points such that
the error is minimum and the same is maintained when you use this model over new data points. This is
done by learning the right values of coefficients\parameters during the model building process. So when we
say we are learning a linear regression model, these are the set of important considerations implicit in that
statement. See Figure 5-1.

Figure 5-1. A high-level representation of model building
When we specify linear regression as the candidate model, we define the nature of relationship
between our dependent and independent variables. The candidate model then becomes all the possible
combinations of parameters for our model (more on this later). The learning algorithm is the way to
determine the most optimal value of those parameters using some optimization process and validating the
performance with some metrics (such as mean squared error to reduce the overall error). The final model
is nothing but most optimal value of our parameters as selected by our learning algorithm. So in the case of
simple linear regression is nothing but a tuple containing the values of our two parameters, a and b. A point
to remember here is that the term parameter is analogous to coefficients or weights in a model. There are
some other types of parameters called hyperparameters, which represent higher-level meta-parameters
of the model and do not depend on the underlying data. They usually need to be set before we start the

256

Chapter 5 ■ Building, Tuning, and Deploying Models

building or learning process. Usually these hyperparameters are tuned to get the optimal values as a part of
the model-tuning phase (a part of the learning phase itself ). Another important point to remember is that
the output model is generally dependent on the learning algorithm we choose for our data.

Model Types
Models can be differentiated on a variety of categories and nomenclatures. A lot of this is based on the
learning algorithm or method itself, which is used to build the model. Examples can be the model is linear
or nonlinear, what is the output of model, whether it is a parametric model or a non-parametric model,
whether it is supervised, unsupervised, or semi-supervised, whether it is an ensemble model or even a
Deep Learning based model. Refer to the section “Machine Learning Methods” in Chapter 1 to refresh your
memory of possible Machine Learning methods used for building models on datasets. In this section, we
focus on some of the most popular models from supervised and unsupervised learning methods.

C
 lassification Models
Classification is one of the most readily recognizable Machine Learning tasks and it’s covered in detail in
Chapter 1. It is a subset of a broader class of Machine Learning problems known as supervised learning.
Supervised learning is the set of Machine Learning problems\tasks in which we have a labeled dataset with
input attributes and corresponding output labels or classes (discrete). These inputs and corresponding
outputs are then used in learning a generalized system, which can be used to predict results (output class
labels) for previously unseen data points. Classification is one major part of the overall supervised
learning domain.
The output of a classification model is normally a label or a category to which the input data point may
belong. The task of solving a classification (or in general any supervised) problem involves a training set of
data in which we have the data points labeled with their correct classes/categories. We then use supervised
Machine Learning algorithms specific to classification problems, to generalize something similar to a
classification function for our problem. The input to this classification function is exactly similar to the data
that we used to train our model. This input is typically data attributes or features that are generated in the
feature engineering step.
Typical classification models include the following major types of methods; however, the list is
not exhaustive.
•

Linear models like logistic regression, Naïve Bayes, and support vector machines

•

Non-parametric models like K-nearest neighbors

•

Tree based methods like decision trees

•

Ensemble methods like random forests (bagging) and gradient boosted machines
(boosting)

•

Neural networks (MLPs)

257

Chapter 5 ■ Building, Tuning, and Deploying Models

Classification models can be further broken down on the type of output variables and the
number of output variables produced by them. This nomenclature is extremely important to understand
the type of classification problem you are dealing with by looking at the dataset attributes and the objective
to be solved.
•

Binary classification: When we have a total of two categories to differentiate
between in the output response variable in the data, then the problem is termed as
a binary classification problem. Hence you would need an appropriate model that
performs binary classification (known as a binary classification model). A popular
binary classification problem is the “Email classification problem”. In this problem,
the candidate e-mails need to be classified and labeled into either of the two
different categories: “Spam” or “Non Spam” (also known as “Ham”).

•

Multi-Class classification: This is an extension of the binary classification problem.
In this case we have more than two categories or classes into which our data can
be classified. An example of the multi-class classification problem is predicting
handwritten digits where a response variable can have any value ranging from 0 to
9. This becomes a 10-class classification problem. The multi-class classification is a
tough problem to solve and the general scheme for solving the multi-class problem
mostly involves some modifications of the binary classification problem.

•

Multi-Label classification: These classification problems typically involve data
where the output variable is not always a single value but a vector having multiple
values or labels. A simple example is predicting categories of news articles that can
have multiple labels for each news article like science, politics, religion, and so on.

Classification models often output the actual class labels or probabilities for each possible class label
that gives a confidence level for each class in the prediction. The following are the major output formats
from classification models.
•

Category classification output: In some classification models, the output for any
unknown data point is the predicted category or class label. These models usually
calculate the probabilities of all the categories, but report only one class label having
the maximum probability or confidence.

•

Category probability classification output: In these classification models, the
output is the probability value of each possible class label. These models are
important when we want to further use the output produced by our classification
model for detailed analysis or to make complex decisions. A very simple example
can be a typical marketing candidate selection problem. In this problem, by getting
the probability output of a potential conversion, we can narrow down our marketing
expenses.

Regression Models
In classification models, we saw that the output variable predicted by the model was a discrete value; even
when we got the output as a probability value, those probability values were tied to the discrete class label
values of the possible categories. Regression models are another subset of the supervised learning family of
models. In these models, the input data is generally labeled with a real valued output variable (continuous
instead of discrete). Regression analysis is an important part of statistical learning and it has a very similar
utility, in the field of Machine Learning.
In statistical learning, regression analysis is used to find relationships between the dependent and the
independent variables (which can be one or more than one). In the case of regression models, when we feed
our new data points to our learned\trained regression model, the output of the model is a continuous value.

258

Chapter 5 ■ Building, Tuning, and Deploying Models

Based on the number of variables, the probability distribution of output variables and form of relationship
(linear versus nonlinear), we have different types of regression models. The following are some of the major
categories of regression models.
•

Simple linear regression: It is the simplest of all the regression models, but it is very
effective and widely used for practical purposes. In this case, we only have a single
independent variable and a single dependent variable. The dependent variable is a
real value and assumed to follow a normal distribution. In linear regression, while
developing the model we assume a linear relationship between the independent and
dependent variable.

•

Multiple linear regression: It is the extension of the simple linear regression
model, to include more than one independent variable. The other assumptions
remain the same, i.e. the dependent variable is still a real value and follows a normal
distribution.

•

Non linear regression: A regression model, in which the dependent variable is
dependent on nonlinear transformation of the parameters\coefficients, is termed
a nonlinear regression model. It is slightly different from models in which we use a
nonlinear transformation of the independent variables. Let’s consider an example to
make this point clear. Consider the model, y = β0 + β1x 2 + Î. In the previous model we
have used the square of the independent variable but the parameters of the models
(the betas, or coefficients) are still linear. Hence this model is still an example of a
linear regression model, or to be more specific, a polynomial regression model. The
model in which the coefficients are not linear is a model that can be termed as a
nonlinear regression model. Consider an example that will fulfill this criterion and
hence can be termed as a nonlinear regression model. y = β0 + (log β1)x 2 + Î. These
models are quite hard to learn and hence not as widely used in practice. In most
cases, a linear model with nonlinear transformations applied to the input variables
usually suffices.

Regression models are a very important part of both statistics and Machine Learning and we encourage
you to refresh your memory by checking out the “Regression” section in Chapter 1 as well as to read some
standard literature on regression models to deep dive into further detailed concepts as necessary. We will be
looking at regression in a future chapter dealing with a real-world case study.

C
 lustering Models
We briefly talked about clustering in Chapter 1 in case you might want to refresh your memory. Clustering is
a member of a different class of Machine Learning methods, known as unsupervised learning. The simplest
definition of clustering is the process of grouping similar data points together that do not have any prelabeled classes or categories. The output of a typical clustering process is segregated groups of data points,
such that the data points in the same group are similar to each other but dissimilar from the members
(data points) of other groups. The major difference between the two methods is that, unlike supervised
learning, we don’t have a pre-labeled set of data that we can use to train and build our model. The input set
for unsupervised learning problems is generally the whole dataset itself. Another important hallmark of the
unsupervised learning set of problems is that they are quite hard to evaluate, as we will see in the later part
of this chapter.

259

Chapter 5 ■ Building, Tuning, and Deploying Models

Clustering models can be of different types on the basis of clustering methodologies and principles.
We will briefly introduce the different types of clustering algorithms, which are as follows.
•

Partition based clustering: A partition based clustering method is the most natural
way to imagine the process of clustering. A partition based clustering method will
define a notion of similarity. This can be any measure that can be derived from the
attributes of the data points by applying mathematical functions on these attributes
(features). Then, on the basis of this similarity measure, we can group data points that
are similar to each other in a single group and separate the ones that are different. A
partition based clustering model is usually developed using a recursive technique, i.e.
we start with some arbitrary partitions of data and, based on the similarity measure,
we keep reassigning data points until we reach a stable stopping criteria. Examples
include techniques like K-means, K-medoids, CLARANS, and so on.

•

Hierarchical clustering: A hierarchical clustering model is different from the
partition based clustering model in the way they are developed and the way
they work. In a hierarchical clustering paradigm, we either start with all the data
points in one group (divisive clustering) or all the data points in different groups
(agglomerative). Based on the starting point we can either keep dividing the big
group into smaller groups or clusters based on some accepted similarity criteria
or we can keep merging different groups or clusters into bigger ones based on the
same criteria. This process is normally stopped when a decided stopping condition
is achieved. Similarity criteria could be inter-data point distance in a cluster as
compared to other cluster data points. Examples include Ward’s minimum variance
criterion based agglomerative hierarchical clustering.

•

Density based clustering: Both the clustering models mentioned previously
are quite dependent on the notion of distance. This leads to these algorithms
primarily finding out spherical clusters of data. This can become a problem when
we have arbitrary shaped clusters in the data. This limitation can be addressed
by doing away with the concept of a distance metric based clustering. We can
define a notion of “density” of data and use that to develop our cluster. The cluster
development methodology then changes from finding points in the vicinity of some
points to finding areas where we have some data points. This approach is not as
straightforward to interpret as the distance metric approach but it leads to clusters
that necessarily need not be spherical. This is very desirable trait as it is unlikely that
all the clusters of interest will be spherical in shape. Examples include DBSCAN and
OPTICS.

Learning a Model
We have been talking about building models, learning parameters, and so on, since the very start of this
chapter. In this section, we explain what we actually mean by the term building a model from the perspective
of Machine Learning. In the following section, we briefly discuss the mathematical aspects of learning a
model by taking a specific model as an example to make things clearer. We try to go light on the math in
this section, so that you don’t get overwhelmed with excess information. However, interested readers are
recommended to check out any standard book on theoretical and conceptual details of Machine Learning
models and their implementations (we recommend An Introduction to Statistical Learning by Tibshirani
et al. http://www.springer.com/in/book/9781461471370).

260

Chapter 5 ■ Building, Tuning, and Deploying Models

Three Stages of Machine Learning
Machine Learning can often be a complex field. We have different types of problems and tasks and different
algorithms to solve them. We also have complex math, stats, and logic that form the very backbone of this
diverse field. If you remember, you learned in the first chapter that Machine Learning is a combination of
statistics, mathematics, optimization, linear algebra, and a bunch of other topics. But despair not; you do not
need to start learning all of them right away! These diverse set of Machine Learning practices can be mostly
unified by a simple three-stage paradigm. These three stages are:
•

Representation

•

Evaluation

•

Optimization

Let’s now discuss each of these steps separately to understand how almost all of the Machine Learning
algorithms or methods work.

Representation
The first stage in any Machine Learning problem is the representation of the problem in a formal language.
This is where we usually define the Machine Learning task to be performed based on the data and the
business objective or problem to be solved. Usually this stage of the problem is masked as another
stage, which is the selection of the ML algorithm or algorithms (you might have multiple possible model
representations at this phase). When we select a target algorithm, we are implicitly deciding on the
representation that we want use for our problem. This stage is akin to deciding on the set of hypothesis
models,, any of which can be the solution of our problem. For example, when we decide the Machine
Learning task to be performed is regression looking at our dataset and then select linear regression as
our regression model. Then we have decided on the linear combination based relationship between the
dependent and the independent variables. Another implicit selection made in this stage is deciding on the
parameters/weights/coefficients of the model that we need to learn.

Evaluation
Once we decide on the representation of our problem and possible set of models, we need some judging
criterion or criteria that will help us choose one model over the others, or the best model from a set of
candidate models. The idea is to define a metric for evaluation or a scoring function\loss function that
will help enable this. This evaluation metric is generally provided in terms of an objective or an evaluation
function (can also be called a loss function). What these objective functions normally do is provide a
numerical performance value which will help us to decide on the effectiveness of any candidate model. The
objective function depends on the type of problem we are solving, the representation we selected, and other
things. A simple example would be the lower the loss or error rate, the better the model is performing.

Optimization
The final stage in the learning process is optimization. Optimization in this case can be simply described as
searching through all the hypothesis model space representations, to find the one that will give us the most
optimal value of our evaluation function. While this description of optimization hides the vast majority of
complexities involved in the process, it is a good way to understand the core principles. The optimization
method that we will normally use is dependent on the choice of representations and the evaluation function
or functions. Fortunately we already have a huge set of robust optimizers we can use once we have decided

261

Chapter 5 ■ Building, Tuning, and Deploying Models

on the representation and the evaluation aspects. Optimization methods can be methods like gradient
descent and even meta-heuristical methods like generic algorithms.

The Three Stages of Logistic Regression
The best way to understand the nuances of a complex process is to explain it using an example. In this
section, we trace the three stages of the Machine Learning process, using the logistic regression model.
Logistic regression is an extension of linear regression to solve classification problems. We will see how a
simple logistic regression problem is solved using a gradient descent based optimization, which is one of the
most popular optimization methods.

Representation
The representation of a logistic regression is obtained by applying the logit function to the representation of
linear regression model. The linear regression representation is given by this hypothesis function:
h(θ) = θT x
Here, θ represents the parameters of the model and x is the input vector. The logit function is given by:

s (t ) =

1
1 - e -t

Applying the logit function on the representation of linear regression gives us the representation of
logistic regression.
h (q ) =

1
1 - e -q

T

x

This is our representation for the logistic regression model. As the value of logit function ranges
between 0 and 1, we can decide between the two categories by supplying an input vector x and a set of
parameters θ and calculating the value of h(θ) If it is less than 0.5 then typically the label is 0; otherwise, the
label is 1 (binary classification problems leverage this).

Evaluation
The next step in the process is specifying an evaluation or cost function. The cost function in our case is
dependent on the actual class of the data point. Suppose the output of the logit function is 0.75 for a data
point whose class is 1, then the error or loss of that case is 0.25. But if that data point is of category 0 then the
error is 0.75. Using this analogy, we can define the cost function for one data point as follows.
cost ( hq ( x ) , y ) =

{

- log( hq ( x ) ), if y =1

- log(1-hq ( x ) ), if y =0

Leveraging the previous logic, the cost function for the whole dataset is given by:
m

(

)

(

cost (q ) = l (q ) = å y i log h ( x i ) + (1 - y i ) log 1 - h ( x i )
i =1

262

)

Chapter 5 ■ Building, Tuning, and Deploying Models

Optimization
The cost function we described earlier is a function of θ and hence we need to maximize the previous
function and find the set of θ that gives us the maximum value (normally we will minimize the cost function,
but here we have taken a log and hence we will maximize the log function). The value θ that we obtain by
represents the model (parameters) that we wanted to learn.
The basic idea of maximizing or minimizing a function is that you differentiate the function and find
the point where the gradient is zero. That is the point where the function is taking either a minimum or a
maximum value. But we have to keep in mind that the function that we have is a nonlinear function in the
parameter θ. Hence we won’t be able to directly solve for the optimal values of θ. This is where we introduce
the concept of the gradient descent method.
In the simplest terms, gradient descent is the process in which we calculate the gradient of the function
we want to optimize at each point and then keep moving in the direction of negative gradient values. Here by
moving, we mean to update the values of θ according to the gradient that we calculate.
We can calculate the gradient of the cost function with respect to each component of the parameter
vector as follows:
¶
= ( y - hq ( x ) ) x j
¶q j
By repeating this calculation for each of the component of parameter vector, we can calculate the
gradient of the function with respect to the whole parameter vector. Once we get the gradient, the next step
is to update the new set of parameter vector values using this equation.

((

)

q j := q j + a y i - hq ( x i ) x ij
Here, α represents the small step we want to take in the direction of the gradient. α is a hyperparameter
of the optimization process (you can think of it as a learning rate or learning step size) and its value can
determine whether we reach a global minima or a local one.
If we keep reiterating the process, we will reach a point where our cost function will not change much
irrespective of any small update that we make to values of θ. Using this method, we can obtain the optimal
set of parameter values.
Keep in mind that this is a simple description of gradient to make things easy to understand and
interpret. Usually there are many other considerations involved in solving an optimization problem and
a vast set of challenges. The main intent of this section is to make you aware of how optimization is an
essential part of any Machine Learning problem.

Model Building Examples
The future chapters of this book are dedicated to build and tune models on real-world datasets. So we will
be doing a lot of model building, tuning, and evaluation in general. In this section, we want to depict some
examples of each category of models that we discussed in the previous section. This will serve as a ready
reckoner starting guide for our model building exploits in the future.

263

Chapter 5 ■ Building, Tuning, and Deploying Models

Classification
In all classification (or supervised learning) problems, the first step after preparing the whole dataset is
to segregate the data into a testing and a training set and optionally a validation set. The idea is to make
the model learn by training it on the train dataset, evaluate and tune it on the validation dataset, or use
techniques like cross validation and finally check its performance on the test dataset. You will learn in the
model evaluation section of this chapter that evaluating a model is a critical part of any Machine Learning
solution. Hence, as a rule of thumb, we must always remember that the actual evaluation of a Machine
Learning algorithm is always on the data that it has not previously seen (even cross validation on the training
dataset will use a part of the train data for model building and the rest for evaluation).
Sometimes we will use the whole dataset to train the model and then use some subset of it as a test set.
This is a common mistake done by many of us often in Machine Learning. To accurately analyze a model,
it must generalize well and perform well on data that it has never seen before. A good evaluation metric on
training data but a bad performance on unseen (validation or test) data means that the algorithm has failed
to produce a generalized solution for the problem (more on this later).
For our classification example, we will use a popular multi-class classification problem we talked about
earlier, handwritten digit recognition. The data for the same is available as part of the scikit-learn library.
The problem here is to predict the actual digit value from a handwritten image of a digit. In its original
form the problem comes in the domain of image based classification and computer vision. In the dataset
we have is a 1x64 feature vector, which represents the image representation of a grey scale image of the
handwritten digit.
Before we proceed to building any model, let’s first see how both the data and the image we intend to
analyze look. The following code will load the data for the image at index 10 and plot it.
In [2]: from sklearn import datasets
   ...: import matplotlib.pyplot as plt
   ...: %matplotlib inline
   ...: digits = datasets.load_digits()
   ...:
   ...: plt.figure(figsize=(3, 3))
   ...: plt.imshow(digits.images[10], cmap=plt.cm.gray_r)
The image generated by the code is depicted in Figure 5-2. Any guesses as to which number it represents?

Figure 5-2. Handwritten digit data representing the digit zero

264

Chapter 5 ■ Building, Tuning, and Deploying Models

We can determine how the raw pixel data looks the flattened vector representation and the number
(class label), which is represented by the image using the following code.
# actual image pixel matrix
In [3]: digits.images[10]
Out[3]:
array([[  0.,   0.,   1.,   9.,  15.,  11.,  
       [  0.,   0.,  11.,  16.,   8.,  14.,  
       [  0.,   2.,  16.,  10.,   0.,   9.,  
       [  0.,   1.,  16.,   4.,   0.,   8.,  
       [  0.,   4.,  16.,   4.,   0.,   8.,  
       [  0.,   1.,  16.,   5.,   1.,  11.,  
       [  0.,   0.,  12.,  12.,  10.,  10.,  
       [  0.,   0.,   1.,  10.,  13.,   3.,  

0.,  
6.,  
9.,  
8.,  
8.,  
3.,  
0.,  
0.,  

0.],
0.],
0.],
0.],
0.],
0.],
0.],
0.]])

# flattened vector
In [4]: digits.data[10]
Out[4]:
array([  0.,   0.,   1.,   9.,  15.,  11.,   0.,   0.,   0.,   0.,  11., 16.,   8.,  14.,  
         6.,   0.,   0.,   2.,  16.,  10.,   0.,   9.,   9.,   0.,   0.,   1.,  16.,   4.,   
         0.,   8.,   8.,   0.,   0.,   4.,  16.,   4.,   0.,   8.,   8.,   0.,   0.,   1.,  
         16.,   5.,  1.,  11.,   3.,   0.,   0.,   0.,  12.,  12.,  10.,  10.,   0.,   0.,   
         0.,   0.,   1.,  10.,  13.,   3.,   0.,   0.])
# image class label
In [5]: digits.target[10]
Out[5]: 0
We will later see that we can frame this problem in a variety of ways. But for this tutorial we will use
a logistic regression model to do this classification. Before we proceed to model building, we will split
the dataset into separate test and train sets. The size of the test set is generally dependent on the total
amount of data available. In our example, we will use a test set, which is 30% of the overall dataset. The total
data points in each dataset is printed for ease of understanding
In [12]: X_digits = digits.data
    ...: y_digits = digits.target
    ...:
    ...: num_data_points = len(X_digits)
    ...:
    ...: X_train = X_digits[:int(.7 * num_data_points)]
    ...: y_train = y_digits[:int(.7 * num_data_points)]
    ...: X_test = X_digits[int(.7 * num_data_points):]
    ...: y_test = y_digits[int(.7 * num_data_points):]
    ...: print(X_train.shape, X_test.shape)
(1257, 64) (540, 64)
From the preceding output, we can see our train dataset has 1257 data points and the test dataset
has 540 data points. The next step in the process is specifying the model that we will be using and the
hyperparameter values that we want to use. The values of these hyperparameters do not depend on the
underlying data and are usually set prior to model training and are fine tuned for extracting the best model.
You will learn about tuning later on in this chapter. For the time being, we will use the default values as
depicted when we initialize the model estimator and fit our model on the training set.

265

Chapter 5 ■ Building, Tuning, and Deploying Models

In [14]: from sklearn import linear_model
    ...:
    ...: logistic = linear_model.LogisticRegression()
    ...: logistic.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
You can see various hyperparameters and parameters of the model depicted in the preceding output.
Let’s now test the accuracy of this model on the test dataset.
In [15]: print('Logistic Regression mean accuracy: %f' % logistic.score(X_test, y_test))
Logistic Regression mean accuracy: 0.900000
This is all it takes in scikit-learn to fit a model like logistic regression. In the first step, we identified
the model that we wanted to use which in our case was a linear model, called logistic regression. Then we
called the fit method of that object with our training data and its output labels. The fit method updates
the model object with the learned parameters of the model. We then used the score method of the object to
determine the accuracy of the fitted model on our test set. So the model we developed without any intensive
tuning is 90% accurate at predicting handwritten digits.
This concludes our very basic example of fitting a classification model on our dataset. Note that our
dataset was in a fully processed and cleaned format. You need to ensure your data is in the same way before
you proceed to fit any models on it when solving any problem.

C
 lustering
In this section, you will learn how we can fit a clustering model on another dataset. In the example which
we will pick, we will use a labeled dataset to help us see the results of the clustering model and compare it
with actual labels. A point to remember here is that, usually labeled data is not available in the real world,
which is why we choose to go for unsupervised methods like clustering. We will try to cover two different
algorithms, one each from partitioning based clustering and hierarchical clustering.
The data that we will use for our clustering example will be the very popular Wisconsin Diagnostic
Breast Cancer dataset, which we covered in detail in Chapter 4 in the section “Feature Selection and
Dimensionality Reduction”. Do check out those sections to refresh your memory. This dataset has 30
attributes or features and a corresponding label for each data point (breast mass) depicting if it has cancer
(malignant: label value 0) or no cancer (benign: label value 1). Let’s load the data using the following code.
import numpy as np
from sklearn.datasets import load_breast_cancer
# load data
data = load_breast_cancer()
X = data.data
y = data.target
print(X.shape, data.feature_names)
(569, 30) ['mean radius' 'mean texture' 'mean perimeter' ... 'worst fractal dimension']
It is evident that we have a total of 569 observations and 30 attributes or features for each observation.

266

Chapter 5 ■ Building, Tuning, and Deploying Models

Partition Based Clustering
We will choose the simplest yet most popular partition based clustering model for our example, which
is K-means algorithm. This algorithm is a centroid based clustering algorithm, which starts with some
assumption about the total clusters in the data and with random centers assigned to each of the clusters.
It then reassigns each data point to the center closest to it, using Euclidean distance as the distance metric.
After each reassignment, it recalculates the center of that cluster. The whole process is repeated iteratively
and stopped when reassignment of data points doesn’t change the cluster centers. Variants include
algorithms like K-medoids.
Since we already know from the data labels that we have two possible types of categories either 0 or 1,
the following code tries to determine these two clusters from the data by leveraging K-means clustering. In
the real world, this is not always the case, since we will not know the possible number of clusters. This is one
of the most important downsides of K-means clustering.
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2)
km.fit(X)
labels = km.labels_
centers = km.cluster_centers_
print(labels[:10])
[0 0 0 1 0 1 0 1 1 1]
Once the fit process is complete we can get the centers and labels of our two clusters in the dataset by
using the preceding attributes. The centers here refer to some numerical value of the dimensions of the data
(the 30 attributes in the dataset) around which data is clustered.
But can we visualize and compare the clusters with the actual labels? Remember we are dealing with
30 features and visualizing the clusters on a 30-dimensional feature space would be impossible to interpret
or even perform. Hence, we will leverage PCA to reduce the input dimensions to two principal components
and visualize the clusters on top of the same. Refer to Chapter 4 to learn more about principal component
analysis.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
bc_pca = pca.fit_transform(X)
The following code helps visualize the clusters on the reduced 2D feature space for the actual labels as
well as the clustered output labels.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))
fig.suptitle('Visualizing breast cancer clusters')
fig.subplots_adjust(top=0.85, wspace=0.5)
ax1.set_title('Actual Labels')
ax2.set_title('Clustered Labels')
for i in range(len(y)):
    if y[i] == 0:
        c1 = ax1.scatter(bc_pca[i,0], bc_pca[i,1],c='g', marker='.')
    if y[i] == 1:
        c2 = ax1.scatter(bc_pca[i,0], bc_pca[i,1],c='r', marker='.')

267

Chapter 5 ■ Building, Tuning, and Deploying Models

    if labels[i] == 0:
        c3 = ax2.scatter(bc_pca[i,0], bc_pca[i,1],c='g', marker='.')
    if labels[i] == 1:
        c4 = ax2.scatter(bc_pca[i,0], bc_pca[i,1],c='r', marker='.')
l1 = ax1.legend([c1, c2], ['0', '1'])
l2 = ax2.legend([c3, c4], ['0', '1'])

Figure 5-3. Visualizing clusters in the breast cancer dataset
From Figure 5-3, you can clearly see that the clustering has worked quite well and it shows distinct
separation between clusters with labels 0 and 1 and is quite similar to the actual labels. However we do
have some overlap where we have mislabeled some instances, which is evident in the plot on the right.
Remember in an actual real-world scenario, you will not have the actual labels to compare with and the
main idea is to find structures or patterns in your data in the form of these clusters. Another very important
point to remember is that cluster label values have no significance. The labels 0 and 1 are just values to
distinguish cluster data points from each other. If you run this process again, you can easily obtain the same
plot with the labels reversed. Hence even when dealing with labeled data and running clustering do not
compare clustered label values with actual labels and try to measure accuracy. Also another important note
is that if we had asked for more than two clusters, the algorithm would have readily supplied more clusters
but it would have been hard to interpret those and many of them would not make sense. Hence, one of
the caveats of using the K-means algorithm is to use it in the case where we have some idea about the total
number of clusters that may exist in the data.

268

Chapter 5 ■ Building, Tuning, and Deploying Models

Hierarchical Clustering
We can use the same data to perform a hierarchical clustering and see if the results change much as
compared to K-means clustering and the actual labels. In scikit-learn we have a multitude of interfaces
like the AgglomerativeClustering class to perform hierarchical clustering. Based on what we discussed
earlier in this chapter as well as in Chapter 1, agglomerative clustering is hierarchical clustering using a
bottom up approach i.e. each observation starts in its own cluster and clusters are successively merged
together. The merging criteria can be used from a candidate set of linkages; the selection of linkage governs
the merge strategy. Some examples of linkage criteria are Ward, Complete linkage, Average linkage and so
on. We will leverage low-level functions from scipy however because we still need to mention the number
of clusters in the AgglomerativeClustering interface which we want to avoid. Since we already have the
breast cancer feature set in variable X, the following code helps us compute the linkage matrix using Ward’s
minimum variance criterion.
from scipy.cluster.hierarchy import dendrogram, linkage
import numpy as np
np.set_printoptions(suppress=True)
Z = linkage(X, 'ward')
print(Z)
[[   287.            336.             3.81596727      2.       ]
[   106.            420.             4.11664267      2.       ]
[    55.            251.             4.93361024     2.        ]
...,
[  1130.           1132.           6196.07482529     86.        ]
[  1131.           1133.           8368.99225244    483.        ]
[  1134.           1135.          18371.10293626    569.        ]]
On seeing the preceding output, you might think what does this linkage matrix indicate? You can think
of the linkage matrix as a complete historical map, keeping track of which data points were merged into
which cluster during each iteration. If you have n data points, the linkage matrix, Z will be having a shape of
(n - 1) x 4 where Z[i] will tell us which clusters were merged at the i th iteration. Each row has four elements,
the first two elements are either data point identifiers or cluster labels (in the later parts of the matrix once
multiple data points are merged), the third element is the cluster distance between the first two elements
(either data points or clusters), and the last element is the total number of elements\data points in the
cluster once the merge is complete. We recommend you refer to https://docs.scipy.org/doc/scipy/
reference/generated/scipy.cluster.hierarchy.linkage.html, which explains this in detail. The best
way to visualize these distance-based merges is to use a dendrogram, as shown in Figure 5-4.
plt.figure(figsize=(8, 3))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Data point')
plt.ylabel('Distance')
dendrogram(Z)
plt.axhline(y=10000, c='k', ls='--', lw=0.5)
plt.show()

269

Chapter 5 ■ Building, Tuning, and Deploying Models

Figure 5-4. Visualizing the hierarchical clustering dendrogram
In the dendrogram depicted in Figure 5-4, we can see how each data point starts as an individual cluster
and slowly starts getting merged with other data points to form clusters. On a high level from the colors and
the dendrogram, you can see that the model has correctly identified two major clusters if you consider a
distance metric of around 10000 or above. Leveraging this distance, we can get the cluster labels using the
following code.
from scipy.cluster.hierarchy import fcluster
max_dist = 10000
hc_labels = fcluster(Z, max_dist, criterion='distance')
Let’s compare how the cluster outputs look based on the PCA reduced dimensions as compared to the
original label distribution (detailed code is in the notebook). See Figure 5-5.

Figure 5-5. Visualizing hierarchical clusters in the breast cancer dataset

270

Chapter 5 ■ Building, Tuning, and Deploying Models

We definitely see two distinct clusters but there is more overlap as compared to the K-means method
between the two clusters and we have more mislabeled instances. However, do take a note of the label
numbers; here we have 1 and 2 as the label values. This is just to reinforce the fact that the label values are
just to distinguish the clusters and don’t mean anything. The advantage of this method is that you do not
need to input the number of clusters beforehand and the model tries to find it from the underlying data.

Model Evaluation
We have seen the process of data retrieval, processing, wrangling and modeling based on various
requirements. A logical question that follows is how we can make the judgment whether a model is good or
bad? Just because we have developed something fancy using a renowned algorithm, doesn’t guarantee its
performance will be great. Model evaluation is the answer to these questions and is an essential part of the
whole Machine Learning pipeline. We have mentioned it quite a number of times in that past about how
model development is an iterative process. Model evaluation is the defining part of the iterative process
which makes it iterative in nature. Based on model evaluation and subsequent comparisons we can take
a call whether to continue our efforts in model enhancement or cease them and which model should be
selected as the final model to be used\deployed. Model evaluation also helps us in the very important
process of tuning the hyperparameters of the model and also deciding scenarios like, if the intelligent
feature that we just developed is adding any value to our model or not. Combining all these arguments
makes a compelling case for having a defined process for model evaluation and what metrics can be used to
measuring and evaluating models.
So how can we evaluate a model? How can we make a decision whether Model A is better or Model
B performs better? The ideal way is to have some numerical measure or metric of a model’s effectiveness
and use that measure to rank and select models. This will be one of the primary ways for us to evaluate
models but we should also keep in mind that a lot of times these evaluation metrics may not capture the
required success criteria of the problem we are trying to solve. In these cases, we will be required to become
imaginative and adapt these metrics to our problem and use things like business constraints and objectives.
Model evaluation metrics are highly dependent on the type of model we have, so metrics for regression
models will be different from the classification models or clustering models. Considering this dependency
we will break this section down in three sub-sections. We cover the major model evaluation metrics for three
categories of models.

Evaluating Classification Models
Classification models are one of the most popular models among Machine Learning practitioners. Due to
their popularity, it is essential to know how to build good quality, generalized models. They have a varied
set of metrics that can be used to evaluate classification models. In this section, we target a small subset of
those metrics that are essential. We use the models that we developed in the previous section to illustrate
them in detail. For this, let’s first prepare train and test datasets to build our classification models. We will
be leveraging the X and y variables from before, which holds the data and labels for the breast cancer dataset
observations.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(X_train.shape, X_test.shape)
(398, 30) (171, 30)

271

Chapter 5 ■ Building, Tuning, and Deploying Models

From the preceding output, it is clear that we have 398 observations in our train dataset and 171
observations in our test dataset. We will be leveraging a nifty module we have created for model evaluation.
It is named model_evaluation_utils and you can find it along with the code files and notebooks for this
chapter. We recommend you to check out the code, which leverages the scikit-learn metrics module to
compute most of the evaluation metrics and plots.

Confusion Matrix
Confusion matrix is one of the most popular ways to evaluate a classification model. Although the matrix by
itself is not a metric, the matrix representation can be used to define a variety of metrics, all of which become
important in some specific case or scenario. A confusion matrix can be created for a binary classification as
well as a multi-class classification model.
A confusion matrix is created by comparing the predicted class label of a data point with its actual class
label. This comparison is repeated for the whole dataset and the results of this comparison are compiled
in a matrix or tabular format. This resultant matrix is our confusion matrix. Before we go any further, let’s
build a logistic regression model on our breast cancer dataset and look at the confusion matrix for the model
predictions on the test dataset.
from sklearn import linear_model
# train and build the model
logistic = linear_model.LogisticRegression()
logistic.fit(X_train,y_train)
# predict on test data and view confusion matrix
import model_evaluation_utils as meu
y_pred = logistic.predict(X_test)
meu.display_confusion_matrix(true_labels=y_test, predicted_labels=y_pred, classes=[0, 1])
          Predicted:     
                   0    1
Actual: 0         59    4
        1          2  106
The preceding output depicts the confusion matrix with necessary annotations. We can see that out of
63 observations with label 0 (malignant), our model has correctly predicted 59 observations. Similarly out of
108 observations with label 1 (benign), our model has correctly predicted 106 observations. More detailed
analysis is coming right up!

Understanding the Confusion Matrix
While the name itself sounds pretty overwhelming, understanding the confusion matrix is not that confusing
once you have the basics right! To reiterate what you learned in the previous section, the confusion matrix
is a tabular structure to keep a track of correct classifications as well as misclassifications. This is useful to
evaluate the performance of a classification model for which we know the true data labels and can compare
with the predicted data labels. Each column in the confusion matrix represents classified instance counts
based on predictions from the model and each row of the matrix represents instance counts based on the
actual\true class labels. This structure can also be reversed, i.e. predictions depicted by rows and true labels
by columns. In a typical binary classification problem, we usually have a class label which defined as the
positive class which is basically the class of our interest. For instance in our breast cancer dataset, let’s say we

272

Chapter 5 ■ Building, Tuning, and Deploying Models

are interested in detecting or predicting when the patient does not have breast cancer (benign). Then label 1
is our positive class. However, suppose our class of interest was to detect cancer (malignant) then we could
have chosen label 0 as our positive class. Figure 5-6 shows a typical confusion matrix for a binary classification
problem, where p denotes the positive class and n denotes the negative class.

Figure 5-6. Typical structure of a confusion matrix
Figure 5-6 should make things more clear with regard to the structure of confusion matrices. In general,
we usually have a positive class as we discussed earlier and the other class is the negative class. Based on this
structure, we can clearly see four terms of importance.
•

True Positive (TP): This is the count of the total number of instances from the
positive class where the true class label was equal to the predicted class label, i.e. the
total instances where we correctly predicted the positive class label with our model.

•

False Positive (FP): This is the count of the total number of instances from the
negative class where our model misclassified them by predicting them as positive.
Hence the name, false positive.

•

True Negative (FN): This is the count of the total number of instances from the
negative class where the true class label was equal to the predicted class label, i.e. the
total instances where we correctly predicted the negative class label with our model.

•

False Negative (FN): This is the count of the total number of instances from the
positive class where our model misclassified them by predicting them as negative.
Hence the name, false negative.

273

Chapter 5 ■ Building, Tuning, and Deploying Models

Thus based on this information, can you compute the previously mentioned metrics for our confusion
matrix based on the model predictions on the breast cancer test data?
positive_class = 1
TP = 106
FP = 4
TN = 59
FN = 2

Performance Metrics
The confusion matrix by itself is not a performance measure for classification models. But it can be used to
calculate several metrics that are useful measures for different scenarios. We will describe how the major
metrics can be calculated from the confusion matrix, compute them manually using necessary formulae,
and then compare the results with functions provided by scikit-learn on our predicted results and give an
intuition of scenarios where each of those metric can be used.
Accuracy: This is one of the most popular measures of classifier performance. It is defined as the overall
accuracy or proportion of correct predictions of the model. The formula for computing accuracy from the
confusion matrix is:
Accuracy =

TP + TN
TP + FP + TN + FN

Accuracy measure is normally used when our classes are almost balanced and correct predictions of
those classes are equally important. The following code computes accuracy on our model predictions.
fw_acc = round(meu.metrics.accuracy_score(y_true=y_test, y_pred=y_pred), 5)
mc_acc = round((TP + TN) / (TP + TN + FP + FN), 5)
print('Framework Accuracy:', fw_acc)
print('Manually Computed Accuracy:', mc_acc)
Framework Accuracy: 0.96491
Manually Computed Accuracy: 0.96491
Precision: Precision, also known as positive predictive value, is another metric that can be derived from
the confusion matrix. It is defined as the number of predictions made that are actually correct or relevant out
of all the predictions based on the positive class. The formula for precision is as follows:
Precision =

TP
TP + FP

A model with high precision will identify a higher fraction of positive class as compared to a model
with a lower precision. Precision becomes important in cases where we are more concerned about finding
the maximum number of positive class even if the total accuracy reduces. The following code computes
precision on our model predictions.
fw_prec = round(meu.metrics.precision_score(y_true=y_test, y_pred=y_pred), 5)
mc_prec = round((TP) / (TP + FP), 5)
print('Framework Precision:', fw_prec)
print('Manually Computed Precision:', mc_prec)

274

Chapter 5 ■ Building, Tuning, and Deploying Models

Framework Precision: 0.96364
Manually Computed Precision: 0.96364
Recall: Recall, also known as sensitivity, is a measure of a model to identify the percentage of relevant
data points. It is defined as the number of instances of the positive class that were correctly predicted. This is
also known as hit rate, coverage, or sensitivity. The formula for recall is:
Recall =

TP
TP + FN

Recall becomes an important measure of classifier performance in scenarios where we want to catch
the most number of instances of a particular class even when it increases our false positives. For example,
consider the case of bank fraud, a model with high recall will give us higher number of potential fraud cases.
But it will also help us raise alarm for most of the suspicious cases. The following code computes recall on
our model predictions.
fw_rec = round(meu.metrics.recall_score(y_true=y_test, y_pred=y_pred), 5)
mc_rec = round((TP) / (TP + FN), 5)
print('Framework Recall:', fw_rec)
print('Manually Computed Recall:', mc_rec)
Framework Recall: 0.98148
Manually Computed Recall: 0.98148
F1 Score: There are some cases in which we want a balanced optimization of both precision and recall.
F1 score is a metric that is the harmonic mean of precision and recall and helps us optimize a classifier for
balanced precision and recall performance.
The formula for the F1 score is:
F 1 Score =

2 ´ Precision ´ Recall
Precision + Recall

Let’s compute the F1 score on the predictions made by our model using the following code.
fw_f1 = round(meu.metrics.f1_score(y_true=y_test, y_pred=y_pred), 5)
mc_f1 = round((2*mc_prec*mc_rec) / (mc_prec+mc_rec), 5)
print('Framework F1-Score:', fw_f1)
print('Manually Computed F1-Score:', mc_f1)
Framework F1-Score: 0.97248
Manually Computed F1-Score: 0.97248
Thus you can see how our manually computed metrics match the results obtained from the scikit-learn
functions. This should give you a good idea of how to evaluate classification models with these metrics.

275

Chapter 5 ■ Building, Tuning, and Deploying Models

Receiver Operating Characteristic Curve
ROC which stands for Receiver Operating Characteristic is a concept from early Radar days. This concept
can be extended to evaluation of binary classifiers as well as multi-class classifiers (Note that to adapt the
ROC curve for multi-class classifiers we have to use one-vs-all scheme and averaging techniques like macro
and micro averaging.) It can be interpreted as the effectiveness with which the model can distinguish
between actual signal and the noise in the data.
The ROC curve can be created by plotting the fraction of true positives versus the fraction of false
positives, i.e. it is a plot of True Positive Rate (TPR) versus the False Positive Rate (FPR). It is applicable
mostly for scoring classifiers. Scoring classifiers are the type of classifiers which will return a probability
value or score for each class label, from which a class label can be deduced (based on maximum probability
value). This curve can be plotted using the true positive rate (TPR) and the false positive rate (FPR) of a
classifier. TPR is known as sensitivity or recall, which is the total number of correct positive results, predicted
among all the positive samples the dataset. FPR is known as false alarms or (1 - specificity), determining the
total number of incorrect positive predictions among all negative samples in the dataset. Although we will
rarely be plotting the ROC curve manually, it is always a good idea to understand how they can be plotted.
The following steps can be followed to plot a ROC curve given the class label probabilities of each data point
and their correct or true labels.
1.

Order the outputs of the classifier by their scores (or the probability of being the
positive class).

2.

Start at the (0, 0) coordinate.

3.

For each example x in the sorted order:
•
•

1
up
pos
1
right
If x is negative, move
neg

If x is positive, move

Here pos and neg are the fraction of positive and negative examples respectively. The idea is that
typically in any ROC curve, the ROC space is between points (0, 0) and (1, 1). Each prediction result from the
confusion matrix occupies one point in this ROC space. Ideally, the best prediction model would give a point
on the top left corner (0, 1) indicating perfect classification (100% sensitivity & specificity). A diagonal line
depicts a classifier that does a random guess. Ideally if your ROC curve occurs in the top half of the graph,
you have a decent classifier which is better than average. You can always leverage the roc_curve function
provided by scikit-learn to generate the necessary data for an ROC curve. Refer to http://scikit-learn.
org/stable/auto_examples/model_selection/plot_roc.html for further details. Figure 5-7 shows a
sample ROC curve from the link we just mentioned.

276

Chapter 5 ■ Building, Tuning, and Deploying Models

Figure 5-7. Sample ROC Curve (Source: http://scikit-learn.org/stable/modules/model_evaluation.
html#roc-metrics)
Figure 5-7 depicts the sample ROC curve. In general, the ROC curve is an important tool for visually
interpreting classification models. But it doesn’t directly provide us with a numerical value that we can use to
compare models. The metric which does that task is the Area Under Curve popularly known as AUC. In the
ROC plot in Figure 5-7, the area under the orange line is the area under the classifier’s ROC curve. The ideal
classifier will have the unit area under the curve. Based on this value we can compare two models, generally
the model with the AUC score is a better one. We have built a generic function for plotting ROC curves
with AUC scores for binary as well as multi-class classification problems in our model_evaluation_utils
module. Do check out the function, plot_model_roc_curve(...) to know more about it. The following code
plots the ROC curve for our breast cancer logistic regression model leveraging the same function.
meu.plot_model_roc_curve(clf=logistic, features=X_test, true_labels=y_test)

277

Chapter 5 ■ Building, Tuning, and Deploying Models

Figure 5-8. ROC curve for our logistic regression model
Considering our model has an accuracy and F1 Score of around 97%, Figure 5-8 makes sense where we
see a near perfect ROC curve! Check out Chapter 9 to see a multi-class classifier ROC curve in action!

Evaluating Clustering Models
We discussed some of the popular ways to evaluate classification models in the previous section. The
confusion matrix alone provided us with a bunch of metrics that we can use to compare classification
models. The tables are turned drastically when it comes to evaluating clustering (or unsupervised models in
general). This difficulty arises from the lack of a validated ground truth in case of unsupervised models, i.e.
the absence of true labels in the data. In this section, you learn about some of the methods/metrics we can
use to evaluate the performance of our clustering models.
To illustrate the evaluation metrics with a real-world example, we will leverage the breast cancer
dataset available in the variables X for the data and y for the observation labels. We will also use the K-means
algorithm to fit two models on this data—one with two clusters and the second one with five clusters—and
then evaluate their performance.
km2 = KMeans(n_clusters=2, random_state=42).fit(X)
km2_labels = km2.labels_
km5 = KMeans(n_clusters=5, random_state=42).fit(X)
km5_labels = km5.labels_

E xternal Validation
External validation means validating the clustering model when we have some ground truth available
as labeled data. The presence of external labels reduces most of the complexity of model evaluation as
the clustering (unsupervised) model can be validated in similar fashion to classification models. Recall
the breast cancer dataset example that we took in the first section of this chapter, we ran the labeled data
through a clustering algorithm. In that case we had two classes and we got two clusters from our algorithm.
However evaluating the performance is not as straightforward as classification algorithms.

278

Chapter 5 ■ Building, Tuning, and Deploying Models

If you remember our discussion earlier on cluster labels, they are just indicators used to distinguish data
points from each other based on which cluster or group they fall into. Hence we cannot compare a cluster
with label 0 directly with a true class label 0. It is possible that all data points with true class label of 0 were
actually clustered with label 1 during the clustering process. Based on this, we can leverage several metrics
to validate clustering performance when we have the true labels available. Three popular metrics can be
used in this scenario:
•

Homogeneity: A clustering model prediction result satisfies homogeneity if all of
its clusters contain only data points that are members of a single class (based on the
true class labels).

•

Completeness: A clustering model prediction result satisfies completeness if
all the data points of a specific ground truth class label are also elements of the
same cluster.

•

V-measure: The harmonic mean of homogeneity and completeness scores gives us
the V-measure value.

Values are typically bounded between 0 and 1 and usually higher values are better. Let’s compute these
metric on our two K-means clustering models.
km2_hcv = np.round(metrics.homogeneity_completeness_v_measure(y, km2_labels), 3)
km5_hcv = np.round(metrics.homogeneity_completeness_v_measure(y, km5_labels), 3)
print('Homogeneity, Completeness, V-measure metrics for num clusters=2: ', km2_hcv)
print('Homogeneity, Completeness, V-measure metrics for num clusters=5: ', km5_hcv)
Homogeneity, Completeness, V-measure metrics for num clusters=2:  [ 0.422  0.517  0.465]
Homogeneity, Completeness, V-measure metrics for num clusters=5:  [ 0.602  0.298  0.398]
We can see that the V-measure for the first model with two clusters is better than the one with five
clusters and the reason is because of higher completeness score. Another metric you can try out includes the
Fowlkes-Mallows score.

Internal Validation
Internal validation means validating a clustering model by defining metrics that capture the expected
behavior of a good clustering model. A good clustering model can be identified by two very desirable traits:
•

Compact groups, i.e. the data points in one cluster occur close to each other.

•

Well separated groups, i.e. two groups\clusters have as large distance among
them as possible.

We can define metrics that mathematically calculate the goodness of these two major traits and use
them to evaluate clustering models. Most of such metrics will use some concept of distance between data
points. The distance between data points can be defined using any candidate distance metric ranging from a
Euclidian distance, Manhattan distance, or any metric that meets the criteria for being a distance metric.

279

Chapter 5 ■ Building, Tuning, and Deploying Models

Silhouette Coefficient
Silhouette coefficient is a metric that tries to combine the two requirements of a good clustering model. The
silhouette coefficient is defined for each sample and is a combination of its similarity to the data points in its
own cluster and its dissimilarity to the data points not in its cluster.
The mathematical formula of silhouette coefficient for a clustering model with n data points is given by:
1 n
å Sample SCi
n i =1
Here, sample SC is the silhouette coefficient for each of the samples. The formula for a sample’s
silhouette coefficient is,
Sample SC =

b-a
max ( b ,a )

Here,
a = mean distance between a sample and all other points in the same class
b = mean distance between a sample and all other points in the next nearest cluster
The silhouette coefficient is usually bounded between -1 (incorrect clustering) and +1 (excellent quality
dense clusters). A higher value of silhouette coefficient generally means that the clustering model is leading
to clusters that are dense and well separated and distinguishable from each other. Lower scores indicate
overlapping clusters. In scikit-learn, we can compute the silhouette coefficient by using the silhouette_
score function. The function also allows for different options for distance metrics.
from sklearn import metrics
km2_silc = metrics.silhouette_score(X, km2_labels, metric='euclidean')
km5_silc = metrics.silhouette_score(X, km5_labels, metric='euclidean')
print('Silhouette Coefficient for num clusters=2: ', km2_silc)
print('Silhouette Coefficient for num clusters=5: ', km5_silc)
Silhouette Coefficient for num clusters=2:  0.697264615606
Silhouette Coefficient for num clusters=5:  0.510229299791
Based on the preceding output, we can observe that from the metric results it seems like we have better
cluster quality with two clusters as compared to five clusters.

Calinski-Harabaz Index
The Calinski-Harabaz index is another metric that we can use to evaluate clustering models when the
ground truth is not known. The Calinski-Harabaz score is given as the ratio of the between-clusters
dispersion mean and the within-cluster dispersion. The mathematical formula for the score for k clusters is
given by,
s (k ) =

280

Tr ( Bk ) N - k
´
Tr (Wk ) k -1

Chapter 5 ■ Building, Tuning, and Deploying Models

Here,
k

Wk = åå ( x - c q )( x - c q )

T

q =1 xÎc q

Bk = ånq ( c q - c )( c q - c )

T

q

With Tr being the trace of a matrix operator, N being the number of data points in our data, Cq being the
set of points in cluster q, cq being the center of cluster q, c being the center of E, and nq being the number of
points in cluster q.
Thankfully we can calculate this index without having to calculate this complex formula by leveraging
scikit-learn. A higher score normally indicates that the clusters are dense and well separated, which
relates to the general principles of clustering models.
km2_chi = metrics.calinski_harabaz_score(X, km2_labels)
km5_chi = metrics.calinski_harabaz_score(X, km5_labels)
print('Calinski-Harabaz Index for num clusters=2: ', km2_chi)
print('Calinski-Harabaz Index for num clusters=5: ', km5_chi)
Calinski-Harabaz Index for num clusters=2:  1300.20822689
Calinski-Harabaz Index for num clusters=5:  1621.01105301
We can see that both the scores are pretty high with the results for five clusters being even higher. This
goes to show that just relying on metric number alone is not sufficient and you must try multiple evaluation
methods coupled with feedback from data scientists as well as domain experts.

Evaluating Regression Models
Regression models are an example of supervised learning methods and owing to the availability of the
correct measures (real valued numeric response variables), their evaluation is relatively easier than
unsupervised models. Usually in the case of supervised models, we are spoilt for the choice of metrics and
the important decision is choosing the right one for our use case. Regression models, like classification
models, have a varied set of metrics that can be used for evaluating them. In this section, we go through a
small subset of these metrics which are essential.

Coefficient of Determination or R 2
The coefficient of determination measures the proportion of variance in the dependent variable which is
explained by the independent variable. A coefficient of determination score of 1 denotes a perfect regression
model indicating that all of the variance is explained by the independent variables. It also provides a
measure of how well the future samples are likely to be predicted by the model.
The mathematical formula for calculating r2 is given as follows, where y is the mean of the dependent
variable, y indicates the actual true response values, and y indicates the model predicted outputs.
i

i

nsamples 1

 



i 0
R 2 y , y  1  nsamples
1

y  y 

2

i

i

 y  y 
i 0

2

i

i

281

Chapter 5 ■ Building, Tuning, and Deploying Models

In the scikit-learn package, this can be calculated by using the r2_score function by supplying it the
true values and the predicted values (of the output\response variable).

Mean Squared Error
Mean squared error calculates the average of the squares of the errors or deviation between the actual
value and the predicted values, as predicted by a regression model. The mean squared error or MSE can be
used to evaluate a regression model, with lower values meaning a better regression models with less errors.
Taking the square root of the MSE yields the root-mean-square-error or RMSE, which can also be used as an
evaluation metric for regression models.
The mathematical formula for calculating MSE and RMSE is quite simple and is given as follows:

 

MSE y , y 

1
nsamples

nsamples 1


i 0

 y  y 

2

i

i

In the scikit-learn package the MSE can be calculated by invoking the mean_squared_error function
from the metrics module.
Regression models have many more metrics that can be used for evaluating them, including median
absolute error, mean absolute error, explained variance score, and so on. They are easy to calculate using
the functions provided by the scikit-learn library. Their mathematical formulae are easy to interpret and
have an intuitive understanding associated with them. We have only introduced two of them but you are
encouraged to explore other sets of metrics that can be used to regression models. We will look at regression
models in more detail in the next chapter.

Model Tuning
In the first two sections of this chapter, you learned how to fit models on our processed data and how to
evaluate those models. We will build further upon the concepts introduced till now. In this section, you will
learn an important characteristic of all Machine Learning algorithms (which we have been glossing over
till now), their importance, and how to find the optimal values for these entities. Model tuning is one of the
most important concepts of Machine Learning and it does require some knowledge of the underlying math
and logic of the algorithm in focus. Although we cannot deep dive into extensive theoretical aspects of the
algorithms that we discuss, we will try to give some intuition about them so that you are empowered to tune
them better and learn the essential concepts needed for the same.
The models we have developed till now were mostly the default models provided to us by the
scikit-learn package. By default we mean models with the default configurations and settings if you
remember seeing some of the model estimator object parameters. Since the datasets we were analyzing were
essentially not tough-to-analyze datasets, even models with default configurations turn up decent solutions.
The situation is not that rosy when it comes to actual real-world datasets that have a lot of features, noise,
and missing data. You will see in the subsequent chapters how the actual datasets are often tough to process,
wrangle, and even harder to model. Hence, it is unlikely that we will always use the default configured
models out of the box. Instead we will delve deeper into the models that we are targeting, look at the knobs
that can be tuned and set to extract the best performance out of any given models. This process of iterative
experimentation with dataset, model parameters, and features is the very core of the model tuning process.
We start this section by introducing these so-called parameters that are associated with ML algorithms, then
we try to justify why it is hard to have a perfect model, and in the last section we discuss some strategies that
we can pursue to tune our models.

282

Chapter 5 ■ Building, Tuning, and Deploying Models

Introduction to Hyperparameters
What are hyperparameters? The simplest definition is that hyperparameters are meta parameters that are
associated with any Machine Learning algorithm and are usually set before the model training and building
process. We do this because model hyperparameters do not have any dependency on being derived from
the underlying dataset on which a model is trained. Hyperparameters are extremely important for tuning
the performance of learning algorithms. Hyperparameters are often confused with model parameters, but
we must keep in mind that hyperparameters are different than model parameters since they do not have
dependency on the data. In simple terms, model hyperparameters represent some high-level concepts or
knobs that a data scientist can tweak and tune during the model training and building process to improve its
performance. Let’s take an example to illustrate this in case you still have difficulty in interpreting them.

D
 ecision Trees
Decision trees are one of the simplest and easy to interpret classification algorithms (also used in regression
sometimes; check out CART models). First you will learn how a decision tree is created as hyperparameters
are often tightly coupled with the actual intricacies of the algorithm. The decision tree algorithm is based on
a greedy recursive partitioning of the initial dataset (features). It leverages a decision tree based structure
for taking decisions of how to perform the partitions. The steps involved in learning a decision tree are as
follows:
1.

Start with whole dataset and find the attribute (feature) that will best differentiate
between the classes. This best attribute is found out using metrics such as
Information gain or Gini impurity.

2.

Once the best attribute is found, separate the dataset in two (or more parts)
based on the values of the attributes.

3.

If any one part of the dataset contains only labels of one class, we can stop the
process for that part and label it as a leaf node of that class.

4.

We repeat the whole process until we have leaf nodes in all, of which we have
data points of one class only.

The final model returned by the decision tree algorithm can be represented as a flow chart (the core
decision tree structure). Consider a sample decision tree for the Titanic survival prediction problem depicted
in Figure 5-9.

283

Chapter 5 ■ Building, Tuning, and Deploying Models

Figure 5-9. A sample decision tree model
The decision tree is easy to interpret by following the path with the values of an unknown data point.
The leaf node where you end up is the predicted class for the data point. The model parameters in this case
are the attributes on which we are splitting (here sex, age, and sibsp) and the values of those attributes.
For example if a person was female, it is likely she had survived based on this model. However, infant males
having age less than 9 years and 6 months are likely to have perished.
In the algorithm, the decision whether we will continue splitting the dataset at a node further or stop
the splitting process is governed by one of the hyperparameters of the algorithm named min_samples_leaf.
This is a hyperparameter associated with the decision tree algorithm. The default value of this parameter is
1, which means that we can potentially keep splitting the data until we have a leaf node with a single data
point (with a unique class label). This leads to a lot of overfitting as potentially each data point can end up in
its own leaf node and the model will not learn anything useful. Suppose we want to stop the splitting process
if we have 3-4% of the whole dataset in a leaf node and label that node with the majority class of that node.
This can be achieved by setting a different value for the specified hyperparameter. This allows us to control
the overfitting and help us develop a generalized model. This is just one of the hyperparameters associated
with the algorithm; there are many more like the splitting criterion (criterion), maximum depth of the tree
(max_depth), number of features (max_features), and so on, which can have different effects on the quality
of the overall model.
Similar hyperparameters exist for each learning algorithm. Examples include the learning rate in
logistic regression, the kernel in SVMs, and the dropout rate in neural networks. Hyperparameters are
generally closely related to the learning algorithm. Hence we require some understanding of the algorithm
to have intuition about setting the value of a particular hyperparameter. In the later sections of this chapter
and the book, we deal with datasets and models that will require some level of hyperparameter tuning.

The Bias-Variance Tradeoff
So far, we learned about the necessary concepts which talk about tuning our models. But before go into the
process of putting it all together and actually tuning our models, we must understand a potential tradeoff
that puts some restriction on the best model that we can develop. This tradeoff is called the bias versus
variance tradeoff. The obvious question that arises is what are bias and variance in the context of Machine
Learning models?

284

Chapter 5 ■ Building, Tuning, and Deploying Models

•

Bias: This is the error that arises due to the model (learning algorithm) making
wrong assumptions on the parameters in the underlying data. The bias error is the
difference between the expected or predicted value of the model estimator and
the true or actual value which we are trying to predict. If you remember, model
building is an iterative process. If you imagine building a model multiple times over
a dataset every time you get some new observations, due to the underlying noise and
randomness in the data, predictions will not always be what is expected and bias
tries to measure the difference\error in actual and predicted values. It can also be
specified as the average approximation error that the models have over all possible
training datasets. The last part here, all possible training datasets, needs some
explanation. The dataset that we observe and develop our models on is one of the
possible combinations of data that exist. All the possible combinations of each of the
attributes\features that we have in our data will give rise to a different dataset. For
example, consider if we have a dataset with 50 binary (categorical) features, then the
size of that entire dataset would be 250 data points. The dataset that we model on will
obviously be a subset of this huge data. So bias is the average approximation error
that we can expect over subset of this entire dataset. Bias is mostly affected by our
assumptions (or the model’s assumptions) about the underlying data and patterns.
For example, consider a simple linear regression model; it makes the assumption
that the dependent variable is linearly dependent on the independent variable.
Whereas consider the case of a decision tree model it makes no such assumption
about the structure of the data and purely learns patterns from the data. Hence, in
relative sense, a linear model may tend to have a higher bias than a decision tree
model. High bias makes a model miss relevant relationships between features and
the output variables in data.

•

Variance: This error arises due to model sensitivity to fluctuations in the dataset that
can arise due to new data points, features, randomness, noise, and so on. It is the
variance of our approximation function over all possible datasets. It represents the
sensitivity of the model prediction results on particular set of data points. Suppose
you could have learned the model on different subset of all possible datasets then
variance would quantify how the results of the model change with the change in
the dataset. If the results stay quite stable then the model would be said to having
a low variance but if the results vary considerably each time then the model would
said to be having a high variance. Consider the same example of contrasting a
linear model against a decision tree model, under the assumption that a clear linear
relationship exists between the dependent and the independent data variables. Then
for a sufficiently large dataset our linear model will always capture that relationship.
Whereas the capability of a decision tree model depends on the dataset, if we get
a dataset which consists of a lot of outliers, we are likely to get a bad decision tree
model. Hence we can make a statement that the decision tree model will be having
a higher variance than a linear regression model based on data and the underlying
noise\randomness. High variance makes a model too sensitive to outliers or random
noise instead of generalizing well.

An effective way to get a clearer idea at this somewhat confusing concept is through a visual
representation of bias and variance, as depicted in Figure 5-10.

285

Chapter 5 ■ Building, Tuning, and Deploying Models

Figure 5-10. The Bias-variance tradeoff
In Figure 5-10, the inner red circle represents the perfect model that we can have considering all the
combinations of the data that we can get. Each blue dot (⋅) marks a model that we have learned on the basis
of combinations of the dataset and features that we get.

286

•

Models with low bias, low variance, represented by the top left image, will learn a
good general structure of underlying data patterns and relationships that will be
close to the hypothetical model and predictions will be consistent and hit the
bull’s eye!

•

Models with low bias, high variance, represented by the top right image, are models
that generalize to some extent (learn proper relationships\patterns) and perform
decently on average due to low bias but are sensitive to the data it is trained on
leading to high variance and hence predictions keep fluctuating.

•

Models with high bias, low variance will tend to make consistent predictions
irrespective of datasets on which the models are built leading to low variance but
due to high bias, it will not learn the necessary patterns\relationships in the data that
are required for correct predictions and hence misses the mark due to the high bias
error on average, as depicted in the bottom-left image.

Chapter 5 ■ Building, Tuning, and Deploying Models

•

Models with high bias, high variance are the worst sort of models possible, as they
will not learn necessary data attribute relationships that are essential to correlation
with output responses. Also they will be extremely sensitive to data and outliers
and noise leading to highly fluctuating predictions which result in high variance, as
depicted in the bottom-right image.

Extreme Cases of Bias-Variance
In real-world modeling, we will always have a tradeoff between decreasing bias and variance
simultaneously. To understand why we have this tradeoff, we must first consider the two possible extreme
cases of bias and variance.

Underfitting
Consider a linear model that is lazy and always predicts a constant value. This model will have extremely
low variance (in fact it will be a zero variance model) as the model is not dependent at all on which subset of
data it gets. It will always predict a constant and hence have stable performance. But on the other hand it will
have extremely high bias as it has not learned anything from the data and made a very rigid and erroneous
assumption about the data. This is the case of model underfitting, in which we fail to learn anything about
the data, its underlying patterns, and relationships.

O verfitting
Consider the opposite case in which we have model that attempts to fit every data point it encounters (the
closest example would be fitting an n th order polynomial curve for an n-observation dataset so that the curve
passes through each point). In this case, we will get a model which will have low bias as no assumption to
structure of data was made (even when there was some structure) but the variance will be very high as we
have tightly fit the model to one of the possible subsets of data (focusing too much on the training data). Any
subset different from the training set will lead to a lot of error. This is the case of overfitting, where we have
built our model so specific to the data at hand that it fails to do any generalization over other subsets of data.

T he Tradeoff
The total generalization error of any model will be a sum of its bias error, variance error, and irreducible
error, as depicted in the following equation.
Generalization Error = Bias Error + Variance Error + Irreducible Error
Such that the irreducible error is the error that gets introduced due to noise in the training data itself,
something that is common in real-world datasets and not much can be done about it. The idea is to focus
on the other two errors. Every model needs to do a tradeoff between the two choices: making assumptions
about the structure of data or fitting itself too closely to the data at hand. Either choice in entirety will lead to
one of the extreme cases. The idea is to focus on balancing model complexity by doing an optimal tradeoff
between bias and variance, as depicted in Figure 5-11.

287

Chapter 5 ■ Building, Tuning, and Deploying Models

Figure 5-11. Test and train errors as a function of model complexity (Source: The Elements of Statistical
Learning, Tibshirani et al. Springer)
Figure 5-11 should give you more clarity on the tradeoff that needs to be done to prevent an increase
in model errors. We will need to make some assumptions about the underlying structure in the data but
they must be reasonable. At the same time, the model must ensure that it learns from the data at hand
and generalizes well instead of overfitting to each and every data point. This tradeoff can be controlled by
making sure that our model is not a very complex model and by ensuring reasonable performance on the
unseen validation data. We will cover more on cross validation in the next section. We recommend you to
check out the section on model selection and the book Bias-Variance Tradeoff in the Elements of Statistical
Learning, Tibshirani et al., Springer.

Cross Validation
In the initial sections of this chapter when we were learning to fit different models, we followed the practice
of partitioning the data into a training set and a test set. We built the model on the training set and reported
its performance on the test set. Although that way of building models works, when working on tuning
models intensively, we need to consider some other strategies around validation datasets. In this section
we will discuss how we can use the same data to build different models and also tune their hyperparameters
using a simple data partitioning strategy. This strategy is one of the most prevalent practices in Data Science
domain irrespective of the type of models and it is called cross validation or just CV. This is extremely useful
when you also have less data observations and cannot segregate a specific partition of data for being a
validation set (more on this shortly!). You can then leverage a cross-validation strategy to leverage parts of
the training data itself for validation in such a way that you don’t end up overfitting the model.
The main intention of any model building activity is to develop a generalized model on the available
data which will perform well on the unseen data. But to estimate a model’s performance on unseen data,
we need to simulate that unseen data using the data that we have available. This is achieved by splitting
our available data into training and testing sets. By following this simple principle we ensure that we don’t
evaluate the model on the data that it has already seen and been trained on. The story would be over
here if we were completely satisfied with the model that we developed. But the initial models are seldom
satisfactory enough for deployment.

288

Chapter 5 ■ Building, Tuning, and Deploying Models

In theory, we can extend the same principles for tuning our algorithm. We can evaluate the
performance of particular values of the model hyperparameters on the test set. Retrain the model with a
different partition of training and test set with a different values of hyperparameters. If the new parameters
perform better than the old ones we take them and keep repeating the same process until we have the
optimal values of the hyperparameters. This scheme of things is simple but it suffers from a serious flaw. It
induces a bias in the model development process. Although the test set is changed in every iteration, the
data is being seen by the model to make some choices about the model development process (as we tune
and build the model). Hence, the models that we develop end up being biased and not well-generalized and
their performance may or may not reflect their performance on unseen data.
A simple change in the data splitting process can help us avoid this leakage of unseen data. Suppose
we initially made three different subsets of data instead of the original two. One is the usual training set, the
second one is the test set and the last one is called a validation set. So we can train our models on the train
data evaluate their performance on the validation data to tune model parameters (or even to select among
different models). Once we are done with the tuning process, we can evaluate the final model on the really
unseen test set and report the performance on the test set as the approximate performance of the model on
the real-world unseen data. In the very essence of things this is the basic principle behind the process of
cross validation, as depicted in Figure 5-12.

Figure 5-12. Building toward the cross-validation process for model building and tuning
Figure 5-12 gives us an idea of how the whole process works. We divide the original dataset into a train
and test set. The test set is completely set aside from the learning process. The train set so obtained is again
split into an actual train set and a validation set. Then we learn different models on the train set. A point
worth noting here is that the models are general, i.e. all of them can be of single type for example logistic
regression but with different hyperparameters. They can also be models using other algorithms like tree
based methods, support vector machines, and so on. The process of model selection is similar irrespective of
whether we are assessing completely different models or whether we are trying out different hyperparameter
values of the same type of models. Once we have the models developed, we assess their performance on the
validation set and select the model with the best performance as the final model. We leverage model evaluation
metrics for this based on the type of model (accuracy, f1 Score, rmse, silhouette coefficient, and so on).

289

Chapter 5 ■ Building, Tuning, and Deploying Models

The previously described process seems to be good. We have described the validation part of the
process but we haven’t touched on the cross part of it. So where is the cross-validation? To understand
that intricacy of the CV process, we would have to discuss why we need this in the first place. The need for
it arises from the fact that by dividing the data into test and a validation set we have lost out on a decent
amount of data which we could have used to further refine our modeling process. Also another important
point is that if we take a model’s error for a single iteration to be its overall error we are making a serious
mistake. Instead we want to take some average measure of error by building multiple iterations of the same
model. But if we keep rebuilding the model on the same dataset, we are not going to get much difference in
the model performance. We address these two issues by introducing the cross concept of cross-validation.
The idea of cross validation is to get different splits of train and validation sets (different observations in
each set, each time) using some strategy (which we will elaborate on later) and then build multiple iterations
of each model on these different splits. The average error on these splits is then reported as the error of the
model in question and the final decision is made on this averaged error metric. This strategy has a brilliant
effect on the estimated error of each model, as it ensures that the averaged error is a close approximation of
the model’s error on really unseen data (here our test set) and we could also leverage the complete training
dataset for building the model. This process is explained pictorially in Figure 5-13.

Figure 5-13. The final cross-validation process for model building and tuning
The various strategies in which these different train and validation sets can be generated gives rise
to different kind of cross-validation strategies. The common idea in each of these strategies remains the
same. The only difference is in the way the original train set is split into a train and validation set for each
iteration of model building.

C
 ross-Validation Strategies
We explained the basic principle of cross validation in the previous section. In this section, we see the
different strategies in which we can split the training data into training and validation data. Apart from the
way of this split, as mentioned before, the process for each of these strategies remains the same. The major
types of cross-validation strategies are described as follows.

290

Chapter 5 ■ Building, Tuning, and Deploying Models

L eave One Out CV
In this strategy for cross validation, we select a random single data point from the initial training dataset
and that becomes our validation set. So we have a single point only in our validation set and the rest n-1
observations become our training set. This means that if have 1000 data points in a training set than we
will be developing 1000 iterations of each model with a different training set and validation set each time
such that the validation set has one observation and the rest (999) go into the training set. This may become
infeasible if the dataset size is large. But in practice the error can be estimated by performing some small
number of iterations. Due to the computational complexity of this measure, it is mostly suitable for small
datasets and rarely used in practice.

K-Fold CV
The other strategy for cross-validation is to split the training dataset into k equal subsets. Out of these k
subsets we train the model on k-1 subsets and keep one subset as a validation set. This process is repeated k
times and the error is averaged over the k models that are obtained by developing different iterations of the
model. We keep changing the validation set in each of these iterations which ensures that in each iteration,
the model is trained on a different subset of data. This practice of cross-validation is quite effective in
practice, both for model selection and hyperparameter optimization.
A natural question for this strategy is to select the appropriate number of folds, as they will both control
our error approximation and the computational runtime of our CV process. There are mathematical ways to
select the most appropriate k but in practice a good choice of k ranges from 5-10. So, in most cases, we can
do a 5-fold or 10-fold validation and be confident of the results that we obtained.

Hyperparameter Tuning Strategies
Based on our discussions until now, we have all the prerequisites for tuning our model. We know what
hyperparameters are, how the performance of a model can be evaluated, and how we can use crossvalidation to search through the parameter space for optimal value of our algorithm’s hyperparameters.
In this section, we discuss two major strategies that tie all this together to determine the most optimal
hyperparameters. Fortunately, the scikit-learn library has an excellent built-in support for performing
hyperparameter search with cross-validation.
There are two major ways in which we can search our parameter space for an optimal model. These two
methods differ in the way we will search for them: systemic versus random. In this section we will discuss
these two methods along with hands-on examples. The takeaway from this section is to understand the
processes so that you can start leveraging on your own datasets. Also note that even if we don’t mention it
explicitly, we will always be using cross validation to perform any of these searches.

Grid Search
This is the simplest of the hyperparameter optimization methods. In this method we will specify the grid of
values (of hyperparameters) that we want to try out and optimize to get the best parameter combinations.
Then we will build models on each of those values (combination of multiple parameter values), using
cross-validation of course, and report the best parameters’ combination in the whole grid. The output will be
the model using the best combination from the grid. Although it is quite simple, it suffers from one serious
drawback that the user has to manually supply the actual parameters, which may or may not contain the
most optimal parameters.

291

Chapter 5 ■ Building, Tuning, and Deploying Models

In scikit-learn, grid search can be done using the GridSearchCV class. We go through an example
by performing grid search on a support vector machine (SVM) model on the breast cancer dataset from
earlier. The SVM model is another example of a supervised Machine Learning algorithm that can be used for
classification. It is an example of a maximum margin classifier, where it tries to learn a representation of all
the data points such that separate categories\labels are divided or separated by a clear gap, which is as large
as possible. We won’t be going into further extensive details here since the intent is to run grid search, but we
recommend you to check out some standard literation on SVMs if you are interested.
Let’s first split our breast cancer dataset variables X and y into train and test datasets and build an
SVM model with default parameters. Then we’ll evaluate its performance on the test dataset by leveraging
our model_evaluation_utils module.
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
# prepare datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# build default SVM model
def_svc = SVC(random_state=42)
def_svc.fit(X_train, y_train)
# predict and evaluate performance
def_y_pred = def_svc.predict(X_test)
print('Default Model Stats:')
meu.display_model_performance_metrics(true_labels=y_test, predicted_labels=def_y_pred,
                                     classes=[0,1])

Figure 5-14. Model performance metrics for default SVM model on the breast cancer dataset
Would you look at that, our model gives an overall F1 Score of only 49% and model accuracy of 63%
as depicted in Figure 5-14. Also by looking at the confusion matrix, you can clearly see that it is predicting
every data point as benign (label 1). Basically our model learned nothing! Let’s try tuning this model to see
if we get something better. Since we have chosen a SVM model, we specify some hyperparameters specific
to it, which includes the parameter C (deals with the margin parameter in SVM), the kernel function (used
for transforming data into a higher dimensional feature space) and gamma (determines the influence a
single training data point has). There are a lot of other hyperparameters to tune, which you can check out
at http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html for further details.
We build a grid by supplying some pre-set values. The next choice is selecting the score or metric we want
to maximize here we have chosen to maximize accuracy of the model. Once that is done, we will be using
five-fold cross-validation to build multiple models over this grid and evaluate them to get the best model.
Detailed code and outputs are depicted as follows.
from sklearn.model_selection import GridSearchCV
# setting the parameter grid

292

Chapter 5 ■ Building, Tuning, and Deploying Models

grid_parameters = {'kernel': ['linear', 'rbf'],
                   'gamma': [1e-3, 1e-4],
                   'C': [1, 10, 50, 100]}
# perform hyperparameter tuning
print("# Tuning hyper-parameters for accuracy\n")
clf = GridSearchCV(SVC(random_state=42), grid_parameters, cv=5, scoring='accuracy')
clf.fit(X_train, y_train)
# view accuracy scores for all the models
print("Grid scores for all the models based on CV:\n")
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.5f (+/-%0.05f) for %r" % (mean, std * 2, params))
# check out best model performance
print("\nBest parameters set found on development set:", clf.best_params_)
print("Best model validation accuracy:", clf.best_score_)
# Tuning hyper-parameters for accuracy
Grid scores for all the models based on CV:
0.95226
0.91206
0.95226
0.92462
0.96231
0.90201
0.96231
0.92965
0.95729
0.90201
0.95729
0.93467
0.95477
0.90201
0.95477
0.93216

(+/-0.06310)
(+/-0.04540)
(+/-0.06310)
(+/-0.02338)
(+/-0.04297)
(+/-0.04734)
(+/-0.04297)
(+/-0.03425)
(+/-0.05989)
(+/-0.04734)
(+/-0.05989)
(+/-0.02975)
(+/-0.05772)
(+/-0.04734)
(+/-0.05772)
(+/-0.04674)

for
for
for
for
for
for
for
for
for
for
for
for
for
for
for
for

{'C':
{'C':
{'C':
{'C':
{'C':
{'C':
{'C':
{'C':
{'C':
{'C':
{'C':
{'C':
{'C':
{'C':
{'C':
{'C':

1, 'gamma': 0.001, 'kernel': 'linear'}
1, 'gamma': 0.001, 'kernel': 'rbf'}
1, 'gamma': 0.0001, 'kernel': 'linear'}
1, 'gamma': 0.0001, 'kernel': 'rbf'}
10, 'gamma': 0.001, 'kernel': 'linear'}
10, 'gamma': 0.001, 'kernel': 'rbf'}
10, 'gamma': 0.0001, 'kernel': 'linear'}
10, 'gamma': 0.0001, 'kernel': 'rbf'}
50, 'gamma': 0.001, 'kernel': 'linear'}
50, 'gamma': 0.001, 'kernel': 'rbf'}
50, 'gamma': 0.0001, 'kernel': 'linear'}
50, 'gamma': 0.0001, 'kernel': 'rbf'}
100, 'gamma': 0.001, 'kernel': 'linear'}
100, 'gamma': 0.001, 'kernel': 'rbf'}
100, 'gamma': 0.0001, 'kernel': 'linear'}
100, 'gamma': 0.0001, 'kernel': 'rbf'}

Best parameters set found on development set: {'C': 10, 'gamma': 0.001, 'kernel': 'linear'}
Best model validation accuracy: 0.962311557789
Thus, from the preceding output and code, you can see how the best model parameters were obtained
based on cross-validation accuracy and we get a pretty awesome validation accuracy of 96%. Let’s take this
optimized and tuned model and put it to the test on our test data!
gs_best = clf.best_estimator_
tuned_y_pred = gs_best.predict(X_test)
print('\n\nTuned Model Stats:')
meu.display_model_performance_metrics(true_labels=y_test, predicted_labels=tuned_y_pred,
                                     classes=[0,1])

293

Chapter 5 ■ Building, Tuning, and Deploying Models

Figure 5-15. Model performance metrics for tuned SVM model on the breast cancer dataset
Well things are certainly looking great now! Our model gives an overall F1 Score and model
accuracy of 97% on the test dataset too, as depicted in Figure 5-14. This should give you a clear indication of
the power of hyperparameter tuning! This scheme of things can be extended for different models and their
respective hyperparameters. We can also play around with the evaluation measure we want to optimize.
The scikit-learn framework provides us with different values that we can optimize. Some of them are
adjusted_rand_score, average_precision, f1, average_recall, and so on.

R
 andomized Search
Grid search is a very popular method to optimizing hyperparameters in practice. It is due to its simplicity
and the fact that it is embarrassingly parallelizable. This becomes important when the dataset we are
dealing with is of a large size. But it suffers from some major shortcomings, the most important one being
the limitation of manually specifying the grid. This brings a human element into a process that could benefit
from a purely automatic mechanism.
Randomized parameter search is a modification to the traditional grid search. It takes input for
grid elements as in normal grid search but it can also take distributions as input. For example consider
the parameter gamma whose values we supplied explicitly in the last section instead we can supply a
distribution from which to sample gamma. The efficacy of randomized parameter search is based on the
proven (empirically and mathematically) result that the hyperparameter optimization functions normally
have low dimensionality and the effect of certain parameters are more than others. We control the number
of times we want to do the random parameter sampling by specifying the number of iterations we want to
run (n_iter). Normally a higher number of iterations mean a more granular parameter search but higher
computation time.
To illustrate the use of randomized parameter search, we will use the example we used earlier but
replace the gamma and C values with a distribution. The results in our example may not be very different
from the grid search, but we establish the process that can be followed for future reference.
import scipy
from sklearn.model_selection import RandomizedSearchCV
param_grid = {'C': scipy.stats.expon(scale=10),
              'gamma': scipy.stats.expon(scale=.1),
              'kernel': ['rbf', 'linear']}
random_search = RandomizedSearchCV(SVC(random_state=42), param_distributions=param_grid,
                                   n_iter=50, cv=5)
random_search.fit(X_train, y_train)
print("Best parameters set found on development set:")
random_search.best_params_
Best parameters set found on development set:
Out[183]:
{'C': 12.020578954763398, 'gamma': 0.036384519279056469, 'kernel': 'linear'}

294

Chapter 5 ■ Building, Tuning, and Deploying Models

# get best model, predict and evaluate performance
rs_best = random_search.best_estimator_
rs_y_pred = rs_best.predict(X_test)
meu.get_metrics(true_labels=y_test, predicted_labels=rs_y_pred)
Accuracy: 0.9649
Precision: 0.9649
Recall: 0.9649
F1 Score: 0.9649
In this example, we are getting the values of parameter C and gamma from an exponential distribution
and we are controlling the number of iterations of model search by the parameter n_iter. While the overall
model performance is similar to grid search, the intent is to be aware of the different strategies in model tuning.

M
 odel Interpretation
The objective of Data Science or Machine Learning is to solve real-world problems, automate complex tasks,
and make our life easier and better. While data scientists spend a huge amount of time building, tuning,
and deploying models, one must ask the questions, “What is this going to be used for?” and “How does this
really work?” and the most important question, “Why should I trust your model?”. A business or organization
will be more concerned about business objective, generating profits, and minimizing losses by leveraging
analytics and Machine Learning. Hence often there is a disconnect between analytics teams and key
stakeholders, customers, clients, or management in trying to explain how models really work. Most of the
time, explaining complex theoretical and mathematical concepts can be really difficult to non-experts who
may not have an idea or, worse, might not be interested in knowing all the gory details. This brings us back to
the main objective, “can we explain and interpret Machine Learning models in an easy to understand way”,
such that anyone even without thorough knowledge of Machine Learning can understand them. The benefit
of this would be two-fold—Machine Learning models will not just stop at being research projects or proofof-concepts and it will pave the way for higher adoption of Machine Learning based solutions in enterprises.
Some Machine Learning models use interpretable algorithms, for example a decision tree will give you
the importance of all the variables as an output. Also the prediction path of any new data point can be analyzed
using a decision tree hence we can learn what variable played a crucial role for a prediction. Unfortunately, this
can’t be said for a lot of models, especially for the ones who have no notion of variable importance.
Some Machine Learning Models are interpretable in nature by default - e.g. generative model such
as Bayesian Rule List, Letham et. al (https://arxiv.org/abs/1511.01644), while other simple black box
models such as Simple Decision Trees could be made interpretable by using the feature importance as
an output. Also, the prediction path for a single tree from the root of the tree to its leaves can be visualized
capturing the contribution of the feature to the estimators decision policies. But, this intuitiveness may not
be possible for complex non-linear models - Random Forest, Deep Neural Networks - Convolutional Neural
Networks (CNNs), Recurrent Neural Networks (RNNS). The lack of understanding of the complex nature
of Machine Learned decision policies makes predictive models to be still viewed as black boxes. Model
interpretations can help a data scientist and an end user in a variety of ways. It will help bridge the gap that
often exists between the technology teams and the business. For example, it can help identify the reason
why a particular prediction is being made and it can be verified using the domain knowledge of the end
user by leveraging that easy to understand interpretation. It can also help the data scientists understand the
interactions among features that can lead to better feature engineering and enhanced performance. It can
also help in model comparisons and explaining the results better to the business stakeholders.

295

Chapter 5 ■ Building, Tuning, and Deploying Models

While the simplest approach to having models that are interpretable is to use algorithms that lead to
interpretable models like decision trees, logistic regression and others. But we don’t have the guarantee that
an interpretable model will provide us with the best performance. Hence we cannot always resort to such
models. A recent, much better approach is to explain model predictions in an easy-to-interpret manner
by learning an interpretable model locally around the prediction. This topic in fact has gained extensive
attention very recently in 2016. Refer to the original research paper by M.T. Ribeiro, S. Singh & C. Guestrin
titled “Why Should I Trust You?”: Explaining the Predictions of Any Classifier from https://arxiv.org/
pdf/1602.04938.pdf to understand more about model interpretation and the LIME framework, which
proposes to solve this. The LIME framework attempts to successfully explain any black box model locally
(somewhere we need define the scope of Interpretation - Globally and Locally) and you can check out the
GitHub repository at https://github.com/marcotcr/lime.
We will be leveraging another library named Skater, an open sourced Python library designed to
demystify the inner workings of of predictive models. Skater defines the scope of interpretating models
1.Globally(on the basis of a complete dataset) and 2. Locally (on the basis of an individual prediction). For
global explanations, Skater makes use of model-agnostic variable importance and partial dependence plots
to judge the bias of a model and understand its general behavior. To validate a model’s decision policies
for a single prediction, on the other hand, the library currently embraces a novel technique called local
interpretable model agnostic explanation (LIME, Ribeiro et al., 2016), which uses local surrogate models to
assess performance. The library is authored by Aaron Kramer, Pramit Choudhary, and the DataScience.com
team, Skater is now a mainstream project and an excellent framework for model interpretation. We would
like to acknowledge and thank the folks at DataScience.com—Ian Swanson, Pramit Choudhary, and Aaron
Kramer—for developing this amazing framework and especially Pramit for taking out time to explain to us in
detail the features and vision for the Skater project. Some advantages of leveraging Skater are mentioned as
follows and some of them are still actively being worked on and improved.

296

•

Production ready code using functional style programming (declarative
programming paradigm)

•

Enable Interpretation for both classification and regression based models for
Supervised Learning problems to start with and then gradually extend it to
support interpretation for Unsupervised Learning problems as well. This includes
computationally efficient Partial Dependence Plots and model independent feature
importance plots.

•

Workflow abstraction: Common interface to perform local interpretation for
In-Memory (Model is under development) as well as Deployed Model (Model has
been deployed in production)

•

Extending LIME - added support for interpreting Regression based model, better
sampling distribution for generating samples around a local prediction, researching
the ability to include non-linear models for local evaluation

•

Enabling support of Rule Based interpretable Models - e.g. Letham et. al
(https://arxiv.org/abs/1511.01644)

•

Better support for model Evaluation for NLP based models - e.g. Bach et. al
Layerwise Relevance Propagation (http://journals.plos.org/plosone/
article?id=10.1371/journal.pone.0130140)

•

Better support for Image Interpretability - Batra et. al. Gradient weighted Class
Activation Map (https://arxiv.org/abs/1610.02391)

Chapter 5 ■ Building, Tuning, and Deploying Models

Besides this, since, the time this project started, they have committed some improvements, namely
support for regression back into the original LIME repository and they still have other aspects of
interpretation in their roadmap and further improvements to LIME in the future. You can easily install
skater by running the pip install -U Skater command from your prompt or terminal. For more
information, you can check out the GitHub repository at https://github.com/datascienceinc/Skater or
join the chat group here: https://gitter.im/datascienceinc-skater/Lobby.

Understanding Skater
Skater is an open source Python framework that aims to provide model agnostic interpretation of predictive
models. It is an active project on GitHub at https://github.com/datascienceinc/Skater with many of
the previously mentioned features being worked upon actively. The idea of skater is to understand black
box Machine Learning models by querying them and interpreting their learned decision policies. The
philosophy of skater is that all models should be evaluated as black boxes and decision criteria of the
models are inferred and interpreted based on input perturbations and observing the corresponding output
predictions. The scope of model interpretation by leveraging skater, enables us to do both global and local
interpretation as depicted in Figure 5-16.

Figure 5-16. Scope of Model Interpretation (source: DataScience.com)
Using the skater library, we can explore the features’ importance, partial dependency plots upon
features, and global and local fidelity of the predictions made by the model. The fidelity of a model can be
described as the reasons on the basis of which the model calculated and predicted a particular class. For
example, suppose we have a model that predicts whether a particular user transaction can be tagged as a
fraudulent transaction or not. The output of the model will be much more trustworthy if we can identify,
interpret, and depict that the reason the model marked the prediction as fraud, is because the amount is
larger than the maximum transaction of the user in the last six months and the location of transaction is
1000 kms away from user’s normal transaction areas. Contrast it with the case where we are only given a
prediction label without any justifying explanation.

297

Chapter 5 ■ Building, Tuning, and Deploying Models

The general workflow within the skater package is to create an interpretation object, create a model
object, and run interpretation algorithms. Also, an Interpretation object takes as input, a dataset, and
optionally some metadata like feature names and row identifiers. Internally, the Interpretation object
will generate a DataManager to handle data requests and sampling. While we can interpret any model by
leveraging the model estimator objects, for ensuring consistency and proper functionality across all of
skater's interfaces, model objects need to be wrapped in skater's Model object, which can either be an
InMemoryModel object over an actual model or even a DeployedModel object to take a model behind an API
or web service. Figure 5-17 depicts a standard machine learning workflow and how skater can be leveraged
for interpreting the two different types of models we just mentioned. Let’s use our logistic regression model
from earlier to do some model interpretation on our breast cancer dataset!

Figure 5-17. Model Interpretation in a standard Machine Learning Workflow (source: DataScience.com)

Model Interpretation in Action
We will be using our train and test datasets from the breast cancer dataset that we have been using in this
chapter for consistency. We will leverage the X_train and X_test variables and also the logistic model
object (logistic regression model) that we created previously. We will try to run some model interpretations
on this model object. The standard workflow for model interpretation is to create a skater interpretation
and model object.
from skater.core.explanations import Interpretation
from skater.model import InMemoryModel
interpreter = Interpretation(X_test, feature_names=data.feature_names)
model = InMemoryModel(logistic.predict_proba, examples=X_train, target_names=logistic.classes_)

298

Chapter 5 ■ Building, Tuning, and Deploying Models

Once this is complete, we are ready to run model interpretation algorithms. We will start by trying to
generate feature importances. This will give us an idea of the degree to which our predictive model relies on
particular features. The skater framework’s feature importance implementation is based on an information
theory criterion, where it measures the entropy in the change of predictions, given a perturbation of a
specific feature. The idea is that the more a model’s decision making criteria depends on a feature, the more
the predictions will change as a function of perturbing the feature.
plots = interpreter.feature_importance.plot_feature_importance(model, ascending=False)

Figure 5-18. Feature importances obtained from our logistic regression model
We can clearly observe from Figure 5-18 that the most important feature in our model is worst
area, followed by mean perimeter and area error. Let’s now consider the most important feature,
worst area, and think about ways it might influence the model decision making process during
predictions. Partial dependence plots are an excellent tool to leverage to visualize this. In general,
partial dependence plots help describe the marginal impact of a specific feature on model prediction
by holding the other features in the model constant. The derivative of partial dependence, describes
the impact of a feature. The following code helps build the partial dependence plot for the worst area
feature in our model.
p = interpreter.partial_dependence.plot_partial_dependence(['worst area'], model,  
                                                          grid_resolution=50,
                                                          with_variance=True, figsize = (6, 4))

299

Chapter 5 ■ Building, Tuning, and Deploying Models

Figure 5-19. One-way partial dependence plot for our logistic regression model predictor based on worst area
From the plot in Figure 5-19, we can see that the worst area feature has a strong influence on the
model decision making process. Based on the plot, if the worst area value decreases from 800, the model
is more prone to classify the data point as benign (label 1) which indicates no cancer. This is definitely
interesting! Let’s try to interpret some actual predictions now. We will predict two data points, one not
having cancer (label 1) and one having cancer (label 0), and try to interpret the prediction making process.
from skater.core.local_interpretation.lime.lime_tabular import LimeTabularExplainer
exp = LimeTabularExplainer(X_train, feature_names=data.feature_names,
                           discretize_continuous=True, class_names=['0', '1'])
# explain prediction for data point having no cancer, i.e. label 1
exp.explain_instance(X_test[0], logistic.predict_proba).show_in_notebook()

300

Chapter 5 ■ Building, Tuning, and Deploying Models

Figure 5-20. Model interpretation for our logistic regression model’s prediction for a data point having no
cancer (benign)
The results depicted in Figure 5-20 show the features that were primarily responsible for the model
to predict the data point as label 1, i.e. having no cancer. We can also see the feature that was the most
influential in this decision was worst area! Let’s run a similar interpretation on a data point with malignant
cancer.
# explain prediction for data point having malignant cancer, i.e. label 0
exp.explain_instance(X_test[1], logistic.predict_proba).show_in_notebook()

Figure 5-21. Model interpretation for our logistic regression model’s prediction for a data point having cancer
(malignant)

301

Chapter 5 ■ Building, Tuning, and Deploying Models

The results depicted in Figure 5-21 once again show us the features that were primarily responsible
for the model to predict the data point as label 0, i.e. having malignant cancer. The feature worst area
was again the most influential one and you can notice the stark difference in its value as compared to the
previous data point. Hopefully this should give you some insight into how model interpretation works. A
point to remember here is that we are just getting started with model interpretation based on the recent
interest since 2016, but it is going to be a good and worthwhile journey toward making models easy to
understand for anyone!

M
 odel Deployment
The tough part of the whole modeling process is mostly the iterative process of feature engineering, model
building, tuning, and evaluation. Once we are done with this iterative process of model development, we can
breathe a sigh of relief—but not for long! The final piece of the Machine Learning modeling puzzle is that of
deploying the model in production so that we actually start using it. In this section, you learn of the various
ways you can deploy your models in action and the necessary dependencies that must be taken care of in
this process.

Model Persistence
Model persistence is the simplest way of deploying a model. In this scheme of things we will persist our
final model on permanent media like our hard drive and use this persisted version for making predictions
in the future. This simple scheme is a good way to deploy models with minimal effort. Model development
is generally done on a static data source but once deployed, typically the model is used on a constant
stream of data either in realtime\near-realtime or in batches. For example, consider a bank fraud detection
model; at the time of model development, we will have data collected over some historical time span. We
will use this data for the model development process and come up with a model with good performance,
i.e. a model that is very good at flagging potential fraud transactions. The model then needs to be deployed
over all of the future transactions that the bank (or any other financial entity) conducts. It means that for all
the transactions, we need to extract the data required for our model and feed that data to our model. The
model prediction is attached to the transaction and on the basis of it the transaction is flagged as a fraud
transaction or a clean transaction.
In the simplest scheme of things, we can write a standalone Python script that is given the new data as
soon as it arrives. It performs the necessary data transformations on the raw data and then reads our model
from the permanent data store. Once we have the data and the model we can make a prediction and this
prediction communication can be integrated with the required operations. These required operations are
often tied to the business needs of the model. In our case of tagging fraudulent transactions, it can involve
notifying the fraud department or simply denying the transaction. Most of the steps involved in this process
like data acquisition\retrieval, extraction, feature engineering and actions to be taken upon prediction are
related to the software or data engineering process and require custom software development and tinkering
with data engineering processes like ETL (extract-transform-load).
For persisting our model to disk, we can leverage libraries like pickle or joblib, which is also available
with scikit-learn. This allows us to deploy and use the model in the future, without having to retrain it
each time we want to use it.
from sklearn.externals import joblib
joblib.dump(logistic, 'lr_model.pkl')

302

Chapter 5 ■ Building, Tuning, and Deploying Models

This code will persist our model on the disk as a file named lr_model.pkl. So whenever we will load
this object in memory again we will get the logistic regression model object.
lr = joblib.load('lr_model.pkl')
lr
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
We can now use this lr object, which is our model loaded from the disk, and make predictions.
A sample is depicted as follows.
print(lr.predict(X_test[10:11]), y_test[10:11])
[1] [1]
Remember that once you have a persisted model, you can easily integrate it with a Python based
script or application that can be scheduled to predict in realtime or batches of new data. However, proper
engineering of the solution is necessary to ensure the right data reaches the model and the prediction output
should also be broadcasted to the right channels.

Custom Development
Another option to deploy a model is by developing the implementation of model prediction method
separately. The output of most Machine Learning algorithms is just the values of parameters that were
learned. Once we have extracted these parameter values, the prediction process is pretty straightforward.
For example, the prediction of a logistic regression can be done by multiplying the coefficient vector with
the input data vector. This simple calculation will give us the score for the data vector that we can feed to the
sigmoid\logistic function and extract the prediction for our input data.
This method has more roots in the software development process as the developed model is reduced
to a set of configurations and parameters and the main focus would be on engineering the data and the
necessary mathematical computations using some programming language. This configuration can be used
to develop a custom implementation pipeline in which the prediction process is just a simple mathematical
operation.

In-House Model Deployment
A lot of enterprises and organizations will not want to expose their private and confidential data on which
models need to be built and deployed. Hence they will be leveraging their own software and Data Science
expertise to build and deploy custom solutions on their own infrastructure. This can involve leveraging
commercial-off-the-shelf tools to deploy models or using custom open source tools and frameworks.
Python based models can be easily integrated with frameworks like Flask or Django to create REST APIs or
Micro-services on top of the prediction models and these API endpoints can then be exposed and integrated
with any other solutions or applications that might need it.

303

Chapter 5 ■ Building, Tuning, and Deploying Models

Model Deployment as a Service
The computational world is seeing a surge of the cloud and the XAAS (anything as a service) model in all
areas. This is also true for model development and deployment. Major providers like Google, Microsoft, and
Amazon Web Services (AWS) provide the facility of developing Machine Learning models using their cloud
services and also the facility of deploying those models as a service on the cloud. This is very beneficial to
the end users due to the reliability and ease of scaling offered by these service providers. A major downside
to custom development or deploying models in-house is the extra work and maintenance required.
The scalability of the solution is also another problem that may exist for some kind of models like fraud
prediction, due to the sheer number of prediction volumes required.
Model deployment as a service takes care of these issues as in most cases the model prediction can be
accessed via a request made to a cloud based API endpoint (by supplying in the necessary data of course).
This capability frees the burden of maintaining an extra system for the developers of the application that
will be consuming the outputs of our model. In most cases, if the developers can take care of passing the
required data to the model deployment APIs, they don’t have to deal with the computational requirement of
the prediction system and dealing with its maintenance.
Another advantage of cloud deployment comes from how easy it is to update the models. Model
development is an iterative process and the deployed models need to be updated from time to time to
maintain their relevance. By maintaining the models at a single end point in the cloud, we simplify the
process of model updating as only a single replacement is required, which can actually happen with the
push of a button, which also syncs with all downstream applications.

Summary
This chapter concludes the second part of this book, which focused on the Machine Learning pipeline.
We learned the most important aspects of the model building process, which include model training, tuning,
evaluation, interpretation, and deployment. Details of various types of models was discussed in the model
building section including classification, regression, and clustering models. We also covered the three vital
stages of any Machine Learning process with an example of the logistic regression model and how gradient
descent is an important optimization process. Hands-on examples of classification and clustering model
building processes were depicted on real datasets. Various strategies of evaluating classification, regression,
and clustering models were also covered with detailed metrics for each of them, which were depicted with real
examples. A section of this book has been completely dedicated to tuning of models that include strategies for
hyperparameter tuning and cross validation with detailed depiction of tuning on real models. A nascent field
in Machine Learning is model interpretation, where we try to understand and explain how model predictions
really work. Detailed coverage on various aspects of model interpretation have also been covered, including
feature importances, partial dependence plots, and prediction explanations. Finally, we also looked at some
aspects pertaining to model deployment and the various options for deploying models. This should give you a
good idea of how to start building and tuning models. We will reinforce these concepts and methodologies in
the third part of this book where we will be working on real-world case studies.

304

PART III

Real-World Case Studies

CHAPTER 6

Analyzing Bike Sharing Trends
“All work and no play” is a well-known proverb and we certainly do not want to be dull. So far, we have
covered the theoretical concepts, frameworks, workflows, and tools required to solve Data Science problems.
The use case driven theme begins with this chapter. In this section of the book, we cover a wide range of
Machine Learning/Data Science concepts through real life case studies. Through this and subsequent
chapters, we will discuss and apply concepts learned so far to solve some exciting real-world problems.
This chapter discusses regression based models to analyze data and predict outcomes. In particular,
we will utilize the Capital Bike Sharing dataset from the UCI Machine Learning Repository to understand
regression models to predict bike usage demand. Through this chapter, we cover the following topics:
•

The Bike Sharing dataset to understand the dataset available from the UCI Machine
Learning Repository

•

Problem statement to formally define the problem to be solved

•

Exploratory data analysis to explore and understand the dataset at hand

•

Regression analysis to understand regression modeling concepts and apply them to
solve the problem

The Bike Sharing Dataset
The CRISP-DM model introduced in the initial chapters talks about a typical workflow associated with
a Data Science problem/project. The workflow diagram has data at its center for a reason. Before we get
started on different techniques to understand and play with the data, let’s understand its origins.
The Bike Sharing dataset is available from the UCI Machine Learning Repository. It is one of the largest
and probably also the longest standing online repository of datasets used in all sorts of studies and research
from across the world. The dataset we will be utilizing is one such dataset from among hundreds available
on the web site.
The dataset was donated by University of Porto, Portugal in 2013. More information is available at
https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset.

■■Note We encourage you to check out the UCI Machine Learning Repository and particularly the Bike
Sharing Data Set page. We thank Fanaee et al. for their work and sharing the dataset through the UCI Machine
Learning Repository.
Fanaee-T, Hadi, and Gama, Joao, Event labeling combining ensemble detectors and background knowledge,
Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.
© Dipanjan Sarkar, Raghav Bali and Tushar Sharma 2018
D. Sarkar et al., Practical Machine Learning with Python, https://doi.org/10.1007/978-1-4842-3207-1_6

307

Chapter 6 ■ Analyzing Bike Sharing Trends

Problem Statement
With environmental issues and health becoming trending topics, usage of bicycles as a mode of
transportation has gained traction in recent years. To encourage bike usage, cities across the world have
successfully rolled out bike sharing programs. Under such schemes, riders can rent bicycles using manual/
automated kiosks spread across the city for defined periods. In most cases, riders can pick up bikes from one
location and return them to any other designated place.
The bike sharing platforms from across the world are hotspots of all sorts of data, ranging from travel
time, start and end location, demographics of riders, and so on. This data along with alternate sources
of information such as weather, traffic, terrain, and so on makes it an attractive proposition for different
research areas.
The Capital Bike Sharing dataset contains information related to one such bike sharing program
underway in Washington DC. Given this augmented (bike sharing details along with weather information)
dataset, can we forecast bike rental demand for this program?

E xploratory Data Analysis
Now that we have an overview of the business case and a formal problem statement, the very next stage is to
explore and understand the data. This is also called the Exploratory Data Analysis (EDA) step. In this section,
we will load the data into our analysis environment and explore its properties. It is worth mentioning
again that EDA is one of the most important phases in the whole workflow and can help with not just
understanding the dataset, but also in presenting certain fine points that can be useful in the coming steps.

■■Note The bike sharing dataset contains day level and hour level data. We will be concentrating only on
hourly data available in hour.csv.

Preprocessing
The EDA process begins with loading the data into the environment, getting a quick look at it along with
count of records and number of attributes. We will be making heavy use of pandas and numpy to perform data
manipulation and related tasks. For visualization purposes, we will use matplotlib and seaborn along with
pandas' visualization capabilities wherever possible.
We begin with loading the hour.csv and checking the shape of the loaded dataframe. The following
snippet does the same.
In [2]: hour_df = pd.read_csv('hour.csv')
   ...: print("Shape of dataset::{}".format(hour_df.shape))
Shape of dataset::(17379, 17)
The dataset contains more than 17k records with 17 attributes. Let’s check the top few rows to see how
the data looks. We use the head() utility from pandas for the same to get the output in Figure 6-1.

308

Chapter 6 ■ Analyzing Bike Sharing Trends

Figure 6-1. Sample rows from Bike Sharing dataset
The data seems to have loaded correctly. Next, we need to check what data types pandas has inferred
and if any of the attributes require type conversions. The following snippet helps us check the data types of
all attributes.
In [3]: hour_df.dtypes
Out[3]:
instant         int64
dteday          object
season          int64
yr              int64
mnth            int64
hr              int64
holiday         int64
weekday         int64
workingday      int64
weathersit      int64
temp            float64
atemp           float64
hum             float64
windspeed       float64
casual          int64
registered      int64
cnt             int64
dtype: object
As mentioned in the documentation for the dataset, there are bike sharing as well as weather attributes
available. The attribute dteday would require type conversion from object (or string type) to timestamp.
Attributes like season, holiday, weekday, and so on are inferred as integers by pandas, and they would
require conversion to categoricals for proper understanding.
Before jumping into type casting attributes, the following snippet cleans up the attribute names to make
them more understandable and pythonic.
In [4]: hour_df.rename(columns={'instant':'rec_id',
   ...:                         'dteday':'datetime',
   ...:                         'holiday':'is_holiday',
   ...:                         'workingday':'is_workingday',
   ...:                         'weathersit':'weather_condition',
   ...:                         'hum':'humidity',
   ...:                         'mnth':'month',
   ...:                         'cnt':'total_count',
   ...:                         'hr':'hour',
   ...:                         'yr':'year'},inplace=True)

309

Chapter 6 ■ Analyzing Bike Sharing Trends

Now that we have attribute names cleaned up, we perform type-casting of attributes using utilities like
pd.to_datetime() and astype(). The following snippet gets the attributes into proper data types.
In [5]:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:

# date time conversion
hour_df['datetime'] = pd.to_datetime(hour_df.datetime)
# categorical variables
hour_df['season'] = hour_df.season.astype('category')
hour_df['is_holiday'] = hour_df.is_holiday.astype('category')
hour_df['weekday'] = hour_df.weekday.astype('category')
hour_df['weather_condition'] = hour_df.weather_condition.astype('category')
hour_df['is_workingday'] = hour_df.is_workingday.astype('category')
hour_df['month'] = hour_df.month.astype('category')
hour_df['year'] = hour_df.year.astype('category')
hour_df['hour'] = hour_df.hour.astype('category')

Distribution and Trends
The dataset after preprocessing (which we performed in the previous step) is ready for some visual
inspection. We begin with visualizing hourly ridership counts across the seasons. The following snippet uses
seaborn’s pointplot to visualize the same.
In [6]: fig,ax = plt.subplots()
   ...: sn.pointplot(data=hour_df[['hour',
   ...:                            'total_count',
   ...:                            'season']],
   ...:             x='hour',y='total_count',
   ...:             hue='season',ax=ax)
   ...: ax.set(title="Season wise hourly distribution of counts")

Figure 6-2. Season wise hourly data distribution
The plot in Figure 6-2 shows similar trends for all seasons with counts peaking in the morning between
7-9 am and in the evening between 4-6 pm, possibly due to high movement during start and end of office
hours. The counts are lowest for the spring season, while fall sees highest riders across all 24 hours.

310

Chapter 6 ■ Analyzing Bike Sharing Trends

Similarly, distribution of ridership across days of the week also presents interesting trends of higher
usage during afternoon hours over weekends, while weekdays see higher usage during mornings and
evenings. The code for the same is available in the jupyter notebook bike_sharing_eda.ipynb. The plot is as
shown in Figure 6-3.

Figure 6-3. Day-wise hourly data distribution
Having observed hourly distribution of data across different categoricals, let’s see if there are any
aggregated trends. The following snippet helps us visualize monthly ridership trends using seaborn’s
barplot().
In [7]: fig,ax = plt.subplots()
   ...: sn.barplot(data=hour_df[['month',
   ...:                         'total_count']],
   ...:            x="month",y="total_count")
   ...: ax.set(title="Monthly distribution of counts")
The generated barplot showcases a definite trend in ridership based on month of the year. The
months June-September see highest ridership. Looks like Fall is a good season for Bike Sharing programs in
Washington, D.C. The plot is shown in Figure 6-4.

Figure 6-4. Month-wise ridership distribution

311

Chapter 6 ■ Analyzing Bike Sharing Trends

We encourage you to try and plot the four seasons across different subplots as an exercise to employ
plotting concepts and see the trends for each season separately.
Moving up the aggregation level, let’s look at the distribution at year level. Our dataset contains year
value of 0 representing 2011 and 1 representing 2012. We use a violin plot to understand multiple facets of
this distribution in a crisp format.

■■Note Violin plots are similar to boxplots. Like boxplots, violin plots also visualize inter-quartile range and
other summary statistics like mean/median. Yet these plots are more powerful than standard boxplots due to
their ability to visualize probability density of data. This is particularly helpful if data is multimodal.
The following snippet plots yearly distribution on violin plots.
In [8]: sn.violinplot(data=hour_df[['year',
   ...:                            'total_count']],
   ...:               x="year",y="total_count")
Figure 6-5 clearly helps us understand the multimodal distribution in both 2011 and 2012 ridership
counts with 2011 having peaks at lower values as compared to 2012. The spread of counts is also much more
for 2012, although the max density for both the years is between 100-200 rides.

Figure 6-5. Violin plot showcasing year-wise ridership distribution

Outliers
While exploring and learning about any dataset, it is imperative that we check for extreme and unlikely
values. Though we handle missing and incorrect information while preprocessing the dataset, outliers are
usually caught during EDA. Outliers can severely and adversely impact the downstream steps like modeling
and the results.
We usually utilize boxplots to check for outliers in the data. In the following snippet, we analyze outliers
for numeric attributes like total_count, temperature, and wind_speed.

312

Chapter 6 ■ Analyzing Bike Sharing Trends

In [9]: fig,(ax1,ax2)= plt.subplots(ncols=2)
   ...: sn.boxplot(data=hour_df[['total_count',
   ...:                         'casual','registered']],ax=ax1)
   ...: sn.boxplot(data=hour_df[['temp','windspeed']],ax=ax2)
The generated plot is shown in Figure 6-6. We can easily mark out that for the three count related
attributes, all of them seem to have a sizable number of outlier values. The casual rider distribution has
overall lower numbers though. For weather attributes of temperature and wind speed, we find outliers only
in the case of wind speed.

Figure 6-6. Outliers in the dataset
We can similarly try to check outliers at different granularity levels like hourly, monthly, and so on. The
visualization in Figure 6-7 showcases boxplots at hourly level (the code is available in the bike_sharing_
eda.ipynb jupyter notebook).

Figure 6-7. Outliers in hourly distribution of ridership

313

Chapter 6 ■ Analyzing Bike Sharing Trends

Correlations
Correlation helps us understand relationships between different attributes of the data. Since this chapter
focuses on forecasting, correlations can help us understand and exploit relationships to build better models.

■■Note It is important to understand that correlation does not imply causation. We strongly encourage you to
explore more on the same.
The following snippet first prepares a correlational matrix using the pandas utility function corr().
It then uses a heat map to plot the correlation matrix.
In [10]: corrMatt = hour_df[["temp","atemp",
    ...:                         "humidity","windspeed",
    ...:                        "casual","registered",
    ...:                         "total_count"]].corr()
    ...: mask = np.array(corrMatt)
    ...: mask[np.tril_indices_from(mask)] = False
    ...: sn.heatmap(corrMatt, mask=mask,
    ...: vmax=.8, square=True,annot=True)
Figure 6-8 shows the output correlational matrix (heat map) showing values in the lower triangular
form on a blue to red gradient (negative to positive correlation).

Figure 6-8. Correlational matrix

314

Chapter 6 ■ Analyzing Bike Sharing Trends

The two count variables, registered and casual, show obvious strong correlation to total_count.
Similarly, temp and atemp show high correlation. wind_speed and humidity have slight negative correlation.
Overall, none of the attributes show high correlational statistics.

Regression Analysis
Regression analysis is a statistical modeling technique used by statisticians and Data Scientists alike. It is
the process of investigating relationships between dependent and independent variables. Regression itself
includes a variety of techniques for modeling and analyzing relationships between variables. It is widely
used for predictive analysis, forecasting, and time series analysis.
The dependent or target variable is estimated as a function of independent or predictor variables. The
estimation function is called the regression function.

■■Note In a very abstract sense, regression is referred to estimation of continuous response/target variables
as opposed to classification, which estimates discrete targets.
The height-weight relationship is a classic example to get started with regression analysis. The example
states that weight of a person is dependent on his/her height. Thus, we can formulate a regression function
to estimate the weight (dependent variable) given height (independent variable) of a person, provided we
have enough training examples. We discuss more on this in the coming section.
Regression analysis models the relationship between dependent and independent variables. It should
be kept in mind that correlation between dependent and independent variables does not imply causation!

Types of Regression
There are multiple techniques that have evolved over the years and that help us perform regression analysis.
In general, all regression modeling techniques involve the following:
•

The independent variable X

•

The dependent or target variable Y

•

Unknown parameter(s), denoted as β

Thus, a regression function relates these entities as:
Y = f ( X ,b )
The function f() needs to be specified or learned from the dataset available. Depending upon the data
and use case at hand, the following are commonly used regression techniques:
•

Linear regression: As the name suggests, it maps linear relationships between
dependent and independent variables. The regression line is a straight line in this
technique. The aim here is to minimize the error (sum of squared error for instance).

•

Logistic regression: In cases where the dependent variable is binary (0/1 or Yes/No),
this technique is utilized. It helps us determine the probability of the binary target
variable. It derives its name from the logit function used by this technique. The aim
here is to maximize the likelihood of observed values. This technique has more in
common with classification techniques than regression.

315

Chapter 6 ■ Analyzing Bike Sharing Trends

•

Non-linear regression: In cases where dependent variable is related polynomially to
independent variable, i.e. the regression function has independent variables’ power
of more than 1. It is also termed as polynomial regression.

Regression techniques may also be classified as non-parametric.

Assumptions
Regression analysis has a few general assumptions while specific analysis techniques have added (or
reduced) assumptions as well. The following are important general assumptions for regression analysis:
•

The training dataset needs to be representative of the population being modeled.

•

The independent variables are linearly independent, i.e., one independent variable
cannot be explained as a linear combination of others. In other words, there should
be no multicollinearity.

•

Homoscedasticity of error, i.e. the variance of error, is consistent across the sample.

Evaluation Criteria
Evaluation of model performance is an important aspect of Data Science use cases. We should be able to
not just understand the outcomes but also evaluate how models compare to each other or whether the
performance is acceptable or not.
In general, evaluation metrics and performance guidelines are pretty use case and domain specific,
regression analysis often uses a few standard metrics.

R
 esidual Analysis
Regression is an estimation of target variable using the regression function on explanatory variables. Since
the output is an approximation, there will be some difference between the predicted value of target and the
observed value.
Residual is the difference between the observed and the predicted (output of the regression function).
Mathematically, the residual or difference between the observed and the predicted value of the ith data
point is given as:
ei = y i - f ( x i , b )
A regression model that has nicely fit the data will have its residuals display randomness (i.e., lack of any
pattern). This comes from the homoscedasticity assumption of regression modeling. Typically scatter plots
between residuals and predictors are used to confirm the assumption. Any pattern, results in a violation of
this property and points toward a poor fitting model.

Normality Test (Q-Q Plot)
This is a visual/graphical test to check for normality of the data. This test helps us identify outliers, skewness,
and so on. The test is performed by plotting the data verses theoretical quartiles. The same data is also
plotted on a histogram to confirm normality. The following are sample plots showcasing data confirming the
normality test (see Figure 6-9).

316

Chapter 6 ■ Analyzing Bike Sharing Trends

Figure 6-9. Normal plot (Q-Q plot) on the left and histogram to confirm normality on the right
Any deviation from the straight line in normal plot or skewness/multi-modality in histogram shows that
the data does not pass the normality test.

R-Squared: Goodness of Fit
R-Squared or the coefficient of determination is another measure used to check for goodness of fit for
regression analysis. It is a measure used to determine if the regression line is able to indicate the variance in
dependent variable as explained by the independent variables(s).
R-squared is a numeric value between 0 and 1, with 1 pointing to the fact that the independent
variable(s) are able to explain the variance in dependent variable. Values closer to 0 are indicative of poor
fitting models.

Cross Validation
As discussed in Chapter 5, model generalization is also an important aspect of working on Data Science
problems. A model which overfits its training set may perform poorly on unseen data and lead to all sorts of
problems and business impacts. Hence, we employ k-fold cross validation on regression models as well to
make sure there is no overfitting happening.

Modeling
The stage is now set to start modeling our Bike Sharing dataset and solve the business problem of predicting
bike demand for a given date time. We will utilize the concepts of regression analysis discussed in the
previous section to model and evaluate the performance of our models.
The dataset was analyzed and certain transformations like renaming attributes and type casting were
performed earlier in the chapter. Since the dataset contains multiple categorical variables, it is imperative
that we encode the nominal ones before we use them in our modeling process.

317

Chapter 6 ■ Analyzing Bike Sharing Trends

The following snippet showcases the function to one hot encode categorical variables, based on
methodologies we discussed in detail in Chapter 4: Feature Engineering and Selection.
def fit_transform_ohe(df,col_name):
    """This function performs one hot encoding for the specified
        column.
    Args:
        df(pandas.DataFrame): the data frame containing the mentioned column name
        col_name: the column to be one hot encoded
    Returns:
        tuple: label_encoder, one_hot_encoder, transformed column as pandas Series
    """
    # label encode the column
    le = preprocessing.LabelEncoder()
    le_labels = le.fit_transform(df[col_name])
    df[col_name+'_label'] = le_labels
    # one hot encoding
    ohe = preprocessing.OneHotEncoder()
    feature_arr = ohe.fit_transform(df[[col_name+'_label']]).toarray()
    feature_labels = [col_name+'_'+str(cls_label) for cls_label in le.classes_]
    features_df = pd.DataFrame(feature_arr, columns=feature_labels)
    return le,ohe,features_df
We use the fit_transform_ohe() function along with transform_ohe() to encode the categoricals.
The Label and One Hot encoders are available as part of scikit-learn’s preprocessing module.

■■Note

We will be using scikit and sklearn interchangeably in the coming sections.

As discussed in the earlier chapters, we usually divide the dataset at hand into training and testing sets
to evaluate the performance of our models. In this case as well, we use scikit-learn’s train_test_split()
function available through model_selection module. We split our dataset into 67% and 33% as train and
test, respectively. The following snippet showcases the same.
In [11]: X, X_test, y, y_test = train_test_split(hour_df.iloc[:,0:-3],
    ...:                                         hour_df.iloc[:,-1],
    ...:                                         test_size=0.33,
    ...:                                         random_state=42)
    ...:
    ...: X.reset_index(inplace=True)
    ...: y = y.reset_index()
    ...:
    ...: X_test.reset_index(inplace=True)
    ...: y_test = y_test.reset_index()

318

Chapter 6 ■ Analyzing Bike Sharing Trends

The following snippet loops through the list of categorical variables to transform and prepare a list of
encoded attributes.
In [12]: cat_attr_list = ['season','is_holiday',
    ...:                 'weather_condition','is_workingday',
    ...:                 'hour','weekday','month','year']
    ...:
    ...: encoded_attr_list = []
    ...: for col in cat_attr_list:
    ...:         return_obj = fit_transform_ohe(X,col)
    ...:         encoded_attr_list.append({'label_enc':return_obj[0],
    ...:                                
'ohe_enc':return_obj[1],
    ...:                                
'feature_df':return_obj[2],
    ...:                                
'col_name':col})

■■Note Though we have transformed all categoricals into their one-hot encodings, note that ordinal
attributes such as hour, weekday, and so on do not require such encoding.
Next, we merge the numeric and one hot encoded categoricals into a dataframe that we will use for our
modeling purposes. The following snippet helps us prepare the required dataset.
In [13]: feature_df_list = [X[numeric_feature_cols]]
    ...: feature_df_list.extend([enc['feature_df'] \
    ...:                                 for enc in encoded_attr_list \
    ...:                                         if enc['col_name'] in subset_cat_features])
    ...:
    ...: train_df_new = pd.concat(feature_df_list, axis=1)
    ...: print("Shape::{}".format(train_df_new.shape))
We prepared a new dataframe using numeric and one hot encoded categorical attributes from the
original training dataframe. The original dataframe had 10 such attributes (including both numeric and
categoricals). Post this transformation, the new dataframe has 19 attributes due to one hot encoding of the
categoricals.

Linear Regression
One of the simplest regression analysis techniques is linear regression. As discussed earlier in the chapter,
linear regression is the analysis of relationship between the dependent and independent variables. Linear
regression assumes linear relationship between the two variables. Extending on the general regression
analysis notation, linear regression takes the following form:
Y = a + bX
In this equation, Y is the dependent variable, X is the independent variable. The symbol a denotes the
intercept of the regression line and b is the slope of it.
Numerous lines can be fitted to a given dataset based on different combinations of the intercept (i.e. a)
and slope (i.e. b). The aim is to find the best fitting line to model our data.

319

Chapter 6 ■ Analyzing Bike Sharing Trends

If we think for a second, what would a best fitting line look like? Such line would invariably have the
least error/residual, i.e. the difference between the predicted and observed would be least for such a line.
The Ordinary Least Squares criterion is one such technique to identify the best fitting line. The
algorithm tries to minimize the error with respect to slope and intercept. It uses the squared error form,
shown as follows:
q = å ( yobserved - y predicted )

2

where, q is the total squared error. We minimize the total error to get the slope and intercept of the best
fitting line.

Training
Now that we have the background on linear regression and OLS, we’ll get started with the model building.
The linear regression model is exposed through scikit-learn’s linear_model module. Like all Machine
Learning algorithms in scikit, this also works on the familiar fit() and predict() theme. The following
snippet prepares the linear regression object for us.
In [14]: X = train_df_new
    ...: y= y.total_count.values.reshape(-1,1)
    ...:
    ...: lin_reg = linear_model.LinearRegression()
One simple way of proceeding would be call the fit() function to build our linear regression model
and then call the predict() function on the test dataset to get the predictions for evaluation. We also want to
keep in mind the aspects of overfitting and reduce its affects and obtain a generalizable model. As discussed
in the previous section and earlier chapters, cross validation is one method to keep overfitting in check.
We thus use the k-fold cross validation (specifically 10-fold) as shown in the following snippet.
In [15]: predicted = cross_val_predict(lin_reg, X, y, cv=10)
The function cross_val_predict() is exposed through model_selection module of sklearn. This
function takes the model object, predictors, and targets as inputs. We specify the k in k-fold using the cv
parameter. In our example, we use 10-fold cross validation. This function returns cross validated prediction
values as fitted by the model object.
We use scatter plot to analyze our predictions. The following snippet uses matplotlib to generate
scatter plot between residuals and observed values.
In [16]:
    ...:
    ...:
    ...:
    ...:
    ...:

320

fig, ax = plt.subplots()
ax.scatter(y, y-predicted)
ax.axhline(lw=2,color='black')
ax.set_xlabel('Observed')
ax.set_ylabel('Residual')
plt.show()

Chapter 6 ■ Analyzing Bike Sharing Trends

Figure 6-10. Residual plot
The plot in Figure 6-10 clearly violates the homoscedasticity assumption, which is about residuals
being random and not following any pattern. To further quantify our findings related to the model, we
plot the cross-validation scores. We use the cross_val_score() function available again as part of the
model_selection module, which is shown in the visualization in Figure 6-11.

Figure 6-11. Cross validation scores
The r-squared or the coefficient of determination is 0.39 on an average for 10-fold cross validation. This
points to the fact that the predictor is only able to explain 39% of the variance in the target variable.
You are encouraged to plot and confirm the normality of data. It is important to understand if the data
can be modeled by a linear model or not. This is being left as an exercise for you to explore.

Testing
The linear regression model prepared and evaluated in the training phase needs to be checked for its
performance on a completely un-seen dataset, the testing dataset. At the beginning of this section, we used
the train_test_split() function to keep a dataset specifically for testing purposes.

321

Chapter 6 ■ Analyzing Bike Sharing Trends

But before we can use the test dataset on the learned regression line, we need to make sure the
attributes have been through the same preprocessing in both training and testing sets. Since we transformed
categorical variables into their one hot encodings in the train dataset, in the following snippet we perform
the same actions on the test dataset as well.
In [17]: test_encoded_attr_list = []
    ...: for enc in encoded_attr_list:
    ...: col_name = enc['col_name']
    ...: le = enc['label_enc']
    ...: ohe = enc['ohe_enc']
    ...: test_encoded_attr_list.append({'feature_df':transform_ohe(X_test,
    ...:                                                                 le,ohe,
    ...:                                                                 col_name),
    ...:                                 'col_name':col_name})
    ...:
    ...:
    ...: test_feature_df_list = [X_test[numeric_feature_cols]]
    ...: test_feature_df_list.extend([enc['feature_df'] \
    ...:                                 for enc in test_encoded_attr_list \
    ...:                                         if enc['col_name'] in subset_cat_features])
    ...:
    ...: test_df_new = pd.concat(test_feature_df_list, axis=1)
    ...: print("Shape::{}".format(test_df_new.shape))
The transformed test dataset is shown in Figure 6-12.

Figure 6-12. Test dataset after transformations
The final piece of the puzzle is to use the predict() function of the LinearRegression object and
compare our results/predictions. The following snippet performs the said actions.
In [18]:
    ...:
    ...:
    ...:
    ...:
    ...:

X_test = test_df_new
y_test = y_test.total_count.values.reshape(-1,1)
y_pred = lin_reg.predict(X_test)
residuals = y_test-y_pred

We also calculate the residuals and use them to prepare the residual plot, similar to the one we created
during training step. The following snippet plots the residual plot on the test dataset.
In [19]: fig, ax = plt.subplots()
    ...: ax.scatter(y_test, residuals)

322

Chapter 6 ■ Analyzing Bike Sharing Trends

    ...:
    ...:
    ...:
    ...:
    ...:

ax.axhline(lw=2,color='black')
ax.set_xlabel('Observed')
ax.set_ylabel('Residuals')
ax.title.set_text("Residual Plot with R-Squared={}".format(np.average(r2_score)))
plt.show()

The generated plot shows an R-squared that is comparable to training performance. The plot is shown
in Figure 6-13.

Figure 6-13. Residual plot for test dataset
It is clearly evident from our evaluation that the linear regression model is unable to model the data to
generate decent results. Though it should be noted that the model is performing equally on both training
and testing datasets. It seems like a case where we would need to model this data using methods that can
model non-linear relationships.

EXERCISE
In this section, we used training and testing datasets with 19 attributes (including both numeric and one
hot encoded categoricals). The performance is dismal due to non-linearity and other factors.
Experiment with different combination of attributes (use only a subset or use only numerical attributes
or any combination of them) and prepare different linear regression models. Follow the same steps as
outlined in this section. Check the performance against the model prepared in this section and analyze if
a better performing model could be possible.

Decision Tree Based Regression
Decision trees are supervised learning algorithms used for both regression and classification problems. They
are simple yet powerful in modeling non-linear relationships. Being a non-parametric model, the aim of this
algorithm is to learn a model that can predict outcomes based on simple decision rules (for instance,
if-else conditions) based on features. The interpretability of decision trees makes them even more
lucrative, as we can visualize the rules it has inferred from the data.

323

Chapter 6 ■ Analyzing Bike Sharing Trends

We explain the concepts and terminologies related to decision trees using an example. Suppose we
have a hypothetical dataset of car models from different manufacturers. Assume each data point has features
like fuel_capacity, engine_capacity, price, year_of_purchase, miles_driven, and mileage. Given
this data, we need a model that can predict the mileage given other attributes.
Since decision trees are supervised learning algorithms, we have certain number of data points with
actual mileage values available. A decision tree starts off at the root and divides the dataset into two or more
non-overlapping subsets, each represented as child node of the root. It divides the root into subsets based
on a specific attribute. It goes on performing the split at every node until a leaf node is achieved where the
target value is available. There might be a lot many questions on how it all happens, and we will get to each
of them in a bit. For better understanding, assume Figure 6-14 is the structure of the decision tree inferred
from the dataset at hand.

Year of Purchase
< 2010

>=2010

Miles Driven

Price

>= 2000cc

< 2000cc
30.5 mi/gal

Fuel Capacity

20 mi/gal

Engine Capacity

14.2 mi/gal

>= 30k miles

< 30k miles

>= $50000

< $50000

< 15 gal
55 mi/gal

23.2 mi/gal
>= 15 gal

35.3 mi/gal

Figure 6-14. Sample decision tree
The visualization depicted in Figure 6-14 showcases a sample decision tree with leaf nodes pointing
toward target values. The tree starts off by splitting the dataset at root based on year of purchase with left
child representing purchases before 2010 and right child for purchases after 2010, and similarly for other
nodes. When presented with new/unseen data point, we simply traverse the tree and arrive at a leaf node
which determines the target value. Even though the previous example is simple, it clearly brings out the
interpretability of the model as well as its ability to learn simple rules.

N
 ode Splitting
Decision trees work in a top-down manner and node splitting is an important concept for any decision tree
algorithm. Most algorithms follow a greedy approach to divide the input space into subsets.
The basic process in simple terms is to try and split data points using different attributes/features
and test against a cost function. The split resulting in least cost is selected at every step. Classification and
regression problems use different set of cost functions. Some of the most common ones are as follows:
•

324

Mean Squared Error (MSE): Used mainly for regression trees, it is calculated as the
square of difference between observed and predicted values.

Chapter 6 ■ Analyzing Bike Sharing Trends

•

Mean Absolute Error: Used for regression trees, it is similar to MSE though we only
use the difference between the observed and predicted values.

•

Variance Reduction: This was first introduced with CART algorithm, and it uses the
standard formula of variance and we choose the split that results in least variance.

•

Gini Impurity/Index: Mainly used by classification trees, it is a measure of a
randomly chosen data point to have an incorrect label given it was labeled randomly.

•

Information Gain: Again used mainly for classification problems, it is also termed as
entropy. We choose the splits based on the amount of information gain. The higher
information gain, the better it is.

■■Note These are some of the most commonly used cost functions. There are many more which are used
under specific scenarios.

S
 topping Criteria
As mentioned, decision trees follow greedy recursive splitting of nodes, but how or when do they stop? There
are many strategies applied to define the stopping criteria. The most common being the minimum count
of data points. Node splitting is stopped if further splitting would violate this constraint. Another constraint
used is the depth of the tree.
The stopping criteria together with other parameters help us achieve trees that can generalize well. A
tree that is very deep or has too many non-leaf nodes often results in overfitting.

Hyperparameters
Hyperparameters are the knobs and controls we set with an aim to optimize the model’s performance on
unseen data. These hyperparameters are different from parameters which are learned by our learning
algorithm over the course of training process. Hyperparameters help us achieve objectives of avoiding
overfitting and so on. Decision trees provide us with quite a few hyperparameters to play with, some of
which we discussed in Chapter 5.
Maximum depth, minimum samples for leaf nodes, minimum samples to split internal nodes,
maximum leaf nodes, and so on are some of the hyperparameters actively used to improve performance
of decision trees. We will use techniques like grid search (refresh your memory from Chapter 5) to identify
optimal values for these hyperparameters in the coming sections.

Decision Tree Algorithms
Decision trees have been around for quite some time now. They have evolved with improvements in
algorithms based on different techniques over the years. Some of the most commonly used algorithms are
listed as follows:
•

CART or Classification And Regression Tree

•

ID3 or Iterative Dichotomizer 3

•

C4.5

Now that we have a decent understanding of decision trees, let’s see if we can achieve improvements by
using them for our regression problem of predicting the bike sharing demand.

325

Chapter 6 ■ Analyzing Bike Sharing Trends

Training
Similar to the process with linear regression, we will use the same preprocessed dataframe train_df_new
with categoricals transformed into one hot encoded form along with other numerical attributes. We also
split the dataset into train and test again using the train_test_split() utility from scikit-learn.
The training process for decision trees is a bit involved and different as compared to linear regression.
Even though we performed cross validation while training our linear regression model, we did not have
any hyperparameters to tune. In the case of decision trees, we have quite a handful of them (some of
which we even discussed in the previous section). Before we get into the specifics of obtaining optimal
hyperparameters, we will look at the DecisionTreeRegressor from sklearn’s tree module. We do so by
instantiating a regressor object with some of the hyperparameters set as follows.
In [1]: dtr = DecisionTreeRegressor(max_depth=4,
   ...:                                 min_samples_split=5,
   ...:                                 max_leaf_nodes=10)
This code snippet prepares a DecisionTreeRegressor object that is set to have a maximum depth of 4,
maximum leaf nodes as 10, and minimum number of samples required to split a node as 5. Though there
can be many more, this example outlines how hyperparameters are utilized in algorithms.

■■Note You are encouraged to try and fit the default Decision Tree Regressor on the training data and
observe its performance on testing dataset.
As mentioned, decision trees have an added advantage of being interpretable. We can visualize the
model object using Graphviz and pydot libraries, as shown in the following snippet.
In [2]: dot_data = tree.export_graphviz(dtr, out_file=None)
   ...: graph = pydotplus.graph_from_dot_data(dot_data)
   ...: graph.write_pdf("bikeshare.pdf")
The output is a pdf file showcasing a decision tree with hyperparameters as set in the previous step. The
following plot as depicted in Figure 6-15, shows the root node being split on attribute 3 and then going on
until a depth of 4 is achieved. There are some leaves at a depth lesser than 4 as well. Each node clearly marks
out the attributes associated with it.

326

Chapter 6 ■ Analyzing Bike Sharing Trends

Figure 6-15. Decision tree with defined hyperparameters on Bike Sharing dataset
Now we start with the actual training process. As must be evident from our workflow so far, we would
train our regressor using k-fold cross validation. Since we have hyperparameters as well in case of decision
trees to worry, we need a method to fine-tune them as well.
There are many ways of fine-tuning the hyperparameters, the most common ones are grid search
and random search, with grid search being the more popular one. As the name suggests, random search
randomly searches the combinations of hyperparameters to find the best combination, grid search on the
other hand is a more systematic approach where all combinations are tried before the best is identified. To
make our lives easier, sklearn provides a utility to grid search the hyperparameters while cross validating
the model using the GridSearchCV() method from model_selection module.
The GridSearchCV() method takes the Regression/Classifier object as input parameter along with a
dictionary of hyperparameters, number of cross validations required, and a few more. We use the following
dictionary to define our grid of hyperparameters.
In [3]: param_grid = {"criterion": ["mse", "mae"],
   ...:               "min_samples_split": [10, 20, 40],
   ...:               "max_depth": [2, 6, 8],
   ...:               "min_samples_leaf": [20, 40, 100],
   ...:               "max_leaf_nodes": [5, 20, 100, 500, 800],
   ...: }
The dictionary basically provides a list of feasible values for each of the hyperparameters that we want
to fine-tune. The hyperparameters are the keys, while the values are presented as list of possible values of
these hyperparameters. For instance, our dictionary provides max_depth with possible values of 2, 6, and 8
levels. The GridSearchCV() function will in turn search in this defined list of possible values to arrive at the
best one value. The following snippet prepares a GridSearchCV object and fits our training dataset to it.
In [4]: grid_cv_dtr = GridSearchCV(dtr, param_grid, cv=5)

327

Chapter 6 ■ Analyzing Bike Sharing Trends

The grid search of hyperparameters with k-fold cross validation is an iterative process wrapped,
optimized, and standardized by GridSearchCV() function. The training process takes time due to the same
and results in quite a few useful attributes to analyze. The best_score_ attribute helps us get the best cross
validation score our Decision Tree Regressor could achieve. We can view the hyperparameters for the model
that generates the best score using best_params_. We can view the detailed information of each iteration of
GridSearchCV() using the cv_results_ attribute. The following snippet showcases some of these attributes.
In [5]: print("R-Squared::{}".format(grid_cv_dtr.best_score_))
   ...: print("Best Hyperparameters::\n{}".format(grid_cv_dtr.best_params_))
R-Squared::0.85891903233008
Best Hyperparameters::
{'min_samples_split': 10, 'max_depth': 8, 'max_leaf_nodes': 500, 'min_samples_leaf': 20,
'criterion': 'mse'}
The results are decent and show a dramatic improvement over our linear regression model. Let’s first
try to understand the learning/model fitting results across different settings of this model fitting. To get to
different models prepared during our grid search, we use the cv_results_ attribute of our GridSearchCV
object. The cv_results_ attribute is a numpy array that we can easily convert to a pandas dataframe. The
dataframe is shown in Figure 6-16.

Figure 6-16. Dataframe showcasing tuning results with a few attributes of Grid Search with CV
It important to understand that grid search with cross validation was optimizing on finding the best set
of hyperparameters that can help prepare a generalizable Decision Tree Regressor. It may be possible that
there are further optimizations possible. We use seaborn to plot the impact of depth of the tree on the overall
score along with number of leaf nodes. The following snippet uses the same dataframe we prepared using
cv_results_ attribute of GridSearchCV object discussed previously.
In [6]: fig,ax = plt.subplots()
   ...: sn.pointplot(data=df[['mean_test_score',
   ...:                         'param_max_leaf_nodes',
   ...:                         'param_max_depth']],
   ...:                y='mean_test_score',x='param_max_depth',
   ...:                hue='param_max_leaf_nodes',ax=ax)
   ...: ax.set(title="Affect of Depth and Leaf Nodes on Model Performance")

328

Chapter 6 ■ Analyzing Bike Sharing Trends

The output shows a sudden improvement in score as depth increases from 2 to 6 while a gradual
improvement as we reach 8 from 6. The impact of number of leaf nodes is rather interesting. The difference
in scores between 100 and 800 leaf nodes is strikingly not much. This is a clear indicator that further finetuning is possible. Figure 6-17 depicts the visualization showcasing these results.

Figure 6-17. Mean test score and impact of tree depth and count of leaf nodes
As mentioned, there is still scope of fine-tuning to further improve the results. It is therefore a decision
which is use case and cost dependent. Cost can be in terms of effort, time and corresponding improvements
achieved. For now, we’ll proceed with the best model our GridSearchCV object has helped us identify.

T esting
Once we have a model trained with optimized hyperparameters, we can begin with our usual workflow to
test performance on an unseen dataset. We will utilize the same preprocessing as discussed while preparing
test set for linear regression (you may refer to the “Testing” section of linear regression and/or the jupyter
notebook decision_tree_regression.ipynb). The following snippet predicts the output values for the test
dataset using the best estimator achieved during training phase.
In [7]: y_pred = best_dtr_model.predict(X_test)
   ...: residuals = y_test.flatten() - y_pred
The final step is to view the R-squared score on this dataset. A well-fit model should have comparable
performance on this set as well, the same is evident from the following snippet.
In [1]: print("R-squared::{}".format(r2_score))
R-squared::0.8722
As is evident from the R-squared value, the performance is quite comparable to our training
performance. We can conclude by saying Decision Tree Regressor was better at forecasting bike demand as
compared to linear regression.

329

Chapter 6 ■ Analyzing Bike Sharing Trends

Next Steps
Decision trees helped us achieve a better performance over linear regression based models, yet there are
improvements possible. The following are a few next steps to ponder and keep in mind:
•

Model fine-tuning: We achieved a drastic improvement from using decision trees,
yet this can be improved further by analyzing the results of training, cross validation,
and so on. Acceptable model performance is something that is use case dependent
and is usually discussed upon while formalizing the problem statement. In our case,
an R-squared of 0.8 might be very good or missing the mark, hence the results need
to be discussed with all stakeholders involved.

•

Other models and ensembling: If our current models do not achieve the
performance criteria, we need to evaluate other algorithms and even ensembles.
There many other regression models to explore. Ensembles are also quite useful and
are used extensively.

•

Machine Learning pipeline: The workflow shared in this chapter was verbose to
assist in understanding and exploring concepts and techniques. Once a sequence of
preprocessing and modeling steps have stabilized, standardized pipelines are used
to maintain consistency and validity of the entire process. You are encouraged to
explore sklearn’s pipeline module for further details.

Summary
This chapter introduced the Bike Sharing dataset from the UCI Machine Learning repository. We followed
the Machine Learning workflow discussed in detail in the previous section of the book. We started off with
a brief discussion about the dataset followed by formally defining the problem statement. Once we had a
mission to forecast the bike demand, the next was to get started with exploratory data analysis to understand
and uncover properties and patterns in the data. We utilized pandas, numpy, and seaborn/matplotlib to
manipulate, transform, and visualize the dataset at hand. The line plots, bar charts, box plots, violin plots,
etc. all helped us understand various aspects of the dataset.
We then took a detour and explored the regression analysis. The important concepts, assumptions,
and types of regression analysis techniques were discussed. Briefly, we touched on various performance/
evaluation criteria typically used for regression problems like residual plots, normal plots, and coefficient
of determination. Equipped with a better understanding of dataset and regression itself, we started off
with a simple algorithm called linear regression. It is not only a simple algorithm but one of the most well
studied and extensively used algorithms for regression use cases. We utilized sklearn’s linear_model to
build and test our dataset for the problem at hand. We also utilized the model_selection module to split
our dataset and cross validate our model. The next step saw us graduating to decision tree based regression
model to improve on the performance of linear regression. We touched upon the concepts and important
aspects related to decision trees before using them for modeling our dataset. The same set of preprocessing
steps was used for both linear regression and decision trees. Finally, we concluded the chapter by briefly
discussing about next steps for improvement and enhancements. This chapter sets the flow and context for
coming chapters, which will build on the concepts and workflows from here on.

330

CHAPTER 7

Analyzing Movie Reviews
Sentiment
In this chapter, we continue with our focus on case-study oriented chapters, where we will focus on specific
real-world problems and scenarios and how we can use Machine Learning to solve them. We will cover aspects
pertaining to natural language processing (NLP), text analytics, and Machine Learning in this chapter. The
problem at hand is sentiment analysis or opinion mining, where we want to analyze some textual documents
and predict their sentiment or opinion based on the content of these documents. Sentiment analysis is perhaps
one of the most popular applications of natural language processing and text analytics with a vast number of
websites, books and tutorials on this subject. Typically sentiment analysis seems to work best on subjective
text, where people express opinions, feelings, and their mood. From a real-world industry standpoint,
sentiment analysis is widely used to analyze corporate surveys, feedback surveys, social media data, and
reviews for movies, places, commodities, and many more. The idea is to analyze and understand the reactions
of people toward a specific entity and take insightful actions based on their sentiment.
A text corpus consists of multiple text documents and each document can be as simple as a single
sentence to a complete document with multiple paragraphs. Textual data, in spite of being highly
unstructured, can be classified into two major types of documents. Factual documents that typically depict
some form of statements or facts with no specific feelings or emotion attached to them. These are also
known as objective documents. Subjective documents on the other hand have text that expresses feelings,
moods, emotions, and opinions.
Sentiment analysis is also popularly known as opinion analysis or opinion mining. The key idea is to
use techniques from text analytics, NLP, Machine Learning, and linguistics to extract important information
or data points from unstructured text. This in turn can help us derive qualitative outputs like the overall
sentiment being on a positive, neutral, or negative scale and quantitative outputs like the sentiment polarity,
subjectivity, and objectivity proportions. Sentiment polarity is typically a numeric score that’s assigned to
both the positive and negative aspects of a text document based on subjective parameters like specific words
and phrases expressing feelings and emotion. Neutral sentiment typically has 0 polarity since it does not
express and specific sentiment, positive sentiment will have polarity > 0, and negative < 0. Of course, you can
always change these thresholds based on the type of text you are dealing with; there are no hard constraints
on this.
In this chapter, we focus on trying to analyze a large corpus of movie reviews and derive the sentiment.
We cover a wide variety of techniques for analyzing sentiment, which include the following.
•

Unsupervised lexicon-based models

•

Traditional supervised Machine Learning models

•

Newer supervised Deep Learning models

•

Advanced supervised Deep Learning models

© Dipanjan Sarkar, Raghav Bali and Tushar Sharma 2018
D. Sarkar et al., Practical Machine Learning with Python, https://doi.org/10.1007/978-1-4842-3207-1_7

331

Chapter 7 ■ Analyzing Movie Reviews Sentiment

Besides looking at various approaches and models, we also focus on important aspects in the Machine
Learning pipeline including text pre-processing, normalization, and in-depth analysis of models, including
model interpretation and topic models. The key idea here is to understand how we tackle a problem like
sentiment analysis on unstructured text, learn various techniques, models and understand how to interpret
the results. This will enable you to use these methodologies in the future on your own datasets. Let’s get
started!

P roblem Statement
The main objective in this chapter is to predict the sentiment for a number of movie reviews obtained from
the Internet Movie Database (IMDb). This dataset contains 50,000 movie reviews that have been pre-labeled
with “positive” and “negative” sentiment class labels based on the review content. Besides this, there are
additional movie reviews that are unlabeled. The dataset can be obtained from http://ai.stanford.
edu/~amaas/data/sentiment/, courtesy of Stanford University and Andrew L. Maas, Raymond E. Daly, Peter
T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. This dataset was also used in their famous paper,
Learning Word Vectors for Sentiment Analysis proceedings of the 49th Annual Meeting of the Association
for Computational Linguistics (ACL 2011). They have datasets in the form of raw text as well as already
processed bag of words formats. We will only be using the raw labeled movie reviews for our analyses in
this chapter. Hence our task will be to predict the sentiment of 15,000 labeled movie reviews and use the
remaining 35,000 reviews for training our supervised models. We will still predict sentiments for only 15,000
reviews in case of unsupervised models to maintain consistency and enable ease of comparison.

Setting Up Dependencies
We will be using several Python libraries and frameworks specific to text analytics, NLP, and Machine
Learning. While most of them will be mentioned in each section, you need to make sure you have pandas,
numpy, scipy, and scikit-learn installed, which will be used for data processing and Machine Learning.
Deep Learning frameworks used in this chapter include keras with the tensorflow backend, but you can
also use theano as the backend if you choose to do so. NLP libraries which will be used include spacy, nltk,
and gensim. Do remember to check that your installed nltk version is at least >= 3.2.4, otherwise, the
ToktokTokenizer class may not be present. If you want to use a lower nltk version for some reason, you can
use any other tokenizer like the default word_tokenize() based on the TreebankWordTokenizer. The version
for gensim should be at least 2.3.0 and for spacy, the version used was 1.9.0. We recommend using the latest
version of spacy which was recently released (version 2.x) as this has fixed several bugs and added several
improvements. You also need to download the necessary dependencies and corpora for spacy and nltk in
case you are installing them for the first time. The following snippets should get this done. For nltk you need
to type the following code from a Python or ipython shell after installing nltk using either pip or conda.
import nltk
nltk.download('all', halt_on_error=False)
For spacy, you need to type the following code in a Unix shell\windows command prompt, to install the
library (use pip install spacy if you don’t want to use conda) and also get the English model dependency.
$ conda config --add channels conda-forge  
$ conda install spacy
$ python -m spacy download en

332

Chapter 7 ■ Analyzing Movie Reviews Sentiment

We also use our custom developed text pre-processing and normalization module, which you will
find in the files named contractions.py and text_normalizer.py. Utilities related to supervised model
fitting, prediction, and evaluation are present in model_evaluation_utils.py, so make sure you have these
modules in the same directory and the other Python files and jupyter notebooks for this chapter.

Getting the Data
The dataset will be available along with the code files for this chapter in the GitHub repository for this
book at https://github.com/dipanjanS/practical-machine-learning-with-python under the filename
movie_reviews.csv containing 50,000 labeled IMDb movie reviews. You can also download the same data
from http://ai.stanford.edu/~amaas/data/sentiment/ if needed. Once you have the CSV file, you can
easily load it in Python using the read_csv(...) utility function from pandas.

Text Pre-Processing and Normalization
One of the key steps before diving into the process of feature engineering and modeling involves cleaning,
pre-processing, and normalizing text to bring text components like phrases and words to some standard
format. This enables standardization across a document corpus, which helps build meaningful features
and helps reduce noise that can be introduced due to many factors like irrelevant symbols, special
characters, XML and HTML tags, and so on. The file named text_normalizer.py has all the necessary
utilities we will be using for our text normalization needs. You can also refer to the jupyter notebook named
Text Normalization Demo.ipynb for a more interactive experience. The main components in our text
normalization pipeline are described in this section.
•

Cleaning text: Our text often contains unnecessary content like HTML tags, which
do not add much value when analyzing sentiment. Hence we need to make sure we
remove them before extracting features. The BeautifulSoup library does an excellent
job in providing necessary functions for this. Our strip_html_tags(...) function
enables in cleaning and stripping out HTML code.

•

Removing accented characters: In our dataset, we are dealing with reviews in
the English language so we need to make sure that characters with any other
format, especially accented characters are converted and standardized into ASCII
characters. A simple example would be converting é to e. Our remove_accented_
chars(...) function helps us in this respect.

•

Expanding contractions: In the English language, contractions are basically
shortened versions of words or syllables. These shortened versions of existing words
or phrases are created by removing specific letters and sounds. More than often
vowels are removed from the words. Examples would be, do not to don’t and I
would to I’d. Contractions pose a problem in text normalization because we have
to deal with special characters like the apostrophe and we also have to convert each
contraction to its expanded, original form. Our expand_contractions(...) function
uses regular expressions and various contractions mapped in our contractions.py
module to expand all contractions in our text corpus.

•

Removing special characters: Another important task in text cleaning and
normalization is to remove special characters and symbols that often add to the extra
noise in unstructured text. Simple regexes can be used to achieve this. Our function
remove_special_characters(...) helps us remove special characters. In our code,
we have retained numbers but you can also remove numbers if you do not want
them in your normalized corpus.

333

Chapter 7 ■ Analyzing Movie Reviews Sentiment

•

Stemming and lemmatization: Word stems are usually the base form of possible
words that can be created by attaching affixes like prefixes and suffixes to the stem
to create new words. This is known as inflection. The reverse process of obtaining
the base form of a word is known as stemming. A simple example are the words
WATCHES, WATCHING, and WATCHED. They have the word root stem WATCH
as the base form. The nltk package offers a wide range of stemmers like the
PorterStemmer and LancasterStemmer. Lemmatization is very similar to stemming,
where we remove word affixes to get to the base form of a word. However the base
form in this case is known as the root word but not the root stem. The difference
being that the root word is always a lexicographically correct word (present in the
dictionary) but the root stem may not be so. We will be using lemmatization only in
our normalization pipeline to retain lexicographically correct words. The function
lemmatize_text(...) helps us with this aspect.

•

Removing stopwords: Words which have little or no significance especially when
constructing meaningful features from text are also known as stopwords or stop
words. These are usually words that end up having the maximum frequency if you do
a simple term or word frequency in a document corpus. Words like a, an, the, and
so on are considered to be stopwords. There is no universal stopword list but we use
a standard English language stopwords list from nltk. You can also add your own
domain specific stopwords if needed. The function remove_stopwords(...) helps
us remove stopwords and retain words having the most significance and context in a
corpus.

We use all these components and tie them together in the following function called normalize_
corpus(...), which can be used to take a document corpus as input and return the same corpus with
cleaned and normalized text documents.
def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True,
                     text_lemmatization=True, special_char_removal=True,
                     stopword_removal=True):
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)
        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)
        # expand contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # insert spaces between special characters to isolate them    
        special_char_pattern = re.compile(r'([{.(-)!}])')
        doc = special_char_pattern.sub(" \\1 ", doc)

334

Chapter 7 ■ Analyzing Movie Reviews Sentiment

        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        # remove special characters    
        if special_char_removal:
            doc = remove_special_characters(doc)  
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
        normalized_corpus.append(doc)
    return normalized_corpus
The following snippet depicts a small demo of text normalization on a sample document using our
normalization module.
In [1]: from text_normalizer import normalize_corpus
In [2]: document = """
Héllo! Héllo! can you hear me! I just heard about Python!
\r\n
   ...: It's an amazing language which can be used for Scripting, Web development,\r\n\r\n
   ...: Information Retrieval, Natural Language Processing, Machine Learning & Artificial
Intelligence!\n
   ...: What are you waiting for? Go and get started.
 He's learning, she's learning,
they've already\n\n
   ...: got a headstart!

   ...:            """
In [3]: document
Out[3]: "Héllo! Héllo! can you hear me! I just heard about Python!
\r\n
\n              It's an amazing language which can be used for Scripting, Web development,
\r\n\r\n\n              Information Retrieval, Natural Language Processing, Machine
Learning & Artificial Intelligence!\n\n              What are you waiting for? Go and
get started.
 He's learning, she's learning, they've already\n\n\n              got a
headstart!\n           "
In [4]: normalize_corpus([document], text_lemmatization=False, stopword_removal=False,
                         text_lower_case=False)
Out[4]: ['Hello Hello can you hear me I just heard about Python It is an amazing language
which can be used for Scripting Web development Information Retrieval Natural Language
Processing Machine Learning Artificial Intelligence What are you waiting for Go and get
started He is learning she is learning they have already got a headstart ']
In [5]: normalize_corpus([document])
Out[5]: ['hello hello hear hear python amazing language use scripting web development
information retrieval natural language processing machine learning artificial intelligence
wait go get start learn learn already get headstart']

335

Chapter 7 ■ Analyzing Movie Reviews Sentiment

Now that we have our normalization module ready, we can start modeling and analyzing our corpus.
NLP and text analytics enthusiasts who might be interested in more in-depth details of text normalization
can refer to the section “Text Normalization,” Chapter 3, page 115, of Text Analytics with Python (Apress;
Dipanjan Sarkar, 2016).

Unsupervised Lexicon-Based Models
We have talked about unsupervised learning methods in the past, which refer to specific modeling methods
that can be applied directly on data features without the presence of labeled data. One of the major challenges
in any organization is getting labeled datasets due the lack of time as well as resources to do this tedious task.
Unsupervised methods are very useful in this scenario and we will be looking at some of these methods in
this section. Even though we have labeled data, this section should give you a good idea of how lexicon based
models work and you can apply the same in your own datasets when you do not have labeled data.
Unsupervised sentiment analysis models use well curated knowledgebases, ontologies, lexicons, and
databases that have detailed information pertaining to subjective words, phrases including sentiment,
mood, polarity, objectivity, subjectivity, and so on. A lexicon model typically uses a lexicon, also known as
a dictionary or vocabulary of words specifically aligned toward sentiment analysis. Usually these lexicons
contain a list of words associated with positive and negative sentiment, polarity (magnitude of negative or
positive score), parts of speech (POS) tags, subjectivity classifiers (strong, weak, neutral), mood, modality,
and so on. You can use these lexicons and compute sentiment of a text document by matching the presence
of specific words from the lexicon, look at other additional factors like presence of negation parameters,
surrounding words, overall context and phrases and aggregate overall sentiment polarity scores to decide
the final sentiment score. There are several popular lexicon models used for sentiment analysis. Some of
them are mentioned as follows.
•

Bing Liu’s Lexicon

•

MPQA Subjectivity Lexicon

•

Pattern Lexicon

•

AFINN Lexicon

•

SentiWordNet Lexicon

•

VADER Lexicon

This is not an exhaustive list of lexicon models, but definitely lists among the most popular ones
available today. We will be covering the last three lexicon models in more detail with hands-on code
and examples using our movie review dataset. We will be using the last 15,000 reviews and predict their
sentiment and see how well our model performs based on model evaluation metrics like accuracy,
precision, recall, and F1-score, which we covered in detail in Chapter 5. Since we have labeled data, it
will be easy for us to see how well our actual sentiment values for these movie reviews match our lexiconmodel based predicted sentiment values. You can refer to the Python file titled unsupervised_sentiment_
analysis.py for all the code used in this section or use the jupyter notebook titled Sentiment Analysis Unsupervised Lexical.ipynb for a more interactive experience. Before we start our analysis, let’s load the
necessary dependencies and configuration settings using the following snippet.
In [1]:
   ...:
   ...:
   ...:
   ...:
   ...:

336

import
import
import
import

pandas as pd
numpy as np
text_normalizer as tn
model_evaluation_utils as meu

np.set_printoptions(precision=2, linewidth=80)

Chapter 7 ■ Analyzing Movie Reviews Sentiment

Now, we can load our IMDb review dataset, subset out the last 15,000 reviews which will be used for our
analysis, and normalize them using the following snippet.
In [2]:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:

dataset = pd.read_csv(r'movie_reviews.csv')
reviews = np.array(dataset['review'])
sentiments = np.array(dataset['sentiment'])
# extract data for model evaluation
test_reviews = reviews[35000:]
test_sentiments = sentiments[35000:]
sample_review_ids = [7626, 3533, 13010]
# normalize dataset
norm_test_reviews = tn.normalize_corpus(test_reviews)

We also extract out some sample reviews so that we can run our models on them and interpret their
results in detail.

Bing Liu’s Lexicon
This lexicon contains over 6,800 words which have been divided into two files named positive-words.txt,
containing around 2,000+ words/phrases and negative-words.txt, which contains 4,800+ words/phrases.
The lexicon has been developed and curated by Bing Liu over several years and has also been explained
in detail in his original paper by Nitin Jindal and Bing Liu, “Identifying Comparative Sentences in Text
Documents” proceedings of the 29th Annual International ACM SIGIR, Seattle 2006. If you want to use this
lexicon, you can get it from https:// www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon,
which also includes a link to download it as an archive (RAR format).

MPQA Subjectivity Lexicon
The term MPQA stands for Multi-Perspective Question Answering and it contains a diverse set of resources
pertaining to opinion corpora, subjectivity lexicon, subjectivity sense annotations, argument lexicon, debate
corpora, opinion finder, and many more. This is developed and maintained by the University of Pittsburgh
and their official web site http://mpqa.cs.pitt.edu/ contains all the necessary information. The
subjectivity lexicon is a part of their opinion finder framework and contains subjectivity clues and contextual
polarity. Details on this can be found in the paper by Theresa Wilson, Janyce Wiebe, and Paul Hoffmann,
“Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis” proceeding of HLT-EMNLP-2005.
You can download the subjectivity lexicon from their official web site at http://mpqa.cs.pitt.edu/
lexicons/subj_lexicon/, contains subjectivity clues present in the dataset named subjclueslen1HLTEMNLP05.tff. The following snippet shows some sample lines from the lexicon.
type=weaksubj len=1 word1=abandonment pos1=noun stemmed1=n priorpolarity=negative
type=weaksubj len=1 word1=abandon pos1=verb stemmed1=y priorpolarity=negative
...
...
type=strongsubj len=1 word1=zenith pos1=noun stemmed1=n priorpolarity=positive
type=strongsubj len=1 word1=zest pos1=noun stemmed1=n priorpolarity=positive
Each line consists of a specific word and its associated polarity, POS tag information, length (right now
only words of length 1 are present), subjective context, and stem information.

337

Chapter 7 ■ Analyzing Movie Reviews Sentiment

Pattern Lexicon
The pattern package is a complete natural language processing framework available in Python which can be
used for text processing, sentiment analysis and more. This has been developed by CLiPS (Computational
Linguistics & Psycholinguistics), a research center associated with the Linguistics Department of the Faculty
of Arts of the University of Antwerp. Pattern uses its own sentiment module which internally uses a lexicon
which you can access from their official GitHub repository at https://github.com/clips/pattern/blob/
master/pattern/text/en/en-sentiment.xml and this contains the complete subjectivity based lexicon
database. Each line in the lexicon typically looks like the following sample.

Thus you get important metadata information like WordNet corpus identifiers, polarity scores, word
sense, POS tags, intensity, subjectivity scores, and so on. These can in turn be used to compute sentiment
over a text document based on polarity and subjectivity score. Unfortunately, pattern has still not been
ported officially for Python 3.x and it works on Python 2.7.x. However, you can still load this lexicon and do
your own modeling as needed.

AFINN Lexicon
The AFINN lexicon is perhaps one of the simplest and most popular lexicons that can be used extensively
for sentiment analysis. Developed and curated by Finn Årup Nielsen, you can find more details on this
lexicon in the paper by Finn Årup Nielsen, “A new ANEW: evaluation of a word list for sentiment analysis in
microblogs”, proceedings of the ESWC2011 Workshop. The current version of the lexicon is AFINN-en-165.
txt and it contains over 3,300+ words with a polarity score associated with each word. You can find this
lexicon at the author’s official GitHub repository along with previous versions of this lexicon including
AFINN-111 at https://github.com/fnielsen/afinn/blob/master/afinn/data/. The author has also
created a nice wrapper library on top of this in Python called afinn which we will be using for our analysis
needs. You can import the library and instantiate an object using the following code.
In [3]: from afinn import Afinn
   ...:
   ...: afn = Afinn(emoticons=True)
We can now use this object and compute the polarity of our chosen four sample reviews using the
following snippet.
In [4]: for review, sentiment in zip(test_reviews[sample_review_ids], test_
sentiments[sample_review_ids]):
   ...:    print('REVIEW:', review)
   ...:    print('Actual Sentiment:', sentiment)
   ...:    print('Predicted Sentiment polarity:', afn.score(review))
   ...:    print('-'*60)
REVIEW: no comment - stupid movie, acting average or worse... screenplay - no sense at
all... SKIP IT!
Actual Sentiment: negative

338

Chapter 7 ■ Analyzing Movie Reviews Sentiment

Predicted Sentiment polarity: -7.0
-----------------------------------------------------------REVIEW: I don't care if some people voted this movie to be bad. If you want the Truth this
is a Very Good Movie! It has every thing a movie should have. You really should Get this
one.
Actual Sentiment: positive
Predicted Sentiment polarity: 3.0
-----------------------------------------------------------REVIEW: Worst horror film ever but funniest film ever rolled in one you have got to see this
film it is so cheap it is unbelievable but you have to see it really!!!! P.S. Watch the
carrot
Actual Sentiment: positive
Predicted Sentiment polarity: -3.0
-----------------------------------------------------------We can compare the actual sentiment label for each review and also check out the predicted sentiment
polarity score. A negative polarity typically denotes negative sentiment. To predict sentiment on our
complete test dataset of 15,000 reviews (I used the raw text documents because AFINN takes into account
other aspects like emoticons and exclamations), we can now use the following snippet. I used a threshold
of >= 1.0 to determine if the overall sentiment is positive else negative. You can choose your own threshold
based on analyzing your own corpora in the future.
In [5]: sentiment_polarity = [afn.score(review) for review in test_reviews]
   ...: predicted_sentiments = ['positive' if score >= 1.0 else 'negative' for score in
                                                                        sentiment_polarity]
Now that we have our predicted sentiment labels, we can evaluate our model performance based on
standard performance metrics using our utility function. See Figure 7-1.
In [6]: meu.display_model_performance_metrics(true_labels=test_sentiments,
                                             predicted_labels=predicted_sentiments,
                                             classes=['positive', 'negative'])

Figure 7-1. Model performance metrics for AFINN lexicon based model
We get an overall F1-Score of 71%, which is quite decent considering it’s an unsupervised model.
Looking at the confusion matrix we can clearly see that quite a number of negative sentiment based
reviews have been misclassified as positive (3,189) and this leads to the lower recall of 57% for the negative
sentiment class. Performance for positive class is better with regard to recall or hit-rate, where we correctly
predicted 6,376 out of 7,510 positive reviews, but precision is 67% because of the many wrong positive
predictions made in case of negative sentiment reviews.

339

Chapter 7 ■ Analyzing Movie Reviews Sentiment

SentiWordNet Lexicon
The WordNet corpus is definitely one of the most popular corpora for the English language used extensively
in natural language processing and semantic analysis. WordNet gave us the concept of synsets or synonym
sets. The SentiWordNet lexicon is based on WordNet synsets and can be used for sentiment analysis and
opinion mining. The SentiWordNet lexicon typically assigns three sentiment scores for each WordNet synset.
These include a positive polarity score, a negative polarity score and an objectivity score. Further details
are available on the official web site http://sentiwordnet.isti.cnr.it, including research papers and
download links for the lexicon. We will be using the nltk library, which provides a Pythonic interface into
SentiWordNet. Consider we have the adjective awesome. We can get the sentiment scores associated with
the synset for this word using the following snippet.
In [8]: from nltk.corpus import sentiwordnet as swn
   ...:
   ...: awesome = list(swn.senti_synsets('awesome', 'a'))[0]
   ...: print('Positive Polarity Score:', awesome.pos_score())
   ...: print('Negative Polarity Score:', awesome.neg_score())
   ...: print('Objective Score:', awesome.obj_score())
Positive Polarity Score: 0.875
Negative Polarity Score: 0.125
Objective Score: 0.0
Let’s now build a generic function to extract and aggregate sentiment scores for a complete textual
document based on matched synsets in that document.
def analyze_sentiment_sentiwordnet_lexicon(review,
                                           verbose=False):
    # tokenize and POS tag text tokens
    tagged_text = [(token.text, token.tag_) for token in tn.nlp(review)]
    pos_score = neg_score = token_count = obj_score = 0
    # get wordnet synsets based on POS tags
    # get sentiment scores if synsets are found
    for word, tag in tagged_text:
        ss_set = None
        if 'NN' in tag and list(swn.senti_synsets(word, 'n')):
            ss_set = list(swn.senti_synsets(word, 'n'))[0]
        elif 'VB' in tag and list(swn.senti_synsets(word, 'v')):
            ss_set = list(swn.senti_synsets(word, 'v'))[0]
        elif 'JJ' in tag and list(swn.senti_synsets(word, 'a')):
            ss_set = list(swn.senti_synsets(word, 'a'))[0]
        elif 'RB' in tag and list(swn.senti_synsets(word, 'r')):
            ss_set = list(swn.senti_synsets(word, 'r'))[0]
        # if senti-synset is found        
        if ss_set:
            # add scores for all found synsets
            pos_score += ss_set.pos_score()
            neg_score += ss_set.neg_score()
            obj_score += ss_set.obj_score()
            token_count += 1

340

Chapter 7 ■ Analyzing Movie Reviews Sentiment

    # aggregate final scores
    final_score = pos_score - neg_score
    norm_final_score = round(float(final_score) / token_count, 2)
    final_sentiment = 'positive' if norm_final_score >= 0 else 'negative'
    if verbose:
        norm_obj_score = round(float(obj_score) / token_count, 2)
        norm_pos_score = round(float(pos_score) / token_count, 2)
        norm_neg_score = round(float(neg_score) / token_count, 2)
        # to display results in a nice table
        sentiment_frame = pd.DataFrame([[final_sentiment, norm_obj_score, norm_pos_score,
                                         norm_neg_score, norm_final_score]],
                                       columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'],
                                                        ['Predicted Sentiment',
'Objectivity',
                                                         'Positive', 'Negative',
'Overall']],
                                                        labels=[[0,0,0,0,0],[0,1,2,3,4]]))
        print(sentiment_frame)
    return final_sentiment
Our function basically takes in a movie review, tags each word with its corresponding POS tag, extracts
out sentiment scores for any matched synset token based on its POS tag, and finally aggregates the scores.
This will be clearer when we run it on our sample documents.
In [10]: for review, sentiment in zip(test_reviews[sample_review_ids], test_
sentiments[sample_review_ids]):
    ...:     print('REVIEW:', review)
    ...:     print('Actual Sentiment:', sentiment)
    ...:     pred = analyze_sentiment_sentiwordnet_lexicon(review, verbose=True)    
    ...:     print('-'*60)
REVIEW: no comment - stupid movie, acting average or worse... screenplay - no sense at
all... SKIP IT!
Actual Sentiment: negative
     SENTIMENT STATS:                                   
  Predicted Sentiment Objectivity Positive Negative Overall
0            negative        0.76     0.09     0.15   -0.06
-----------------------------------------------------------REVIEW: I don't care if some people voted this movie to be bad. If you want the Truth this
is a Very Good Movie! It has every thing a movie should have. You really should Get this
one.
Actual Sentiment: positive
     SENTIMENT STATS:                                   
  Predicted Sentiment Objectivity Positive Negative Overall
0            positive        0.74      0.2     0.06    0.14
-----------------------------------------------------------REVIEW: Worst horror film ever but funniest film ever rolled in one you have got to see this
film it is so cheap it is unbelievable but you have to see it really!!!! P.S. watch the
carrot

341

Chapter 7 ■ Analyzing Movie Reviews Sentiment

Actual Sentiment: positive
     SENTIMENT STATS:                                   
  Predicted Sentiment Objectivity Positive Negative Overall
0            positive         0.8     0.14     0.07    0.07
-----------------------------------------------------------We can clearly see the predicted sentiment along with sentiment polarity scores and an objectivity
score for each sample movie review depicted in formatted dataframes. Let’s use this model now to predict
the sentiment of all our test reviews and evaluate its performance. A threshold of >=0 has been used for the
overall sentiment polarity to be classified as positive and < 0 for negative sentiment. See Figure 7-2.
In [11]: predicted_sentiments = [analyze_sentiment_sentiwordnet_lexicon(review,
verbose=False)
                                                            for review in norm_test_reviews]
    ...: meu.display_model_performance_metrics(true_labels=test_sentiments,
                                              predicted_labels=predicted_sentiments,
    ...:                                     classes=['positive', 'negative'])

Figure 7-2. Model performance metrics for SentiWordNet lexicon based model
We get an overall F1-Score of 68%, which is definitely a step down from our AFINN based model. While
we have lesser number of negative sentiment based reviews being misclassified as positive, the other aspects
of the model performance have been affected.

VADER Lexicon
The VADER lexicon, developed by C.J. Hutto, is a lexicon that is based on a rule-based sentiment analysis
framework, specifically tuned to analyze sentiments in social media. VADER stands for Valence Aware
Dictionary and Sentiment Reasoner. Details about this framework can be read in the original paper by Hutto,
C.J. & Gilbert, E.E. (2014) titled “VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social
Media Text”, proceedings of the Eighth International Conference on Weblogs and Social Media (ICWSM-14).
You can use the library based on nltk's interface under the nltk.sentiment.vader module. Besides this,
you can also download the actual lexicon or install the framework from https://github.com/cjhutto/
vaderSentiment, which also contains detailed information about VADER. This lexicon, present in the file
titled vader_lexicon.txt contains necessary sentiment scores associated with words, emoticons and slangs
(like wtf, lol, nah, and so on). There were a total of over 9000 lexical features from which over 7500 curated
lexical features were finally selected in the lexicon with proper validated valence scores. Each feature was
rated on a scale from "[-4] Extremely Negative" to "[4] Extremely Positive", with allowance for "[0]
Neutral (or Neither, N/A)". The process of selecting lexical features was done by keeping all features that
had a non-zero mean rating and whose standard deviation was less than 2.5, which was determined by the
aggregate of ten independent raters. We depict a sample from the VADER lexicon as follows.

342

Chapter 7 ■ Analyzing Movie Reviews Sentiment

:(     -1.9     1.13578 [-2, -3, -2, 0, -1, -1, -2, -3, -1, -4]
:)      2.0     1.18322 [2, 2, 1, 1, 1, 1, 4, 3, 4, 1]
...
terrorizing     -3.0    1.0     [-3, -1, -4, -4, -4, -3, -2, -3, -2, -4]
thankful         2.7    0.78102 [4, 2, 2, 3, 2, 4, 3, 3, 2, 2]
Each line in the preceding lexicon sample depicts a unique term, which can either be an emoticon or a
word. The first token indicates the word/emoticon, the second token indicates the mean sentiment polarity
score, the third token indicates the standard deviation, and the final token indicates a list of scores given
by ten independent scorers. Now let’s use VADER to analyze our movie reviews! We build our own modeling
function as follows.
from nltk.sentiment.vader import SentimentIntensityAnalyzer
def analyze_sentiment_vader_lexicon(review,
                                    threshold=0.1,
                                    verbose=False):
    # pre-process text
    review = tn.strip_html_tags(review)
    review = tn.remove_accented_chars(review)
    review = tn.expand_contractions(review)
    # analyze the sentiment for review
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    # get aggregate scores and final sentiment
    agg_score = scores['compound']
    final_sentiment = 'positive' if agg_score >= threshold\
                                   else 'negative'
    if verbose:
        # display detailed sentiment statistics
        positive = str(round(scores['pos'], 2)*100)+'%'
        final = round(agg_score, 2)
        negative = str(round(scores['neg'], 2)*100)+'%'
        neutral = str(round(scores['neu'], 2)*100)+'%'
        sentiment_frame = pd.DataFrame([[final_sentiment, final, positive,
                                        negative, neutral]],
                                        columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'],
                                                    
['Predicted Sentiment', 'Polarity Score',
                                                     'Positive', 'Negative', 'Neutral']],
                                                    labels=[[0,0,0,0,0],[0,1,2,3,4]]))
        print(sentiment_frame)
    return final_sentiment
In our modeling function, we do some basic pre-processing but keep the punctuations and emoticons
intact. Besides this, we use VADER to get the sentiment polarity and also proportion of the review text with
regard to positive, neutral and negative sentiment. We also predict the final sentiment based on a user-input
threshold for the aggregated sentiment polarity. Typically, VADER recommends using positive sentiment

343

Chapter 7 ■ Analyzing Movie Reviews Sentiment

for aggregated polarity >= 0.5, neutral between [-0.5, 0.5], and negative for polarity < -0.5. We use a
threshold of >= 0.4 for positive and < 0.4 for negative in our corpus. The following is the analysis of our
sample reviews.
In [13]: for review, sentiment in zip(test_reviews[sample_review_ids], test_
sentiments[sample_review_ids]):
    ...:     print('REVIEW:', review)
    ...:     print('Actual Sentiment:', sentiment)
    ...:     pred = analyze_sentiment_vader_lexicon(review, threshold=0.4, verbose=True)    
    ...:     print('-'*60)
REVIEW: no comment - stupid movie, acting average or worse... screenplay - no sense at
all... SKIP IT!
Actual Sentiment: negative
     SENTIMENT STATS:                                        
  Predicted Sentiment Polarity Score Positive Negative Neutral
0            negative           -0.8     0.0%    40.0%   60.0%
-----------------------------------------------------------REVIEW: I don't care if some people voted this movie to be bad. If you want the Truth this
is a Very Good Movie! It has every thing a movie should have. You really should Get this
one.
Actual Sentiment: positive
     SENTIMENT STATS:                                                  
  Predicted Sentiment Polarity Score Positive  Negative Neutral
0            negative          -0.16    16.0%  14.0%   69.0%
-----------------------------------------------------------REVIEW: Worst horror film ever but funniest film ever rolled in one you have got to see this
film it is so cheap it is unbelievable but you have to see it really!!!! P.S. Watch the carrot
Actual Sentiment: positive
     SENTIMENT STATS:                                        
  Predicted Sentiment Polarity Score Positive Negative Neutral
0            positive           0.49    11.0%    11.0%   77.0%
-----------------------------------------------------------We can see the details statistics pertaining to the sentiment and polarity for each sample movie review.
Let’s try out our model on the complete test movie review corpus now and evaluate the model performance.
See Figure 7-3.
In [14]: predicted_sentiments = [analyze_sentiment_vader_lexicon(review, threshold=0.4,
                                                    verbose=False) for review in test_
reviews]
    ...: meu.display_model_performance_metrics(true_labels=test_sentiments,
                                              predicted_labels=predicted_sentiments,
    ...:                                     classes=['positive', 'negative'])

Figure 7-3. Model performance metrics for VADER lexicon based model

344

Chapter 7 ■ Analyzing Movie Reviews Sentiment

We get an overall F1-Score and model accuracy of 71%, which is quite similar to the AFINN based
model. The AFINN based model only wins out on the average precision by 1%; otherwise, both models have
a similar performance.

Classifying Sentiment with Supervised Learning
Another way to build a model to understand the text content and predict the sentiment of the text based
reviews is to use supervised Machine Learning. To be more specific, we will be using classification models
for solving this problem. We have already covered the concepts relevant to supervised learning and
classification in Chapter 1 under the section “Supervised Learning”. With regard to details on building and
evaluating classification models, you can head over to Chapter 5 and refresh your memory if needed. We will
be building an automated sentiment text classification system in subsequent sections. The major steps to
achieve this are mentioned as follows.
1.

Prepare train and test datasets (optionally a validation dataset)

2.

Pre-process and normalize text documents

3.

Feature engineering

4.

Model training

5.

Model prediction and evaluation

These are the major steps for building our system. Optionally the last step would be to deploy the
model in your server or on the cloud. Figure 7-4 shows a detailed workflow for building a standard text
classification system with supervised learning (classification) models.

Figure 7-4. Blueprint for building an automated text classification system (Source: Text Analytics with
Python, Apress 2016)

345

Chapter 7 ■ Analyzing Movie Reviews Sentiment

In our scenario, documents indicate the movie reviews and classes indicate the review sentiments that
can either be positive or negative, making it a binary classification problem. We will build models using both
traditional Machine Learning methods and newer Deep Learning in the subsequent sections. You can refer
to the Python file titled supervised_sentiment_analysis.py for all the code used in this section or use the
jupyter notebook titled Sentiment Analysis - Supervised.ipynb for a more interactive experience. Let’s
load the necessary dependencies and settings before getting started.
In [1]:
   ...:
   ...:
   ...:
   ...:
   ...:

import
import
import
import

pandas as pd
numpy as np
text_normalizer as tn
model_evaluation_utils as meu

np.set_printoptions(precision=2, linewidth=80)

We can now load our IMDb movie reviews dataset, use the first 35,000 reviews for training models and
the remaining 15,000 reviews as the test dataset to evaluate model performance. Besides this, we will also
use our normalization module to normalize our review datasets (Steps 1 and 2 in our workflow).
In [2]: dataset = pd.read_csv(r'movie_reviews.csv')
   ...:
   ...: # take a peek at the data
   ...: print(dataset.head())
   ...: reviews = np.array(dataset['review'])
   ...: sentiments = np.array(dataset['sentiment'])
   ...:
   ...: # build train and test datasets
   ...: train_reviews = reviews[:35000]
   ...: train_sentiments = sentiments[:35000]
   ...: test_reviews = reviews[35000:]
   ...: test_sentiments = sentiments[35000:]
   ...:
   ...: # normalize datasets
   ...: norm_train_reviews = tn.normalize_corpus(train_reviews)
   ...: norm_test_reviews = tn.normalize_corpus(test_reviews)
                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. 

The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive
Our datasets are now prepared and normalized so we can proceed from Step 3 in our text classification
workflow described earlier to build our classification system.

Traditional Supervised Machine Learning Models
We will be using traditional classification models in this section to classify the sentiment of our movie
reviews. Our feature engineering techniques (Step 3) will be based on the Bag of Words model and the
TF-IDF model, which we discussed extensively in the section titled “Feature Engineering on Text Data” in
Chapter 4. The following snippet helps us engineer features using both these models on our train and test
datasets.

346

Chapter 7 ■ Analyzing Movie Reviews Sentiment

In [3]: from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
   ...:
   ...: # build BOW features on train reviews
   ...: cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0, ngram_range=(1,2))
   ...: cv_train_features = cv.fit_transform(norm_train_reviews)
   ...: # build TFIDF features on train reviews
   ...: tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0, ngram_range=(1,2),
   ...:                      sublinear_tf=True)
   ...: tv_train_features = tv.fit_transform(norm_train_reviews)
   ...:
   ...: # transform test reviews into features
   ...: cv_test_features = cv.transform(norm_test_reviews)
   ...: tv_test_features = tv.transform(norm_test_reviews)
   ...:
   ...: print('BOW model:> Train features shape:', cv_train_features.shape,
              ' Test features shape:', cv_test_features.shape)
   ...: print('TFIDF model:> Train features shape:', tv_train_features.shape,
              ' Test features shape:', tv_test_features.shape)
BOW model:> Train features shape: (35000, 2114021)  Test features shape: (15000, 2114021)
TFIDF model:> Train features shape: (35000, 2114021)  Test features shape: (15000, 2114021)
We take into account word as well as bi-grams for our feature-sets. We can now use some traditional
supervised Machine Learning algorithms which work very well on text classification. We recommend using
logistic regression, support vector machines, and multinomial Naïve Bayes models when you work on your
own datasets in the future. In this chapter, we built models using Logistic Regression as well as SVM. The
following snippet helps initialize these classification model estimators.
In [4]: from sklearn.linear_model import SGDClassifier, LogisticRegression
   ...:
   ...: lr = LogisticRegression(penalty='l2', max_iter=100, C=1)
   ...: svm = SGDClassifier(loss='hinge', n_iter=100)
Without going into too many theoretical complexities, the logistic regression model is a supervised
linear Machine Learning model used for classification regardless of its name. In this model, we try to predict
the probability that a given movie review will belong to one of the discrete classes (binary classes in our
scenario). The function used by the model for learning is represented here.
P ( y = positive X ) = s (q T X )
P ( y = negative X ) = 1 - s (q T X )
Where the model tries to predict the sentiment class using the feature vector X and s ( z ) =

1
, which is
1 + e -z

popularly known as the sigmoid function or logistic function or the logit function. The main objective of this
model is to search for an optimal value of q such that probability of the positive sentiment class is
maximum when the feature vector X is for a positive movie review and small when it is for a negative movie
review. The logistic function helps model the probability to describe the final prediction class. The optimal
value of q can be obtained by minimizing an appropriate cost\loss function using standard methods like

347

Chapter 7 ■ Analyzing Movie Reviews Sentiment

gradient descent (refer to the section, “The Three Stages of Logistic Regression” in Chapter 5 if you are
interested in more details). Logistic regression is also popularly known as logit regression or MaxEnt
(maximum entropy) classifier.
We will now use our utility function train_predict_model(...) from our model_evaluation_utils
module to build a logistic regression model on our training features and evaluate the model performance on
our test features (Steps 4 and 5). See Figure 7-5.
In [5]: # Logistic Regression model on BOW features
   ...: lr_bow_predictions = meu.train_predict_model(classifier=lr,
   ...:                        train_features=cv_train_features, train_labels=train_
sentiments,
   ...:                        test_features=cv_test_features, test_labels=test_sentiments)
   ...: meu.display_model_performance_metrics(true_labels=test_sentiments,
   ...:                                     predicted_labels=lr_bow_predictions,
   ...:                                     classes=['positive', 'negative'])

Figure 7-5. Model performance metrics for logistic regression on Bag of Words features
We get an overall F1-Score and model accuracy of 91%, as depicted in Figure 7-5, which is really
excellent! We can now build a logistic regression model similarly on our TF-IDF features using the following
snippet. See Figure 7-6.
In [6]: # Logistic Regression model on TF-IDF features
   ...: lr_tfidf_predictions = meu.train_predict_model(classifier=lr,
   ...:                         train_features=tv_train_features, train_labels=train_
sentiments,
   ...:                         test_features=tv_test_features, test_labels=test_sentiments)
   ...: meu.display_model_performance_metrics(true_labels=test_sentiments,  
   ...:                                     predicted_labels=lr_tfidf_predictions,
   ...:                                     classes=['positive', 'negative'])

Figure 7-6. Model performance metrics for logistic regression on TF-IDF features

348

Chapter 7 ■ Analyzing Movie Reviews Sentiment

We get an overall F1-Score and model accuracy of 90%, depicted in Figure 7-6, which is great
but our previous model is still slightly better. You can similarly use the Support Vector Machine model
estimator object svm, which we created earlier, and use the same snippet to train and predict using an SVM
model. We obtained a maximum accuracy and F1-score of 90% with the SVM model (refer to the jupyter
notebook for step-by-step code snippets). Thus you can see how effective and accurate these supervised
Machine Learning classification algorithms are in building a text sentiment classifier.

Newer Supervised Deep Learning Models
We have already mentioned multiple times in previous chapters about how Deep Learning has
revolutionized the Machine Learning landscape over the last decade. In this section, we will be building
some deep neural networks and train them on some advanced text features based on word embeddings
to build a text sentiment classification system similar to what we did in the previous section. Let’s load the
following necessary dependencies before we start our analysis.
In [7]: import gensim
   ...: import keras
   ...: from keras.models import Sequential
   ...: from keras.layers import Dropout, Activation, Dense
   ...: from sklearn.preprocessing import LabelEncoder
Using TensorFlow backend.
If you remember in Chapter 4, we talked about encoding categorical class labels and also the onehot encoding scheme. So far, our models in scikit-learn directly accepted the sentiment class labels as
positive and negative and internally performed these operations. However for our Deep Learning models,
we need to do this explicitly. The following snippet helps us tokenize our movie reviews and also converts
the text-based sentiment class labels into one-hot encoded vectors (forms a part of Step 2).
In [8]: le = LabelEncoder()
   ...: num_classes=2
   ...: # tokenize train reviews & encode train labels
   ...: tokenized_train = [tn.tokenizer.tokenize(text)
   ...:                   for text in norm_train_reviews]
   ...: y_tr = le.fit_transform(train_sentiments)
   ...: y_train = keras.utils.to_categorical(y_tr, num_classes)
   ...: # tokenize test reviews & encode test labels
   ...: tokenized_test = [tn.tokenizer.tokenize(text)
   ...:                   for text in norm_test_reviews]
   ...: y_ts = le.fit_transform(test_sentiments)
   ...: y_test = keras.utils.to_categorical(y_ts, num_classes)
   ...:
   ...: # print class label encoding map and encoded labels
   ...: print('Sentiment class label map:', dict(zip(le.classes_, le.transform
(le.classes_))))
   ...: print('Sample test label transformation:\n'+'-'*35,
   ...:       '\nActual Labels:', test_sentiments[:3], '\nEncoded Labels:', y_ts[:3],
   ...:       '\nOne hot encoded Labels:\n', y_test[:3])
Sentiment class label map: {'positive': 1, 'negative': 0}

349

Chapter 7 ■ Analyzing Movie Reviews Sentiment

Sample test label transformation:
----------------------------------Actual Labels: ['negative' 'positive' 'negative']
Encoded Labels: [0 1 0]
One hot encoded Labels:
[[ 1.  0.]
  [ 0.  1.]
  [ 1.  0.]]
Thus, we can see from the preceding sample outputs how our sentiment class labels have been encoded
into numeric representations, which in turn have been converted into one-hot encoded vectors. The
feature engineering techniques we will be using in this section (Step 3) are slightly more advanced word
vectorization techniques that are based on the concept of word embeddings. We will be using the word2vec
and GloVe models to generate embeddings. The word2vec model was built by Google and we have covered
this in detail in Chapter 4 under the section “Word Embeddings”. We will be choosing the size parameter to
be 500 in this scenario representing feature vector size to be 500 for each word.
In [9]: # build word2vec model
   ...: w2v_num_features = 500
   ...: w2v_model = gensim.models.Word2Vec(tokenized_train, size=w2v_num_features,
window=150,
   ...:                                    min_count=10, sample=1e-3)
We will be using the document word vector averaging scheme on this model from Chapter 4 to
represent each movie review as an averaged vector of all the word vector representations for the different
words in the review. The following function helps us compute averaged word vector representations for any
corpus of text documents.
def averaged_word2vec_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    def average_word_vectors(words, model, vocabulary, num_features):
        feature_vector = np.zeros((num_features,), dtype="float64")
        nwords = 0.        
        for word in words:
            if word in vocabulary:
                nwords = nwords + 1.
                feature_vector = np.add(feature_vector, model[word])
        if nwords:
            feature_vector = np.divide(feature_vector, nwords)
        return feature_vector
    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)
We can now use the previous function to generate averaged word vector representations on our two
movie review datasets.

350

Chapter 7 ■ Analyzing Movie Reviews Sentiment

In [10]: # generate averaged word vector features from word2vec model
    ...: avg_wv_train_features = averaged_word2vec_vectorizer(corpus=tokenized_train,
    ...:                                                 model=w2v_model, num_features=500)
    ...: avg_wv_test_features = averaged_word2vec_vectorizer(corpus=tokenized_test,
    ...:                                                 model=w2v_model, num_features=500)
The GloVe model, which stands for Global Vectors, is an unsupervised model for obtaining word vector
representations. Created at Stanford University, this model is trained on various corpora like Wikipedia,
Common Crawl, and Twitter and corresponding pre-trained word vectors are available that can be used
for our analysis needs. You can refer to the original paper by Jeffrey Pennington, Richard Socher, and
Christopher D. Manning. 2014, called GloVe: Global Vectors for Word Representation, for more details. The
spacy library provided 300-dimensional word vectors trained on the Common Crawl corpus using the GloVe
model. They provide a simple standard interface to get feature vectors of size 300 for each word as well as the
averaged feature vector of a complete text document. The following snippet leverages spacy to get the GloVe
embeddings for our two datasets. Do note that you can also build your own GloVe model by leveraging other
pre-trained models or by building a model on your own corpus by using the resources available at
https://nlp.stanford.edu/projects/glove which contains pre-trained word embeddings, code and examples.
In [11]:
    ...:
    ...:
    ...:
    ...:
    ...:

# feature engineering with GloVe model
train_nlp = [tn.nlp(item) for item in norm_train_reviews]
train_glove_features = np.array([item.vector for item in train_nlp])
test_nlp = [tn.nlp(item) for item in norm_test_reviews]
test_glove_features = np.array([item.vector for item in test_nlp])

You can check the feature vector dimensions for our datasets based on each of the previous models
using the following code.
In [12]: print('Word2Vec model:> Train features shape:', avg_wv_train_features.shape,
               ' Test features shape:', avg_wv_test_features.shape)
    ...: print('GloVe model:> Train features shape:', train_glove_features.shape,
               ' Test features shape:', test_glove_features.shape)
Word2Vec model:> Train features shape: (35000, 500)  Test features shape: (15000, 500)
GloVe model:> Train features shape: (35000, 300)  Test features shape: (15000, 300)
We can see from the preceding output that as expected the word2vec model features are of size 500
and the GloVe features are of size 300. We can now proceed to Step 4 of our classification system workflow
where we will build and train a deep neural network on these features. We have already briefly covered the
various aspects and architectures with regard to deep neural networks in Chapter 1 under the section “Deep
Learning”. We will be using a fully-connected four layer deep neural network (multi-layer perceptron or deep
ANN) for our model. We do not count the input layer usually in any deep architecture, hence our model will
consist of three hidden layers of 512 neurons or units and one output layer with two units that will be used to
either predict a positive or negative sentiment based on the input layer features. Figure 7-7 depicts our deep
neural network model for sentiment classification.

351

Chapter 7 ■ Analyzing Movie Reviews Sentiment

Figure 7-7. Fully connected deep neural network model for sentiment classification
We call this a fully connected deep neural network (DNN) because neurons or units in each pair
of adjacent layers are fully pairwise connected. These networks are also known as deep artificial neural
networks (ANNs) or Multi-Layer Perceptrons (MLPs) since they have more than one hidden layer. The
following function leverages keras on top of tensorflow to build the desired DNN model.
def construct_deepnn_architecture(num_input_features):
    dnn_model = Sequential()
    dnn_model.add(Dense(512, activation='relu', input_shape=(num_input_features,)))
    dnn_model.add(Dropout(0.2))
    dnn_model.add(Dense(512, activation='relu'))
    dnn_model.add(Dropout(0.2))
    dnn_model.add(Dense(512, activation='relu'))
    dnn_model.add(Dropout(0.2))
    dnn_model.add(Dense(2))
    dnn_model.add(Activation('softmax'))
    dnn_model.compile(loss='categorical_crossentropy', optimizer='adam',                
                      metrics=['accuracy'])
    return dnn_model
From the preceding function, you can see that we accept a parameter num_input_features, which
decides the number of units needed in the input layer (500 for word2vec and 300 for glove features). We
build a Sequential model, which helps us linearly stack our hidden and output layers.
We use 512 units for all our hidden layers and the activation function relu indicates a rectified linear
unit. This function is typically defined as relu ( x ) = max ( 0,x ) where x is typically the input to a neuron. This
is popularly known as the ramp function also in electronics and electrical engineering. This function is
preferred now as compared to the previously popular sigmoid function because it tries to solve the vanishing
gradient problem. This problem occurs when x > 0 and as x increases, the gradient from sigmoids becomes
really small (almost vanishing) but relu prevents this from happening. Besides this, it also helps with faster
convergence of gradient descent. We also use regularization in the network in the form of Dropout layers. By
adding a dropout rate of 0.2, it randomly sets 20% of the input feature units to 0 at each update during
training the model. This form of regularization helps prevent overfitting the model.

352

Chapter 7 ■ Analyzing Movie Reviews Sentiment

The final output layer consists of two units with a softmax activation function. The softmax function is
basically a generalization of the logistic function we saw earlier, which can be used to represent a probability
distribution over n possible class outcomes. In our case n = 2 where the class can either be positive or
negative and the softmax probabilities will help us determine the same. The binary softmax classifier is also
interchangeably known as the binary logistic regression function.
The compile(...) method is used to configure the learning or training process of the DNN model
before we actually train it. This involves providing a cost or loss function in the loss parameter. This will be
the goal or objective which the model will try to minimize. There are various loss functions based on the
type of problem you want to solve, for example the mean squared error for regression and categorical crossentropy for classification. Check out https://keras.io/losses/ for a list of possible loss functions.
We will be using categorical_crossentropy, which helps us minimize the error or loss from the
softmax output. We need an optimizer for helping us converge our model and minimize the loss or error
function. Gradient descent or stochastic gradient descent is a popular optimizer. We will be using the adam
optimizer which only required first order gradients and very little memory. Adam also uses momentum
where basically each update is based on not only the gradient computation of the current point but also
includes a fraction of the previous update. This helps with faster convergence. You can refer to the original
paper from https://arxiv.org/pdf/1412.6980v8.pdf for further details on the ADAM optimizer. Finally,
the metrics parameter is used to specify model performance metrics that are used to evaluate the model
when training (but not used to modify the training loss itself ). Let’s now build a DNN model based on our
word2vec input feature representations for our training reviews.
In [13]: w2v_dnn = construct_deepnn_architecture(num_input_features=500)
You can also visualize the DNN model architecture with the help of keras, similar to what we had done
in Chapter 4, by using the following code. See Figure 7-8.
In [14]: from IPython.display import SVG
    ...: from keras.utils.vis_utils import model_to_dot
    ...:
    ...: SVG(model_to_dot(w2v_dnn, show_shapes=True, show_layer_names=False,
    ...:                 rankdir='TB').create(prog='dot', format='svg'))

Figure 7-8. Visualizing the DNN model architecture using keras
We will now be training our model on our training reviews dataset of word2vec features represented
by avg_wv_train_features (Step 4). We will be using the fit(...) function from keras for the training
process and there are some parameters which you should be aware of. The epoch parameter indicates one
complete forward and backward pass of all the training examples through the network. The batch_size
parameter indicates the total number of samples which are propagated through the DNN model at a time
for one backward and forward pass for training the model and updating the gradient. Thus if you have 1,000
observations and your batch size is 100, each epoch will consist of 10 iterations where 100 observations will
be passed through the network at a time and the weights on the hidden layer units will be updated. We also

353

Chapter 7 ■ Analyzing Movie Reviews Sentiment

specify a validation_split of 0.1 to extract 10% of the training data and use it as a validation dataset for
evaluating the performance at each epoch. The shuffle parameter helps shuffle the samples in each epoch
when training the model.
In [18]: batch_size = 100
    ...: w2v_dnn.fit(avg_wv_train_features, y_train, epochs=5, batch_size=batch_size,
    ...:             shuffle=True, validation_split=0.1, verbose=1)
Train on 31500 samples, validate on 3500 samples
Epoch 1/5 31500/31500 - 11s - loss: 0.3097 - acc: 0.8720 - val_loss: 0.3159 - val_acc:
Epoch 2/5 31500/31500 - 11s - loss: 0.2869 - acc: 0.8819 - val_loss: 0.3024 - val_acc:
Epoch 3/5 31500/31500 - 11s - loss: 0.2778 - acc: 0.8857 - val_loss: 0.3012 - val_acc:
Epoch 4/5 31500/31500 - 11s - loss: 0.2708 - acc: 0.8901 - val_loss: 0.3041 - val_acc:
Epoch 5/5 31500/31500 - 11s - loss: 0.2612 - acc: 0.8920 - val_loss: 0.3023 - val_acc:

0.8646
0.8743
0.8763
0.8734
0.8763

The preceding snippet tells us that we have trained our DNN model on the training data for five epochs
with 100 as the batch size. We get a validation accuracy of close to 88%, which is quite good. Time now to put
our model to the real test! Let’s evaluate our model performance on the test review word2vec features (Step 5).
In [19]: y_pred = w2v_dnn.predict_classes(avg_wv_test_features)
    ...: predictions = le.inverse_transform(y_pred)
    ...: meu.display_model_performance_metrics(true_labels=test_sentiments,
    ...:            
predicted_labels=predictions, classes=['positive', 'negative'])

Figure 7-9. Model performance metrics for deep neural networks on word2vec features
The results depicted in Figure 7-9 show us that we have obtained a model accuracy and F1-score of
88%, which is great! You can use a similar workflow to build and train a DNN model for our GloVe based
features and evaluate the model performance. The following snippet depicts the workflow for Steps 4 and 5
of our text classification system blueprint.
# build DNN model
glove_dnn = construct_deepnn_architecture(num_input_features=300)
# train DNN model on GloVe training features
batch_size = 100
glove_dnn.fit(train_glove_features, y_train, epochs=5, batch_size=batch_size,
              shuffle=True, validation_split=0.1, verbose=1)
# get predictions on test reviews
y_pred = glove_dnn.predict_classes(test_glove_features)
predictions = le.inverse_transform(y_pred)
# Evaluate model performance
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_
labels=predictions,
                                      classes=['positive', 'negative'])

354

Chapter 7 ■ Analyzing Movie Reviews Sentiment

We obtained an overall model accuracy and F1-score of 85% with the GloVe features, which is still good
but not better than what we obtained using our word2vec features. You can refer to the Sentiment Analysis Supervised.ipynb jupyter notebook to see the step-by-step outputs obtained for the previous code. This
concludes our discussion on building text sentiment classification systems leveraging newer Deep Learning
models and methodologies. Onwards to learning about advanced Deep Learning models!

Advanced Supervised Deep Learning Models
We have used fully connected deep neural network and word embeddings in the previous section. Another
new and interesting approach toward supervised Deep Learning is the use of recurrent neural networks
(RNNs) and long short term memory networks (LSTMs) which also considers the sequence of data (words,
events, and so on). These are more advanced models than your regular fully connected deep networks
and usually take more time to train. We will leverage keras on top of tensorflow and try to build a LSTMbased classification model here and use word embeddings as our features. You can refer to the Python file
titled sentiment_analysis_adv_deep_learning.py for all the code used in this section or use the jupyter
notebook titled Sentiment Analysis - Advanced Deep Learning.ipynb for a more interactive experience.
We will be working on our normalized and pre-processed train and test review datasets, norm_
train_reviews and norm_test_reviews, which we created in our previous analyses. Assuming you have
them loaded up, we will first tokenize these datasets such that each text review is decomposed into its
corresponding tokens (workflow Step 2).
In [1]: tokenized_train = [tn.tokenizer.tokenize(text) for text in norm_train_reviews]
    ...: tokenized_test = [tn.tokenizer.tokenize(text) for text in norm_test_reviews]
For feature engineering (Step 3), we will be creating word embeddings. However, we will create them
ourselves using keras instead of using pre-built ones like word2vec or GloVe, which we used earlier. Word
embeddings tend to vectorize text documents into fixed sized vectors such that these vectors try to capture
contextual and semantic information.
For generating embeddings, we will use the Embedding layer from keras, which requires documents
to be represented as tokenized and numeric vectors. We already have tokenized text vectors in our
tokenized_train and tokenized_text variables. However we would need to convert them into numeric
representations. Besides this, we would also need the vectors to be of uniform size even though the
tokenized text reviews will be of variable length due to the difference in number of tokens in each review. For
this, one strategy could be to take the length of the longest review (with maximum number of tokens\words)
and set it as the vector size, let’s call this max_len. Reviews of shorter length can be padded with a PAD term
in the beginning to increase their length to max_len.
We would need to create a word to index vocabulary mapping for representing each tokenized text
review in a numeric form. Do note you would also need to create a numeric mapping for the padding term
which we shall call PAD_INDEX and assign it the numeric index of 0. For unknown terms, in case they are
encountered later on in the test dataset or newer, previously unseen reviews, we would need to assign it to
some index too. This would be because we will vectorize, engineer features, and build models only on the
training data. Hence, if some new term should come up in the future (which was originally not a part of the
model training), we will consider it as an out of vocabulary (OOV) term and assign it to a constant index (we
will name this term NOT_FOUND_INDEX and assign it the index of vocab_size+1). The following snippet helps
us create this vocabulary from our tokenized_train corpus of training text reviews.

355

Chapter 7 ■ Analyzing Movie Reviews Sentiment

In [2]: from collections import Counter
   ...:
   ...: # build word to index vocabulary
   ...: token_counter = Counter([token for review in tokenized_train for token in review])
   ...: vocab_map = {item[0]: index+1
                      for index, item in enumerate(dict(token_counter).items())}
   ...: max_index = np.max(list(vocab_map.values()))
   ...: vocab_map['PAD_INDEX'] = 0
   ...: vocab_map['NOT_FOUND_INDEX'] = max_index+1
   ...: vocab_size = len(vocab_map)
   ...: # view vocabulary size and part of the vocabulary map
   ...: print('Vocabulary Size:', vocab_size)
   ...: print('Sample slice of vocabulary map:', dict(list(vocab_map.items())[10:20]))
Vocabulary Size: 82358
Sample slice of vocabulary map: {'martyrdom': 6, 'palmira': 7, 'servility': 8, 'gardening':
9, 'melodramatically': 73505, 'renfro': 41282, 'carlin': 41283, 'overtly': 41284, 'rend':
47891, 'anticlimactic': 51}
In this case we have used all the terms in our vocabulary, you can easily filter and use more relevant
terms here (based on their frequency) by using the most_common(count) function from Counter and taking
the first count terms from the list of unique terms in the training corpus. We will now encode the tokenized
text reviews based on the previous vocab_map. Besides this, we will also encode the text sentiment class
labels into numeric representations.
In [3]: from keras.preprocessing import sequence
   ...: from sklearn.preprocessing import LabelEncoder
   ...:
   ...: # get max length of train corpus and initialize label encoder
   ...: le = LabelEncoder()
   ...: num_classes=2 # positive -> 1, negative -> 0
   ...: max_len = np.max([len(review) for review in tokenized_train])
   ...:
   ...: ## Train reviews data corpus
   ...: # Convert tokenized text reviews to numeric vectors
   ...: train_X = [[vocab_map[token] for token in tokenized_review]
                       for tokenized_review in tokenized_train]
   ...: train_X = sequence.pad_sequences(train_X, maxlen=max_len) # pad
   ...: ## Train prediction class labels
   ...: # Convert text sentiment labels (negative\positive) to binary encodings (0/1)
   ...: train_y = le.fit_transform(train_sentiments)
   ...:
   ...: ## Test reviews data corpus
   ...: # Convert tokenized text reviews to numeric vectors
   ...: test_X = [[vocab_map[token] if vocab_map.get(token) else vocab_map['NOT_FOUND_INDEX']
   ...:            for token in tokenized_review]
   ...:               for tokenized_review in tokenized_test]
   ...: test_X = sequence.pad_sequences(test_X, maxlen=max_len)

356

Chapter 7 ■ Analyzing Movie Reviews Sentiment

   ...: ## Test prediction class labels
   ...: # Convert text sentiment labels (negative\positive) to binary encodings (0/1)
   ...: test_y = le.transform(test_sentiments)
   ...:
   ...: # view vector shapes
   ...: print('Max length of train review vectors:', max_len)
   ...: print('Train review vectors shape:', train_X.shape,
               ' Test review vectors shape:', test_X.shape)
Max length of train review vectors: 1442
Train review vectors shape: (35000, 1442)  Test review vectors shape: (15000, 1442)
From the preceding code snippet and the output, it is clear that we encoded each text review into a
numeric sequence vector so that the size of each review vector is 1442, which is basically the maximum
length of reviews from the training dataset. We pad shorter reviews and truncate extra tokens from longer
reviews such that the shape of each review is constant as depicted in the output. We can now proceed with
Step 3 and a part of Step 4 of the classification workflow by introducing the Embedding layer and coupling it
with the deep network architecture based on LSTMs.
from keras.models import Sequential
from keras.layers import Dense, Embedding, Dropout, SpatialDropout1D
from keras.layers import LSTM
EMBEDDING_DIM = 128 # dimension for dense embeddings for each token
LSTM_DIM = 64 # total LSTM units
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=EMBEDDING_DIM, input_length=max_len))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(LSTM_DIM, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="adam",
              metrics=["accuracy"])
The Embedding layer helps us generate the word embeddings from scratch. This layer is also initialized
with some weights initially and this gets updated based on our optimizer similar to weights on the neuron
units in other layers when the network tries to minimize the loss in each epoch. Thus, the embedding layer
tries to optimize its weights such that we get the best word embeddings which will generate minimum
error in the model and also capture semantic similarity and relationships among words. How do we get the
embeddings, let’s consider we have a review with 3 terms ['movie', 'was', 'good'] and a vocab_map
consisting of word to index mappings for 82358 words. The word embeddings would be generated somewhat
similar to Figure 7-10.

357

Chapter 7 ■ Analyzing Movie Reviews Sentiment

Figure 7-10. Understanding how word embeddings are generated
Based on our model architecture, the Embedding layer takes in three parameters—input_dim, which is
equal to the vocabulary size (vocab_size) of 82358, output_dim, which is 128, representing the dimension of
dense embedding (depicted by rows in the EMBEDDING LAYER in Figure 7-10), and input_len, which specifies
the length of the input sequences (movie review sequence vectors), which is 1442. In the example depicted
in Figure 7-10, since we have one review, the dimension is (1, 3). This review is converted into a numeric
sequence [2, 57, 121] based on the VOCAB_MAP. Then the specific columns representing the indices in the
review sequence are selected from the EMBEDDING LAYER (vectors at column indices 2, 57 and 121 respectively),
to generate the final word embeddings. This gives us an embedding vector of dimension (1, 128, 3)
also represented as (1, 3, 128) when each row is represented based on each sequence word embedding
vector. Many Deep Learning frameworks like keras represent the embedding dimensions as (m, n) where m
represents all the unique terms in our vocabulary (82358) and n represents the output_dim which is 128 in this
case. Consider a transposed version of the layer depicted in Figure 7-10 and you are good to go!
Usually if you have the encoded review terms sequence vector represented in one-hot encoded
format (3, 82358) and do a matrix multiplication with the EMBEDDING LAYER represented as (82358, 128)
where each row represents the embedding for a word in the vocabulary, you will directly obtain the word
embeddings for the review sequence vector as (3, 128). The weights in the embedding layer get updated
and optimized in each epoch based on the input data when propagated through the whole network like we
mentioned earlier such that overall loss and error is minimized to get maximum model performance.
These dense word embeddings are then passed to the LSTM layer having 64 units. We already introduced
you to the LSTM architecture briefly in Chapter 1 in the subsection titled “Long Short Term Memory
Networks” in the “Important Concepts” section under “Deep Learning”. LSTMs basically try to overcome
the shortcomings of RNN models especially with regard to handling long term dependencies and problems
which occur when the weight matrix associated with the units (neurons) become too small (leading to
vanishing gradient) or too large (leading to exploding gradient). These architectures are more complex than
regular deep networks and going into detailed internals and math concepts would be out of the current
scope, but we will try to cover the essentials here without making it math heavy. If you’re interested in
researching the internals of LSTMs, check out the original paper which inspired it all, by Hochreiter, S., and
Schmidhuber, J. (1997). Long short-term memory. Neural computation. 9(8), 1735-1780. We depict the basic
architecture of RNNs and compare it with LSTMs in Figure 7-11.

358

Chapter 7 ■ Analyzing Movie Reviews Sentiment

Figure 7-11. Basic structure of RNN and LSTM units (Source: Christopher Olah’s blog: colah.github.io)
The RNN units usually have a chain of repeating modules (this happens when we unroll the loop; refer
to Figure 1-13 in Chapter 1, where we talk about this) such that the module has a simple structure of having
maybe one layer with the tanh activation. LSTMs are also a special type of RNN, having a similar structure
but the LSTM unit has four neural network layers instead of just one. The detailed architecture of the LSTM
cell is shown in Figure 7-12.

359

Chapter 7 ■ Analyzing Movie Reviews Sentiment

Figure 7-12. Detailed architecture of an LSTM cell (Source: Christopher Olah’s blog: colah.github.io)
The detailed architecture of an LSTM cell is depicted in Figure 7-12. The notation
t indicates one time

step, C depicts the cell states, and h indicates the hidden states. The gates i , f ,o C help in removing or
adding information to the cell state. The gates i, f & o represent the input, output and forget gates respectively
and each of them are modulated by the sigmoid layer which outputs numbers from 0 to 1 controlling how
much of the output from these gates should pass. Thus this helps is protecting and controlling the cell state.
Detailed work flow of how information flows through the LSTM cell is depicted in Figure 7-13 in four steps.
1.

The first step talks about the forget gate layer f, which helps us decide what
information should we throw away from the cell state. This is done by looking at
the previous hidden state ht -1 and current inputs xt as depicted in the equation.
The sigmoid layer helps control how much of this should be kept or forgotten.

2.

The second step depicts the input gate layer t, which helps decide what
information will be stored in the current cell state. The sigmoid layer in the input
gate helps decide which values will be updated based on ht -1 xt again. The tanh

layer helps create a vector of the new candidate values Ct based on ht -1 xt ,
which can be added to the current cell state. Thus the tanh layer creates the
values and the input gate with sigmoid layer helps choose which values should
be updated.

3.

The third step involves updating the old cell state Ct ‐ 1 to the new cell state Ct by
leveraging what we obtained in the first two steps. We multiply the old cell state
by the forget gate ( ft ´ Ct -1 ) and then add the new candidate values scaled by the

input gate with sigmoid layer it ´ Ct .

(

4.

)

The fourth and final step helps us decide what should be the final output which
is basically a filtered version of our cell state. The output gate with the sigmoid
layer o helps us select which parts of the cell state will pass to the final output.
This is multiplied with the cell state values when passed
through the tanh layer to

give us the final hidden state values ht = ot ´ tanh Ct .

( )

360

Chapter 7 ■ Analyzing Movie Reviews Sentiment

All these steps in this detailed workflow are depicted in Figure 7-13 with necessary annotations and
equations. We would like to thank our good friend Christopher Olah for providing us detailed information
as well as the images for depicting the internal workings of LSTM networks. We recommend checking out
Christopher’s blog at http://colah.github.io/posts/2015-08-Understanding-LSTMs for more details.
A shout out also goes to Edwin Chen, for explaining RNNs and LSTMs in an easy-to-understand format.
We recommend referring to Edwin’s blog at http://blog.echen.me/2017/05/30/exploring-lstms for
information on the workings of RNNs and LSTMs.

Figure 7-13. Walkthrough of data flow in an LSTM cell (Source: Christopher Olah’s blog: colah.github.io)
The final layer in our deep network is the Dense layer with 1 unit and the sigmoid activation function.
We basically use the binary_crossentropy function with the adam optimizer since this is a binary
classification problem and the model will ultimately predict a 0 or a 1, which we can decode back to a
negative or positive sentiment prediction with our label encoder. You can also use the categorical_
crossentropy loss function here, but you would need to then use a Dense layer with 2 units instead with a
softmax function. Now that our model is compiled and ready, we can head on to Step 4 of our classification
workflow of actually training the model. We use a similar strategy from our previous deep network models,
where we train our model on the training data with five epochs, batch size of 100 reviews, and a 10%
validation split of training data to measure validation accuracy.

361

Chapter 7 ■ Analyzing Movie Reviews Sentiment

In [4]: batch_size = 100
   ...: model.fit(train_X, train_y, epochs=5, batch_size=batch_size,
   ...:          shuffle=True, validation_split=0.1, verbose=1)
Train on 31500 samples, validate on 3500 samples
Epoch 1/5 31500/31500 - 2491s - loss: 0.4081 - acc: 0.8184 - val_loss:
0.8751
Epoch 2/5 31500/31500 - 2489s - loss: 0.2253 - acc: 0.9158 - val_loss:
0.8780
Epoch 3/5 31500/31500 - 2656s - loss: 0.1431 - acc: 0.9493 - val_loss:
0.8671
Epoch 4/5 31500/31500 - 2604s - loss: 0.1023 - acc: 0.9658 - val_loss:
0.8729
Epoch 5/5 31500/31500 - 2701s - loss: 0.0694 - acc: 0.9761 - val_loss:
0.8706

0.3006 - val_acc:
0.3209 - val_acc:
0.3483 - val_acc:
0.3803 - val_acc:
0.4430 - val_acc:

Training LSTMs on CPU is notoriously slow and as you can see my model took approximately 3.6 hours
to train for just five epochs on an i5 3rd Gen Intel CPU with 8 GB of memory. Of course, a cloud-based
environment like Google Cloud Platform or AWS on GPU took me approximately less than an hour to train
the same model. So I would recommend you choose a GPU based Deep Learning environment, especially
when working with RNNs or LSTM based network architectures. Based on the preceding output, we can
see that just with five epochs we have decent validation accuracy but the training accuracy starts shooting
up indicating some over-fitting might be happening. Ways to overcome this include adding more data or by
increasing the drouput rate. Do give it a shot and see if it works! Time to put our model to the test! Let’s see
how well it predicts the sentiment for our test reviews and use the same model evaluation framework we
have used for our previous models (Step 5).
In [5]: # predict sentiments on test data
   ...: pred_test = model.predict_classes(test_X)
   ...: predictions = le.inverse_transform(pred_test.flatten())
   ...: # evaluate model performance
   ...: meu.display_model_performance_metrics(true_labels=test_sentiments,
   ...:                   predicted_labels=predictions, classes=['positive', 'negative'])

Figure 7-14. Model performance metrics for LSTM based Deep Learning model on word embeddings
The results depicted in Figure 7-14 show us that we have obtained a model accuracy and F1-score
of 88%, which is quite good! With more quality data, you can expect to get even better results. Try
experimenting with different architectures and see if you get better results!

362

Chapter 7 ■ Analyzing Movie Reviews Sentiment

Analyzing Sentiment Causation
We built both supervised and unsupervised models to predict the sentiment of movie reviews based on the
review text content. While feature engineering and modeling is definitely the need of the hour, you also need
to know how to analyze and interpret the root cause behind how model predictions work. In this section,
we analyze sentiment causation. The idea is to determine the root cause or key factors causing positive or
negative sentiment. The first area of focus will be model interpretation, where we will try to understand,
interpret, and explain the mechanics behind predictions made by our classification models. The second area
of focus is to apply topic modeling and extract key topics from positive and negative sentiment reviews.

Interpreting Predictive Models
One of the challenges with Machine Learning models is the transition from a pilot or proof-of-concept
phase to the production phase. Business and key stakeholders often perceive Machine Learning models as
complex black boxes and poses the question, why should I trust your model? Explaining to them complex
mathematical or theoretical concepts doesn’t serve the purpose. Is there some way in which we can explain
these models in an easy-to-interpret manner? This topic in fact has gained extensive attention very recently
in 2016. Refer to the original research paper by M.T. Ribeiro, S. Singh & C. Guestrin titled “Why Should
I Trust You?: Explaining the Predictions of Any Classifier” from https://arxiv.org/pdf/1602.04938.
pdf to understand more about model interpretation and the LIME framework. Check out more on model
interpretation in Chapter 5 where we cover the skater framework in detail which performs excellent
interpretations of various models.
There are various ways to interpret the predictions made by our predictive sentiment classification
models. We want to understand more into why a positive review was correctly predicted as having positive
sentiment or a negative review having negative sentiment. Besides this, no model is a 100% accurate always,
so we would also want to understand the reason for mis-classifications or wrong predictions. The code used
in this section is available in the file named sentiment_causal_model_interpretation.py or you can also
refer to the jupyter notebook named Sentiment Causal Analysis - Model Interpretation.ipynb for an
interactive experience.
Let’s first build a basic text classification pipeline for the model that worked best for us so far. This is the
Logistic Regression model based on the Bag of Words feature model. We will leverage the pipeline module
from scikit-learn to build this Machine Learning pipeline using the following code.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
# build BOW features on train reviews
cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0, ngram_range=(1,2))
cv_train_features = cv.fit_transform(norm_train_reviews)
# build Logistic Regression model
lr = LogisticRegression()
lr.fit(cv_train_features, train_sentiments)
# Build Text Classification Pipeline
lr_pipeline = make_pipeline(cv, lr)
# save the list of prediction classes (positive, negative)
classes = list(lr_pipeline.classes_)

363

Chapter 7 ■ Analyzing Movie Reviews Sentiment

We build our model based on norm_train_reviews, which contains the normalized training reviews
that we have used in all our earlier analyses. Now that we have our classification pipeline ready, you can
actually deploy the model by using pickle or joblib to save the classifier and feature objects similar to what
we discussed in the “Model Deployment” section in Chapter 5. Assuming our pipeline is in production, how
do we use it for new movie reviews? Let’s try to predict the sentiment for two new sample reviews (which
were not used in training the model).
In [3]: lr_pipeline.predict(['the lord of the rings is an excellent movie',
   ...:                      'i hated the recent movie on tv, it was so bad'])
Out[3]: array(['positive', 'negative'], dtype=object)
Our classification pipeline predicts the sentiment of both the reviews correctly! This is a good start,
but how do we interpret the model predictions? One way is to typically use the model prediction class
probabilities as a measure of confidence. You can use the following code to get the prediction probabilities
for our sample reviews.
In [4]: pd.DataFrame(lr_pipeline.predict_proba(['the lord of the rings is an excellent movie',
   ...:                      'i hated the recent movie on tv, it was so bad']),
columns=classes)
Out[4]:
   negative  positive
0  0.169653  0.830347
1  0.730814  0.269186
Thus we can say that the first movie review has a prediction confidence or probability of 83% to
have positive sentiment as compared to the second movie review with a 73% probability to have negative
sentiment. Let’s now kick it up a notch, instead of playing around with toy examples, we will now run the
same analysis on actual reviews from the test_reviews dataset (we will use norm_test_reviews, which
has the normalized text reviews). Besides prediction probabilities, we will be using the skater framework
for easy interpretation of the model decisions, similar to what we have done in Chapter 5 under the section
“Model Interpretation”. You need to load the following dependencies from the skater package first. We
also define a helper function which takes in a document index, a corpus, its response predictions, and an
explainer object and helps us with the our model interpretation analysis.
from skater.core.local_interpretation.lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=classes)
# helper function for model interpretation
def interpret_classification_model_prediction(doc_index, norm_corpus, corpus,
                                              prediction_labels, explainer_obj):
    # display model prediction and actual sentiments
    print("Test document index: {index}\nActual sentiment: {actual}
                                       \nPredicted sentiment: {predicted}"
      .format(index=doc_index, actual=prediction_labels[doc_index],
              predicted=lr_pipeline.predict([norm_corpus[doc_index]])))
    # display actual review content
    print("\nReview:", corpus[doc_index])
    # display prediction probabilities
    print("\nModel Prediction Probabilities:")
    for probs in zip(classes, lr_pipeline.predict_proba([norm_corpus[doc_index]])[0]):
        print(probs)

364

Chapter 7 ■ Analyzing Movie Reviews Sentiment

    # display model prediction interpretation
    exp = explainer.explain_instance(norm_corpus[doc_index],
                                     lr_pipeline.predict_proba, num_features=10,
                                     labels=[1])
    exp.show_in_notebook()
The preceding snippet leverages skater to explain our text classifier to analyze its decision-making
process in an easy to interpret form. Even though the model might be a complex one in a global perspective,
it is easier to explain and approximate the model behavior on local instances. This is done by learning the
model around the vicinity of the data point of interest X by sampling instances around X and assigning
weightages based on their proximity toX. Thus, these locally learned linear models help in explaining
complex models in a more easy to interpret way with class probabilities, contribution of top features to the
class probabilities that aid in the decision making process. Let’s take a movie review from our test dataset
where both the actual and predicted sentiment is negative and analyze it with the helper function we created
in the preceding snippet.
In [6]: doc_index = 100
   ...: interpret_classification_model_prediction(doc_index=doc_index, corpus=norm_test_
reviews,
                                         corpus=test_reviews, prediction_labels=test_
sentiments,
                                         explainer_obj=explainer)
Test document index: 100
Actual sentiment: negative
Predicted sentiment: ['negative']
Review: Worst movie, (with the best reviews given it) I've ever seen. Over the top dialog,
acting, and direction. more slasher flick than thriller. With all the great reviews this
movie got I'm appalled that it turned out so silly. shame on you Martin Scorsese
Model Prediction Probabilities:
('negative', 0.8099323456145181)
('positive', 0.19006765438548187)

Figure 7-15. Model interpretation for our classification model’s correct prediction for a negative review

365

Chapter 7 ■ Analyzing Movie Reviews Sentiment

The results depicted in Figure 7-15 show us the class prediction probabilities and also the top 10
features that contributed the maximum to the prediction decision making process. These key features
are also highlighted in the normalized movie review text. Our model performs quite well in this scenario
and we can see the key features that contributed to the negative sentiment of this review including bad,
silly, dialog, and shame, which make sense. Besides this, the word great contributed the maximum to
the positive probability of 0.19 and in fact if we had removed this word from our review text, the positive
probability would have dropped significantly.
The following code runs a similar analysis on a test movie review with both actual and predicted
sentiment of positive value.
In [7]: doc_index = 2000
   ...: interpret_classification_model_prediction(doc_index=doc_index, corpus=norm_test_
reviews,
                                         corpus=test_reviews, prediction_labels=test_
sentiments,
                                         explainer_obj=explainer)
Test document index: 2000
Actual sentiment: positive
Predicted sentiment: ['positive']
Review: I really liked the Movie "JOE." It has really become a cult classic among
certain age groups.

The Producer of this movie is a personal friend of mine.
He is my Stepsons Father-In-Law. He lives in Manhattan's West side, and has a Bungalow.
in Southampton, Long Island. His son-in-law live next door to his Bungalow.

Presently, he does not do any Producing, But dabbles in a business with HBO movies.

As a person, Mr. Gil is a real gentleman and I wish he would have continued in the
production business of move making.
Model Prediction Probabilities:
('negative', 0.020629181561415355)
('positive', 0.97937081843858464)

Figure 7-16. Model interpretation for our classification model’s correct prediction for a positive review

366

Chapter 7 ■ Analyzing Movie Reviews Sentiment

The results depicted in Figure 7-16 show the top features responsible for the model making a decision of
predicting this review as positive. Based on the content, the reviewer really liked this model and also it was a
real cult classic among certain age groups. In our final analysis, we will look at the model interpretation of an
example where the model makes a wrong prediction.
In [8]: doc_index = 347
   ...: interpret_classification_model_prediction(doc_index=doc_index, corpus=norm_test_
reviews,
                                         corpus=test_reviews, prediction_labels=test_
sentiments,
                                         explainer_obj=explainer)
Test document index: 347
Actual sentiment: negative
Predicted sentiment: ['positive']
Review: When I first saw this film in cinema 11 years ago, I loved it. I still think the
directing and cinematography are excellent, as is the music. But it's really the script that
has over the time started to bother me more and more. I find Emma Thompson's writing selfabsorbed and unfaithful to the original book; she has reduced Marianne to a side-character,
a second fiddle to her much too old, much too severe Elinor - she in the movie is given
many sort of 'focus moments', and often they appear to be there just to show off Thompson
herself.

I do understand her cutting off several characters from the book, but
leaving out the one scene where Willoughby in the book is redeemed? For someone who red
and cherished the book long before the movie, those are the things always difficult to
digest.

As for the actors, I love Kate Winslet as Marianne. She is not given the
best script in the world to work with but she still pulls it up gracefully, without too much
sentimentality. Alan Rickman is great, a bit old perhaps, but he plays the role beautifully.
And Elizabeth Spriggs, she is absolutely fantastic as always.
Model Prediction Probabilities:
('negative', 0.067198213044844413)
('positive', 0.93280178695515559)

Figure 7-17. Model interpretation for our classification model’s incorrect prediction

367

Chapter 7 ■ Analyzing Movie Reviews Sentiment

The preceding output tells us that our model predicted the movie review indicating a positive sentiment
when in-fact the actual sentiment label is negative for the same review. The results depicted in Figure 7-17
tell us that the reviewer in fact shows signs of positive sentiment in the movie review, especially in parts
where he\she tells us that “I loved it. I still think the directing and cinematography are excellent, as is the
music... Alan Rickman is great, a bit old perhaps, but he plays the role beautifully. And Elizabeth Spriggs, she
is absolutely fantastic as always.” and feature words from the same have been depicted in the top features
contributing to positive sentiment. The model interpretation also correctly identifies the aspects of the
review contributing to negative sentiment like, “But it’s really the script that has over the time started to
bother me more and more.”. Hence, this is one of the more complex reviews which indicate both positive
and negative sentiment and the final interpretation would be in the reader’s hands. You can now use this
same framework to interpret your own classification models in the future and understand where your model
might be performing well and where it might need improvements!

Analyzing Topic Models
Another way of analyzing key terms, concepts or topics responsible for sentiment is to use a different
approach known as topic modeling. We have already covered some basics into topic modeling in the section
titled “Topic Models” under “Feature Engineering on Text Data” in Chapter 4. The main aim of topic models
is to extract and depict key topics or concepts which are otherwise latent and not very prominent in huge
corpora of text documents. We have already seen the use of Latent Dirichlet Allocation (LDA) for topic
modeling in Chapter 4. In this section, we use another topic modeling technique called Non-Negative Matrix
factorization. Refer to the Python file named sentiment_causal_topic_models.py or the jupyter notebook
titled Sentiment Causal Analysis - Topic Models.ipynb for a more interactive experience.
The first step in this analysis is to combine all our normalized train and test reviews and separate out
these reviews into positive and negative sentiment reviews. Once we do this, we will extract features from
these two datasets using the TF-IDF feature vectorizer. The following snippet helps us achieve this.
In [11]: from sklearn.feature_extraction.text import TfidfVectorizer
    ...:
    ...: # consolidate all normalized reviews
    ...: norm_reviews = norm_train_reviews+norm_test_reviews
    ...: # get tf-idf features for only positive reviews
    ...: positive_reviews = [review for review, sentiment in zip(norm_reviews, sentiments)
                                 if sentiment == 'positive']
    ...: ptvf = TfidfVectorizer(use_idf=True, min_df=0.05, max_df=0.95,
                                ngram_range=(1,1), sublinear_tf=True)
    ...: ptvf_features = ptvf.fit_transform(positive_reviews)
    ...: # get tf-idf features for only negative reviews
    ...: negative_reviews = [review for review, sentiment in zip(norm_reviews, sentiments)
                                 if sentiment == 'negative']
    ...: ntvf = TfidfVectorizer(use_idf=True, min_df=0.05, max_df=0.95,
                                ngram_range=(1,1), sublinear_tf=True)
    ...: ntvf_features = ntvf.fit_transform(negative_reviews)
    ...: # view feature set dimensions
    ...: print(ptvf_features.shape, ntvf_features.shape)
(25000, 331) (25000, 331)

368

Chapter 7 ■ Analyzing Movie Reviews Sentiment

From the preceding output dimensions, you can see that we have filtered out a lot of the features we
used previously when building our classification models by making min_df to be 0.05 and max_df to be
0.95. This is to speed up the topic modeling process and remove features that either occur too much or too
rarely. Let’s now import the necessary dependencies for the topic modeling process.
In [12]:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:

import pyLDAvis
import pyLDAvis.sklearn
from sklearn.decomposition import NMF
import topic_model_utils as tmu
pyLDAvis.enable_notebook()
total_topics = 10

The NMF class from scikit-learn will help us with topic modeling. We also use pyLDAvis for building
interactive visualizations of topic models. The core principle behind Non-Negative Matrix Factorization
(NNMF) is to apply matrix decomposition (similar to SVD) to a non-negative feature matrix X such that
the decomposition can be represented as X ≈ WH where W & H are both non-negative matrices which if
multiplied should approximately re-construct the feature matrix X. A cost function like L2 norm can be used
for getting this approximation. Let’s now apply NNMF to get 10 topics from our positive sentiment reviews.
We will also leverage some utility functions from our topic_model_utils module to display the results in a
clean format.
In [13]: # build topic model on positive sentiment review features
    ...: pos_nmf = NMF(n_components=total_topics,
    ...:        
random_state=42, alpha=0.1, l1_ratio=0.2)
    ...: pos_nmf.fit(ptvf_features)    
    ...: # extract features and component weights
    ...: pos_feature_names = ptvf.get_feature_names()
    ...: pos_weights = pos_nmf.components_
    ...: # extract and display topics and their components
    ...: pos_topics = tmu.get_topics_terms_weights(pos_weights, pos_feature_names)
    ...: tmu.print_topics_udf(topics=pos_topics, total_topics=total_topics,
    ...:                 num_terms=15, display_weights=False)
Topic #1 without weights
['like', 'not', 'think', 'really', 'say', 'would', 'get', 'know', 'thing', 'much', 'bad',
'go', 'lot', 'could', 'even']
Topic #2 without weights
['movie', 'see', 'watch', 'great', 'good', 'one', 'not', 'time', 'ever', 'enjoy',
'recommend', 'make', 'acting', 'like', 'first']
Topic #3 without weights
['show', 'episode', 'series', 'tv', 'watch', 'dvd', 'first', 'see', 'time', 'one', 'good',
'year', 'remember', 'ever', 'would']
Topic #4 without weights
['performance', 'role', 'play', 'actor', 'cast', 'good', 'well', 'great', 'character',
'excellent', 'give', 'also', 'support', 'star', 'job']
...
Topic #10 without weights
['love', 'fall', 'song', 'wonderful', 'beautiful', 'music', 'heart', 'girl', 'would',
'watch', 'great', 'favorite', 'always', 'family', 'woman']

369

Chapter 7 ■ Analyzing Movie Reviews Sentiment

We depict some of the topics out of the 10 topics generated in the preceding output. You can leverage
pyLDAvis now to visualize these topics in an interactive visualization. See Figure 7-18.
In [14]: pyLDAvis.sklearn.prepare(pos_nmf, ptvf_features, ptvf, R=15)

Figure 7-18. Visualizing topic models on positive sentiment movie reviews
The visualization depicted in Figure 7-18 shows us the 10 topics from positive movie reviews and we
can see the top relevant terms for Topic 6 highlighted in the output. From the topics and the terms, we can
see terms like movie cast, actors, performance, play, characters, music, wonderful, good, and so on have
contributed toward positive sentiment in various topics. This is quite interesting and gives you a good
insight into the components of the reviews that contribute toward positive sentiment of the reviews. This
visualization is completely interactive if you are using the jupyter notebook and you can click on any of the
bubbles representing topics in the Intertopic Distance Map on the left and see the most relevant terms in
each of the topics in the right bar chart.
The plot on the left is rendered using Multi-dimensional Scaling (MDS). Similar topics should be close
to one another and dissimilar topics should be far apart. The size of each topic bubble is based on the
frequency of that topic and its components in the overall corpus.
The visualization on the right shows the top terms. When no topic it selected, it shows the top 15 most
salient topics in the corpus. A term’s saliency is defined as a measure of how frequently the term appears the
corpus and its distinguishing factor when used to distinguish between topics. When some topic is selected,
the chart changes to show something similar to Figure 7-13, which shows the top 15 most relevant terms for
that topic. The relevancy metric is controlled by λ, which can be changed based on a slider on top of the bar
chart (refer to the notebook to interact with this). If you’re interested in more mathematical theory behind
these visualizations, you are encouraged to check out more details at https://cran.r-project.org/web/
packages/LDAvis/vignettes/details.pdf, which is a vignette for the R package LDAvis, which has been
ported to Python as pyLDAvis.

370

Chapter 7 ■ Analyzing Movie Reviews Sentiment

Let’s now extract topics and run this same analysis on our negative sentiment reviews from the movie
reviews dataset.
In [15]: # build topic model on negative sentiment review features
    ...: neg_nmf = NMF(n_components=10,
    ...:        
random_state=42, alpha=0.1, l1_ratio=0.2)
    ...: neg_nmf.fit(ntvf_features)    
    ...: # extract features and component weights
    ...: neg_feature_names = ntvf.get_feature_names()
    ...: neg_weights = neg_nmf.components_
    ...: # extract and display topics and their components
    ...: neg_topics = tmu.get_topics_terms_weights(neg_weights, neg_feature_names)
    ...: tmu.print_topics_udf(topics=neg_topics,
    ...:                 total_topics=total_topics,
    ...:                 num_terms=15,
    ...:                 display_weights=False)
Topic #1 without weights
['get', 'go', 'kill', 'guy', 'scene', 'take', 'end', 'back', 'start', 'around', 'look',
'one', 'thing', 'come', 'first']
Topic #2 without weights
['bad', 'movie', 'ever', 'acting', 'see', 'terrible', 'one', 'plot', 'effect', 'awful',
'not', 'even', 'make', 'horrible', 'special']
...
Topic #10 without weights
['waste', 'time', 'money', 'watch', 'minute', 'hour', 'movie', 'spend', 'not', 'life',
'save', 'even', 'worth', 'back', 'crap']
In [16]: pyLDAvis.sklearn.prepare(neg_nmf, ntvf_features, ntvf, R=15)

Figure 7-19. Visualizing topic models on positive sentiment movie reviews

371

Chapter 7 ■ Analyzing Movie Reviews Sentiment

The visualization depicted in Figure 7-19 shows us the 10 topics from negative movie reviews and we
can see the top relevant terms for Topic 8 highlighted in the output. From the topics and the terms, we can
see terms like waste, time, money, crap, plot, terrible, acting, and so on have contributed toward negative
sentiment in various topics. Of course, there are high chances of overlap between topics from positive
and negative sentiment reviews, but there will be distinguishable, distinct topics that further help us with
interpretation and causal analysis.

S
 ummary
This case-study oriented chapter introduces the IMDb movie review dataset with the objective of predicting
the sentiment of the reviews based on the textual content. We covered concepts and techniques from natural
language processing (NLP), text analytics, Machine Learning and Deep Learning in this chapter. We covered
multiple aspects from NLP including text pre-processing, normalization, feature engineering as well as text
classification. Unsupervised learning techniques using sentiment lexicons like Afinn, SentiWordNet, and
VADER were covered in extensive detail, to show how we can analyze sentiment in the absence of labeled
training data, which is a very valid problem in today’s organizations. Detailed workflow diagrams depicting
text classification as a supervised Machine Learning problem helped us in relating NLP with Machine
Learning so that we can use Machine Learning techniques and methodologies to solve this problem of
predicting sentiment when labeled data is available.
The focus on supervised methods was two-fold. This included traditional Machine Learning
approaches and models like Logistic Regression and Support Vector Machines and newer Deep Learning
models including deep neural networks, RNNs, and LSTMs. Detailed concepts, workflows, hands-on
examples and comparative analyses with multiple supervised models and different feature engineering
techniques have been covered for the purpose of predicting sentiment from movie reviews with maximum
model performance. The final section of this chapter covered a very important aspect of Machine Learning
that is often neglected in our analyses. We looked at ways to analyze and interpret the cause of positive or
negative sentiment. Analyzing and visualizing model interpretations and topic models have been covered
with several examples, to give you a good insight into how you can re-use these frameworks on your own
datasets. The frameworks and methodologies used in this chapter should be useful for you in tackling
similar problems on your own text data in the future.

372

CHAPTER 8

Customer Segmentation and
Effective Cross Selling
Money makes the world go round and in the current ecosystem of data intensive business practices, it is safe
to claim that data also makes the world go round. A very important skill set for data scientists is to match the
technical aspects of analytics with its business value, i.e., its monetary value. This can be done in a variety of
ways and is very much dependent on the type of business and the data available. In the earlier chapters, we
covered problems that can be framed as business problems (leveraging the CRISP-DM model) and linked to
revenue generation. In this chapter we will directly focus on two very important problems that can directly
have a positive impact on the revenue streams of businesses and establishments particularly from the retail
domain. This chapter is also unique in the way that we address a different paradigm of Machine Learning
algorithm altogether, focusing more on tasks pertaining to pattern recognition and unsupervised learning.
In this chapter, first we digress from our usual technical focus and try to gather some business and
domain knowledge. This knowledge is quite important, as often this is the stumbling block for many data
scientists in scenarios where a perfectly developed Machine Learning solution is not productionalized due
to a lack of focus in the actual value obtained from the solution based on business demands. A firm grasp on
the underlying business (and monetary) motivation helps the data scientists with defining the value aspect
of their solutions and hence ensuring that they are deployed and contribute to generation of realizable value
for their employers.
For achieving this objective, we will start with a retail transactions based dataset sourced from UCI
Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/online+retail) and we will
use this dataset to target two fairly simple but important problems. The detailed code, notebooks, and
datasets used in this chapter are available in the respective directory for Chapter 8 in the GitHub repository
at https://github.com/dipanjanS/practical-machine-learning-with-python.
•

Customer segmentation: Customer segmentation is the problem of uncovering
information about a firm’s customer base, based on their interactions with the
business. In most cases this interaction is in terms of their purchase behavior and
patterns. We explore some of the ways in which this can be used.

•

Market basket analysis: Market basket analysis is a method to gain insights
into granular behavior of customers. This is helpful in devising strategies which
uncovers deeper understanding of purchase decisions taken by the customers. This
is interesting as a lot of times even the customer will be unaware of such biases or
trends in their purchasing behavior.

© Dipanjan Sarkar, Raghav Bali and Tushar Sharma 2018
D. Sarkar et al., Practical Machine Learning with Python, https://doi.org/10.1007/978-1-4842-3207-1_8

373

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

Online Retail Transactions Dataset
The online retail transactions dataset is available from the UCI Machine Learning Repository. We already
used some datasets from this repository in our earlier chapters and this should underline the importance
of this repository to the users. The dataset we will be using for our analysis is quite simple. Based on
its description on the UCI web site, it contains all the transactions occurring between 01/12/2010 and
09/12/2011 for a UK-based and registered non-store online retail. From the web site, we also learn that the
company sells unique all-occasion gift items and a lot of customers of the organization are wholesalers.
The last piece of information is particularly important as gives us an opportunity to explore purchase
behaviors of large-scale customers instead of normal retail customers only. The dataset does not have any
information that will help us distinguish between a wholesale purchase and a retail purchase. Before we get
started, make sure you load the following dependencies.
import
import
import
import
import
import

pandas as pd
datetime
math
numpy as np
matplotlib.pyplot as plt
matplotlib.mlab as mlab

%matplotlib inline

■■Note We encourage you to check out the UCI Machine Learning Repository and the page for this particular
dataset at http://archive.ics.uci.edu/ml/datasets/online+retail. On the web site, you can find some
research papers that use the same dataset. We believe the papers along with the analysis performed in this
chapter will make an interesting read for all our readers.

E xploratory Data Analysis
We have always maintained that irrespective of the actual use case or the algorithm, we intend to implement
the standard analysis workflow, which should always start with exploratory data analysis (EDA). So following
the tradition, we will start with EDA on our dataset.
The first thing you should notice about the dataset is its format. Unlike most of the datasets that we have
handled in this book the dataset is not in a CSV format and instead comes as an Excel file. In some other
languages (or even frameworks) it could have been a cause of problem but with python and particularly
pandas we don’t face any such problem and we can read the dataset using the function read_excel provided
by the pandas library. We also take a look at some of the lines in the dataset.
In [3]: cs_df = pd.read_excel(io=r'Online Retail.xlsx')

374

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

The few lines from the dataset gives us information about the attributes of the datasets, as shown in
Figure 8-1.

Figure 8-1. Sample transactions from the Retail Transactions dataset
The attributes of the dataset are easily identifiable from their names. We know right away what each of
these fields might mean. For the sake of completeness, we include the description of each column here:
•

InvoiceNo: A unique identifier for the invoice. An invoice number shared across
rows means that those transactions were performed in a single invoice (multiple
purchases).

•

StockCode: Identifier for items contained in an invoice.

•

Description: Textual description of each of the stock item.

•

Quantity: The quantity of the item purchased.

•

InvoiceDate: Date of purchase.

•

UnitPrice: Value of each item.

•

CustomerID: Identifier for customer making the purchase.

•

Country: Country of customer.

Let’s analyze this data and first determine which are the top countries the retailer is shipping its items
to, and how are the volumes of sales for those countries.
In [5]: cs_df.Country.value_counts().reset_index().head(n=10)
Out[5]:
            index  Country
0  United Kingdom   495478
1         Germany     9495
2          France     8557
3            EIRE     8196
4           Spain     2533
5     Netherlands     2371
6         Belgium     2069
7     Switzerland     2002
8        Portugal     1519
9       Australia     1259
This shows us that the bulk of ordering is taking place in its home country only which is not surprising. We
also notice the odd country name EIRE, which is a little concerning. But a quick web search indicates that it is just
an old name for Ireland, so no harm done! Interestingly, Australia is also in the top-ten list of sales by country.

375

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

Next we might be interested in how many unique customers the retailer is having and how do they stack
up in the number of orders they make. We are also interested in knowing that what percentage of orders is
made by the top 10 customers of the retailer. This information is interesting as it would tell us whether the
user base of the firm is distributed relatively uniformly.
In [7]: cs_df.CustomerID.unique().shape
Out[7]: (4373,)
In [8]: (cs_df.CustomerID.value_counts()/sum(cs_df.CustomerID.value_counts())*100).
head(n=13).cumsum()
Out[8]:
17841.0    1.962249
14911.0    3.413228
14096.0    4.673708
12748.0    5.814728
14606.0    6.498553
15311.0    7.110850
14646.0    7.623350
13089.0    8.079807
13263.0    8.492020
14298.0    8.895138
15039.0    9.265809
14156.0    9.614850
18118.0    9.930462
Name: CustomerID, dtype: float64
This tells us that we have 4,373 unique customers but almost 10% of total sales are contributed by only
13 customers (based on the cumulative percentage aggregation in the preceding output). This is expected
given the fact that we have both wholesale and retail customers. The next thing we want to determine is how
many unique items the firm is selling and check whether we have equal number of descriptions for them.
In [9]: cs_df.StockCode.unique().shape
Out[9]: (4070,)
In [10]: cs_df.Description.unique().shape
Out[10]: (4224,)
We have a mismatch in the number of StockCode and Description, as we can see that item descriptions
are more than stock code values, which means that we have multiple descriptions for some of the stock
codes. Although this is not going to interfere with our analysis, we would like to dig a little deeper in this to
find out what may have caused this issue or what kind of duplicated descriptions are present in the data.
cat_des_df = cs_df.groupby(["StockCode","Description"]).count().reset_index()
cat_des_df.StockCode.value_counts()[cat_des_df.StockCode.value_counts()>1].reset_index().head()
   index  StockCode
0  20713          8
1  23084          7
2  85175          6
3  21830          6
4  21181          5

376

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

In [13]: cs_df[cs_df['StockCode']
    ...:    
==cat_des_df.StockCode.value_counts()[cat_des_df.StockCode.value_counts()>1].
    ...:    
reset_index()['index'][6]]['Description'].unique()
Out[13]:
array(['MISTLETOE HEART WREATH CREAM', 'MISTLETOE HEART WREATH WHITE',
       'MISTLETOE HEART WREATH CREAM', '?', 'had been put aside', nan], dtype=object)
This gives the multiple descriptions for one of those items and we witness the simple ways in which
data quality can be corrupted in any dataset. A simple spelling mistake can end up in reducing data quality
and an erroneous analysis. In an enterprise-level scenario, dedicated people work toward restoring data
quality manually over time. Since the intent of this section is to focus on customer segmentation, we will be
skipping this tedious activity for now. Let’s now verify the sanctity of the Quantity and UnitPrice attributes,
as those are the attributes we will be using in our analysis.
In [14]: cs_df.Quantity.describe()
Out[14]:
count    541909.000000
mean          9.552250
std         218.081158
min      -80995.000000
25%           1.000000
50%           3.000000
75%          10.000000
max       80995.000000
Name: Quantity, dtype: float64
In [15]: cs_df.UnitPrice.describe()
Out[15]:
count    541909.000000
mean          4.611114
std          96.759853
min      -11062.060000
25%           1.250000
50%           2.080000
75%           4.130000
max       38970.000000
Name: UnitPrice, dtype: float64
We can observe from the preceding output that both of these attributes are having negative values,
which may mean that we may have some return transactions in our data also. This scenario is quite common
for any retailer but we need to handle these before we proceed to our analysis. These are some of the
data quality issues we found in our dataset. In the real world, the datasets are generally messy and have
considerable data quality issues, so it is always a good practice to explicitly verify information at hand before
performing any kind of analysis. We encourage you to try and find similar issues with any dataset which you
might want to analyze in the future.

377

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

Customer Segmentation
Segmentation is the process of segregating any aggregated entity into separate parts or groups (segments).
These parts may or may not share something common across them. Customer segmentation is similarly the
process of dividing an organization’s customer bases into different sections or segments based on various
customer attributes. It is driven by the belief that customers are inherently different and this difference is
exemplified by their behavior. A deep understanding of an organization’s customer base and their behavior
is often the focus of any customer segmentation project. The process of customer segmentation is based on
the premise of finding differences among the customers’ behavior and patterns. These differences can be
on the basis of their buying behavior, their demographic information, their geographic information, their
psychographic attributes, and so on.

Objectives
Customer segmentation can help an organization in multitude of ways. Before we describe the various ways
it can be done, we want to enumerate the major objectives and benefits behind the motivation for customer
segmentation.

Customer Understanding
One of the primary objectives of a customer segmentation process is a deeper understanding of a firm’s
customers and their attributes and behavior. These insights into the customer base can be used in different
ways, as we will discuss shortly. But the information is useful by itself. One of the mostly widely accepted
business paradigms is “know your customer” and a segmentation of the customer base allows for a perfect
dissection of this paradigm. This understanding and its exploitation is what forms the basis of the other
benefits of customer segmentation.

Target Marketing
The most visible reason for customer segmentation is the ability to focus marketing efforts effectively
and efficiently. If a firm knows the different segments of its customer base, it can devise better marketing
campaigns which are tailor made for the segment. Consider the example of any travel company, if it knows
that the major segments of its customers are the budget travelers and the luxury travelers, it can have two
separate marketing campaigns for each of the group. One can focus on the higher value aspects of the
company’s offerings relevant to budget deals while the other campaign deals with luxurious offerings.
Although the example seems quite trivial, the same logic can be extended in a number of ways to arrive
at better marketing practices. A good segmentation model allows for better understanding of customer
requirements and hence increases the chances of the success of any marketing campaign developed by the
organization.

Optimal Product Placement
A good customer segmentation strategy can also help the firm with developing or offering new products.
This benefit is highly dependent on the way the segmentation process is leveraged. Consider a very simple
example in which an online retailer finds out that a major section of its customers are buying makeup
products together. This may prompt him to bundle those product together as a combined offering, which
may increase sales margin for the retailer and make the buying process more streamlined for the customer.

378

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

Finding Latent Customer Segments
A customer segmentation process helps the organization with knowing its customer base. An obvious side
effect of any such practice is finding out which segment of customers it might be missing. This can help in
identifying untapped customer segments by focused on marketing campaigns or new product development.

Higher Revenue
This is the most obvious requirement of any customer segmentation project. The reason being that customer
segmentation can lead to higher revenue due to the combined effects of all the advantages identified in this
section.

Strategies
The easy answer to the question of “How to do customer segmentation?” would be “In any way you deem fit”
and it would be a perfectly acceptable answer. The reason this is the right answer, is because of the original
definition of customer segmentation. It is just a way of differentiating between the customers. It can be as
simple as making groups of customers based on age groups or other attributes using a manual process or as
complex as using a sophisticated algorithm for finding out those segments in an automated fashion. Since
our book is all about Machine Learning, we describe how customer segmentation can be translated into a
core Machine Learning task. The detailed code for this section is available in the notebook titled Customer
Segmentation.ipynb.

Clustering
We are dealing with an unlabeled, unsupervised transactional dataset from which we want to find
out customer segments. Thus, the most obvious method to perform customer segmentation is using
unsupervised Machine Learning methods like clustering. Hence, this will be the method that we use for
customer segmentation in this chapter. The method is as simple as collecting as much data about the
customers as possible in the form of features or attributes and then finding out the different clusters that can
be obtained from that data. Finally, we can find traits of customer segments by analyzing the characteristics
of the clusters.

Exploratory Data Analysis
Using exploratory data analysis is another way of finding out customer segments. This is usually done by
analysts who have a good knowledge about the domain relevant to both products and customers. It can be
done flexibly to include the top decision points in an analysis. For example, finding out the range of spends
by customers will give rise to customer segments based on spends. We can proceed likewise on important
attributes of customers until we get segments of customer that have interesting characteristics.

Clustering vs. Customer Segmentation
In our use case, we will be using a clustering based model to find out interesting segments of customers.
Before we go on to modify our data for the model there is an interesting point we would like to clarify. A lot of
people think that clustering is equivalent to customer segmentation. Although it is true that clustering is one
of the most suitable techniques for segmentation, it is not the only technique. Besides this, it is just a method
that is “applied” to extract segments.

379

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

Customer segmentation is just the task of segmenting customers. It can be solved in several ways
and it need not always be a complex model. Clustering provides a mathematical framework which can be
leveraged for finding out such segment boundaries in the data. Clustering is especially useful when we have
a lot of attributes about the customer on which we can make different segments. Also, often it is observed
that a clustering-based segmentation will be superior to an arbitrary segmentation process and it will often
encompass the segments that can be devised using such an arbitrary process.

Clustering Strategy
Now that we have some information about what customer segmentation is, various strategies and how it can
be useful, we can start with the process of finding out customer segments in our online retail dataset. The
dataset that we have consists of only the sales transactions of the customers and no other information about
them, i.e., no other additional attributes. Usually in larger organizations we will usually have more attributes
of information about the customers that can help in clustering. However, it will be interesting and definitely
a challenge to work with this limited attribute dataset! So we will use a RFM—Recency, Frequency and
Monetary Value—based model of customer value for finding our customer segments.

RFM Model for Customer Value
The RFM model is a popular model in marketing and customer segmentation for determining a customer’s
value. The RFM model will take the transactions of a customer and calculate three important informational
attributes about each customer:
•

Recency: The value of how recently a customer purchased at the establishment

•

Frequency: How frequent the customer’s transactions are at the establishment

•

Monetary value: The dollar (or pounds in our case) value of all the transactions that
the customer made at the establishment

A combination of these three values can be used to assign a value to the customer. We can directly think
of some desirable segments that we would want on such model. For example, a high value customer is one
who buys frequently, just bought something recently, and spends a high amount whenever he buys or shops!

Data Cleaning
We hinted in the “Exploratory Data Analysis” section about the return transactions that we have in our
dataset. Before proceeding with our analysis workflow, we will find out all such transactions and remove
them. Another possibility is to remove the matching buying transactions also from the dataset. But we will
assume that those transactions are still important hence we will keep them intact. Another data cleaning
operation is to separate transactions for a particular geographical region only, as we don’t want data from
Germany’s customers to affect the analysis for another country’s customers. The following snippet of code
achieves both these tasks. We focus on UK customers, which are notably the largest segment (based on country!).
In [16]:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:

380

# Separate data for one geography
cs_df = cs_df[cs_df.Country == 'United Kingdom']
# Separate attribute for total amount
cs_df['amount'] = cs_df.Quantity*cs_df.UnitPrice
# Remove negative or return transactions

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

    ...: cs_df = cs_df[~(cs_df.amount<0)]
    ...: cs_df.head()
    ...:
Out[16]:
  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6
1    536365     71053                 WHITE METAL LANTERN         6
2    536365    84406B    
CREAM CUPID HEARTS COAT HANGER         8
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6
4    536365    84029E    
RED WOOLLY HOTTIE WHITE HEART.         6
          InvoiceDate  UnitPrice  CustomerID        
0 2010-12-01 08:26:00       2.55     17850.0  United
1 2010-12-01 08:26:00       3.39     17850.0  United
2 2010-12-01 08:26:00       2.75     17850.0  United
3 2010-12-01 08:26:00       3.39     17850.0  United
4 2010-12-01 08:26:00       3.39     17850.0  United

Country  amount
Kingdom   15.30
Kingdom   20.34
Kingdom   22.00
Kingdom   20.34
Kingdom   20.34

The data now has only buying transactions from United Kingdom. We will now remove all the
transactions that have a missing value for the CustomerID field as all our subsequent transactions will be
based on the customer entities.
In [17]: cs_df = cs_df[~(cs_df.CustomerID.isnull())]
In [18]: cs_df.shape
Out[18]: (354345, 9)
The next step is creating the recency, frequency, and monetary value features for each of the customers
that exist in our dataset.

Recency
To create the recency feature variable, we need to decide the reference date for our analysis. For our use
case, we will define the reference date as one day after the last transaction in our dataset.
In [19]:
    ...:
    ...:
Out[20]:

refrence_date = cs_df.InvoiceDate.max()
refrence_date = refrence_date + datetime.timedelta(days = 1)
refrence_date
Timestamp('2011-12-10 12:49:00')

We will construct the recency variable as the number of days before the reference date when a customer
last made a purchase. The following snippet of code will create this variable for us.
In [21]: cs_df['days_since_last_purchase'] = refrence_date - cs_df.InvoiceDate
    ...: cs_df['days_since_last_purchase_num'] = cs_df['days_since_last_purchase'].
astype('timedelta64[D]')
In [22]: customer_history_df = cs_df.groupby("CustomerID").min().reset_index()
[['CustomerID', 'days_since_last_purchase_num']]
    ...: customer_history_df.rename(columns={'days_since_last_purchase_num':'recency'},
inplace=True)

381

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

Before we proceed, let’s examine how the distribution of customer recency looks for our data
(see Figure 8-2).
x = customer_history_df.recency
mu = np.mean(customer_history_df.recency)
sigma = math.sqrt(np.var(customer_history_df.recency))
n, bins, patches = plt.hist(x, 1000, facecolor='green', alpha=0.75)
# add a 'best fit' line
y = mlab.normpdf( bins, mu, sigma)
l = plt.plot(bins, y, 'r--', linewidth=2)
plt.xlabel('Recency in days')
plt.ylabel('Number of transactions')
plt.title(r'$\mathrm{Histogram\ of\ sales\ recency}\ $')
plt.grid(True)

Figure 8-2. Distribution of sales recency
The histogram in Figure 8-2 tells us that we have skewed distribution of sales recency with a much
higher number of frequent transactions and a fairly uniform number of less recent transactions.

Frequency and Monetary Value
Using similar methods, we can create a frequency and monetary value variable for our dataset. We will
create these variables separately and then merge all the dataframes to arrive at the customer value dataset.
We will perform our clustering-based customer segmentation on this dataframe. The following snippet will
create both these variables and the final merged dataframe.

382

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

In [29]: customer_monetary_val = cs_df[['CustomerID',  
                                        'amount']].groupby("CustomerID").sum().reset_index()
    ...: customer_history_df = customer_history_df.merge(customer_monetary_val, how='outer')
    ...: customer_history_df.amount = customer_history_df.amount+0.001
    ...:
    ...: customer_freq = cs_df[['CustomerID',                
                                'amount']].groupby("CustomerID").count().reset_index()
    ...: customer_freq.rename(columns={'amount':'frequency'},inplace=True)
    ...: customer_history_df = customer_history_df.merge(customer_freq, how='outer')
The input dataframe for clustering will look like the dataframe depicted in Figure 8-3. Notice that we
have added a small figure 0.001 to the customer monetary value, as we will be transforming our values to
the log scale and presence of zeroes in our data may lead to an error.

Figure 8-3. Customer value dataframe

Data Preprocessing
Once we have created our customer value dataframe, we will perform some preprocessing on the data. For
our clustering, we will be using the K-means clustering algorithm, which we discussed in the earlier chapters.
One of the requirements for proper functioning of the algorithm is the mean centering of the variable
values. Mean centering of a variable value means that we will replace the actual value of the variable with a
standardized value, so that the variable has a mean of 1 and variance of 0. This ensures that all the variables
are in the same range and the difference in ranges of values doesn’t cause the algorithm to not perform well.
This is akin to feature scaling.
Another problem that you can investigate about is the huge range of values each variable can take. This
problem is particularly noticeable for the monetary amount variable. To take care of this problem, we will
transform all the variables on the log scale. This transformation, along with the standardization, will ensure
that the input to our algorithm is a homogenous set of scaled and transformed values.
An important point about the data preprocessing step is that sometimes we need it to be reversible. In
our case, we will have the clustering results in terms of the log transformed and scaled variable. But to make
inferences in terms of the original data, we will need to reverse transform all the variable so that we get back
the actual RFM figures. This can be done by using the preprocessing capabilities of Python.
In [30]: from sklearn import preprocessing
    ...: import math
    ...: customer_history_df['recency_log'] = customer_history_df['recency'].apply(math.log)
    ...: customer_history_df['frequency_log'] = customer_history_df['frequency'].apply(math.log)
    ...: customer_history_df['amount_log'] = customer_history_df['amount'].apply(math.log)
    ...: feature_vector = ['amount_log', 'recency_log','frequency_log']

383

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

    ...: X = customer_history_df[feature_vector].as_matrix()
    ...: scaler = preprocessing.StandardScaler().fit(X)
    ...: X_scaled = scaler.transform(X)
The previous code snippet will create a log valued and mean centered version of our dataset. We can
visualize the results of our preprocessing by inspecting the variable with the widest range of values. The
following code snippet will help us visualize this.
In [31]:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:

x = customer_history_df.amount_log
n, bins, patches = plt.hist(x, 1000, facecolor='green', alpha=0.75)
plt.xlabel('Log of Sales Amount')
plt.ylabel('Probability')
plt.title(r'$\mathrm{Histogram\ of\ Log\ transformed\ Customer\ Monetary\ value}\ $')
plt.grid(True)
plt.show()

The resulting graph is a distribution resembling normal distribution with mean 0 and variance of 1, as
clearly depicted in Figure 8-4.

Figure 8-4. Scaled and log transformed sales amount

384

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

Let’s try to visualize our three main features (R, F, and M) on a three-dimensional plot to see if we can
understand any interesting patterns that the data distribution is showing.
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
xs =customer_history_df.recency_log
ys = customer_history_df.frequency_log
zs = customer_history_df.amount_log
ax.scatter(xs, ys, zs, s=5)
ax.set_xlabel('Recency')
ax.set_ylabel('Frequency')
ax.set_zlabel('Monetary')

Figure 8-5. Customer value dataframe
The obvious patterns we can see from the plot in Figure 8-5 is that people who buy with a higher
frequency and more recency tend to spend more based on the increasing trend in Monetary value with a
corresponding increasing and decreasing trend for Frequency and Recency, respectively. Do you notice any
other interesting patterns?

385

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

Clustering for Segments
We will be using the K-means clustering algorithm for finding out clusters (or segments in our data). It is one
of the simplest clustering algorithms that we can employ and hence it is widely used in practice. We will give
you a brief primer of the algorithm before we go on to using the same, for finding segments in our data.

K-Means Clustering
The K-means clustering belongs to the partition based\centroid based clustering family of algorithms. The
steps that happen in the K-means algorithm for partitioning the data are as given follows:
1.

The algorithm starts with random point initializations of the required number of
centers. The “K” in K-means stands for the number of clusters.

2.

In the next step, each of the data point is assigned to the center closest to it. The
distance metric used in K-means clustering is normal Euclidian distance.

3.

Once the data points are assigned, the centers are recalculated by averaging the
dimensions of the points belonging to the cluster.

4.

The process is repeated with new centers until we reach a point where the
assignments become stable. In this case, the algorithm terminates.

We will adapt the code given in the scikit-learn documentation at http://scikit-learn.org/
stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html and use the silhouette score for
finding out the optimal number of clusters during our clustering process. We leave it as an exercise for you
to adapt the code and create the visualization in Figure 8-6. You are encouraged to modify the code in the
documentation to not only build the visualization, but also to capture the centers and the silhouette score
of each cluster in a dictionary, as we will need to refer to those for performing our analysis of the customer
segments obtained. Of course, in case you find it overwhelming, you can always refer to the detailed code
snippet in the Customer Segmentation.ipynb notebook.

386

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

Figure 8-6. Silhouette analysis with three and five clusters
In the visualization depicted in Figure 8-6, we plotted the silhouette score of each cluster along with the
center of each of the cluster discovered. We will use this information in the next section on cluster analysis.
Although we have to keep in mind that in several cases and scenarios, sometimes we may have to drop the
mathematical explanation given by the algorithm and look at the business relevance of the results obtained.

Cluster Analysis
Before we proceed to the analysis of clusters such obtained, let’s look at the cluster center values after
retransforming them to normal values from the log and scaled version. The following code helps us convert
the center values to the reversed transformed values.
In [38]: for i in range(3,6,2):
    ...:     print("for {} number of clusters".format(i))

387

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

    ...:     cent_transformed = scaler.inverse_transform(cluster_centers[i]['cluster_center'])
    ...:     print(pd.DataFrame(np.exp(cent_transformed),columns=feature_vector))
    ...:     print("Silhouette score for cluster {} is {}". format(i,
                                                    cluster_centers[i]['silhouette_score']))
    ...:     print()
for 3 number of clusters
    amount_log  recency_log  frequency_log
0   843.937271    44.083222      53.920633
1   221.236034   121.766072      10.668661
2  3159.294272     7.196647     177.789098
Silhouette score for cluster 3 is 0.30437444714898737
for 5 number of clusters
    amount_log  recency_log  frequency_log
0  3905.544371     5.627973     214.465989
1  1502.519606    46.880212      92.306262
2   142.867249   126.546751       5.147370
3   408.235418   139.056216      25.530424
4   464.371885    13.386419      29.581298
Silhouette score for cluster 5 is 0.27958641427323727
When we look at the results of the clustering process, we can infer some interesting insights. Consider
the three-cluster configuration and try to understand the following insights.
•

We get three clusters with stark differences in the Monetary value of the customer.

•

Cluster 2 is the cluster of high value customer who shops frequently and is certainly
an important segment for each business.

•

In the similar way we obtain customer groups with low and medium spends in
clusters with labels 1 and 0, respectively.

•

Frequency and Recency correlate perfectly to the Monetary value based on the trend
we talked about in Figure 8-3 (High Monetary-Low Recency-High Frequency).

The five-cluster configuration results are more surprising! When we go looking for more segments, we
find out that our high valued customer base is comprised of two subgroups:
•

Those who shop often and with high amount (represented by cluster 0).

•

Those who have a decent spend but are not as frequent (represented by cluster 1).

This is in direct conflict with the result we obtain from the silhouette score matrix which says the fivecluster segments are less optimal then the three cluster segments. Of course, remember you must not strictly
go after mathematical metrics all the time and think about the business aspects too. Besides this, there could
be more insights that are uncovered as you visualize the data based on these segments which might prove
that in-fact the three-cluster segmentation was far better. For instance, if you check the right-side plot in
Figure 8-6, you can see that segments with five clusters have too much overlap among them, as compared to
segments with three clusters.

388

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

Cluster Descriptions
On the basis of eyeballing the cluster centers, we can figure out that we have a good difference in the
customer value in the segments as defined in terms of recency, amount and frequency. To further drill
down on this point and find out the quality of these difference, we can label our data with the corresponding
cluster label and then visualize these differences. We will do this visualization for probably one of the most
important customer value indicators, the total dollar value sales.
To arrive at such distinction based summary computations, we will first label each data row in our
customer summary dataframe with the corresponding label as returned by our clustering algorithm.
Note that you have to modify the code you are using if you want to try a different configuration of let’s say
two or four clusters. We will have to make changes so that we capture the labels for each different cluster
configuration. We encourage you to try other cluster configurations to see if you get even better segments!
The following code will extract the clustering label and attach it with our customer summary dataframe.
labels = cluster_centers[5]['labels']
customer_history_df['num_cluster5_labels'] = labels
labels = cluster_centers[3]['labels']
customer_history_df['num_cluster3_labels'] = labels
Once we have the labels assigned to each of the customers, our task is simple. Now we want to find out
how the summary of customer in each group is varying. If we can visualize that information we will able to
find out the differences in the clusters of customers and we can modify our strategy on the basis of those
differences. We have used a lot of matplotlib and seaborn so far; we will be using plotly in this section for
creating some interactive plots that you can play around with in your jupyter notebook!

■■Note While plotly provides excellent interactive visualizations in jupyter notebooks, you might come
across some notebook rendering issues that come up as popups when you open the notebook. To fix this
problem, upgrade your nbformat library by running the conda update nbformat command and re-open the
notebook. The problem should disappear.
The following code leverages plotly and will take the cluster labels we got for the configuration of
five clusters and create boxplots that will show how the median, minimum, maximum, highest, and lowest
values are varying in the five groups. Note that we want to avoid the extremely high outlier values of each
group, as they will interfere in making a good observation (due to noise) around the central tendencies of
each cluster. So we will restrict the data such that only data points which are less than 0.8th percentile of the
cluster is used. This will give us good information about the majority of the users in that cluster segment.
The following code will help us create this plot for the total sales value.
import plotly as py
import plotly.graph_objs as go
py.offline.init_notebook_mode()
x_data = ['Cluster 1','Cluster 2','Cluster 3','Cluster 4', 'Cluster 5']
cutoff_quantile = 80
field_to_plot = 'amount'
y0=customer_history_df[customer_history_df['num_cluster5_labels']==0][field_to_plot].values
y0 = y0[y01 :
                rule_dict = {'support' : ex_rule_frm_rule_stat[2],
                             'confidence' : ex_rule_frm_rule_stat[3],
                             'coverage' : ex_rule_frm_rule_stat[4],
                             'strength' : ex_rule_frm_rule_stat[5],
                             'lift' : ex_rule_frm_rule_stat[6],
                             'leverage' : ex_rule_frm_rule_stat[7],
                             'antecedent': ante_rule,
                             'consequent':named_cons[:-2] }
                rule_list_df.append(rule_dict)
    rules_df = pd.DataFrame(rule_list_df)
    print("Raw rules data frame of {} rules generated".format(rules_df.shape[0]))
    if not rules_df.empty:
        pruned_rules_df = rules_df.groupby(['antecedent','consequent']).max().reset_index()
    else:
        print("Unable to generate any rule")
Raw rules data frame of 16628 rules generated

402

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

The output of this code snippet consists of the association rules dataframe that we can use for our
analysis. You can play around with the item number, consequent, antecedent, support, and confidence
values to generate different rules. Let’s take some sample rules generated using transactions that explain
40% of total sales, min-support of 1% (required number of transactions >=45) and confidence greater than
30%. Here, we have collected rules having maximum lift for each of the items that can be a consequent (that
appear on the right side) by using the following code.
(pruned_rules_df[['antecedent','consequent',
                  'support','confidence','lift']].groupby('consequent')
                                                 .max()
                                                 .reset_index()
                                                 .sort_values(['lift',
'support','confidence'],
                                                              ascending=False))

Figure 8-13. Association rules on the grocery dataset
Let’s interpret the first rule, which states that:

{ yogurt, whole milk , tropical

fruit ® root vegtables}

The pattern that the rule states in the equation is easy to understand—people who bought yogurt, whole
milk, and tropical fruit also tend to buy root vegetables. Let’s try to understand the metrics. Support of the
rule is 228, which means, all the items together appear in 228 transactions in the dataset. Confidence of the
rule is 46%, which means that 46% of the time the antecedent items occurred we also had the consequent
in the transaction (i.e. 46% of times, customers who bought the left side items also bought root vegetables).
Another important metric in Figure 8-13 is Lift. Lift means that the probability of finding root vegetables
in the transactions which have yogurt, whole milk, and tropical fruit is greater than the normal probability

403

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

of finding root vegetables in the previous transactions (2.23). Typically, a lift value of 1 indicates that the
probability of occurrence of the antecedent and consequent together are independent of each other. Hence,
the idea is to look for rules having a lift much greater than 1. In our case, all the previously mentioned rules
are good quality rules.
This is a significant piece of information, as this can prompt a retailer to bundle specific products like
these together or run a marketing scheme that offers discount on buying root vegetables along with these
other three products.
We encourage you to try similar analyses with your own datasets in the future and also with the online
retail transactions dataset that we used for our market segmentation case study. Considering the dataset
Online Retail from market segmentation, the workflow for that particular analysis will be very similar. The
only difference among these two datasets is the way in which they are represented. You can leverage the
following code snippets to analyze patterns from the United Kingdom in that dataset.
cs_mba = pd.read_excel(io=r'Online Retail.xlsx')
cs_mba_uk = cs_mba[cs_mba.Country == 'United Kingdom']
# remove returned items
cs_mba_uk = cs_mba_uk[~(cs_mba_uk.InvoiceNo.str.contains("C") == True)]
cs_mba_uk = cs_mba_uk[~cs_mba_uk.Quantity<0]
# create transactional database
items = list(cs_mba_uk.Description.unique())
grouped = cs_mba_uk.groupby('InvoiceNo')
transaction_level_df_uk = grouped.aggregate(lambda x: tuple(x)).reset_index()
[['InvoiceNo','Description']]
transaction_dict = {item:0 for item in items}
output_dict = dict()
temp = dict()
for rec in transaction_level_df_uk.to_dict('records'):
    invoice_num = rec['InvoiceNo']
    items_list = rec['Description']
    transaction_dict = {item:0 for item in items}
    transaction_dict.update({item:1 for item in items if item in items_list})
    temp.update({invoice_num:transaction_dict})
new = [v for k,v in temp.items()]
tranasction_df = pd.DataFrame(new)
del(tranasction_df[tranasction_df.columns[0]])
Once you build the transactional dataset, you can choose your own configuration based on which you
want to extract and mine rules. For instance, the following code mines for patterns on the top 15 most sold
products with min-support of 0.01 (min transactions 49) and minimum confidence of 0.3
output_df_uk_n, item_counts_n = prune_dataset(input_df=tranasction_df, length_trans=2,
                                              start_item=0, end_item=15)
input_assoc_rules = output_df_uk_n
domain_transac = Domain([DiscreteVariable.make(name=item,values=['0', '1']) for item in
                                                                   input_assoc_rules.
columns])
data_tran_uk = Orange.data.Table.from_numpy(domain=domain_transac,  X=input_assoc_rules.
as_matrix(),Y= None)
data_tran_uk_en, mapping = OneHot.encode(data_tran_uk, include_class=True)

404

Chapter 8 ■ Customer Segmentation and Effective Cross Selling

support = 0.01
num_trans = input_assoc_rules.shape[0]*support
itemsets = dict(frequent_itemsets(data_tran_uk_en, support))
confidence = 0.3
rules_df = pd.DataFrame()
...  # rest of the code similar to what we did earlier
The rest of the analysis can be performed using the same workflow which we used for the groceries
dataset. Feel free to check out the Cross Selling.ipynb notebook in case you get stuck. Figure 8-14 shows
some patterns from the previous analysis on our Online Retail dataset.

Figure 8-14. Association rules on the online retail dataset for UK customers
It is quite evident from the metrics in Figure 8-14 that these are excellent quality rules. We can see that
items relevant to baking are purchased together and items like bags are purchased together. Try changing
the previously mentioned parameters and see if you can find more interesting patterns!

Summary
In this chapter, we read about some simple yet high-value case studies. The crux of the chapter was to realize
that the most important part about any analytics or Machine Learning-based solution is the value it can
deliver to the organization. Being an analytics or data science professional, we must always try to balance
the value aspect of our work with its technical complexity. We learned some important methods that have
the potential to directly contribute to the revenue generation of organizations and retail establishments.
We looked at ideas pertaining to customer segmentation, its impact, and explored a novel way of using
unsupervised learning to find out customer segments and view interesting patterns and behavior. Cross
selling introduced us to the world of pattern-mining and rule-based frameworks like association rulemining and principles like market basket analysis. We utilized a framework that was entirely different from
the ones that we have used until now and understood the value of data parsing and pre-processing besides
regular modeling and analysis. In subsequent chapters of this book, we increase the technical complexity
of our case studies, but we urge you to always have an eye out for defining the value and impact of these
solutions. Stay tuned!

405

CHAPTER 9

Analyzing Wine Types and Quality
In the last chapter, we looked at specific case studies leveraging unsupervised Machine Learning techniques
like clustering and rule-mining frameworks. In this chapter, we focus on some more case studies relevant to
supervised Machine Learning algorithms and predictive analytics. We have looked at classification based
problems in Chapter 7, where we built sentiment classifiers based on text reviews to predict the sentiment
of movie reviews. In this chapter, the problem at hand is to analyze, model, and predict the type and quality
of wine using physicochemical attributes. Wine is a pleasant tasting alcoholic beverage, loved by millions
across the globe. Indeed many of us love to celebrate our achievements or even unwind at the end of a tough
day with a glass of wine! The following quote from Francis Bacon should whet your appetite about wine and
its significance.

“Age appears best in four things: old wood to burn, old wine to drink, old friends to trust,
and old authors to read.”
—Francis Bacon
Regardless of whether you like and consume wine or not, it will definitely be interesting to analyze the
physicochemical attributes of wine and understand their relationships and significance with wine quality
and types. Since we will be trying to predict wine types and quality, the supervised Machine Learning
task involved here is classification. In this chapter, we look at various ways to analyze and visualize wine
data attributes and features. We focus on univariate as well as multivariate analyses. For predicting wine
types and quality, we will be building classifiers based on state-of-the-art supervised Machine Learning
techniques, including logistic regression, deep neural networks, decision trees, and ensemble models like
random forests and gradient boosting to name a few. Special emphasis is on analyzing, visualizing, and
modeling data such that you can emulate similar principles on your own classification based real-world
problems in the future. We would like to thank the UC Irvine ML repository for the dataset. Also a special
mention goes to DataCamp and Karlijn Willems, notable Data Science journalist, who has done some
excellent work in analyzing the wine quality dataset and has written an article on her findings at
https://www.datacamp.com/community/tutorials/deep-learning-python, which you can check out for
more details. We have taken a couple of analyses and explanations from this article as an inspiration for our
chapter and Karlijn has been more than helpful in sharing the same with us.

P roblem Statement
“Given a dataset, or in this case two datasets that deal with physicochemical properties of wine, can you
guess the wine type and quality?” This is the main objective of this chapter. Of course this doesn’t mean
the entire focus will be only on leveraging Machine Learning to build predictive models. We will process,
analyze, visualize, and model our dataset based on standard Machine Learning and data mining workflow
models like the CRISP-DM model.
© Dipanjan Sarkar, Raghav Bali and Tushar Sharma 2018
D. Sarkar et al., Practical Machine Learning with Python, https://doi.org/10.1007/978-1-4842-3207-1_9

407

Chapter 9 ■ Analyzing Wine Types and Quality

The datasets used in this chapter are available in the very popular UCI Machine Learning Repository
under the name of Wine Quality Data Set. You can access more details at https://archive.ics.uci.edu/
ml/datasets/wine+quality, which gives you access to the raw datasets as well as details about the various
features in the datasets. There are two datasets, one for red wines and the other for white wines. To be more
specific, the wine datasets are related to red and white vinho verde wine samples, from the north of Portugal.
Another file in the same web page talks about the details for the datasets including attribute information.
Credits for the datasets go out to P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis and you can get more
details in their paper, “Modeling wine preferences by data mining from physicochemical properties”
in Decision Support Systems, Elsevier, 47(4): 547-553, 2009.
To summarize our main objectives, we will be trying to solve the following major problems by
leveraging Machine Learning and data analysis on our wine quality dataset.
•

Predict if each wine sample is a red or white wine.

•

Predict the quality of each wine sample, which can be low, medium, or high.

Let’s get started by setting up the necessary dependencies before moving on to accessing and analyzing
our data!

Setting Up Dependencies
We will be using several Python libraries and frameworks specific to Machine Learning and Deep Learning.
Just like in our previous chapters, you need to make sure you have pandas, numpy, scipy, and scikit-learn
installed, which will be used for data processing and Machine Learning. We will also use matplotlib and
seaborn extensively for exploratory data analysis and visualizations. Deep Learning frameworks used in
this chapter include keras with the tensorflow backend, but you can also use theano as the backend if you
choose to do so. We also use the xgboost library for the gradient boosting ensemble model. Utilities related
to supervised model fitting, prediction, and evaluation are present in model_evaluation_utils.py, so make
sure you have these modules in the same directory and the other Python files and jupyter notebooks for this
chapter, which you can obtain from the relevant directory for this chapter on GitHub at https://github.
com/dipanjanS/practical-machine-learning-with-python.

Getting the Data
The datasets will be available along with the code files for this chapter in the GitHub repository for this book
at https://github.com/dipanjanS/practical-machine-learning-with-python under the respective
folder for Chapter 9. The following files refer to the datasets of interest.
•

The file named winequality-red.csv contains the dataset pertaining to 1599
records of red wine samples

•

The file named winequality-white.csv contains the dataset pertaining to 4898
records of white wine samples

•

The file named winequality.names consists of detailed information and the data
dictionary pertaining to the datasets

You can also download the same data from https://archive.ics.uci.edu/ml/datasets/
wine+quality if needed. Once you have the CSV file, you can easily load it in Python using the read_
csv(...) utility function from pandas.

408

Chapter 9 ■ Analyzing Wine Types and Quality

Exploratory Data Analysis
Standard Machine Learning and analytics workflow recommend processing, cleaning, analyzing, and
visualizing your data before moving on toward modeling your data. We will also follow the same workflow
we used in all our other chapters. You can refer to the Python file titled exploratory_data_analysis.py for
all the code used in this section or use the jupyter notebook titled Exploratory Data Analysis.ipynb for a
more interactive experience.

Process and Merge Datasets
Let’s load the following necessary dependencies and configuration settings.
import
import
import
import
import

pandas as pd
matplotlib.pyplot as plt
matplotlib as mpl
numpy as np
seaborn as sns

%matplotlib inline
We will now process the datasets (red and white wine) and add some additional variables that we would
want to predict in future sections. The first variable we will add is wine_type, which would be either red or
white wine based on the dataset and the wine sample. The second variable we will add is quality_label,
which is a qualitative measure of the quality of the wine sample based on the quality variable score. The
rules used for mapping quality to quality_label are described as follows.
•

Wine quality scores of 3, 4, and 5 are mapped to low quality wines under the
quality_label attribute.

•

Wine quality scores of 6 and 7 are mapped to medium quality wines under the
quality_label attribute.

•

Wine quality scores of 8 and 9 are mapped to high quality wines under the
quality_label attribute.

After adding these attributes, we also merge the two datasets for red and white wine together to create
a single dataset and we use pandas to merge and shuffle the records of the data frame. The following snippet
helps us in achieving this.
In [3]: white_wine = pd.read_csv('winequality-white.csv', sep=';')
   ...: red_wine = pd.read_csv('winequality-red.csv', sep=';')
   ...:
   ...: # store wine type as an attribute
   ...: red_wine['wine_type'] = 'red'   
   ...: white_wine['wine_type'] = 'white'
   ...: # bucket wine quality scores into qualitative quality labels
   ...: red_wine['quality_label'] = red_wine['quality'].apply(lambda value: 'low'
   ...:                                                       if value <= 5 else 'medium'
   ...:                                                       if value <= 7 else 'high')
   ...: red_wine['quality_label'] = pd.Categorical(red_wine['quality_label'],
   ...:                                           categories=['low', 'medium', 'high'])
   ...: white_wine['quality_label'] = white_wine['quality'].apply(lambda value: 'low'

409

Chapter 9 ■ Analyzing Wine Types and Quality

   ...:                                                            if value <= 5 else 'medium'
   ...:                                                             if value <= 7 else 'high')
   ...: white_wine['quality_label'] = pd.Categorical(white_wine['quality_label'],
   ...:                                              categories=['low', 'medium', 'high'])
   ...:
   ...: # merge red and white wine datasets
   ...: wines = pd.concat([red_wine, white_wine])
   ...: # re-shuffle records just to randomize data points
   ...: wines = wines.sample(frac=1, random_state=42).reset_index(drop=True)
Our objective in future sections would be to predict wine_type and quality_label based on other
features in the wines dataset. Let’s now try to understand more about our dataset and its features.

Understanding Dataset Features
The wines dataframe we obtained in the previous section is our final dataset we will be using for our analysis
and modeling. We will also be using the red_wine and white_wine dataframes where necessary for basic
exploratory analysis and visualizations. Let’s start by looking at the total number of data samples we are
dealing with and also the different features in our dataset.
In [3]: print(white_wine.shape, red_wine.shape)
   ...: print(wines.info())
(4898, 14) (1599, 14)

RangeIndex: 6497 entries, 0 to 6496
Data columns (total 14 columns):
fixed acidity           6497 non-null float64
volatile acidity        6497 non-null float64
citric acid             6497 non-null float64
residual sugar          6497 non-null float64
chlorides               6497 non-null float64
free sulfur dioxide     6497 non-null float64
total sulfur dioxide    6497 non-null float64
density                 6497 non-null float64
pH                      6497 non-null float64
sulphates               6497 non-null float64
alcohol                 6497 non-null float64
quality                 6497 non-null int64
wine_type               6497 non-null object
quality_label           6497 non-null category
dtypes: category(1), float64(11), int64(1), object(1)
memory usage: 666.4+ KB
This information tells us that we have 4898 white wine data points and 1599 red wine data points. The
merged dataset contains a total of 6497 data points and we also get an idea of numeric and categorical
attributes. Let’s take a peek at our dataset to see some sample data points.
In [4]: wines.head()

410

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-1. Sample data points from the wine quality dataset
The output depicted in Figure 9-1 shows us sample wine records for our wine quality dataset. Looking
at the values, we can get an idea of numeric as well as categorical features. Let’s now try to gain some
domain knowledge about wine and its attributes. Domain knowledge is essential and always recommended,
especially if you are trying to analyze and model data from diverse domains.
Wine is an alcoholic beverage made by the process of fermentation of grapes, without the addition of
sugars, acids, enzymes, water, or other nutrients. Red and white wine are two variants. Usually, red wine is
made from dark red and black grapes. The color ranges from various shades of red, brown, and violet. This
is produced with whole grapes, including the skin, which adds to the color and flavor of red wines giving it a
rich flavor. White wine is made from white grapes with no skins or seeds. The color is usually straw-yellow,
yellow-green, or yellow-gold. Most white wines have a light and fruity flavor as compared to richer red wines.
Let’s now dive into details for each feature in the dataset. Credits once again go to Karlijn for some of the
attribute descriptions. Our dataset has a total of 14 attributes and they are described as follows.
•

fixed acidity: Acids are one of the fundamental properties of wine and contribute
greatly to the taste of the wine. Reducing acids significantly might lead to wines
tasting flat. Fixed acids include tartaric, malic, citric, and succinic acids, which are
found in grapes (except succinic). This variable is usually expressed in
g ( tartaricacid )
in the dataset.
dm 3

•

volatile acidity: These acids are to be distilled out from the wine before
completing the production process. It is primarily constituted of acetic acid, though
other acids like lactic, formic, and butyric acids might also be present. Excess of
volatile acids are undesirable and lead to unpleasant flavor. In the United States, the
legal limits of volatile acidity are 1.2 g/L for red table wine and 1.1 g/L for white table
wine. The volatile acidity is expressed in

•

g ( aceticacid )
in the dataset.
dm 3

citric acid: This is one of the fixed acids that gives a wine its freshness. Usually
most of it is consumed during the fermentation process and sometimes it is added
g
separately to give the wine more freshness. It’s usually expressed in
in the
dm 3
dataset.

•

residual sugar: This typically refers to the natural sugar from grapes that remains
g
after the fermentation process stops, or is stopped. It’s usually expressed in
dm 3
in the dataset.

411

Chapter 9 ■ Analyzing Wine Types and Quality

•

chlorides: This is usually a major contributor to saltiness in wine. It’s usually
expressed in

•

g ( sodiumchloride )
in the dataset.
dm 3

free sulfur dioxide: This is the part of the sulfur dioxide that, when added to a
wine, is said to be free after the remaining part binds. Winemakers will always try to
get the highest proportion of free sulfur to bind. They are also known as sulfites and
too much is undesirable and gives a pungent odor. This variable is expressed in
mg
in the dataset.
dm 3

•

total sulfur dioxide: This is the sum total of the bound and the free sulfur
mg
. This is mainly added to kill harmful
dioxide (SO2). Here, it’s expressed in
dm 3
bacteria and preserve quality and freshness. There are usually legal limits for
sulfur levels in wines and excess of it can even kill good yeast and produce an
undesirable odor.

•

density: This can be represented as a comparison of the weight of a specific volume
of wine to an equivalent volume of water. It is generally used as a measure of the
g
conversion of sugar to alcohol. Here, it’s expressed in
.
cm 3

•

pH: Also known as the potential of hydrogen, this is a numeric scale to specify the
acidity or basicity the wine. Fixed acidity contributes the most toward the pH of
wines. You might know, solutions with a pH less than 7 are acidic, while solutions
with a pH greater than 7 are basic. With a pH of 7, pure water is neutral. Most wines
have a pH between 2.9 and 3.9 and are therefore acidic.

•

sulphates: These are mineral salts containing sulfur. Sulphates are to wine as gluten
is to food. They are a regular part of the winemaking around the world and are
considered essential. They are connected to the fermentation process and affect the
g ( potassiumsulphate )
wine aroma and flavor. Here, they are expressed in
in the
dm 3
dataset.

412

•

alcohol: Wine is an alcoholic beverage. Alcohol is formed as a result of yeast
converting sugar during the fermentation process. The percentage of alcohol can
vary from wine to wine. Hence it is not a surprise for this attribute to be a part of this
dataset. It’s usually measured in % vol or alcohol by volume (ABV).

•

quality: Wine experts graded the wine quality between 0 (very bad) and 10 (very
excellent). The eventual quality score is the median of at least three evaluations made
by the same wine experts.

•

wine_type: Since we originally had two datasets for red and white wine, we
introduced this attribute in the final merged dataset, which indicates the type of
wine for each data point. A wine can be a red or a white wine. One of the predictive
models we will build in this chapter would be such that we can predict the type of
wine by looking at other wine attributes.

Chapter 9 ■ Analyzing Wine Types and Quality

•

quality_label: This is a derived attribute from the quality attribute. We bucket
or group wine quality scores into three qualitative buckets, namely low, medium,
and high. Wines with a quality score of 3, 4, and 5 are low quality; scores of 6 and 7
are medium quality; and scores of 8 and 9 are high quality wines. We will also build
another model in this chapter to predict this wine quality label based on other wine
attributes.

Now that you have a solid foundation on the dataset as well as its features, let’s analyze and visualize
various features and their interactions.

Descriptive Statistics
We will start by computing some descriptive statistics of our various features of interest in our dataset. This
involves computing aggregation metrics like mean, median, standard deviation, and so on. If you remember
one of our primary objectives is to build a model that can correctly predict if a wine is a red or white wine based
on its attributes. Let’s build a descriptive summary table on various wine attributes separated by wine type.
In [5]: subset_attributes = ['residual sugar', 'total sulfur dioxide', 'sulphates',
                             'alcohol', 'volatile acidity', 'quality']
   ...: rs = round(red_wine[subset_attributes].describe(),2)
   ...: ws = round(white_wine[subset_attributes].describe(),2)
   ...: pd.concat([rs, ws], axis=1, keys=['Red Wine Statistics', 'White Wine Statistics'])

Figure 9-2. Descriptive statistics for wine attributes separated by wine type
The summary table depicted in Figure 9-2 shows us descriptive statistics for various wine attributes. Do
you notice any interesting properties? For starters, mean residual sugar and total sulfur dioxide content in
white wine seems to be much higher than red wine. Also, the mean value of sulphates and volatile acidity
seem to be higher in red wine as compared to white wine. Try including other features too and see if you
can find more interesting comparisons! Considering wine quality levels as data subsets, let’s build some
descriptive summary statistics with the following snippet.
In [6]: subset_attributes = ['alcohol', 'volatile acidity', 'pH', 'quality']
   ...: ls = round(wines[wines['quality_label'] == 'low'][subset_attributes].describe(),2)
   ...: ms = round(wines[wines['quality_label'] == 'medium'][subset_attributes].describe(),2)
   ...: hs = round(wines[wines['quality_label'] == 'high'][subset_attributes].describe(),2)
   ...: pd.concat([ls, ms, hs], axis=1, keys=['Low Quality Wine', 'Medium Quality Wine',
                                              'High Quality Wine'])

413

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-3. Descriptive statistics for wine attributes separated by wine quality
The summary table depicted in Figure 9-3 shows us descriptive statistics for various wine attributes
subset by wine quality ratings. Interestingly, mean alcohol levels seem to increase based on the rating of the
wine quality. We also see that pH levels are almost consistent across the wine samples of varying quality. Is
there any way to statistically prove this? We will see that in the following section.

Inferential Statistics
The general notion of inferential statistics is to draw inferences and propositions of a population using a
data sample. The idea is to use statistical methods and models to draw statistical inferences from a given
hypotheses. Each hypothesis consists of a null hypothesis and an alternative hypothesis. Based on statistical
test results, if the result is statistically significant based on pre-set significance levels (e.g., if obtained
p-value is less than 5% significance level), we reject the null hypothesis in favor of the alternative hypothesis.
Otherwise, if the results is not statistically significant, we conclude that our null hypothesis was correct.
Coming back to our problem from the previous section, given multiple data groups or subsets of wine
samples based on wine quality rating, is there any way to prove that mean alcohol levels or pH levels vary
significantly among the data groups?
A great statistical model to prove or disprove the difference in mean among subsets of data is to use
the one-way ANOVA test. ANOVA stands for “analysis of variance,” which is a nifty statistical model and can
be used to analyze statistically significant differences among means or averages of various groups. This is
basically achieved using a statistical test that helps us determine whether or not the means of several groups
are equal. Usually the null hypothesis is represented as
H0: μ1 = μ2 = μ3 = ... = μn
Where n is the number of data groups or subsets and it indicates that the group means for the various
groups are not very different from each other based on statistical significance levels. The alternative
hypotheses, HA, tells us that there exists at least two group means that are statistically significantly different
from each other. Usually the F-statistic and the associated p-value from it is used to determine the statistical
significance. Typically a p-value less than 0.05 is taken to be a statistically significant result where we reject
the null hypothesis in favor of the original. We recommend reading up a standard book on inferential
statistics to gain more in-depth knowledge regarding these concepts.

414

Chapter 9 ■ Analyzing Wine Types and Quality

For our scenario, three data subsets or groups from the data are created based on wine quality ratings. The
mean values in the first test would be based on the wine alcohol content and the second test would be based
on the wine pH levels. Also let’s assume the null hypothesis is that the group means for low, medium, and high
quality wine is same and the alternate hypothesis would be that there is a difference (statistically significant)
between at least two group means. The following snippet helps us perform the one-way ANOVA test.
In [7]: from scipy import stats
   ...:
   ...: F, p = stats.f_oneway(wines[wines['quality_label'] == 'low']['alcohol'],
   ...:                      wines[wines['quality_label'] == 'medium']['alcohol'],
   ...:                      wines[wines['quality_label'] == 'high']['alcohol'])
   ...: print('ANOVA test for mean alcohol levels across wine samples with different quality
               ratings')
   ...: print('F Statistic:', F, '\tp-value:', p)
   ...:
   ...: F, p = stats.f_oneway(wines[wines['quality_label'] == 'low']['pH'],
   ...:                      wines[wines['quality_label'] == 'medium']['pH'],
   ...:                      wines[wines['quality_label'] == 'high']['pH'])
   ...: print('\nANOVA test for mean pH levels across wine samples with different quality
               ratings')
   ...: print('F Statistic:', F, '\tp-value:', p)
ANOVA test for mean alcohol levels across wine samples with different quality ratings
F Statistic: 673.074534723      p-value: 2.27153374506e-266
ANOVA test for mean pH levels across wine samples with different quality ratings
F Statistic: 1.23638608035      p-value: 0.290500277977
From the preceding results we can clearly see we have a p-value much less than 0.05 in the first test and
greater than 0.05 in the second test. This tells us that there is a statistically significant difference in alcohol
level means for at least two groups out of the three (rejecting the null hypothesis in favor of the alternative).
However, in case of pH level means, we do not reject the null hypothesis and thus we conclude that the pH
level means across the three groups are not statistically significantly different. We can even visualize these
two features and observe the means using the following snippet.
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
f.suptitle('Wine Quality - Alcohol Content/pH', fontsize=14)
f.subplots_adjust(top=0.85, wspace=0.3)
sns.boxplot(x="quality_label", y="alcohol",
            data=wines, ax=ax1)
ax1.set_xlabel("Wine Quality Class",size = 12,alpha=0.8)
ax1.set_ylabel("Wine Alcohol %",size = 12,alpha=0.8)
sns.boxplot(x="quality_label", y="pH", data=wines, ax=ax2)
ax2.set_xlabel("Wine Quality Class",size = 12,alpha=0.8)
ax2.set_ylabel("Wine pH",size = 12,alpha=0.8)

415

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-4. Visualizing wine alcohol content and pH level distributions based on quality ratings
The boxplots depicted in Figure 9-4 show us stark differences in wine alcohol content distributions
based on wine quality as compared to pH levels, which look to be between 3.1 - 3.3 and in fact if you look at
the mean and median values for pH levels across the three groups, it is approximately 3.2 across the three
groups as compared to alcohol %, which varies significantly. Can you find our more interesting patterns and
hypothesis with other features from this data? Give it a try!

Univariate Analysis
This is perhaps one of the easiest yet a core foundational step in exploratory data analysis. Univariate
analysis involves analyzing data such that at any instance of analysis we are only dealing with one variable or
feature. No relationships or correlations are analyzed among multiple variables. The simplest way to easily
visualize all the variables in your data is to build some histograms. The following snippet helps visualize
distributions of data values for all features. While histogram may not be an appropriate visualization in many
cases, it is a good one to start with for numeric data.
red_wine.hist(bins=15, color='red', edgecolor='black', linewidth=1.0,
              xlabelsize=8, ylabelsize=8, grid=False)    
plt.tight_layout(rect=(0, 0, 1.2, 1.2))   
rt = plt.suptitle('Red Wine Univariate Plots', x=0.65, y=1.25, fontsize=14)  
white_wine.hist(bins=15, color='white', edgecolor='black', linewidth=1.0,
              xlabelsize=8, ylabelsize=8, grid=False)    
plt.tight_layout(rect=(0, 0, 1.2, 1.2))   
wt = plt.suptitle('White Wine Univariate Plots', x=0.65, y=1.25, fontsize=14)

416

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-5. Univariate plots depicting feature distributions for the wine quality dataset
The power of packages like matplotlib and pandas enable you to easily plot variable distributions as
depicted in Figure 9-5 using minimal code. Do you notice any interesting patterns across the two wine types?
Let’s take the feature named residual sugar and plot the distributions across data pertaining to red and
white wine samples.
fig = plt.figure(figsize = (10,4))
title = fig.suptitle("Residual Sugar Content in Wine", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)
ax1 = fig.add_subplot(1,2, 1)
ax1.set_title("Red Wine")
ax1.set_xlabel("Residual Sugar")
ax1.set_ylabel("Frequency")
ax1.set_ylim([0, 2500])
ax1.text(8, 1000, r'$\mu$='+str(round(red_wine['residual sugar'].mean(),2)),
         fontsize=12)
r_freq, r_bins, r_patches = ax1.hist(red_wine['residual sugar'], color='red', bins=15,
                                    edgecolor='black', linewidth=1)
ax2 = fig.add_subplot(1,2, 2)
ax2.set_title("White Wine")
ax2.set_xlabel("Residual Sugar")
ax2.set_ylabel("Frequency")
ax2.set_ylim([0, 2500])
ax2.text(30, 1000, r'$\mu$='+str(round(white_wine['residual sugar'].mean(),2)),
         fontsize=12)
w_freq, w_bins, w_patches = ax2.hist(white_wine['residual sugar'], color='white', bins=15,
                                    edgecolor='black', linewidth=1)

417

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-6. Residual sugar distribution for red and white wine samples
We can notice easily from the visualization in Figure 9-6 that residual sugar content in white wine
samples seems to be more as compared to red wine samples. You can reuse the plotting template in the
preceding code snippet and visualize more features. Some plots are depicted as follows (detailed code is
present in the jupyter notebook).

Figure 9-7. Distributions for sulphate content and alcohol content for red and white wine samples
The plots depicted in Figure 9-7 show us that the sulphate content is slightly more in red wine samples
as compared to white wine samples and alcohol content is almost similar in both types on an average. Of
course, frequency counts are higher in all cases for white wine because we have more white wine sample
records as compared to red wine. Next, we plot the distributions of the quality and quality_label
categorical features to get an idea of the class distributions, which we will be predicting later on.

418

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-8. Distributions for wine quality for red and white wine samples

The bar plots depicted in Figure 9-8 show us the distribution of wine samples based on type and
quality. It is quite evident that high quality wine samples are far less as compared to low and medium
quality wine samples.

Multivariate Analysis
Analyzing multiple feature variables and their relationships is what multivariate analysis is all about. We
would want to see if there are any interesting patterns and relationships among the physicochemical
attributes of our wine samples, which might be helpful in our modeling process in the future. One of the best
ways to analyze features is to build a pairwise correlation plot depicting the correlation coefficient between
each pair of features in the dataset. The following snippet helps us build a correlation matrix and plot the
same in the form of an easy-to-interpret heatmap.
f, ax = plt.subplots(figsize=(10, 5))
corr = wines.corr()
hm = sns.heatmap(round(corr,2), annot=True, ax=ax, cmap="coolwarm",fmt='.2f',
            linewidths=.05)
f.subplots_adjust(top=0.93)
t= f.suptitle('Wine Attributes Correlation Heatmap', fontsize=12)

419

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-9. Correlation heatmap for features in the wine quality dataset
While most of the correlations are weak, as observed in Figure 9-9, we can see a strong negative
correlation between density and alcohol and a strong positive correlation between total and free sulfur
dioxide, which is expected. You can also visualize patterns and relationships among multiple variables
using pairwise plots and use different hues for the wine types essentially plotting three variables at a time.
The following snippet depicts a sample pairwise plot for some features in our dataset.
cols = ['wine_type', 'quality', 'sulphates', 'volatile acidity']
pp = sns.pairplot(wines[cols], hue='wine_type', size=1.8, aspect=1.8,
                  palette={"red": "#FF9999", "white": "#FFE888"},
                  plot_kws=dict(edgecolor="black", linewidth=0.5))
fig = pp.fig
fig.subplots_adjust(top=0.93, wspace=0.3)
t = fig.suptitle('Wine Attributes Pairwise Plots', fontsize=14)

420

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-10. Pairwise plots by wine type for features in the wine quality dataset
From the plots in Figure 9-10, we can notice several interesting patterns, which are in alignment with
some insights we obtained earlier. These observations include the following:
•

Presence of higher sulphate levels in red wines as compared to white wines

•

Lower sulphate levels in wines with high quality ratings

•

Lower levels of volatile acids in wines with high quality ratings

•

Presence of higher volatile acid levels in red wines as compared to white wines

You can use similar plots on other variables and features to discover more patterns and relationships.
To observe relationships among features with a more microscopic view, joint plots are excellent visualization
tools specifically for multivariate visualizations. The following snippet depicts the relationship between wine
types, sulphates, and quality ratings.
rj = sns.jointplot(x='quality', y='sulphates', data=red_wine,
                   kind='reg', ylim=(0, 2),  
                   color='red', space=0, size=4.5, ratio=4)
rj.ax_joint.set_xticks(list(range(3,9)))
fig = rj.fig
fig.subplots_adjust(top=0.9)
t = fig.suptitle('Red Wine Sulphates - Quality', fontsize=12)
wj = sns.jointplot(x='quality', y='sulphates', data=white_wine,
                   kind='reg', ylim=(0, 2),
                   color='#FFE160', space=0, size=4.5, ratio=4)
wj.ax_joint.set_xticks(list(range(3,10)))
fig = wj.fig
fig.subplots_adjust(top=0.9)
t = fig.suptitle('White Wine Sulphates - Quality', fontsize=12)

421

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-11. Visualizing relationships between wine types’ sulphates and quality with joint plots
While there seems to be some pattern depicting lower sulphate levels for higher quality rated wine
samples, the correlation is quite weak (see Figure 9-11). However, we do see clearly that sulphate levels
for red wine are much higher as compared to the ones in white wine. In this case we have visualized three
features (type, quality, and sulphates) with the help of two plots. What if we wanted to visualize a higher
number of features and determine patterns from them? The seaborn framework provides facet grids that
help us visualize higher number of variables in two-dimensional plots. Let’s try to visualize relationships
between wine type, quality ratings, volatile acidity, and alcohol volume levels.
g = sns.FacetGrid(wines, col="wine_type", hue='quality_label',
                  col_order=['red', 'white'], hue_order=['low', 'medium', 'high'],
                  aspect=1.2, size=3.5, palette=sns.light_palette('navy', 3))
g.map(plt.scatter, "volatile acidity", "alcohol", alpha=0.9,
      edgecolor='white', linewidth=0.5)
fig = g.fig
fig.subplots_adjust(top=0.8, wspace=0.3)
fig.suptitle('Wine Type - Alcohol - Quality - Acidity', fontsize=14)
l = g.add_legend(title='Wine Quality Class')

422

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-12. Visualizing relationships between wine types: alcohol, quality, and acidity levels
The plot in Figure 9-12 shows us some interesting patterns. Not only are we able to successfully visualize
four variables, but also we can see meaningful relationships among them. Higher quality wine samples
(depicted by darker shades) have lower levels of volatile acidity and higher levels of alcohol content as
compared to wine samples with medium and low ratings. Besides this, we can also see that volatile acidity
levels are slightly lower in white wine samples as compared to red wine samples.
Let’s now build a similar visualization. However, in this scenario, we want to analyze patterns in wine
types, quality, sulfur dioxide, and acidity levels. We can use the same framework as our last code
snippet to achieve this.
g = sns.FacetGrid(wines, col="wine_type", hue='quality_label',
                  col_order=['red', 'white'], hue_order=['low', 'medium', 'high'],
                  aspect=1.2, size=3.5, palette=sns.light_palette('green', 3))
g.map(plt.scatter, "volatile acidity", "total sulfur dioxide", alpha=0.9,
      edgecolor='white', linewidth=0.5)
fig = g.fig
fig.subplots_adjust(top=0.8, wspace=0.3)
fig.suptitle('Wine Type - Sulfur Dioxide - Acidity - Quality', fontsize=14)
l = g.add_legend(title='Wine Quality Class')

423

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-13. Visualizing relationships between wine types: quality, sulfur dioxide, and acidity levels
We can easily interpret from Figure 9-13 that volatile acidity as well as total sulfur dioxide is
considerably lower in high quality wine samples. Also, total sulfur dioxide is considerable more in white
wine samples as compared to red wine samples. However, volatile acidity levels are slightly lower in white
wine samples as compared to red wine samples we also observed in the previous plot.
A nice way to visualize numerical features segmented by groups (categorical variables) is to use box
plots. In our dataset, we have already discussed the relationship of higher alcohol levels with higher quality
ratings for wine samples in the “Inferential Statistics” section. Let’s try to visualize the relationship between
wine alcohol levels grouped by wine quality ratings. We will generate two plots for wine alcohol content
versus both wine quality and quality_label.
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
f.suptitle('Wine Type - Quality - Alcohol Content', fontsize=14)
sns.boxplot(x="quality", y="alcohol", hue="wine_type",
               data=wines, palette={"red": "#FF9999", "white": "white"}, ax=ax1)
ax1.set_xlabel("Wine Quality",size = 12,alpha=0.8)
ax1.set_ylabel("Wine Alcohol %",size = 12,alpha=0.8)
sns.boxplot(x="quality_label", y="alcohol", hue="wine_type",
               data=wines, palette={"red": "#FF9999", "white": "white"}, ax=ax2)
ax2.set_xlabel("Wine Quality Class",size = 12,alpha=0.8)
ax2.set_ylabel("Wine Alcohol %",size = 12,alpha=0.8)
l = plt.legend(loc='best', title='Wine Type')

424

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-14. Visualizing relationships between wine types: quality and alcohol content
Based on our earlier analysis for wine quality versus alcohol volume in the “Inferential Statistics”
section, these results look consistent. Each box plot in Figure 9-14 depicts the distribution of alcohol level
for a particular wine quality rating separated by wine types. The box itself depicts the inter-quartile range
and the line inside depicts the median value of alcohol. Whiskers indicate the minimum and maximum
value with outliers often depicted by individual points. We can clearly observe the wine alcohol by volume
distribution has an increasing trend based on higher quality rated wine samples. Similarly we can also using
violin plots to visualize distributions of numeric features over categorical features. Let’s build a visualization
for analyzing the fixed acidity of wine sample by quality ratings.
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
f.suptitle('Wine Type - Quality - Acidity', fontsize=14)
sns.violinplot(x="quality", y="volatile acidity", hue="wine_type",
               data=wines, split=True, inner="quart", linewidth=1.3,
               palette={"red": "#FF9999", "white": "white"}, ax=ax1)
ax1.set_xlabel("Wine Quality",size = 12,alpha=0.8)
ax1.set_ylabel("Wine Fixed Acidity",size = 12,alpha=0.8)
sns.violinplot(x="quality_label", y="volatile acidity", hue="wine_type",
               data=wines, split=True, inner="quart", linewidth=1.3,
               palette={"red": "#FF9999", "white": "white"}, ax=ax2)
ax2.set_xlabel("Wine Quality Class",size = 12,alpha=0.8)
ax2.set_ylabel("Wine Fixed Acidity",size = 12,alpha=0.8)
l = plt.legend(loc='upper right', title='Wine Type')

425

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-15. Visualizing relationships between wine types: quality and acidity
In Figure 9-15, Each violin plot typically depicts the inter-quartile range with the median which is shown
with dotted lines in this figure. You can also visualize the distribution of data with the density plots where
width depicts frequency. Thus in addition to the information you get from box plots, you can also visualize the
distribution of data with violin plots. In fact we have built a split-violin plot in this case depicting both types of
wine. It is quite evident that red wine samples have higher acidity as compared to its white wine counterparts.
Also we can see an overall decrease in acidity with higher quality wine for red wine samples but not so much
for white wine samples. These code snippets and examples should give you some good frameworks and
blueprints to perform effective exploratory data analysis on your datasets in the future.

P redictive Modeling
We will now focus on our main objectives of building predictive models to predict the wine types and quality
ratings based on other features. We will be following the standard classification Machine Learning pipeline
in this case. There will be two main classification systems we will be building in this section.
•

Prediction system for wine type (red or white wine)

•

Prediction system for wine quality rating (low, medium, or high)

We will be using the wines data frame from the previous sections. The entire code for this section is
available in the Python file titled predictive_analytics.py or you can use the jupyter notebook titled
Predictive Analytics.ipynb for a more interactive experience. To start with, let’s load the following
necessary dependencies and settings.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import model_evaluation_utils as meu
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
%matplotlib inline

426

Chapter 9 ■ Analyzing Wine Types and Quality

Do remember to have the model_evaluation_utils.py module in the same directory where you are
running your code since we will be using it for evaluating our predictive models. Let’s briefly look at the
workflow we will be following for our predictive systems. We will focus on two major phases—model training
and model predictions and evaluation.

Figure 9-16. Workflow blueprint for our wine type and quality classification system
From Figure 9-16, we can see that training data and testing data refer to the wine quality dataset
features. Since we already have the necessary wine attributes, we won’t be building additional hand-crafted
features. Labels can be either wine types or quality ratings based on the classification system. In the training
phase, feature selection will mostly involve selecting all the necessary wine physicochemical attributes and
then after necessary scaling we will be training our predictive models for prediction and evaluation in the
prediction phase.

Predicting Wine Types
In our wine quality dataset, we have two variants or types of wine—red and white wine. The main task of
our classification system in this section is to predict the wine type based on other features. To start with, we
will first select our necessary features and separate out the prediction class labels and prepare train and test
datasets. We use the prefix wtp_ in our variables to easily identify them as needed, where wtp depicts wine
type prediction.
In [5]: wtp_features = wines.iloc[:,:-3]
   ...: wtp_feature_names = wtp_features.columns
   ...: wtp_class_labels = np.array(wines['wine_type'])
   ...:

427

Chapter 9 ■ Analyzing Wine Types and Quality

   ...: wtp_train_X, wtp_test_X, wtp_train_y, wtp_test_y = train_test_split(wtp_features,
   ...:                                    wtp_class_labels, test_size=0.3, random_state=42)
   ...:
   ...: print(Counter(wtp_train_y), Counter(wtp_test_y))
   ...: print('Features:', list(wtp_feature_names))
Counter({'white': 3418, 'red': 1129}) Counter({'white': 1480, 'red': 470})
Features: ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides',
'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']
The numbers show us the wine samples for each class and we can also see the feature names which will
be used in our feature set. Let’s move on to scaling our features. We will be using a standard scaler in this
scenario.
In [6]:
   ...:
   ...:
   ...:
   ...:
   ...:

# Define the scaler
wtp_ss = StandardScaler().fit(wtp_train_X)
# Scale the train set
wtp_train_SX = wtp_ss.transform(wtp_train_X)
# Scale the test set
wtp_test_SX = wtp_ss.transform(wtp_test_X)

Since we are dealing with a binary classification problem, one of the traditional Machine Learning
algorithms we can use is the logistic regression model. If you remember we had talked about this in detail
in Chapter 7. Feel free to skim through the “Traditional Supervised Machine Learning Models” section
in Chapter 7 to refresh your memory on logistic regression or you can refer to any standard text book or
material on classification models. Let’s now train a model on our training dataset and labels using logistic
regression
In [7]: from sklearn.linear_model import LogisticRegression
   ...:
   ...: wtp_lr = LogisticRegression()
   ...: wtp_lr.fit(wtp_train_SX, wtp_train_y)
Out[7]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Now that our model is ready, let’s predict the wine types for our test data samples and evaluate the
performance.
In [8]: wtp_lr_predictions = wtp_lr.predict(wtp_test_SX)
   ...: meu.display_model_performance_metrics(true_labels=wtp_test_y,
   ...:                  predicted_labels=wtp_lr_predictions, classes=['red', 'white'])

Figure 9-17. Model performance metrics for logistic regression for wine type predictive model

428

Chapter 9 ■ Analyzing Wine Types and Quality

We get an overall F1 Score and model accuracy of 99.2%, as depicted in Figure 9-17 which is really
amazing! In spite of low samples of red wine, we seem to do pretty well. In case your models do not perform
well on other datasets due to a class imbalance problem, you can consider over-sampling or under-sampling
techniques including sample selection as well as SMOTE. Coming back to our classification problem,
we have a really good model, but can we do better? While that seems to be a far-fetched dream, let’s try
modeling the data using a fully connected deep neural network (DNN) with three hidden layers. Refer to the
“Newer Supervised Deep Learning Models” section in Chapter 7 to refresh your memory on fully-connected
DNNs and MLPs. Deep Learning frameworks like keras on top of tensorflow prefer if your output response
labels are encoded to numeric forms which are easier to work with. The following snippet encodes our wine
type class labels.
In [9]:
   ...:
   ...:
   ...:
   ...:

le = LabelEncoder()
le.fit(wtp_train_y)
# encode wine type labels
wtp_train_ey = le.transform(wtp_train_y)
wtp_test_ey = le.transform(wtp_test_y)

Let’s build the architecture for our three-hidden layer DNN where each hidden layer has 16 units (the
input layer has 11 units for the 11 features) and the output layer has 1 unit to predict a 0 or 1, which maps
back to red or white wine.
In [10]: from keras.models import Sequential
    ...: from keras.layers import Dense
    ...:
    ...: wtp_dnn_model = Sequential()
    ...: wtp_dnn_model.add(Dense(16, activation='relu', input_shape=(11,)))
    ...: wtp_dnn_model.add(Dense(16, activation='relu'))
    ...: wtp_dnn_model.add(Dense(16, activation='relu'))
    ...: wtp_dnn_model.add(Dense(1, activation='sigmoid'))
    ...:
    ...: wtp_dnn_model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
Using TensorFlow backend.
You can see that we are using keras on top of tensorflow, and for our optimizer, we have chosen the
adam optimizer with a binary cross-entropy loss. You can also use categorical cross-entropy if needed, which
is especially useful when you have more than two classes. The following snippet helps train our DNN.
In [11]: history = wtp_dnn_model.fit(wtp_train_SX, wtp_train_ey, epochs=10, batch_size=5,
    ...:                             shuffle=True, validation_split=0.1, verbose=1)
Train on 4092 samples, validate on 455 samples
Epoch 1/10  4092/4092 - 1s - loss: 0.1266 - acc: 0.9467 - val_loss: 0.0115 - val_acc: 0.9978
Epoch 2/10  4092/4092 - 1s - loss: 0.0315 - acc: 0.9934 - val_loss: 0.0046 - val_acc: 1.0000
...
Epoch 9/10  4092/4092 - 1s - loss: 0.0112 - acc: 0.9973 - val_loss: 0.0029 - val_acc: 1.0000
Epoch 10/10 4092/4092 - 1s - loss: 0.0098 - acc: 0.9978 - val_loss: 0.0013 - val_acc: 1.0000
We use 10% of the training data for a validation set while training the model to see how it performs at
each epoch. Let’s now predict and evaluate our model on the actual test dataset.

429

Chapter 9 ■ Analyzing Wine Types and Quality

In [15]: wtp_dnn_ypred = wtp_dnn_model.predict_classes(wtp_test_SX)
    ...: wtp_dnn_predictions = le.inverse_transform(wtp_dnn_ypred)
    ...: meu.display_model_performance_metrics(true_labels=wtp_test_y,
    ...:                     predicted_labels=wtp_dnn_predictions, classes=['red', 'white'])

Figure 9-18. Model performance metrics for deep neural network for wine type predictive model
We get an overall F1 Score and model accuracy of 99.5%, as depicted in Figure 9-18, which is even
better than our previous model! This goes to prove you don’t always need big data but good quality data
and features even for Deep Learning models. The loss and accuracy measures at each epoch are depicted in
Figure 9-19, with the detailed code present in the notebook.

Figure 9-19. Model performance metrics for DNN model per epoch
Now that we have a working wine type classification system, let’s try to interpret one of these predictive
models. One of the key aspects in model interpretation is to try to understand the importance of each feature
from the dataset. We will be using the skater package that we used in the previous chapters for our model
interpretation needs. The following code helps visualize feature importances for our logistic regression model.

430

Chapter 9 ■ Analyzing Wine Types and Quality

In [16]: from skater.core.explanations import Interpretation
    ...: from skater.model import InMemoryModel
    ...:
    ...: wtp_interpreter = Interpretation(wtp_test_SX, feature_names=wtp_features.columns)
    ...: wtp_im_model = InMemoryModel(wtp_lr.predict_proba, examples=wtp_train_SX,
                                     target_names=wtp_lr.classes_)
    ...: plots = wtp_interpreter.feature_importance.plot_feature_importance(wtp_im_model,
                                                                       ascending=False)

Figure 9-20. Feature importances obtained from our logistic regression model
We can see in Figure 9-20 that density, total sulfur dioxide, and residual sugar are the top three
features that contributed toward classifying wine samples as red or white. Another way of understand how
well a model is performing besides looking at metrics is to plot a receiver operating characteristics curve
also known popularity as a ROC curve. This curve can be plotted using the true positive rate (TPR) and the
false positive rate (FPR) of a classifier. TPR is known as sensitivity or recall, which is the total number
of correct positive results predicted among all the positive samples the dataset. FPR is known as false
alarms or (1 - specificity), determining the total number of incorrect positive predictions among all negative
samples in the dataset. The ROC curve is also known as a sensitivity versus (1 − specificity) plot sometimes.
The following code uses our model evaluation utilities module to plot the ROC curve for our logistic
regression model in the ROC space.
In [17]: meu.plot_model_roc_curve(wtp_lr, wtp_test_SX, wtp_test_y)
Typically in any ROC curve, the ROC space is between points (0,0) and (1, 1). Each prediction result
from the confusion matrix occupies one point in this ROC space. Ideally, the best prediction model would
give a point on the top-left corner (0,1) indicating perfect classification (100% sensitivity and specificity).
A diagonal line depicts a classifier that does a random guess. Ideally if your ROC curve occurs in the top half
of the graph, you have a decent classifier, which is better than average. Figure 9-21 makes this clearer.

431

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-21. ROC curve for our logistic regression model
We achieved almost 100% accuracy if you remember for this model and hence the ROC curve is almost
perfect where we also see that the area under curve (AUC) is 1 which is perfect. Finally, from our feature
importance ranks we obtained earlier, let’s see if we can visualize the model’s decision surface or decision
boundary, which basically gives us a visual depiction of how well the model is able to learn data points
pertaining to each class and separate points belonging to different classes. This surface is basically the
hypersurface, which helps in separating the underlying vector space of data samples based on their features
(feature space). If this surface is linear, the classification problem is linear and the hypersurface is also
known as a hyperplane. Our model evaluation utilities module helps us plot this with the help of an easy-touse function (do note this works only for scikit estimators at the moment since there was no clone function
for keras based estimators and it just got published last month as of writing this book; we might push a
change sometime in the future once it is stable).
In [18]: feature_indices = [i for i, feature in enumerate(wtp_feature_names)
    ...:                        if feature in ['density', 'total sulfur dioxide']]
    ...: meu.plot_model_decision_surface(clf=wtp_lr,
    ...:                                train_features=wtp_train_SX[:, feature_indices],
    ...:                                train_labels=wtp_train_y, plot_step=0.02,   
    ...:                                cmap=plt.cm.Wistia_r, markers=[',', 'o'],
    ...:                                alphas=[0.9, 0.6], colors=['r', 'y'])
Since we would want to plot the decision surface on the underlying feature space, visualizing this
becomes extremely difficult when you have more than two features. Hence for the sake of simplicity and ease
of interpretation, we will use the top two most important features (density and total sulfur dioxide) to
visualize the model decision surface. This is done by fitting a cloned model of the original model estimator
on those two features and then plotting the decision surface based on what it has learned. Check out the
plot_model_decision_surface(...) function for more low-level details on how we visualize the decision
surfaces.

432

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-22. Visualizing the model decision surface for our logistic regression model
The plot depicted in Figure 9-22 reinforces the fact that our model has learned the underlying patterns
quite well based on just the two most important features, which it has used to separate out majority
of the red wine samples from the white wine samples depicted by the scatter dots. There are very few
misclassifications here and there, which are evident based on the statistics we obtained earlier in the
confusion matrices.

P redicting Wine Quality
In our wine quality dataset, we have several quality rating classes ranging from 3 to 9. What we will be
focusing on is the quality_label variable that classifies wine into low, medium, and high ratings based
on the underlying quality variable based on the mapping we created in the “Exploratory Data Analysis”
section. This is done because several rating scores have very few wine samples and hence similar quality
ratings were clubbed together into one quality class rating. We use the prefix wqp_ for all variables and
models involved in prediction of wine quality to distinguish it from other analysis. The prefix wqp stands for
wine quality prediction. We will evaluate and look at tree based classification models as well as ensemble
models in this section. The following code helps us prepare our train and test datasets for modeling.
In [19]: wqp_features = wines.iloc[:,:-3]
    ...: wqp_class_labels = np.array(wines['quality_label'])
    ...: wqp_label_names = ['low', 'medium', 'high']
    ...: wqp_feature_names = list(wqp_features.columns)
    ...: wqp_train_X, wqp_test_X, wqp_train_y, wqp_test_y = train_test_split(wqp_features,
    ...:                                     wqp_class_labels, test_size=0.3, random_state=42)
    ...:
    ...: print(Counter(wqp_train_y), Counter(wqp_test_y))
    ...: print('Features:', wqp_feature_names)

433

Chapter 9 ■ Analyzing Wine Types and Quality

Counter({'medium': 2737, 'low': 1666, 'high': 144}) Counter({'medium': 1178, 'low': 718,
'high': 54})
Features: ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides',
'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']
From the preceding output, it is evident we use the same physicochemical wine features. The number of
samples in each quality rating class is also depicted. It is quite evident we have very few wine samples of high
class rating and a lot of medium quality wine samples. We move on to the next step of feature scaling.
In [20]:
    ...:
    ...:
    ...:
    ...:
    ...:

# Define the scaler
wqp_ss = StandardScaler().fit(wqp_train_X)
# Scale the train set
wqp_train_SX = wqp_ss.transform(wqp_train_X)
# Scale the test set
wqp_test_SX = wqp_ss.transform(wqp_test_X)

Let’s train a tree based model on this data. The Decision Tree Classifier is an excellent example of a
classic tree model. This is based on the concept of decision trees, which focus on using a tree-like graph
or flowchart to model decisions and their possible outcomes. Each decision node in the tree represents a
decision test on a specific data attribute. Edges or branches from each node represent possible outcomes of
the decision test. Each leaf node represents a predicted class label. To get all the end-to-end classification
rules, you need to consider the paths from the root node to the leaf nodes. Decision tree models in the
context of Machine Learning are non-parametric supervised learning methods, which use these decision
tree based structures for classification and regression tasks. The core objective is to build a model such that
we can predict the value of a target response variable by leveraging decision tree based structures to learn
decision rules from the input data features. The main advantage of decision tree based models is model
interpretability, since it is quite easy to understand and interpret the decision rules which led to a specific
model prediction. Besides this, other advantages include the model’s ability to handle both categorical
and numeric data with ease as well as multi-class classification problems. Trees can be even visualized to
understand and interpret decision rules better. The following snippet leverages the DecisionTreeClassifier
estimator to build a decision tree model and predict the wine quality ratings of our wine samples.
In [21]: from sklearn.tree import DecisionTreeClassifier
    ...: # train the model
    ...: wqp_dt = DecisionTreeClassifier()
    ...: wqp_dt.fit(wqp_train_SX, wqp_train_y)
    ...: # predict and evaluate performance
    ...: wqp_dt_predictions = wqp_dt.predict(wqp_test_SX)
    ...: meu.display_model_performance_metrics(true_labels=wqp_test_y,
    ...:                     predicted_labels=wqp_dt_predictions, classes=wqp_label_names)

Figure 9-23. Model performance metrics for decision tree for wine quality predictive model

434

Chapter 9 ■ Analyzing Wine Types and Quality

We get an overall F1 Score and model accuracy of approximately 73%, as depicted in Figure 9-23,
which is not bad for a start. Looking at the class based statistics; we can see the recall for the high quality
wine samples is pretty bad since a lot of them have been misclassified into medium and low quality ratings.
This is kind of expected since we do not have a lot of training samples for high quality wine if you remember
our training sample sizes from earlier. Considering low and high quality rated wine samples, we should at
least try to see if we can prevent our model from predicting a low quality wine as high and similarly prevent
predicting a high quality wine as low. Interpreting this model, you can use the following code to look at the
feature importance scores based on the patterns learned by our model.
In [22]: wqp_dt_feature_importances = wqp_dt.feature_importances_
    ...: wqp_dt_feature_names, wqp_dt_feature_scores = zip(*sorted(zip(wqp_feature_names,
    ...:                                    wqp_dt_feature_importances), key=lambda x: x[1]))
    ...: y_position = list(range(len(wqp_dt_feature_names)))
    ...: plt.barh(y_position, wqp_dt_feature_scores, height=0.6, align='center')
    ...: plt.yticks(y_position , wqp_dt_feature_names)
    ...: plt.xlabel('Relative Importance Score')
    ...: plt.ylabel('Feature')
    ...: t = plt.title('Feature Importances for Decision Tree')

Figure 9-24. Feature importances obtained from our decision tree model
We can clearly observe from Figure 9-24 that the most important features have c hanged as compared to
our previous model. Alcohol and volatile acidity occupy the top two ranks and total sulfur dioxide
seems to be one of the most important features for classifying both wine type and quality (as observed in
Figure 9-20). If you remember, we mentioned earlier that you can also easily visualize the decision tree
structure from decision tree models and check out the decision rules that it learned from the underlying
features used in prediction for new data samples. The following code helps us visualize decision trees.

435

Chapter 9 ■ Analyzing Wine Types and Quality

In [23]: from graphviz import Source
    ...: from sklearn import tree
    ...: from IPython.display import Image
    ...:
    ...: graph = Source(tree.export_graphviz(wqp_dt, out_file=None, class_names=wqp_label_names,
    ...:                                    filled=True, rounded=True, special_
characters=False,
    ...:                                    feature_names=wqp_feature_names, max_depth=3))
    ...: png_data = graph.pipe(format='png')
    ...: with open('dtree_structure.png','wb') as f:
    ...:     f.write(png_data)
    ...: Image(png_data)

Figure 9-25. Visualizing our decision tree model
Our decision tree model has a huge number of nodes and branches hence we visualized our tree for a
max depth of three based on the preceding snippet. You can start observing the decision rules from the tree
in Figure 9-25 where the starting split is determined by the rule of alcohol <= -0.1277 and with each
yes\no decision branch split, we have further decision nodes as we descend into the tree at each depth level.
The class variable is what we are trying to predict, i.e. wine quality being low, medium, or high and value
determines the total number of samples at each class present in the current decision node at each instance.
The gini parameter is basically the criterion which is used to determine and measure the quality of the split
at each decision node. Best splits can be determined by metrics like gini impurity\gini index or information
gain. Just to give you some context, the gini impurity is a metric that helps in minimizing the probability of
misclassification. It is usually mathematically denoted as follows,
C

C

i =1

i =1

I G ( p ) = å pi (1 - pi ) = 1 - å pi 2
Where we have C classes to predict, pi is the fraction of items labeled as class i or the probability
measure of instance with class label i being chosen and (1 - pi) is the mistake in categorizing that item or
misclassification measure. The gini impurity\index is computed by summing the square of the fraction of
classified instances for each class label in the C classes and subtracting the result from 1. Interested readers
can check out some standard literature on decision trees if interested in diving deeper into differences
between entropy and gini or to understand more intricate mathematical details.

436

Chapter 9 ■ Analyzing Wine Types and Quality

Moving forward with our mission of improving our wine quality predictive model, let’s look at some
ensemble modeling methods. Ensemble models are typically Machine Learning models that combine or
take a weighted (average\majority) vote of the predictions of each of the individual base model estimators
that have been built using supervised methods of their own. The ensemble is expected to generalize better
over underlying data, be more robust, and make superior predictions as compared to each individual base
model. Ensemble models can be categorized under three major families.
•

Bagging methods: The term bagging stands for bootstrap aggregating, where the
ensemble model tries to improve prediction accuracy by combining predictions of
individual base models trained over randomly generated training samples. Bootstrap
samples, i.e. independent samples with replacement, are taken from the original
training dataset and several base models are built on these sampled datasets. At any
instance, an average of all predictions from the individual estimators is taken for
the ensemble model to make its final prediction. Random sampling tries to reduce
model variance, reduce overfitting, and boost prediction accuracy. Examples include
the very popular random forests.

•

Boosting methods: In contrast to bagging methods, which operate on the principle
of combining or averaging, in boosting methods, we build the ensemble model
incrementally by training each base model estimator sequentially. Training
each model involves putting special emphasis on learning the instances which it
previously misclassified. The idea is to combine several weak base learners to form a
powerful ensemble. Weak learners are trained sequentially over multiple iterations
of the training data with weight modifications inserted at each retrain phase. At
each re-training of a weak base learner, higher weights are assigned to those training
instances which were misclassified previously. Thus, these methods try to focus on
training instances which it wrongly predicted in the previous training sequence.
Boosted models are prone to over-fitting so one should be very careful. Examples
include Gradient Boosting, AdaBoost, and the very popular XGBoost.

•

Stacking methods: In stacking based methods, we first build multiple base models
over the training data. Then the final ensemble model is built by taking the output
predictions from these models as its additional inputs for training to make the final
prediction.

Let’s now try building a model using random forests, a very popular bagging method. In the random
forest model, each base learner is a decision tree model trained on a bootstrap sample of the training data.
Besides this, when we want to split a decision node in the tree, the split is chosen from a random subset of all
the features instead of taking the best split from all the features. Due to the introduction of this randomness,
bias increases and when we average the result from all the trees in the forest, the overall variance decreases,
giving us a robust ensemble model which generalizes well. We will be using the RandomForestClassifier
from scikit-learn, which averages the probabilistic prediction from all the trees in the forest for the final
prediction instead of taking the actual prediction votes and then averaging it.
In [24]: from sklearn.ensemble import RandomForestClassifier
    ...: # train the model
    ...: wqp_rf = RandomForestClassifier()
    ...: wqp_rf.fit(wqp_train_SX, wqp_train_y)
    ...: # predict and evaluate performance
    ...: wqp_rf_predictions = wqp_rf.predict(wqp_test_SX)
    ...: meu.display_model_performance_metrics(true_labels=wqp_test_y,
    ...:                                     predicted_labels=wqp_rf_predictions,
    ...:                                     classes=wqp_label_names)

437

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-26. Model performance metrics for random forest for wine quality predictive model
The model prediction results on the test dataset depict an overall F1 Score and model accuracy of
approximately 77%, as seen in Figure 9-26. This is definitely an improvement of 4% from what we obtained
with just decision trees proving that ensemble learning is working better.
Another way to further improve on this result is model tuning. To be more specific, models have
hyperparameters that can be tuned, as we have discussed previously in detail in the “Model Tuning and
Hyperparameter Tuning” sections in Chapter 5. Hyperparameters are also known as meta-parameters
and are usually set before we start the model training process. These hyperparameters do not have any
dependency on being derived from the underlying data on which the model is trained. Usually these
hyperparameters represent some high level concepts or knobs, which can be used to tweak and tune the
model during training to improve its performance. Our random forest model has several hyperparameters
and you can view its default values as follows.
In [25]: print(wqp_rf.get_params())
{'bootstrap': True, 'random_state': None, 'verbose': 0, 'min_samples_leaf': 1, 'min_weight_
fraction_leaf': 0.0, 'max_depth': None, 'class_weight': None, 'max_leaf_nodes': None,
'oob_score': False, 'criterion': 'gini', 'n_estimators': 10, 'max_features': 'auto', 'min_
impurity_split': 1e-07, 'n_jobs': 1, 'warm_start': False, 'min_samples_split': 2}
From the preceding output, you can see a number of hyperparameters. We recommend checking out
the official documentation at http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
RandomForestClassifier.html to learn more about each parameter. For hyperparameter tuning, we will
keep things simple and focus our attention on n_estimators which represents the total number of base
tree models in the forest ensemble model and max_features which represents the number of features to
consider during each best split. We use a standard grid search method with five-fold cross validation to
select the best hyperparameters.
In [26]: from sklearn.model_selection import GridSearchCV
    ...:
    ...: param_grid = {
    ...:                 'n_estimators': [100, 200, 300, 500],
    ...:                 'max_features': ['auto', None, 'log2']    
    ...:            
}
    ...:
    ...: wqp_clf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5,
    ...:                        scoring='accuracy')
    ...: wqp_clf.fit(wqp_train_SX, wqp_train_y)
    ...: print(wqp_clf.best_params_)
{'max_features': 'auto', 'n_estimators': 200}
We can see the chosen value of the hyperparameters obtained after the grid search in the preceding
output. We have 200 estimators and auto maximum features which represents the square root of the total
number of features to be considered during the best split operations. The scoring parameter was set to

438

Chapter 9 ■ Analyzing Wine Types and Quality

accuracy to evaluate the model for best accuracy. You can set it to other parameters to evaluate the model
on other metrics like F1 score, precision, recall and so on. Check out http://scikit-learn.org/stable/
modules/model_evaluation.html#scoring-parameter for further details. You can view the grid search
results for all the hyperparameter combinations as follows.
In [27]: results = wqp_clf.cv_results_
    ...: for param, score_mean, score_sd in zip(results['params'], results['mean_test_score'],
                                                                     results['std_test_score']):
    ...:     print(param, round(score_mean, 4), round(score_sd, 4))
{'max_features': 'auto', 'n_estimators': 100} 0.7928 0.0119
{'max_features': 'auto', 'n_estimators': 200} 0.7955 0.0101
{'max_features': 'auto', 'n_estimators': 300} 0.7941 0.0086
{'max_features': 'auto', 'n_estimators': 500} 0.795 0.0094
{'max_features': None, 'n_estimators': 100} 0.7847 0.0144
{'max_features': None, 'n_estimators': 200} 0.781 0.0149
{'max_features': None, 'n_estimators': 300} 0.784 0.0128
{'max_features': None, 'n_estimators': 500} 0.7858 0.0107
{'max_features': 'log2', 'n_estimators': 100} 0.7928 0.0119
{'max_features': 'log2', 'n_estimators': 200} 0.7955 0.0101
{'max_features': 'log2', 'n_estimators': 300} 0.7941 0.0086
{'max_features': 'log2', 'n_estimators': 500} 0.795 0.0094
The preceding output depicts the selected hyperparameter combinations and its corresponding mean
accuracy and standard deviation values across the grid. Let’s train a new random forest model with the
tuned hyperparameters and evaluate its performance on the test data.
In [28]: wqp_rf = RandomForestClassifier(n_estimators=200, max_features='auto', random_state=42)
    ...: wqp_rf.fit(wqp_train_SX, wqp_train_y)
    ...:
    ...: wqp_rf_predictions = wqp_rf.predict(wqp_test_SX)
    ...: meu.display_model_performance_metrics(true_labels=wqp_test_y,
    ...:                        predicted_labels=wqp_rf_predictions, classes=wqp_label_names)

Figure 9-27. Model performance metrics for tuned random forest for wine quality predictive model
The model prediction results on the test dataset depict an overall F1 Score and model accuracy
of approximately 81%, as seen in Figure 9-27. This is quite good considering we got an improvement of
4% from the initial random forest model before tuning and overall we got an improvement of 8% from
the base decision tree model. Also we can see that no low quality wine sample has been misclassified as
high. Similarly no high quality wine sample has been misclassified as low. There is a considerable overlap
between medium and high\low quality wine samples but that is expected given the nature of the data and
class distribution.

439

Chapter 9 ■ Analyzing Wine Types and Quality

Another way of modeling ensemble based methods is boosting. A very popular method is XGBoost
which stands for Extreme Gradient Boosting. It is a variant of the Gradient Boosting Machines (GBM)
model. This model is extremely popular in the Data Science community owing to its superior performance
in several Data Science challenges and competitions especially on Kaggle. For using this model, you can
install the xgboost package in Python. For details on this framework, feel free to check out the official
web site at http://xgboost.readthedocs.io/en/latest, which offers detailed documentation on the
installation, model tuning, and much more. Credits go to the Distributed Machine Learning Community,
popularly known as DMLC for creating the XGBoost framework along with the popular MXNet Deep
Learning framework. Gradient boosting using the principles of boosting methodology for ensembling, which
we discussed earlier, and it uses gradient descent to minimize error or loss when adding new weak base
learners. Going into details of the model internals would be out of the current scope, but we recommend
checking out http://xgboost.readthedocs.io/en/latest/model.html for a nice introduction into boosted
trees and the principle of XGBoost. We trained a basic XGBoost model first on our data and obtained an
overall accuracy of around 74%. After tuning the model with grid search, we trained the model with the
following parameter values and evaluated its performance on the test data (detailed step-by-step snippets
are available in the jupyter notebook).
In [29]: import os
    ...: mingw_path = r'C:\mingw-w64\mingw64\bin'
    ...: os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']
    ...: import xgboost as xgb
    ...:
    ...: # train the model on tuned hyperparameters
    ...: wqp_xgb_model = xgb.XGBClassifier(seed=42, max_depth=10, learning_rate=0.3,
                                          n_estimators=100)
    ...: wqp_xgb_model.fit(wqp_train_SX, wqp_train_y)
    ...: # evaluate and predict performance
    ...: wqp_xgb_predictions = wqp_xgb_model.predict(wqp_test_SX)
    ...: meu.display_model_performance_metrics(true_labels=wqp_test_y,
    ...:                        predicted_labels=wqp_xgb_predictions, classes=wqp_label_names)

Figure 9-28. Model performance metrics for tuned XGBoost model for wine quality predictive model
The model prediction results on the test dataset depict an overall F1 Score and model accuracy of
approximately 79%, as seen in Figure 9-28. Though random forests perform slightly better, it definitely
performs better than a basic model like a decision tree. Try adding more hyperparameters to tune the model
and see if you can get a better model. If you do find one, feel free to send a pull request to our repository!
We have successfully built a decent wine quality classifier using several techniques and also seen the
importance of model tuning and validation. Let’s take our best model and run some model interpretation
tasks on it to try to understand it better. To start with, we can look at the feature importance ranks given to
the various features in the dataset. The following snippet shows comparative feature importance plots using
skater as well as the default feature importances obtained from the scikit-learn model itself. We build
an Interpretation and InMemoryModel object using skater, which will be useful for our future analyses in
model interpretation.

440

Chapter 9 ■ Analyzing Wine Types and Quality

In [31]: from skater.core.explanations import Interpretation
    ...: from skater.model import InMemoryModel
    ...: # leveraging skater for feature importances
    ...: interpreter = Interpretation(wqp_test_SX, feature_names=wqp_feature_names)
    ...: wqp_im_model = InMemoryModel(wqp_rf.predict_proba, examples=wqp_train_SX,
                                     target_names=wqp_rf.classes_)
    ...: # retrieving feature importances from the scikit-learn estimator
    ...: wqp_rf_feature_importances = wqp_rf.feature_importances_
    ...: wqp_rf_feature_names, wqp_rf_feature_scores = zip(*sorted(zip(wqp_feature_names,
                                           wqp_rf_feature_importances), key=lambda x: x[1]))
    ...: # plot the feature importance plots
    ...: f, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 3))
    ...: t = f.suptitle('Feature Importances for Random Forest', fontsize=12)
    ...: f.subplots_adjust(top=0.85, wspace=0.6)
    ...: y_position = list(range(len(wqp_rf_feature_names)))
    ...: ax1.barh(y_position, wqp_rf_feature_scores, height=0.6, align='center',
                 tick_label=wqp_rf_feature_names)
    ...: ax1.set_title("Scikit-Learn")
    ...: ax1.set_xlabel('Relative Importance Score')
    ...: ax1.set_ylabel('Feature')
    ...: plots = interpreter.feature_importance.plot_feature_importance(wqp_im_model,
                                                                    ascending=False, ax=ax2)
    ...: ax2.set_title("Skater")
    ...: ax2.set_xlabel('Relative Importance Score')
    ...: ax2.set_ylabel('Feature')

Figure 9-29. Comparative feature importance analysis obtained from our tuned random forest model
We can clearly observe from Figure 9-29 that the most important features are consistent across the
two plots, which is expected considering we are just using different interfaces on the same model. The top
two features are alcohol by volume and the volatile acidity content. We will be using them shortly
for further analysis. But right now, let’s look at the model’s ROC curve and the area under curve (AUC)
statistics. Plotting a binary classifier’s ROC curve is easy, but what do you do when you are dealing with a
multi-class classifier (3-class in our case)? There are several ways to do this. You would need to binarize the
output. Once this operation is executed, you can plot one ROC curve per class label. Besides this, you can
also follow two aggregation metrics for computing the average ROC measures. Micro-averaging involves
plotting an ROC curve over the entire prediction space by considering each predicted element as a binary

441

Chapter 9 ■ Analyzing Wine Types and Quality

prediction. Hence equal weight is given to each prediction classification decision. Macro-averaging involves
giving equal weight to each class label when averaging. Our model_evaluation_utils module has a nifty
customizable function plot_model_roc_curve(...), which can help plot multi-class classifier ROC curves
with both micro- and macro-averaging capabilities. We recommend you to check out the code, which is
pretty self-explanatory. Let’s now plot the ROC curve for our random forest classifier.
In [32]: meu.plot_model_roc_curve(wqp_rf, wqp_test_SX, wqp_test_y)

Figure 9-30. ROC curve for our tuned random forest model
You can see the various ROC plots (per-class and averaged) in Figure 9-30 for our tuned random forest
model. The AUC is pretty good based on what we see. The dotted lines indicate the per-class ROC curves and
the lines in bold are the macro and micro-average ROC curves. Let’s now revisit our top two most important
features—alcohol and volatile acidity. Let’s use them and try to plot the decision surface\boundary of
our random forest model, similar to what we had done earlier for our logistic regression based wine type
classifier in Figure 9-22.
In [33]: feature_indices = [i for i, feature in enumerate(wqp_feature_names)
    ...:                        if feature in ['alcohol', 'volatile acidity']]
    ...: meu.plot_model_decision_surface(clf=wqp_rf,
    ...:                    
train_features=wqp_train_SX[:, feature_indices],
    ...:                    
train_labels=wqp_train_y, plot_step=0.02, cmap=plt.cm.RdYlBu,
    ...:                    
markers=[',', 'd', '+'], alphas=[1.0, 0.8, 0.5],
    ...:                    
colors=['r', 'b', 'y'])

442

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-31. Visualizing the model decision surface for our tuned random forest model
The plot depicted in Figure 9-31 shows us that the three classes are definitely not as easily
distinguishable as our wine type classifier for red and white wine. Of course, visualizing the hypersurfaces
with multiple features becomes difficult as compared to visualizing with the two most important features,
but the plot should give you a good idea that the model is able to distinguish well between the classes,
although there is a certain amount of overlap, especially with the wine samples of medium quality rating
with high and low quality rated wine samples.
Let’s look at some model prediction interpretations similar to what we did in Chapter 7, where we
analyzed movie review sentiments. For this, we will be leveraging skater and look at model predictions.
We will try to interpret why the model predicted a class label and which features were influential in its
decision. First we build a LimeTabularExplainer object using the following snippet, which will help us in
interpreting and explaining predictions.
from skater.core.local_interpretation.lime.lime_tabular import LimeTabularExplainer
exp = LimeTabularExplainer(wqp_train_SX, feature_names=wqp_feature_names,
                           discretize_continuous=True,
                           class_names=wqp_rf.classes_)
Let’s now look at two wine sample instances from our test dataset. The first instance is a wine of low
quality rating. We show the interpretation for the predicted class with maximum probability\confidence
using the top_labels parameter. You can set it to 3 to view the same for all the three class labels.
exp.explain_instance(wqp_test_SX[10], wqp_rf.predict_proba, top_labels=1).show_in_notebook()

443

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-32. Model interpretation for our wine quality model's prediction for a low quality wine
The results depicted in Figure 9-32 show us the features that were primarily responsible for the model to
predict the wine quality as low. We can see that the most important feature was alcohol, which makes sense
considering what we obtained in our analyses so far from feature importances and model decision surface
interpretations. The values for each corresponding feature depicted here are the scaled values obtained after
feature scaling. Let’s interpret another prediction, this time for a wine of high quality.
exp.explain_instance(wqp_test_SX[747], wqp_rf.predict_proba, top_labels=1).show_in_notebook()

Figure 9-33. Model interpretation for our wine quality model's prediction for a high quality wine
From the interpretation in Figure 9-33, we can see the features responsible for the model correctly
predicting the wine quality as high and the primary feature was again alcohol by volume (besides other
features like density, volatile acidity, and so on). Also you can notice a stark difference in the scaled
values of alcohol for the two instances depicted in Figure 9-32 and Figure 9-33.

444

Chapter 9 ■ Analyzing Wine Types and Quality

To wrap up our discussion on model interpretation, we will be talking about partial dependence plots
and how they are useful in our scenario. In general, partial dependence helps describe the marginal impact
or influence of a feature on the model prediction decision by holding the other features constant. Because it
is very difficult to visualize high dimensional feature spaces, typically one or two influential and important
features are used to visualize partial dependence plots. The scikit-learn framework has functions like
partial_dependence(...) and plot_partial_dependence(...), but unfortunately as of the time of writing
this book, these functions only work on boosting models like GBM. The beauty of skater is that we can build
partial dependence plots on any model including our tuned random forest model. We will leverage skater's
Interpretation object, interpreter, and the InMemoryModel object, wqp_im_model, based on our random
forest model that we had created earlier when we computed the feature importances. The following code
depicts one-way partial dependence plots for our model prediction function based on the most important
feature, alcohol.
In [36]: axes_list = interpreter.partial_dependence.plot_partial_dependence(['alcohol'],
    ...:             wqp_im_model, grid_resolution=100, with_variance=True, figsize = (4, 3))
    ...: axs = axes_list[0][3:]
    ...: [ax.set_ylim(0, 1) for ax in axs];

Figure 9-34. One-way partial dependence plots for our random forest model predictor based on alcohol
From the plots in Figure 9-34, we can see that with an increase in the quantity of alcohol content, the
confidence\probability of the model predictor increases in predicting the wine to be either medium or high
and similarly it decreases for the probability of wine to be of low quality. This shows there is definitely some
relationship between the class predictions with the alcohol content and again the influence of alcohol for
predictions of class high is pretty low, which is expected considering training samples for high quality wine
are less. Let’s now plot two-way partial dependence plots for interpreting our random forest predictor’s
dependence on alcohol and volatile acidity, the top two influential features.
In [42]: plots_list = interpreter.partial_dependence.plot_partial_dependence([('alcohol',
    ...:                                                            
'volatile acidity')],
    ...:                    wqp_im_model, n_samples=1000, figsize=(10, 5), grid_resolution=100)
    ...: axs = plots_list[0][3:]
    ...: [ax.set_zlim(0, 1) for ax in axs];

445

Chapter 9 ■ Analyzing Wine Types and Quality

Figure 9-35. Two-way partial dependence plots for our random forest model predictor based on alcohol and
volatile acidity
The plots in Figure 9-35 bear some resemblance with the plots in Figure 9-34. For predicting high
quality wine, due to the lack of training data, while some dependency is there for high wine quality class
prediction with the increase in alcohol and corresponding decrease in volatile acidity is it quite weak, as
we can see in the left most plot. There also seems to be a strong dependency on low wine quality class
prediction with the corresponding decrease in alcohol and the increase in volatile acidity levels. This
is clearly visible in the rightmost plot. The plot in the middle talks about medium wine quality class
predictions. We can observe predictions having a strong dependency with corresponding increase in alcohol
and with decrease in volatile acidity levels. This should give you a good foundation on leveraging partial
dependence plots to dive deeper into model interpretation.

S
 ummary
In this case study oriented chapter, we processed, analyzed, and modeled on a dataset pertaining to wine
samples with a focus on type and quality ratings. Special emphasis was on exploratory data analysis, which
is often overlooked by data scientists in a hurry to build and deploy models. Some background in the
domain was explored with regard to various features in our wine dataset, which were explained in detail. We
recommend always exploring the domain in which you are solving problems and taking the help of subject
matter experts whenever needed, besides focusing on math, Machine Learning, and analysis.
We looked at multiple ways to analyze and visualize our data and its features, including descriptive and
inferential statistics and univariate and multivariate analysis. Special techniques for visualizing categorical
and multi-dimensional data were explained in detail. The intent is for you to build on these principles and
re-use similar principles and code to visualize attributes and relationships on your own datasets in the future.
Two main objectives of this chapter were to build predictive models for predicting wine types and quality
based on various physicochemical wine attributes. We covered a variety of predictive models, including
linear models like logistic regression and complex models including deep neural networks. Besides this, we
also covered tree based models like decision trees and ensemble models like random forests and the very
popular extreme gradient boosting model. Various aspects of model training, prediction, evaluation, tuning,
and interpretation were covered in detail. We recommend you to not only build models but evaluate them
thoroughly with validation metrics, use hyperparameter tuning where necessary, and leverage ensemble
modeling to build robust, generalized, and superior models. Special focus has also been given to concepts and
techniques for interpreting models, including analyzing feature importances, visualizing model ROC curves
and decision surfaces, explaining model predictions, and visualizing partial dependence plots.

446

CHAPTER 10

Analyzing Music Trends and
Recommendations
Recommendation engines are probably one of the most popular and well known Machine Learning
applications. A lot of people who don’t belong to the Machine Learning community often assume that
recommendation engines are its only use. Although we know that Machine Learning has a vast subspace
where recommendation engines are just one of the candidates, there is no denying the popularity of
recommendation engines. One of the reasons for their popularity is their ubiquitous nature; anyone who is
online, in any way, has been in touch with a recommendation engine in some form or the other. They are
used for recommending products on ecommerce sites, travel destinations on travel portal, songs/videos on
streaming sites, restaurants on food aggregator portals, etc. A long list underlines their universal application.
The popularity of recommendation engines stems from two very important points about them:
•

They are easy to implement: Recommendation engines are easy to integrate in
an already existing workflow. All we need is to collect some data regarding our
user trends and patterns, which normally can be extracted from the business’
transactional database.

•

They work: This statement is equally true for all the other Machine Learning
solutions that we have discussed. But an important distinction comes from the fact
that they have a very limited downside. For example, consider a travel portal, which
suggests a set of most popular locations from its dataset. The recommendation
system can be a trivial one but its mere presence in front of the user is likely to
generate user interest. The firm can definitely gain by working a sophisticated
recommendation engine but even a very simple one is guaranteed to pay some
dividends with minimal investments. This point makes them a very attractive
proposition.

This chapter studies how we can use transactional data to develop different types of recommendation
engines. We will learn about an auxiliary dataset of a very interesting dataset, “The million song dataset”.
We will go through user listening history and then use it to develop multiple recommendation engines with
varying levels of sophistication.

© Dipanjan Sarkar, Raghav Bali and Tushar Sharma 2018
D. Sarkar et al., Practical Machine Learning with Python, https://doi.org/10.1007/978-1-4842-3207-1_10

447

Chapter 10 ■ Analyzing Music Trends and Recommendations

The Million Song Dataset Taste Profile
The million song dataset is a very popular dataset and is available at https://labrosa.ee.columbia.edu/
millionsong/. The original dataset contained quantified audio features of around a million songs
ranging over multiple years. The dataset was created as a collaborative project between The Echonest
(http://the.echonest.com/) and LABRosa (http://labrosa.ee.columbia.edu/). Although we will not be
using this dataset directly, we will be using some parts of it.
Several other datasets were spawned from the original million song dataset. One of those datasets was
called The Echonest Taste Profile Subset. This particular dataset was created by The Echonest with some
undisclosed partners. The dataset contains play counts by anonymous users for songs contained in the
million songs dataset. The taste profile dataset is quite big, as it contains around 48 million lines of triplets.
The triplets contain the following information:
(user id, song id, play counts)
Each row gives the play counts of a song identified by the song ID for the user identified by the user ID.
The overall dataset contains around a million unique users and around 384,000 songs from the million song
dataset are contained in it.
The readers can download the dataset from http://labrosa.ee.columbia.edu/millionsong/sites/
default/files/challenge/train_triplets.txt.zip. The size of the compressed dataset is around 500MB
and, upon decompression, you need a space of around 3.5GB. Once you have the data downloaded and
uncompressed, you will see how to subset the dataset to reduce its size.

■■ The million song dataset has several other useful auxiliary datasets. We will not be covering them in detail
here, but we encourage the reader to explore these datasets and use their imagination for developing innovative
use cases.

E xploratory Data Analysis
Exploratory data analysis is an important part of any data analysis workflow. By this time, we have
established this fact firmly in the mindset of our readers. The exploratory data analysis becomes even more
important in the cases of large datasets, as this will often lead us to information that we can use to trim down
the dataset a little. As we will see, sometimes we will also go beyond the traditional data access tools to
bypass the problems posed by the large size of data.

Loading and Trimming Data
The first step in the process is loading the data from the uncompressed files. As the data size is around 3GB,
we will not load the complete data but we will only load a specified number of rows from the dataset. This
can be achieved by using the nrows parameter of the read_csv function provided by pandas.
In [2]: triplet_dataset = pd.read_csv(filepath_or_buffer=data_home+'train_triplets.txt',
   ...: nrows=10000,sep='\t', header=None, names=['user','song','play_count'])

448

Chapter 10 ■ Analyzing Music Trends and Recommendations

Since the dataset doesn’t have a header, we also provided the column name to the function. A subset of
the data is shown in Figure 10-1.

Figure 10-1. Sample rows from The Echonest taste profile dataset
The first thing we may want to do in the dataset of this size is determine how many unique users (or
songs) we should consider. In the original dataset, we have around a million users, but we want to determine
the number of users we should consider. For example, if 20% of all the users account for around 80% of total
play counts, then it would be a good idea to focus our analysis on those 20% users. Usually this can be done
by summarizing the dataset by users (or by songs) and getting a cumulative sum of the play counts. Then
we can find out how many users account for 80% of the play counts, etc. But due to the size of the data, the
cumulative summation function provided by pandas will run into trouble. So we will write code to read the
file line by line and extract the play count information on a user (or song). This will also serve as a possible
method that readers can use in case the dataset size exceeds the memory available on their systems. The
code snippet that follows will read the file line by line, extract total play count of all the users, and persist that
information for later use.
In [2]: output_dict = {}
   ...: with open(data_home+'train_triplets.txt') as f:
   ...: for line_number, line in enumerate(f):
   ...:      user = line.split('\t')[0]
   ...:      play_count = int(line.split('\t')[2])
   ...:      if user in output_dict:
   ...:          play_count +=output_dict[user]
   ...:          output_dict.update({user:play_count})
   ...:      output_dict.update({user:play_count})
   ...: output_list = [{'user':k,'play_count':v} for k,v in output_dict.items()]
   ...: play_count_df = pd.DataFrame(output_list)
   ...: play_count_df = play_count_df.sort_values(by = 'play_count', ascending = False)
   ...: play_count_df.to_csv(path_or_buf='user_playcount_df.csv', index = False)

449

Chapter 10 ■ Analyzing Music Trends and Recommendations

The persisted dataframe can be then loaded and used based on our requirements. We can use a similar
strategy to extract play counts for each of the songs. A few lines from the dataset are shown in Figure 10-2

Figure 10-2. Play counts for some users
The first thing we want to find out about our dataset is the number of users that we will need to
account for around 40% of the play counts. We have arbitrarily chosen a value of 40% to keep the dataset
size manageable; you can experiment with these figures to get different sized datasets and even leverage big
data processing and analysis frameworks like Spark on top of Hadoop to analyze the complete dataset! The
following code snippet will determine the subset of users that account for this percentage of data. In our case
around 10,0000 users account for 40% of play counts, hence we will subset those users.
In [2]: total_play_count = sum(song_count_df.play_count)
   ...: (float(play_count_df.head(n=100000).play_count.sum())/total_play_count)*100
   ...: play_count_subset = play_count_df.head(n=100000)
In similar way, we can determine the number of unique songs required to explain 80% of the total play
count. In our case, we will find that 30,000 songs account for around 80% of the play count. This information
is already a great find, as around 10% of the songs are contributing to 80% of the play count. Using a code
snippet similar to one given previously, we can determine the subset of such songs. With these songs and
user subsets, we can subset our original dataset to reduce the dataset to contain only filtered users and
songs. The code snippet that follows uses these dataframes to filter the original dataset and then persists the
resultant dataset for future uses.
In [2]:
   ...:
   ...:
   ...:

450

triplet_dataset =
pd.read_csv(filepath_or_buffer=data_home+'train_triplets.txt',sep='\t', header=None,   
names=['user','song','play_count'])
triplet_dataset_sub = triplet_dataset[triplet_dataset.user.isin(user_subset) ]

Chapter 10 ■ Analyzing Music Trends and Recommendations

   ...:
   ...:
   ...:
   ...:
   ...:

del(triplet_dataset)
triplet_dataset_sub_song =
triplet_dataset_sub[triplet_dataset_sub.song.isin(song_subset)]
del(triplet_dataset_sub)
triplet_dataset_sub_song.to_csv(path_or_buf=data_home+'triplet_dataset_sub_song.
csv', index = False)

This subsetting will give us a dataframe with around 10 million rows of tuples. We will use this as the
starting dataset for our all future analyses. You can play around with these numbers to arrive at different
datasets and possibly different results.

Enhancing the Data
The data we loaded is just the triplet data so we are not able to see the song name, the artist name, or the
album names. We can enhance our data by adding this information about the songs. This information is part
of the million song database. This data is provided as a SQLite database file. First we will download the data
by downloading the track_metadata.db file from the web page at https://labrosa.ee.columbia.edu/
millionsong/pages/getting-dataset#subset.
The next step is to read this SQLite database to a dataframe and extract track information by merging
it with our triplet dataframe. We will also drop some extra columns that we won’t be using for our analysis.
The code snippet that follows will load the entire dataset, join it with our subsetted triplet data, and drop the
extra columns.
In [2]:
   ...:
   ...:
   ...:

conn = sqlite3.connect(data_home+'track_metadata.db')
cur = conn.cursor()
cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
cur.fetchall()

Out[2]: [('songs',)]
The output of the above snippet shows that the database contains a table named songs. We will get all
the rows from this table and read it into a dataframe.
In [5]:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:

del(track_metadata_df_sub['track_id'])
del(track_metadata_df_sub['artist_mbid'])
track_metadata_df_sub = track_metadata_df_sub.drop_duplicates(['song_id'])
triplet_dataset_sub_song_merged = pd.merge(triplet_dataset_sub_song, track_metadata_
df_sub, how='left', left_on='song', right_o
n='song_id')
triplet_dataset_sub_song_merged.rename(columns={'play_count':'listen_
count'},inplace=True)
del(triplet_dataset_sub_song_merged['song_id'])
del(triplet_dataset_sub_song_merged['artist_id'])
del(triplet_dataset_sub_song_merged['duration'])
del(triplet_dataset_sub_song_merged['artist_familiarity'])
del(triplet_dataset_sub_song_merged['artist_hotttnesss'])
del(triplet_dataset_sub_song_merged['track_7digitalid'])
del(triplet_dataset_sub_song_merged['shs_perf'])
del(triplet_dataset_sub_song_merged['shs_work'])

451

Chapter 10 ■ Analyzing Music Trends and Recommendations

The final dataset, merged with the triplets dataframe looks similar to the depiction in Figure 10-3. This
will form the starting dataframe for our exploratory data analysis.

Figure 10-3. Play counts dataset merged with songs metadata

Visual Analysis
Before we start developing various recommendation engines, let’s do some visual analysis of our dataset. We
will try to see what the different trends are regarding the songs, albums, and releases.

Most Popular Songs
The first information that we can plot for our data concerns the popularity of the different songs in the
dataset. We will try to determine the top 20 songs in our dataset. A slight modification of this popularity will
also serve as our most basic recommendation engine.
The following code snippet gives us the most popular songs from our dataset.
In [7]:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:

452

import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np
import matplotlib.pyplot as plt
popular_songs = triplet_dataset_sub_song_merged[['title','listen_count']].
groupby('title').sum().reset_index()
popular_songs_top_20 = popular_songs.sort_values('listen_count', ascending=False).
head(n=20)
objects = (list(popular_songs_top_20['title']))
y_pos = np.arange(len(objects))
performance = list(popular_songs_top_20['listen_count'])
plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects, rotation='vertical')
plt.ylabel('Item count')
plt.title('Most popular songs')
plt.show()

Chapter 10 ■ Analyzing Music Trends and Recommendations

The plot that’s generated by the code snippet is shown in Figure 10-4. The plot shows that the most
popular song of our dataset is “You’re the One”. We can also search through our track dataframe to see that
the band responsible for that particular track is The Black Keys.

Figure 10-4. Most popular songs

453

Chapter 10 ■ Analyzing Music Trends and Recommendations

M
 ost Popular Artist
The next information the readers may be interested in is, who are the most popular artists in the dataset?
The code to plot this information is quite similar to the code given previously, so we will not include the
exact code snippet. The resultant graph is shown in Figure 10-5.

Figure 10-5. Most popular artists
We can read the plot to see that Coldplay is one of the most popular artists according to our dataset. A
keen music enthusiast can see that we don’t have a lot of representation from the classic artists like U2 or
The Beatles, except for maybe Metallica and Radiohead. This underlines two key points—first, the data is
mostly sourced from the generation that’s not just always listening to the classic artists online and second,
that very rarely we have an artist that’s not of the present generation but still scores high when it comes to
online plays. Surprisingly the only example of such behavior in our dataset is Metallica and Radiohead.
They have their origins in the 1980s but are still pretty popular when it comes to online play counts.
Diverse music genre representations are depicted however in the top artists with popular rap artists like
Eminem, alternative rock bands like Linkin Park and The Killers and even pop\rock bands like Train and
OneRepublic, besides classic rock or metal bands like Radiohead and Metallica!

454

Chapter 10 ■ Analyzing Music Trends and Recommendations

Another slightly off reading is that Coldplay is the most popular artist, but they don’t have a candidate
in the most popular songs list. This indirectly hints at an even distribution of those play counts across all
of their tracks. You can take it as exercise to determine song play distribution for each of the artists who
appear in the plot in Figure 10-5. This will give you a clue as to whether the artist holds a skewed or uniform
popularity. This idea can be further developed to a full-blown recommendation engine.

User vs. Songs Distribution
The last information that we can seek from our dataset is regarding the distribution of song count for users.
This information will tell us how the number of songs that users listen to on average are distributed. We
can use this information to create different categories of users and modify the recommendation engine on
that basis. The users who are listening to a very select number of songs can be used for developing simple
recommendation engines and the users who provide us lots of insight into their behavior can be candidates
for developing complex recommendation engines.
Before we go on to plot that distribution, let’s try to find some statistical information about that
distribution. The following code calculates that distribution and then shows summary statistics about it.
In [11]: user_song_count_distribution = triplet_dataset_sub_song_merged[['user','title']].
groupby('user').count().reset_index().sort_values(by='title', ascending = False)
    ...: user_song_count_distribution.title.describe()
Out[11]:
count    99996.000000
mean       107.752160
std         79.741555
min          1.000000
25%         53.000000
50%         89.000000
75%        141.000000
max       1189.000000
Name: title, dtype: float64
This gives us some important information about how the song counts are distributed across the users.
We see that on an average, a user will listen to 100+ songs, but some users have a more voracious appetite for
song diversification. Let’s try to visualize this particular distribution. The code that follows will help us plot
the distribution of play counts across our dataset. We have intentionally kept the number of bins to a small
amount, as that can give approximate information about the number of classes of users we have.
In [12]:
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:

x = user_song_count_distribution.title
n, bins, patches = plt.hist(x, 50, facecolor='green', alpha=0.75)
plt.xlabel('Play Counts')
plt.ylabel('Probability')
plt.title(r'$\mathrm{Histogram\ of\ User\ Play\ Count\ Distribution}\ $')
plt.grid(True)
plt.show()

The distribution plot generated by the code is shown in Figure 10-6. The image clearly shows that,
although we have a huge variance in the minimum and maximum play count, the bulk of the mass of the
distribution is centered on 100+ song counts.

455

Chapter 10 ■ Analyzing Music Trends and Recommendations

Figure 10-6. Distribution of play counts for users
Given the nature of the data, we can perform a large variety of cool visualizations; for example, plotting
how the top tracks of artists are played, the play count distribution on a yearly basis, etc. But we believe by
now you are sufficiently skilled in both the art of asking pertinent questions and answering them through
visualizations. So we will conclude our exploratory data analysis and move on to the major focus of the
chapter, which is development of recommendation engines. But feel free to try out additional analysis
and visualizations on this data and if you find out something cool, as always, feel free to send across a pull
request to the book’s code repository!

Recommendation Engines
The work of a recommendation engine is succinctly captured in its name—all it needs to do is make
recommendations. But we must admit that this description is deceptively simple. Recommendation engines
are a way of modeling and rearranging information available about user preferences and then using this
information to provide informed recommendations on the basis of that information. The basis of the
recommendation engine is always the recorded interaction between the users and products. For example,
a movie recommendation engine will be based on the ratings provided to different movies by the users; a
news article recommender will take into account the articles the user has read in past; etc.
This section uses the user-song play count dataset to uncover different ways in which we can
recommend new tracks to different users. We will start with a very basic system and try to evolve linearly into
a sophisticated recommendation system. Before we go into building those systems, we will examine their
utility and the various types of recommendation engines.

456

Chapter 10 ■ Analyzing Music Trends and Recommendations

Types of Recommendation Engines
The major area of distinction in different recommendation engines comes from the entity that they assume is
the most important in the process of generating recommendations. There are different options for choosing
the central entity and that choice will determine the type of recommendation engine we will develop.
•

User-based recommendation engines: In these types of recommendation engines,
the user is the central entity. The algorithm will look for similarities among users and
on the basis of those similarities will come up with the recommendation.

•

Content-based recommendation engines: On the other end of the
recommendation engine spectrum, we have the content-based recommendation
engine. In these, the central entity is the content that we are trying to recommend;
for example, in our case the entity will be songs we are trying to recommend. These
algorithm will attempt to find features about the content and find similar content.
Then these similarities will be used to make recommendations to the end users.

•

Hybrid-recommendation engines: These types of recommendation engines
will take into account both the features of the users and the content to develop
recommendations. These are also sometimes termed as collaborative filtering
recommendation engines as they “collaborate” by using the similarities of content as
well as users. These are one of the most effective classes of recommendation engines,
as they take the best features of both classes of recommendation engines.

Utility of Recommendation Engines
In the previous chapter, we discussed an important requirement of any organization, understanding the
customer. This requirement is made more important for online businesses, which have almost no physical
interaction with their customers. Recommendation engines provide wonderful opportunities to these
organizations to not only understand their clientele but also to use that information to increase their
revenues. Another important advantage of recommendation engines is that they potentially have very
limited downside. The worst thing the user can do is not pay attention to the recommendation made to him.
The organization can easily integrate a crude recommendation engine in its interaction with the users and
then, on the basis of its performance, make the decision to develop a more sophisticated version. Although
unverified claims are often made about the impact of recommendation engines on the sales of major
online service providers like Netflix, Amazon, YouTube, etc., an interesting insight into their effectiveness
is provided by several papers. We encourage you to read one such paper at http://ai2-s2-pdfs.
s3.amazonaws.com/ba21/7822b81c3c9449014cb92e197d8a6baa4914.pdf. The study claims that a good
recommendation engine tends to increase sales volume by around 35% and also leads customers to discover
more products, which in turns adds to a positive customer experience.
Before we start discussing various recommendation engines, we would like to thank our friend and
fellow Data Scientist, Author and Course Instructor, Siraj Raval for helping us with a major chunk of the code
used in this chapter pertaining to recommendation engines as well as sharing his codebase for developing
our recommendation engines (check out Siraj’s GitHub at https://github.com/llSourcell). We would
be modifying some of his code samples to develop the recommendation engines that we discuss in the
subsequent sections. Interested readers can also check out Siraj’s YouTube channel at www.youtube.com/c/
sirajology where he makes excellent videos on machine learning, deep learning, artificial intelligence and
other fun educational content.

457

Chapter 10 ■ Analyzing Music Trends and Recommendations

Popularity-Based Recommendation Engine
The simplest recommendation engine is naturally the easiest to develop. As we can easily develop
recommendation engines, this type of recommendation engine is a very straightforward one to develop. The
driving logic of this recommendation engine is that if some item is liked (or listened to) by a vast majority of
our user base, then it is a good idea to recommend that item to users who have not interacted with that item.
The code to develop this kind of recommendation is extremely easy and is effectively just a
summarization procedure. To develop these recommendations, we will determine which songs in our
dataset have the most users listening to them and then that will become our standard recommendation
set for each user. The code that follows defines a function that will do this summarization and return the
resultant dataframe.
In [1]:
def create_popularity_recommendation(train_data, user_id, item_id):
    #Get a count of user_ids for each unique song as recommendation score
    train_data_grouped = train_data.groupby([item_id]).agg({user_id: 'count'}).reset_index()
    train_data_grouped.rename(columns = {user_id: 'score'},inplace=True)
    #Sort the songs based upon recommendation score
    train_data_sort = train_data_grouped.sort_values(['score', item_id], ascending = [0,1])
    #Generate a recommendation rank based upon score
    train_data_sort['Rank'] = train_data_sort['score'].rank(ascending=0, method='first')
    #Get the top 10 recommendations
    popularity_recommendations = train_data_sort.head(20)
    return popularity_recommendations
In [2]: recommendations = create_popularity_recommendation(triplet_dataset_sub_song_
merged,'user','title')
In [3]: recommendations
We can use this function on our dataset to generate the top 10 recommendations to each of our users.
The output of our plain vanilla recommendation system is shown in Figure 10-7. Here you can see that the
recommendations are very similar to the list of the most popular songs that you saw in the last section,
which is expected as the logic behind both is the same—only the output is different.

458

Chapter 10 ■ Analyzing Music Trends and Recommendations

Figure 10-7. Recommendation by the popularity recommendation engine

Item Similarity Based Recommendation Engine
In the last section, we witnessed one of the simplest recommendation engines. In this section we deal with a
slightly more complex solution. This recommendation engine is based on calculating similarities between a
user’s items and the other items in our dataset.
Before we proceed further with our development effort, let’s describe how we plan to calculate “itemitem” similarity, which is central to our recommendation engine. Usually to define similarity among a set
of items, we need a feature set on the basis of which both items can be described. In our case it will mean
features of the songs on the basis of which one song can be differentiated from another. Although as we
don’t have ready access to these features (or do we?), we will define the similarity in terms of the users who
listen to these songs. Confused? Consider this mathematical formula, which should give you a little more
insight into the metric.
similarityij = intersection(usersi , usersj )/union(usersi , usersj)
This similarity metric is known as the Jaccard index (https://en.wikipedia.org/wiki/Jaccard_
index) and in our case we can use it to define the similarities between two songs. The basic idea remains
that if two songs are being listened to by a large fraction of common users out of the total listeners, the two
songs can be said to be similar to each other. On the basis of this similarity metric, we can define the steps
that the algorithm will take to recommend a song to a user k.

459

Chapter 10 ■ Analyzing Music Trends and Recommendations

1.

Determine the songs listened to by the user k.

2.

Calculate the similarity of each song in the user’s list to those in our dataset,
using the similarity metric defined previously.

3.

Determine the songs that are most similar to the songs already listened to by
the user.

4.

Select a subset of these songs as recommendation based on the similarity score.

As the Step 2 can become a computation-intensive step when we have a large number of songs, we will
subset our data to 5,000 songs to make the computation more feasible. We will select the most popular 5,000
songs so it is quite unlikely that we would miss out on any important recommendations.
In [4]:
song_count_subset = song_count_df.head(n=5000)
user_subset = list(play_count_subset.user)
song_subset = list(song_count_subset.song)
triplet_dataset_sub_song_merged_sub = triplet_dataset_sub_song_merged[triplet_dataset_sub_
song_merged.song.isin(song_subset)]
This code will subset our dataset to contain only most popular 5,000 songs. We will then create our
similarity based recommendation engine and generate a recommendation for a random user. We leverage
Siraj’s Recommenders module here for the item similarity based recommendation system.
In [5]:
train_data, test_data = train_test_split(triplet_dataset_sub_song_merged_sub, test_size =
0.30, random_state=0)
is_model = Recommenders.item_similarity_recommender_py()
is_model.create(train_data, 'user', 'title')
user_id = list(train_data.user)[7]
user_items = is_model.get_user_items(user_id)
is_model.recommend(user_id)
The recommendations for random users are shown in Figure 10-8. Notice the stark difference from the
popularity based recommendation engine. So this particular person is almost guaranteed to not like the
most popular songs in our dataset.

460

Chapter 10 ■ Analyzing Music Trends and Recommendations

Figure 10-8. Recommendation by the item similarity based recommendation engine

■■Note At the start of this section, we mentioned we don’t readily have access to song’s features that we
can use to define similarity. As part of the million song database, we have those features available for each of
the songs in the dataset. You are encouraged to replace this implicit similarity, based on common users, with
the explicit similarity based on features of the song and see how the recommendations change.

Matrix Factorization Based Recommendation Engine
Matrix factorization based recommendation engines are probably the most used recommendation
engines when it comes to implementing recommendation engines in production. In this section, we give an
intuition-based introduction to matrix factorization based recommendation engines. We avoid going into a
heavily mathematical discussion, as from a practitioner’s perspective the intent is to see how we can leverage
this to get valuable recommendations from real data.
Matrix factorization refers to identification of two or more matrices from an initial matrix, such that
when these matrices are multiplied we get the original matrix. Matrix factorization can be used to discover
latent features between two different kinds of entities. What are these latent features? Let’s discuss that for a
moment before we go for a mathematical explanation.
Consider for a moment why you like a certain song—the answer may range from the soulful lyrics, to
catchy music, to it being melodious, and many more. We can try to explain a song in mathematical terms
by measuring its beats, tempo, and other such features and then define similar features in terms of the
user. For example, we can define that from a user’s listening history we know that he likes songs with beats
that are bit on the higher side, etc. Once we have consciously defined such “features,” we can use them to
find matches for a user based on some similarity criteria. But more often than not, the tough part in this
process is defining these features, as we have no handbook for what will make a good feature. It is mostly
based on domain experts and a bit of experimentation. But as you will see in this section, you can use matrix
factorization to discover these latent features and they seem to work great.

461

Chapter 10 ■ Analyzing Music Trends and Recommendations

The starting point of any matrix factorization-based method is the utility matrix, as shown in
Table 10-1. The utility matrix is a matrix of user X item dimension in which each row represents a user and
each column stands for an item.
Table 10-1. Example of a Utility Matrix

Item 1
User A

Item 3

Item 4

2

User B
User C

Item 2

5
1

5

Item 5

5

1

Notice that we have a lot of missing values in the matrix; these are the items that the user hasn’t rated,
either because he hasn’t watched it or because he doesn’t want to watch it. We can right away guess, say Item
4 is a recommendation for User C because User B and User C don’t like Item 2, so it is likely they may end up
liking the same items, in this case Item 4.
The process of matrix factorization means finding out a low rank approximation of the utility matrix. So
we want to break down the utility matrix U into two low rank matrices so that we can recreate the matrix U by
multiplying those two matrices. Mathematically,
R = U * It
and
R =U *I
Here R is our original rating matrix, U is our user matrix, and I is our item matrix. Assuming the process
helps us identify K latent features, our aim is to find two matrices X and Y such that their product (matrix
multiplication) approximates R.
X = |U| x K matrix (A matrix with dimensions of num_users * factors)
Y = |P| x K matrix (A matrix with dimensions of factors * num_movies)

Figure 10-9. Matrix Factorization

462

Chapter 10 ■ Analyzing Music Trends and Recommendations

We can also try to explain the concept of matrix factorization as an image. Based on Figure 10-9, we can
regenerate the original matrix by multiplying the two matrices together. To make a recommendation to the
user, we can multiply the corresponding user’s row from the first matrix by the item matrix and determine
the items from the row with maximum ratings. That will become our recommendations for the user. The
first matrix represents the association between the users and the latent features, while the second matrix
takes care of the associations between items (songs in our case) and the latent features. Figure 10-9 depicts
a typical matrix factorization operation for a movie recommender system but the intent is to understand the
methodology and extend it to build a music recommendation system in this scenario.

Matrix Factorization and Singular Value Decomposition
There are multiple algorithms available for determining factorization of any matrix. We use one of the
simplest algorithms, which is the singular value decomposition or SVD. Remember that we discussed the
mathematics behind SVD in Chapter 1. Here, we explain how the decomposition provided by SVD can be
used as matrix factorization.
You may remember from Chapter 1 that singular value decomposition of a matrix returns three different
matrices: U, S, and V. You can follow these steps to determine the factorization of a matrix using the output of
SVD function.
•

Factorize the matrix to obtain U, S, and V matrices.

•

Reduce the matrix S to first k components. (The function we are using will only
provide k dimensions, so we can skip this step.)

•

Compute the square root of reduced matrix Sk to obtain the matrix Sk1/2.

•

Compute the two resultant matrix U*Sk1/2 and Sk1/2*V as these will serve as our two
factorized matrices, as depicted in Figure 10-9.

We can then generate the prediction of user i for product j by taking the dot product of the ith row of the
first matrix with the jth column of the second matrix. This information gives us all the knowledge required to
build a matrix factorization based recommendation engine for our data.

Building a Matrix Factorization Based Recommendation Engine
After the discussion of the mechanics of matrix factorization based recommendation engines, let’s try to
create such a recommendation engine on our data. The first thing that we notice is that we have no concept
of “rating” in our data; all we have are the play counts of various songs. This is a well known problem in the
case of recommendation engines and is called the “implicit feedback” problem. There are many ways to
solve this problem but we will look at a very simple and intuitive solution. We will replace the play count
with a fractional play count. The logic being that this will measure the strength of “likeness” for a song in the
range of [0,1]. We can argue about better methods to address this problem, but this is an acceptable solution
to our problem. The following code will complete the task.
In [7]:
triplet_dataset_sub_song_merged_sum_df = triplet_dataset_sub_song_merged[['user','listen_
count']].groupby('user').sum().rese
t_index()
triplet_dataset_sub_song_merged_sum_df.rename(columns={'listen_count':'total_listen_
count'},inplace=True)
triplet_dataset_sub_song_merged = pd.merge(triplet_dataset_sub_song_merged,triplet_dataset_
sub_song_merged_sum_df)
triplet_dataset_sub_song_merged['fractional_play_count'] = triplet_dataset_sub_song_
merged['listen_count']/triplet_dataset_s
ub_song_merged['total_listen_count']

463

Chapter 10 ■ Analyzing Music Trends and Recommendations

The modified dataframe is shown in Figure 10-10.

Figure 10-10. Dataset with implicit feedback
The next transformation of data that is required is to convert our dataframe into a numpy matrix in
the format of utility matrix. We will convert our dataframe into a sparse matrix, as we will have a lot of
missing values and sparse matrices are suitable for representation of such a matrix. Since we won’t be able
to transform our song IDs and user IDs into a numpy matrix, we will convert these indices into numerical
indices. Then we will use these transformed indices to create our sparse numpy matrix. The following code
will create such a matrix.
In [8]:
from scipy.sparse import coo_matrix
small_set = triplet_dataset_sub_song_merged
user_codes = small_set.user.drop_duplicates().reset_index()
song_codes = small_set.song.drop_duplicates().reset_index()
user_codes.rename(columns={'index':'user_index'}, inplace=True)
song_codes.rename(columns={'index':'song_index'}, inplace=True)
song_codes['so_index_value'] = list(song_codes.index)
user_codes['us_index_value'] = list(user_codes.index)
small_set = pd.merge(small_set,song_codes,how='left')
small_set = pd.merge(small_set,user_codes,how='left')
mat_candidate = small_set[['us_index_value','so_index_value','fractional_play_count']]
data_array = mat_candidate.fractional_play_count.values
row_array = mat_candidate.us_index_value.values
col_array = mat_candidate.so_index_value.values
data_sparse = coo_matrix((data_array, (row_array, col_array)),dtype=float)
In [8]: data_sparse
Out   : <99996x30000 sparse matrix of type ''
        with 10774785 stored elements in COOrdinate format>
Once we have converted our matrix into a sparse matrix, we will use the svds function provided by the
scipy library to break down our utility matrix into three different matrices. We can specify the number of latent
factors we want to factorize our data. In the example we will use 50 latent factors but users are encouraged to
experiment with different values of latent factors and observe how the recommendation change as a result.
The following code creates the decomposition of our matrix and predicts the recommendation for the
same user as in the item similarity recommendation use case. We leverage our compute_svd(...) function
to perform the SVD operation and then use the compute_estimated_matrix(...) for the low rank matrix
approximation after factorization. Detailed steps with function implementations as always are present in the
jupyter notebook.

464

Chapter 10 ■ Analyzing Music Trends and Recommendations

In [9]:
K=50
#Initialize a sample user rating matrix
urm = data_sparse
MAX_PID = urm.shape[1]
MAX_UID = urm.shape[0]
#Compute SVD of the input user ratings matrix
U, S, Vt = compute_svd(urm, K)
uTest = [27513]
#Get estimated rating for test user
print("Predicted ratings:")
uTest_recommended_items = compute_estimated_matrix(urm, U, S, Vt, uTest, K, True)
for user in uTest:
    print("Recommendation for user with user id {}". format(user))
    rank_value = 1
    for i in uTest_recommended_items[user,0:10]:
        song_details = small_set[small_set.so_index_value == i].drop_duplicates('so_index_
value')[['title','artist_name']]
        print("The number {} recommended song is {} BY {}".format(rank_value, list(song_
details['title'])[0],list(song_details['artist_name'])[0]))
        rank_value+=1
The recommendations made by the matrix factorization based system are also shown in Figure 10-11.
If you refer to the jupyter notebook for the code, you will observe that the user with ID 27513 is the same
one for whom we performed the item similarity based recommendations earlier. Refer to the notebook for
further details on how the different functions we used earlier have been implemented and used for our
recommendations! Note how the recommendations have changed from the previous two systems. It looks
like this user might like listening to some Coldplay and Radiohead, definitely interesting!

Figure 10-10. Recommendation using the matrix factorization based recommender
We used one of the simplest matrix factorization algorithms for our use case; you can try to find out
other more sophisticated implementation of the matrix factorization routine, which will lead to a different
recommendation system. Another topic which we have ignored a bit is the conversion of song play count
into a measure of “implicit feedback”. The system that we chose is acceptable but is far from perfect. There
is a lot of literature that discusses the handling of this issue. We encourage you to find different ways of
handling it and experiment with various measures!

465

Chapter 10 ■ Analyzing Music Trends and Recommendations

A Note on Recommendation Engine Libraries
You might have noticed that we did not use any readily available packages for building our recommendation
system. Like for every possible task, Python has multiple libraries available for building recommendation
engines too. But we have refrained from using such libraries because we wanted to give you a taste of what
goes on behind a recommendation engine. Most of these libraries will take the sparse matrix format as
input data and allow you to develop recommendation engines. We encourage you to continue with your
experimentation with at least one of those libraries, as it will give you an idea of exploring different possible
implementations of the same problem and the differences those implementations can make. Some libraries
that you can use for such exploration include scikit-surprise, lightfm, crab, rec_sys, etc.

Summary
In this chapter, we learned about recommendation systems, which are an important and widely known
Machine Learning application. We discovered a very popular data source that allowed us to peek inside a
small section of online music listeners. We then started on the learning curve of building different types
of recommendation engines. We started with a simple vanilla version based on popularity of songs. After
that, we upped the complexity quotient of our recommendation engines by developing an item similarity
based recommendation engine. We urge you to extend that recommendation engine using the metadata
provided as part of million song database. Finally, we concluded the chapter by building a recommendation
engine that takes an entirely different point of view. We explored some basics of matrix factorization and
learned how a very basic factorization method like singular value decomposition can be used for developing
recommendations. We ended the chapter by mentioning about the different libraries that you can use to
develop sophisticated engines from the dataset that we created.
We would like to close this chapter by further underlining the importance of recommendation engines,
especially in the context of online content delivery. An uncorroborated fact that’s often associated with
recommendation systems is that "60% of Netflix movie viewings are through recommendations". According
to Wikipedia, Netflix has an annual revenue of around $8 billion. Even if half of that is coming from movies
and even if the figure is 30% instead of 60%, it means that around $1 billion of Netflix revenue can be
attributed to recommendation engines. Although we will never be able to verify these figures, they are
definitely a strong argument for recommendation engines and the value they can generate.

466

CHAPTER 11

Forecasting Stock and
Commodity Prices
In the chapters so far, we covered a variety of concepts and solved diverse real-world problems. In this
chapter, we will dive into forecast/prediction use cases. Predictive analytics or modeling involves concepts
from data mining, advanced statistics, Machine Learning, and so on to model historical data to forecast
future events. Predictive modeling has use cases across domains such as financial services, healthcare,
telecommunications, etc.
There are a number of techniques developed over the years to understand temporal data and model
patterns to score future events and behaviors. Time series analysis forms the descriptive aspect of such data
and the understanding helps in modeling and forecasting the same. Traditional approaches like regression
analysis (see Chapter 6) and Box-Jenkins methodologies have been deeply studied and applied over the years.
More recently, improvements in computation and Machine Learning algorithms have seen Machine Learning
techniques like neural networks or to be more specific, Deep Learning, making a headway in forecasting use
cases with some amazing results.
This chapter discusses forecasting using stock and commodity price datasets. We will utilize traditional
time series models as well as deep learning models like recurrent neural networks to forecast the prices.
Through this chapter, we will cover the following topics:
•

Brief overview of time series analysis

•

Forecasting commodity pricing using traditional approaches like ARIMA

•

Forecasting stock prices with newer deep learning approaches like RNNs and LSTMs

The code samples, jupyter notebooks, and sample datasets for this chapter is available in the GitHub
repository for this book at https://github.com/dipanjanS/practical-machine-learning-with-python
under the directory/folder for Chapter 11.

Time Series Data and Analysis
A time series is a sequence of observed values in a time-ordered manner. The observations are recorded at
equally spaced time intervals. Time series data is available and utilized in the fields of statistics, economics,
finance, weather modeling, pattern recognition, and many more.
Time series analysis is the study of underlying structure and forces that produced observations. It
provides descriptive frameworks to analyze characteristics of data and other meaningful statistics. It also
provides techniques to then utilize this to fit a model to forecast, monitor, and control.
There is another school of thought that separates the descriptive and modeling components of time
series. Here, time series analysis typically is concerned with only the descriptive analysis of time series data
to understand various components and underlying structure. The modeling aspect that utilizes time series
© Dipanjan Sarkar, Raghav Bali and Tushar Sharma 2018
D. Sarkar et al., Practical Machine Learning with Python, https://doi.org/10.1007/978-1-4842-3207-1_11

467

Chapter 11 ■ Forecasting Stock and Commodity Prices

for prediction/forecasting use cases is termed as time series forecasting. Though in general, both schools of
thought utilize the same set of tools and techniques. It is more about how the concepts are grouped for a
structured learning and application.

■■Note Time series and its analysis is a complete field of study on its own and we merely discuss certain
concepts and techniques for our use cases here. This chapter is by no means a complete guide on time series
and its analysis.
Time series can be analyzed both in frequency and time domains. The frequency domain analysis
includes spectral and wavelet analysis techniques, while time domain includes auto- and cross-correlation
analysis. In this chapter, we primarily focus on time series forecasting in the time domain with a brief
discussion around descriptive characteristics of time series data. Moreover, we concentrate on univariate
time series analysis (time is an implicit variable).
To better understand concepts related to time series, we utilize a sample dataset. The following snippet
loads a web site visit data with a daily frequency using pandas. You can refer to the jupyter notebook
notebook_getting_started_time_series.ipynb for the necessary code snippets and examples. The dataset
is available at http://openmv.net/info/website-traffic.
In [1] : import pandas as pd
    ...:
    ...: #load data
    ...: input_df = pd.read_csv(r'website-traffic.csv')
    ...:
    ...: input_df['date_of_visit'] = pd.to_datetime(input_df.MonthDay.\
    ...:                                             str.cat( input_df.Year.astype(str),
    ...:                                             sep=' '))
The snippet first creates a new attribute called the date_of_visit using a combination of Day,
Month, and Year values available in the base dataframe. Since the dataset is about web site visits per day,
the variable of interest is the visit count for the day with the time dimension, i.e. date_of_visit being the
implicit one. The output plot of visits per day is shown in Figure 11-1.

468

Chapter 11 ■ Forecasting Stock and Commodity Prices

Figure 11-1. Web site visits per day

Time Series Components
The time series at hand is data related to web site visits per day for a given web site. As mentioned earlier,
time series analysis deals with understanding the underlying structure and forces that result in the series
as we see it. Let’s now try to deconstruct various components that make up the time series at hand. A time
series is said to be comprised of the following three major components:
•

Seasonality: These are the periodic fluctuations in the observed data. For example,
weather patterns or sale patterns.

•

Trend: This is the increasing or decreasing behavior of the series with time. For
example, population growth patterns.

•

Residual: This is the remaining signal after removing the seasonality and trend
signals. It can be further decomposed to remove the noise component as well.

It is interesting to note that most real-world time series data have a combination or all of these
components available. Yet, it is mostly the noise that’s always apparently present, with trend and seasonality
being optional in certain cases. In the following snippet, we utilize statsmodels to decompose our web site
visit time series into its three constituents and then plot the same.
In [2] : from statsmodels.tsa.seasonal import seasonal_decompose
    ...:
    ...: # extract visits as series from the dataframe
    ...: ts_visits = pd.Series(input_df.Visits.values,

469

Chapter 11 ■ Forecasting Stock and Commodity Prices

    ...:                         index=pd.date_range(
    ...:                                         input_df.date_of_visit.min(),
    ...:                                         input_df.date_of_visit.max(),
    ...:                                         freq='D')
    ...:                         )
    ...:
    ...:
    ...: deompose = seasonal_decompose(ts_visits.interpolate(),
    ...:                            
freq=24)
    ...: deompose.plot()  
We first create a pandas Series object with additional care taken to set the frequency of the time series
index. It is important to note that statsmodels has a number of time series modeling modules available and
they rely on underlying data structures (such as pandas, numpy, etc.) to specify the frequency of the time
series. In this case, since the data is at a daily level, we set the frequency of ts_visits object to ‘D’ denoting
a daily frequency. We then simply use the seasonal_decompose() function from statsmodels to get the
required constituents. The decomposed series is shown in the plot in Figure 11-2.

Figure 11-2. Web site visit time series and its constituent signals
It is apparent from Figure 11-2, the time series at hand has both upward and downward trends in it.
It shows a gradual increasing trend until October, post which it starts a downward behavior. The series
certainly has a monthly periodicity or seasonality to it. The remaining signal is what is marked as residual in
Figure 11-2.

470

Chapter 11 ■ Forecasting Stock and Commodity Prices

Smoothing Techniques
As discussed in the previous chapters, preprocessing the raw data depends upon the data as well as the
use case requirements. Yet there are certain standard preprocessing techniques for each type of data.
Unlike the datasets we have seen so far, where we consider each observation to be independent of other
(past or future) observations, time series have inherent dependency on historical observations. As seen
from the decomposition of web site visits series, there are multiple factors impacting each observation. It
is an inherent property of time series data to have random variation to it apart from its other constituents.
To better understand, model, and utilize time series for prediction related tasks, we usually perform a
preprocessing step better termed as smoothing. Smoothing helps reduce the effect of random variation and
helps clearly reveal the seasonality, trend, and residual components of the series. There are various methods
to smooth out a time series. They are broadly categorized as follows.

Moving Average
Instead of taking an average of complete time series (which we do in cases of non-temporal data) to
summarize, moving average makes use of a rolling windowed approach. In this case, we compute the mean
of each successive smaller windows of past data to smoothen out the impact of random variation.
The following is a general formula for moving average calculation.
MAt =

xt + xt -1 + xt -2 +¼+ xt -n
n

Where, MAt is the moving average for time period t, xt, xt-1 and so on denote observed values at
particular time periods, and n is the window size. For example, the following snippet calculates the moving
average for visits with a window size of 3.
In [3] : # moving average
    ...: input_df['moving_average'] = input_df['Visits'].rolling(window=3,
    ...:                                                        center=False).mean()
    ...:
    ...: print(input_df[['Visits','moving_average']].head(10))
    ...:
    ...: plt.plot(input_df.Visits,'-',color='black',alpha=0.3)
    ...: plt.plot(input_df.moving_average,color='b')
    ...: plt.title('Website Visit and Moving Average Smoothening')
    ...: plt.legend()
    ...: plt.show()
The moving average calculated using a window size 3 has the following results. It should be pretty
clear that for a window size of 3, the first two observations would not have any moving averages available,
hence the NaN.
Out[3]:
   Visits  moving_average
0      27             NaN
1      31             NaN
2      38       32.000000
3      38       35.666667
4      31       35.666667

471

Chapter 11 ■ Forecasting Stock and Commodity Prices

5      24      
6      21      
7      29      
8      30      
9      22      

31.000000
25.333333
24.666667
26.666667
27.000000

Figure 11-3. Smoothening using moving average
The plot in Figure 11-3 shows the smoothened visits time series. The smoothened series captures the
overall structure of the original series all the while reducing the random variation in it. You are encouraged
to explore and experiment with different window sizes and compare the results.
Depending on the use case and data at hand, we also try different variations of moving average like
centered moving average, double moving average, and so on apart from different window sizes. We will utilize
some of these concepts when we deal with actual use cases in the coming sections of the chapter.

E xponential Smoothing
Moving average based smoothening is effective yet it is a pretty simple preprocessing technique. In case of
moving average, all past observations in the window are given equal weight. Unlike the previous method,
exponential smoothening techniques apply exponentially decreasing weights to older observations. In
simple words, exponential smoothening methods give more weight to recent past observations as compared
to older observations. Depending on the level of smoothening required, there may be one or more
smoothening parameters to set in case of exponential smoothening.

472

Chapter 11 ■ Forecasting Stock and Commodity Prices

Exponential smoothening is also called exponentially weighted moving average or EWMA for short.
Single exponential smoothening is one of the simplest to get started with. The general formula is given as.
Et = a yt -1 + (1 - a ) Et -1
Where, Et is the t th smoothened observation, y is the actual observed value at t-1 instance, and α
is smoothing constant between 0 and 1. There are different methods to bootstrap the value of E2 (the
time period from which smoothing begins). It can be done by setting it to y1 or an average of first n time
periods and so on. Also, the value of α determines how much of the past is accounted for. A value closer
to 1 dampens out the past observations quickly while values closer to 0 dampen out slowly. The following
snippet uses the pandas ewm() function to calculate the smoothened series for visits. The parameter
halflife is used to calculate α in this case.
In [4] : input_df['ewma'] = input_df['Visits'].ewm(halflife=3,
    ...:                                                 ignore_na=False,
    ...:                                                 min_periods=0,
    ...:                                                 adjust=True).mean()
    ...:
    ...: plt.plot(input_df.Visits,'-',color='black',alpha=0.3)
    ...: plt.plot(input_df.ewma,color='g')
    ...: plt.title('Website Visit and Exponential Smoothening')
    ...: plt.legend()
    ...: plt.show()
The plot depicted in Figure 11-4 showcases the EWMA smoothened series along with the original one.

Figure 11-4. Smoothening using EWMA

473

Chapter 11 ■ Forecasting Stock and Commodity Prices

In the coming sections, we will apply our understanding of time series, preprocessing techniques, and
so on to solve stock and commodity price forecasting problems using different forecasting methods.

Forecasting Gold Price
Gold, the yellow shiny metal, has been the fancy of mankind since ages. From making jewelry to being used
as an investment, gold covers a huge spectrum of use cases. Gold, like other metals, is also traded on the
commodities indexes across the world. For better understanding time series in a real-world scenario, we will
work with gold prices collected historically and predict its future value. Let’s begin by first formally stating
the problem statement.

Problem Statement
Metals such as gold have been traded for years across the world. Prices of gold are determined and used
for trading the metal on commodity exchanges on a daily basis using a variety of factors. Using this daily
price-level information only, our task is to predict future price of gold.

Dataset
For any problem, first and foremost is the data. Stock and Commodity exchanges do a wonderful job of
storing and sharing daily level pricing data. For the purpose of this use case, we will utilize gold pricing from
Quandl. Quandl is a platform for financial, economic, and alternative datasets. You can refer to the jupyter
notebook notebook_gold_forecast_arima.ipynb for the necessary code snippets and examples.
To access publicly shared datasets on Quandl, we can use the pandas-datareader library as well
as quandl (library from Quandl itself ). For this use case, we will depend upon quandl. Kindly install the
same using pip or conda. The following snippet shows a quick one-liner to get your hands on gold pricing
information since 1970s.
In [5]: import quandl
   ...: gold_df = quandl.get("BUNDESBANK/BBK01_WT5511", end_date="2017-07-31")
The get() function takes the stock/commodity identifier as first parameter followed by the date until
which we need the data. Note that not all datasets are public, for some of them, API access must be obtained.

Traditional Approaches
Time series analysis and forecasting have been long studied in detail. There are matured and extensive set of
modeling techniques available for the same. Out of the many, the following are a few most commonly used
and explored techniques:

474

•

Simple moving average and exponential smoothing based forecasting

•

Holt’s, Holt-Winter’s Exponential Smoothing based forecasting

•

Box-Jenkins methodology (AR, MA, ARIMA, S-ARIMA, etc.)

Chapter 11 ■ Forecasting Stock and Commodity Prices

■■Note Causal or cross-sectional forecasting/modeling is where the target variable has a relationship with
one or more predictor variables, example regression models (see Chapter 6). Time series forecasting is about
forecasting variable(s) that is changing over time. Both these techniques are grouped under quantitative
techniques.
As mentioned, there are quite a handful of techniques available, each a deep topic of research and
study. For the scope of this section and chapter, we will focus upon ARIMA models (from the Box-Jenkin’s
methodology) to forecast gold prices. Before we move ahead and discuss ARIMA, let’s look at a few
key concepts.

K
 ey Concepts
•

Stationarity: One the key assumptions behind the ARIMA models we will be
discussing next. Stationarity refers to the property where for a time series its mean,
variance, and autocorrelation are time invariant. In other words, mean, variance,
and autocorrelation do not change with time. For instance, a time series having
an upward (or downward) trend is a clear indicator of a non-stationarity because
its mean would change with time (see web site visit data example in the previous
section).

•

Differencing: One of the methods of stationarizing series. Though there can be other
transformations, differencing is widely used to stabilize the mean of a time series. We
simply compute difference between consecutive observations to obtain a differenced
series. We can then apply different tests to confirm if the resulting series is stationary
or not. We can also perform second order differencing, seasonal differencing, and so
on, depending on the time series at hand.

•

Unit Root Tests: Statistical tests that help us understand if a given series is stationary
or not. The Augmented Dickey Fuller test begins with a null hypothesis of series being
non-stationary, while Kwiatkowski-Phillips-Schmidt-Shin test or KPSS has a null
hypothesis that the series is stationary. We then perform a regression fit to reject or
fail to reject the null hypothesis.

A
 RIMA
The Box-Jenkin’s methodology consists of a wide range of statistical models which are widely used to model
time series for forecasting. For this section, we will be concentrating on one such model called as ARIMA.
ARIMA stands for Auto Regressive Integrated Moving Average model. Sounds pretty complex, right? Let’s
look at the basics and constituents of this model and then build on our understanding to forecast
gold prices.
•

Auto Regressive or AR Modeling: A simple linear regression model where current
observation is regressed upon one or more prior observations. The model is denoted as.
X t = d + q1 X t -1 +¼+ q p X t -p + e t

475

Chapter 11 ■ Forecasting Stock and Commodity Prices

where, Xt is the observation at time t, εt is the noise and
p
æ
ö
d = ç 1 - åqi ÷ m
è i =1 ø

the dependency on prior values is denoted by p or the order of AR model.
•

Moving Average or MA Modeling: Is again essentially a linear regression model that
models the impact of noise/error from prior observations to current one. The model
is denoted as
X t = m + e t - j1e t -1 +¼+ jqe t -q
where, μ is the series mean, εi are the noise terms, and q is the order of the model.

The AR and MA models were known long before Box-Jenkin’s methodology was presented. Yet this
methodology presented a systematic approach to identify and apply these models for forecasting.

■■Note Box, Jenkins, and Reinsel presented this methodology in their book titled Time Series Analysis:
Forecasting and Control. You are encouraged to go through it for a deeper understanding.
The ARIMA model is a logical progression and combination of the two models. Yet if we combine AR
and MA with a differenced series, what we get is called as ARIMA(p,d,q) model.
where,
•

p is the order of Autoregression

•

q is the order of Moving average

•

d is the order of differencing

Thus, for a stationary time series ARIMA models combine autoregressive and moving average concepts
to model the behavior of a long running time series and helps in forecasting. Let’s now apply these concepts
to model gold price forecasting.

Modeling
While describing the dataset, we extracted the gold price information using quandl. Let’s first plot and see
how this time series looks. The following snippet uses the pandas to plot the same.
In [6]: gold_df.plot(figsize=(15, 6))
   ...: plt.show()

476

Chapter 11 ■ Forecasting Stock and Commodity Prices

The plot in Figure 11-5 shows a general upward trend with sudden rise in the 1980s and then near 2010.

Figure 11-5. Gold prices over the years
Since stationarity is one of the primary assumptions of ARIMA models, we will utilize Augmented Dickey
Fuller test to check our series for stationarity. The following snippet helps us calculate the AD Fuller test
statistics and plot rolling characteristics of the series.
In [7]: # Dickey Fuller test for Stationarity
   ...: def ad_fuller_test(ts):
   ...: dftest = adfuller(ts, autolag='AIC')
   ...: dfoutput = pd.Series(dftest[0:4], index=['Test Statistic',
   ...:                                                 'p-value',
   ...:                                                 '#Lags Used',
   ...:                                                 'Number of Observations Used'])
   ...: for key,value in dftest[4].items():
   ...:         dfoutput['Critical Value (%s)'%key] = value
   ...: print(dfoutput)
   ...:
   ...: # Plot rolling stats for a time series
   ...: def plot_rolling_stats(ts):
   ...: rolling_mean = ts.rolling(window=12,center=False).mean()
   ...: rolling_std = ts.rolling(window=12,center=False).std()
   ...:
   ...: #Plot rolling statistics:
   ...: orig = plt.plot(ts, color='blue',label='Original')
   ...: mean = plt.plot(rolling_mean, color='red', label='Rolling Mean')
   ...: std = plt.plot(rolling_std, color='black', label = 'Rolling Std')
   ...: plt.legend(loc='best')
   ...: plt.title('Rolling Mean & Standard Deviation')
   ...: plt.show(block=False)

477

Chapter 11 ■ Forecasting Stock and Commodity Prices

If the test statistic of AD Fuller test is less than the critical value(s), we reject the null hypothesis of nonstationarity. The AD Fuller test is available as part of the statsmodel library. Since it is quite evident that our
original series of gold prices is non-stationary, we will perform a log transformation and see if we are able to
obtain stationarity. The following snippet uses the rolling stats plots and AD Fuller tests to check the same.
In [8]: log_series = np.log(gold_df.Value)
   ...:
   ...: ad_fuller_test(log_series)
   ...: plot_rolling_stats(log_series)
The test statistic of -1.8 is greater than either of critical values, hence we fail to reject the null hypothesis,
i.e., the series is non-stationary even after log transformation. The output and plot depicted in Figure 11-6
confirm the same.
Test Statistic                    -1.849748
p-value                            0.356057
#Lags Used                        29.000000
Number of Observations Used    17520.000000
Critical Value (1%)               -3.430723
Critical Value (5%)               -2.861705
Critical Value (10%)              -2.566858
dtype: float64

Figure 11-6. Rolling mean and standard deviation plot for log transformed gold price

478

Chapter 11 ■ Forecasting Stock and Commodity Prices

The plot points out a time varying mean of the series and hence the non-stationarity. As discussed in
the key concepts, differencing a series helps in achieving stationarity. In the following snippet, we prepare a
first order differenced log series and perform the same tests.
In [9]:
   ...:
   ...:
   ...:
   ...:

log_series_shift = log_series - log_series.shift()
log_series_shift = log_series_shift[~np.isnan(log_series_shift)]
ad_fuller_test(log_series_shift)
plot_rolling_stats(log_series_shift)

The test statistic at -23.91 is lower than even 1% critical value, thus we reject the null hypothesis for AD
Fuller test. The following are the test results.
Test Statistic                   -23.917175
p-value                            0.000000
#Lags Used                        28.000000
Number of Observations Used    17520.000000
Critical Value (1%)               -3.430723
Critical Value (5%)               -2.861705
Critical Value (10%)              -2.566858
dtype: float64

Figure 11-7. Rolling mean and standard deviation plot for log differenced gold price series
This exercise points us to the fact that we need to use a log differenced series for ARIMA to model
the dataset at hand. See Figure 11-7. Yet we still need to figure out the order of autoregression and moving
average components, i.e., p and q.

479

Chapter 11 ■ Forecasting Stock and Commodity Prices

Building an ARIMA model requires some experience and intuition as compared other models.
Identifying p, d, and q parameters of the model can be done using different methods, though arriving at the
right set of numbers is dependent both upon requirements and experience.
One of the commonly used methods is the plotting of ACF and PACF plots to determine p and q values.
ACF or Auto Correlation Function plot and PACF or the Partial Auto Correlation Function plot helps us
narrow down the search space of determining the p and q values with a few caveats. There are certain rules
or heuristics developed over the years to best utilize these plots and thus these do not guarantee the best
possible values.
The ACF plot helps us understand the correlation of an observation with its lag (or previous value).
The ACF plot is used to determine the MA order, i.e. q. The value at which ACF drops is the order of the
MA model.
On the same lines, PACF points toward correlation between an observation and a specific lagged value,
excluding effect of other lags. The value at which PACF drops points toward the order of AR model or the p in
ARIMA(p,d,q).

■■Note

Further details on ACF and PACF is available at http://www.itl.nist.gov/div898/handbook/eda/
section3/autocopl.htm.

Let’s again utilize statsmodels to generate ACF and PACF plots for our series and try to determine
p and q values. The following snippet uses the log differenced series to generate the required plots.
In [10]:
    ...:
    ...:
    ...:
    ...:

fig
ax1
fig
ax2
fig

=
=
=
=
=

plt.figure(figsize=(12,8))
fig.add_subplot(211)
sm.graphics.tsa.plot_acf(log_series_shift.squeeze(), lags=40, ax=ax1)
fig.add_subplot(212)
sm.graphics.tsa.plot_pacf(log_series_shift, lags=40, ax=ax2)

The output plots (Figure 11-8) show a sudden drop at lag 1 for both ACF and PACF, thus pointing toward
possible values of q and p to be 1 each, respectively.

480

Chapter 11 ■ Forecasting Stock and Commodity Prices

Figure 11-8. ACF and PACF plots
The ACF and PACF plot also help us understand if a series is stationary or not. If a series has gradually
decreasing values for ACF and PACF, it points toward non-stationarity property in the series.

■■Note Identifying the p, d, and q values for any ARIMA model is as much science as it is art. More details on
this are available at https://people.duke.edu/~rnau/411arim.htm.
Another method to derive the p, d, q parameters is to perform a grid search of the parameter space.
This is more in tune with the Machine Learning way of hyperparameter tuning. Though statsmodels does
not provide such a utility (for obvious reasons though), we can write our own utility to identify the best
fitting model. Also, pretty much like any other Machine Learning/Data Science use case, we need to split
our dataset into train and test sets. We utilize scikit-learn’s TimeSeriesSplit utility to help us get proper
training and testing sets.
We write a utility function arima_grid_search_cv() to grid search and cross validate the results using
the gold prices at hand. The function is available in arima_utils.py module for reference. The following
snippet performs a five-fold cross validation with auto-ARIMA to find the best fitting model.
In [11]: results_dict = arima_gridsearch_cv(gold_df.log_series,cv_splits=5)
Note that we are passing the log transformed series as input to the arima_gridsearch_cv() function.
As we saw earlier, the log differenced series was what helped us achieve stationarity, hence we use the log
transformation as our starting point and fit an ARIMA model with d set to 1. The function call generates
a detailed output for each train-test split (we have five of them lined up), each iteration performing a grid
search over p, d, and q. Figure 11-9 shows the output of the first iteration, where the training set included
only 2924 observations.

481

Chapter 11 ■ Forecasting Stock and Commodity Prices

Figure 11-9. Auto ARIMA
Similar to our findings using ACF-PACF, auto ARIMA suggests that the best fitting model is
ARIMA(1,1,1) based upon the AIC criteria. Note that AIC or Akaike Information Criterion measures the
goodness of fit and parsimony. It is a relative metric and does not point toward quality of models in an
absolute sense, i.e., if all models being compared are poor, AIC will not be able to point that out. Thus, AIC
should be used as a heuristic. A low value points toward a better fitting model. The following is the summary
generated by the ARIMA(1,1,1) fit, see Figure 11-10.

Figure 11-10. Summary of ARIMA(1,1,1)
The summary is quite self-explanatory. The top section shows details about the training sample, AIC,
and other metrics. The middle section talks about the coefficients of the fitted mode. In case of ARIMA(1,1,1)
for iteration 1, both AR and MA coefficients are statistically significant. The Forecast Plot for iteration 1 with
ARIMA(1,1,1) fitted is shown in Figure 11-11.

482

Chapter 11 ■ Forecasting Stock and Commodity Prices

Figure 11-11. The forecast plot for ARIMA(1,1,1)
As evident from the plot in Figure 11-11, the model captures the overall upward trend though it misses
out on the sudden jump in values around 1980. Yet it seems to give a pretty nice idea of what can be achieved
using this methodology. The arima_gridsearch_cv() function produces similar statistics and plots for five
different train-test splits. We observe that ARIMA(1,1,1) provides us decent enough fit, although we can
define additional performance and error criteria to select a particular model.
In this case, we generated forecast for time periods for which we already had data. This helps us in
visualizing and understanding how the model is performing. This is also called as back testing. Out of sample
forecasting is also supported by statsmodels through its forecast() method. Also, the plot in Figure 11-11
showcases values in the transformed scale, i.e. log scale. Inverse transformation can be easily applied to get
data back in original form.
You should also note that commodity prices are impacted by a whole lot of other factors like global
demand, economic conditions like recession and so on. Hence, what we showcased here was in certain ways
a naïve modeling of a complex process. We would need more features and attributes to have sophisticated
forecasts.

Stock Price Prediction
Stocks and financial instrument trading is a lucrative proposition. Stock markets across the world facilitate
such trades and thus wealth exchanges hands. Stock prices move up and down all the time and having
ability to predict its movement has immense potential to make one rich.
Stock price prediction has kept people interested from a long time. There are hypothesis like the
Efficient Market Hypothesis, which says that it is almost impossible to beat the market consistently and there
are others which disagree with it.
There are a number of known approaches and new research going on to find the magic formula to make
you rich. One of the traditional methods is the time series forecasting, which we saw in the previous section.
Fundamental analysis is another method where numerous performance ratios are analyzed to assess a given
stock. On the emerging front, there are neural networks, genetic algorithms, and ensembling techniques.

■■Note Stock price prediction (along with gold price prediction in the previous section) is an attempt to
explain concepts and techniques to model real-world data and use cases. This chapter is by no means and
extensive guide to algorithmic trading. Algorithmic trading is a complete field of study on its own and you may
explore it further. Knowledge from this chapter alone would not be sufficient to perform trading of any sort and
is beyond both the scope and intent of this book.

483

Chapter 11 ■ Forecasting Stock and Commodity Prices

In this section, we learn how to apply recurrent neural networks (RNNs) to the problem of stock price
prediction and understand the intricacies.

Problem Statement
Stock price prediction is the task of forecasting the future value of a given stock. Given the historical daily
close price for S&P 500 Index, prepare and compare forecasting solutions.
S&P 500 or Standard and Poor's 500 index is an index comprising of 500 stocks from different sectors of
US economy and is an indicator of US equities. Other such indices are the Dow 30, NIFTY 50, Nikkei 225, etc.
For the purpose of understanding, we are utilizing S&P500 index, concepts, and knowledge can be applied
to other stocks as well.

Dataset
Similar to gold price dataset, the historical stock price information is also publicly available. For our current
use case, we will utilize the pandas_datareader library to get the required S&P 500 index history using Yahoo
Finance databases. We will utilize the closing price information from the dataset available though other
information such as opening price, adjusted closing price, etc., are also available.
We prepare a utility function get_raw_data() to extract required information in a pandas dataframe.
The function takes index ticker name as input. For S&P 500 index, the ticker name is ^GSPC. The following
snippet uses the utility function to get the required data.
In [1]: sp_df = get_raw_data('^GSPC')
   ...: sp_close_series = sp_df.Close
   ...: sp_close_series.plot()
The plot for closing price is depicted in Figure 11-12.

Figure 11-12. The S&P 500 index

484

Chapter 11 ■ Forecasting Stock and Commodity Prices

The plot in Figure 11-12 shows that we have closing price information available since 2010 up until recently.
Kindly note that the same information is also available through quandl (which we used in the previous
section to get gold price information). You may use the same to get this data as well.

Recurrent Neural Networks: LSTM
Artificial neural networks are being employed to solve a number of use cases in a variety of domains.
Recurrent neural networks are a class of neural networks with capabilities of modeling sequential data.
LSTMs or Long Short Term Memory is an RNN architecture that’s useful for modeling arbitrary intervals of
information. RNNs, particularly LSTMs were discussed in Chapter 1 while a practical use case was explored
in Chapter 7 for analyzing textual data from movie reviews for sentiment analysis.

Figure 11-13. Basic structure of RNN and LSTM units. (Source: Christopher Olah’s blog: colah.github.io)
As a quick refresher, Figure 11-13 points toward the general architecture of an RNN along with the
internals of a typical LSTM unit. LSTM comprises of three major gates—the input, output, and forget gates.
These gates work in tandem to learn and store long and short term sequence related information. For more
details, refer to Advanced Supervised Deep Learning Models, Chapter 7.

485

Chapter 11 ■ Forecasting Stock and Commodity Prices

For stock price prediction, we will utilize LSTM units to implement an RNN model. RNNs are typically
useful in sequence modeling applications; some of them are as follows:
•

Sequence classification tasks like sentiment analysis of a given corpus
(see Chapter 7 for the detailed use case)

•

Sequence tagging tasks like POS tagging a given sentence

•

Sequence mapping tasks like speech recognition

Unlike traditional forecasting approaches (like ARIMA), which require preprocessing of time series
information to conform to stationarity and other assumptions along with parameter identification (p, d, q,
for instance), neural networks (particularly RNNs) impose far fewer restrictions.
Since stock price information is also a time series data, we will explore the application of LSTMs to this
use case and generate forecasts. There are a number of ways this problem can be modeled to forecast values.
The following sections covers two such approaches.

Regression Modeling
We introduced regression modeling in Chapter 6 to analyze bike demand based on certain predictor
variables. In essence, regression modeling refers to the process of investigating relationship between
dependent and independent variables.
To model our current use case as a regression problem, we state that the stock price at timestamp t+1
(dependent variable) is a function of stock price at timestamps t, t -1, t -2, …, t -n. Where n is the past
window of stock prices.
Now that we have a framework defined on how we would model our time series, we need to transform
our time series data into windowed form. See Figure 11-14.

Figure 11-14. Transformation of stock price time series into windowed format
The windowed transformation is outlined in Figure 11-15 where a window size of 4 is used. The value at
time t+1 is forecasted using past four values. We have data for the S&P 500 index since 2010, hence we would
apply this windowed transformation in a rolling fashion and create multiple such sequences. Thus, if we
have a time series of length M and a window size of n, there would be M-n-1 total windows generated.

486

Chapter 11 ■ Forecasting Stock and Commodity Prices

Figure 11-15. Rolling/sliding windows from original time series
For the hands-on examples in this section, you can refer to the jupyter notebook
notebook_stock_prediction_regression_modeling_lstm.ipynb for the necessary code snippets and
examples. For using LSTMs to model our time series, we need to apply one more level of transformation
to be able to input our data. LSTMs accept 3D tensors as input, we transform each of the windows
(or sequences) in (N, W, F) format. Here, N is the number of samples or windows from the original time
series, W is the size of each window or the number of historical time steps and F is the number of features
per time step. In our case, as we are only using the closing price, F is equal to 1, N and W are configurable.
The following function performs the windowing and 3D tensor transformations using pandas and numpy.
def get_reg_train_test(timeseries,sequence_length= 51,
                   train_size=0.9,roll_mean_window=5,
                   normalize=True,scale=False):
    # smoothen out series
    if roll_mean_window:
        timeseries = timeseries.rolling(roll_mean_window).mean().dropna()
    # create windows
    result = []
    for index in range(len(timeseries) - sequence_length):
        result.append(timeseries[index: index + sequence_length])
    # normalize data as a variation of 0th index
    if normalize:
        normalised_data = []
        for window in result:
            normalised_window = [((float(p) / float(window[0])) - 1) \
                                   for p in window]
            normalised_data.append(normalised_window)
        result = normalised_data
    # identify train-test splits
    result = np.array(result)
    row = round(train_size * result.shape[0])
    # split train and test sets
    train = result[:int(row), :]
    test = result[int(row):, :]

487

Chapter 11 ■ Forecasting Stock and Commodity Prices

    # scale data in 0-1 range
    scaler = None
    if scale:
        scaler=MinMaxScaler(feature_range=(0, 1))
        train = scaler.fit_transform(train)
        test = scaler.transform(test)
    # split independent and dependent variables  
    x_train = train[:, :-1]
    y_train = train[:, -1]
    x_test = test[:, :-1]
    y_test = test[:, -1]
    # Transforms for LSTM input
    x_train = np.reshape(x_train, (x_train.shape[0],
                                   x_train.shape[1],
                                   1))
    x_test = np.reshape(x_test, (x_test.shape[0],
                                 x_test.shape[1],
                                 1))
    return x_train,y_train,x_test,y_test,scaler
The function get_reg_train_test() also performs a number of other optional preprocessing steps.
It allows us to smoothen the time series using rolling mean before the windowing is applied. We can also
normalize the data as well as scale based on requirements. Neural networks are sensitive to input values
and it is generally advised to scale inputs before training the network. For this use case, we will utilize the
normalization of the time series wherein, for each window, every time step is the percentage change from
the first value in that window (we could also use scaling or both and repeat the process).
For our case, we begin with a window size of six days (you can experiment with smaller or larger
windows and observe the difference). The following snippet uses the get_reg_train_test() function with
normalization set to true.
In [2] : WINDOW = 6
    ...: PRED_LENGTH = int(window/2)
    ...: x_train,y_train,x_test,y_test,scaler = get_reg_train_test(sp_close_series,
    ...:                                                         sequence_length=WINDOW +1,
    ...:                                                         roll_mean_window=None,
    ...:                                                         normalize=True,
    ...:                                                         scale=False)
This snippet creates a seven-day window that is comprised of six days of historical data (x_train) and
one-day forecast y_train. The shapes of the train and test variables are as follows.
In [3] :
    ...:
    ...:
    ...:

488

print("x_train shape={}".format(x_train.shape))
print("y_train shape={}".format(y_train.shape))
print("x_test shape={}".format(x_test.shape))
print("y_test shape={}".format(y_test.shape))

Chapter 11 ■ Forecasting Stock and Commodity Prices

x_train shape=(2516, 6, 1)
y_train shape=(2516,)
x_test shape=(280, 6, 1)
y_test shape=(280,)
The x_train and x_test tensors conform to the (N, W, F) format we discussed earlier and is required
for input to our RNN. We have 2516 sequences in our training set, each with six time steps and one value to
forecast. Similarly, we have 280 sequences in our test set.
Now that we have our datasets preprocessed and ready, we build up an RNN network using keras. The
keras framework provides us high level abstractions to work with neural networks over theano and tensorflow
backends. The following snippet showcases the model prepared using the get_reg_model() function.
In [4]: lstm_model = get_reg_model(layer_units=[50,100],
   ...:                         window_size=window)
The generated LSTM model architecture has two hidden LSTM layers stacked over each other with
first one having 50 LSTM units and the second one having 100. The output layer is a Dense layer with linear
activation function. We use mean squared error as our loss function to optimize upon. Since we are stacking
LSTM layers, we need to set return_sequences to true in order for the subsequent layer to get the required
values. As is evident, keras abstracts most of the heavy lifting and makes it pretty intuitive to build even
complex architectures with just a few lines of code.
The next step is to train our LSTM network. We use batch size of 16 with 20 epochs and a validation set
of 5%. The following snippet uses the fit() function to train the model.
In [5]: # use early stopping to avoid overfitting
   ...: callbacks = [keras.callbacks.EarlyStopping(monitor='val_loss',
   ...:                                                 patience=2,
   ...:                                                 verbose=0)]
   ...: lstm_model.fit(x_train, y_train,
   ...:                epochs=20, batch_size=16,
   ...:                verbose=1,validation_split=0.05,
   ...:                callbacks=callbacks)
The model generates information regarding training and validation loss for every epoch it runs. The callback
for stopping enables us to stop the training if there is no further improvement observed for two consecutive
epochs. We start with a batch size of 16; you may experiment with larger batch sizes and observe the difference.
Once the model is fit, the next step is to forecast using the predict() function. Since we have modeled
this as a regression problem with a fixed window size, we would generate forecasts for every sequence. To
do so, we write another utility function called predict_reg_multiple(). This function takes the lstm model,
windowed dataset, window and prediction lengths as input parameters to return a list of predictions for
every input window. The predict_reg_multiple() function works as follows.
1.

2.

For every sequence in the list of windowed sequences, repeat Steps a-c:
a.

Use keras’s predict() function to generate one output value.

b.

Append this output value to the end of the input sequence and remove the
first value to maintain the window size.

c.

Then repeat this process (Steps a and b) until the required prediction length
is achieved.

The function utilizes predicted values to forecast subsequent ones.

489

Chapter 11 ■ Forecasting Stock and Commodity Prices

The function is available in the script lstm_utils.py. The following snippet uses the predict_reg_
multiple() function to get predictions on the test set.
In [6] : test_pred_seqs = predict_reg_multiple(lstm_model,
    ...:                                         x_test,
    ...:                                         window_size=WINDOW,
    ...:                                         prediction_len=PRED_LENGTH)
To analyze the performance, we will calculate the RMSE for the fitted sequence. We use sklearn’s
metrics module for the same. The following snippet calculates the RMSE score.
In [7] : test_rmse = math.sqrt(mean_squared_error(y_test[1:],
    ...:                                                 np.array(test_pred_seqs).\
    ...:                                                 flatten()))
    ...: print('Test Score: %.2f RMSE' % (test_rmse))
Test Score: 0.01 RMSE
The output is an RMSE of 0.01. As an exercise, you may compare RMSE with different window sizes and
prediction lengths and observe the overall model performance.
To visualize our predictions, we plot the predictions against the normalized testing data. We use the
function plot_reg_results(), which is also available for reference in lstm_utils.py. The following snippet
generates the required plot using the same function.
In [8]: plot_reg_results(test_pred_seqs,y_test,prediction_len=PRED_LENGTH)

Figure 11-16. Forecast Plot with LSTM, window size 6 and prediction length 3
In Figure 11-16, the gray line is the original/true test data (normalized) and the black lines denote the
predicted/forecast values in three-day periods. The dotted line is used to explain the overall flow of the
predicted series. As is evident, the forecasts are off the mark to some extent from the actual data trends yet
they seem to have some similarity to the actual data.
Before we conclude this section, there are a few important points to be kept in mind. LSTMs are
vastly powerful units with memory to store and use past information. In our current scenario, we utilized a
windowed approach with a stacked LSTM architecture (two LSTM layers in our model). This is also termed
as many-to-one architecture where multiple input values are used to generate single output. Another
important point here is that the window size along with other hyperparameters of the network (like epochs,
batch size, LSTM units, etc.) have an impact on the final results (this is left as an exercise for you to explore).
Thus, we should be careful before deploying such models in production.

490

Chapter 11 ■ Forecasting Stock and Commodity Prices

Sequence Modeling
In the previous section we modeled our time series like a regression use case. In essence, the problem
formulation though utilized past window to forecast, it did not use the time step information. In this
section, we will solve the same stock prediction problem using LSTMs by modeling it as a sequence. For
the hands-on examples in this section, you can refer to the jupyter notebook notebook_stock_prediction_
sequence_modeling_lstm.ipynb for the necessary code snippets and examples.
Recurrent neural networks are naturally suited for sequence modeling tasks like machine translation,
speech recognition and so on. RNNs utilize memory (unlike normal feed forward neural networks) to keep
track of context and utilize the same to generate outputs. In general feed forward neural networks assume
inputs are independent of each other. This independence may not hold in many scenarios (such as time
series data). RNNs apply same transformations to each element of the sequence, with outcomes being
dependent upon previous values.
In case of our stock price time series, we would like to model it now as sequence where value at each
time step is a function of previous values. Unlike the regression-like modeling, here we do not divide
the time series into windows of fixed sizes, rather we would utilize the LSTMs to learn from the data and
determine which past values to utilize for forecasting.
To do so, we need to perform certain tweaks to the way we processed our data in the previous case and
also how we built our RNN. In the previous section, we utilized the (N,W,F) format as input. In the current
setting, the format remain the same with the following changes.
•

N (number of sequences): This will be set to 1 since we are dealing with only one
stock’s price information.

•

W (length of sequence): This will be set to total number of days worth of price
information we have with us. Here we use the whole series as one big sequence.

•

F (features per timestamp): This is again 1, as we are only dealing with closing stock
value per timestamp.

Talking about the output, in the previous section we had one output for every window/sequence in
consideration. While modeling our data as a sequence/time series we expect our output to be a sequence as
well. Thus, the output is also a 3D tensor following the same format as the input tensor.
We write a small utility function get_seq_train_test() to help us scale and generate train and test
datasets out of our time series. We use a 70-30 split in this case. We then use numpy to reshape our time series
into 3D tensors. The following snippet utilizes the get_seq_train_test() function to do the same.
In [1]: train,test,scaler = get_seq_train_test(sp_close_series,
   ...:                                        scaling=True,
   ...:                                        train_size=TRAIN_PERCENT)
   ...:
   ...: train = np.reshape(train,(1,train.shape[0],1))
   ...: test = np.reshape(test,(1,test.shape[0],1))
   ...:
   ...: train_x = train[:,:-1,:]
   ...: train_y = train[:,1:,:]
   ...:
   ...: test_x = test[:,:-1,:]
   ...: test_y = test[:,1:,:]
   ...:
   ...: print("Data Split Complete")
   ...:
   ...: print("train_x shape={}".format(train_x.shape))

491

Chapter 11 ■ Forecasting Stock and Commodity Prices

   ...: print("train_y shape={}".format(train_y.shape))
   ...: print("test_x shape={}".format(test_x.shape))
   ...: print("test_y shape={}".format(test_y.shape))
Data Split Complete
train_x shape=(1, 1964, 1)
train_y shape=(1, 1964, 1)
test_x shape=(1, 842, 1)
test_y shape=(1, 842, 1)
Having prepared the datasets, let we’ll now move onto setting up the RNN network. Since we are
planning on generating a sequence as output as opposed to a single output in the previous case, we need to
tweak our network architecture.
The requirement in this case is to apply similar transformations/processing for every time step and be
able to get output for every input timestamp rather than waiting for the whole sequence to be processed. To
enable such scenarios, keras provides a wrapper over dense layers called TimeDistributed. This wrapper
applies the same task to every time step and provides hooks to get output after each such time step. We use
TimeDistributed wrapper over Dense layer to get output from each of the time steps being processed. The
following snippet showcases get_seq_model() function to generate the required model.
def get_seq_model(hidden_units=4,input_shape=(1,1),verbose=False):
    # create and fit the LSTM network
    model = Sequential()
    # input shape = timesteps*features
    model.add(LSTM(input_shape=input_shape,
                   units = hidden_units,
                   return_sequences=True
    ))
    # TimeDistributedDense uses the processing for all time steps.
    model.add(TimeDistributed(Dense(1)))
    start = time.time()
    model.compile(loss="mse", optimizer="rmsprop")
    if verbose:
        print("> Compilation Time : ", time.time() - start)
        print(model.summary())
    return model
This function returns a single hidden layer RNN network with four LSTM units and a TimeDistributed
Dense output layer. We again use mean squared error as our loss function.

■■Note TimeDistributed is a powerful yet tricky utility available through keras. You may explore more on
this at https://github.com/fchollet/keras/issues/1029 and https://datascience.stackexchange.
com/questions/10836/the-difference-between-dense-and-timedistributeddense-of-keras.

492

Chapter 11 ■ Forecasting Stock and Commodity Prices

We have our dataset preprocessed and split into train and test along with a model object using the
function get_seq_model(). The next step is to simply train the model using the fit() function. While
modeling stock price information as a sequence, we are assuming the whole time series as one big sequence.
Hence, while training the model, we set the batch size as 1 as there is only one stock to train in this case. The
following snippet gets the model object and then trains the same using the fit() function.
In [2]: # get the model
   ...: seq_lstm_model = get_seq_model(input_shape=(train_x.shape[1],1),
   ...:                                 verbose=VERBOSE)
   ...:
   ...: # train the model
   ...: seq_lstm_model.fit(train_x, train_y,
   ...:                         epochs=150, batch_size=1,
   ...:                         verbose=2)
This snippet returns a model object along with its summary. We also see the output of each of the 150
epochs while the model trains on the training data.

Figure 11-17. RNN Summary
Figure 11-17 shows the total parameters which the RNN tries to learn, a complete 101 of them. We urge
you to explore the summary on the model prepared in the previous section, the results should surprise most
(Hint: this model has far few parameters to learn!). This summary also points toward an important fact, the
shape of the first LSTM layer. This clearly shows that the model expects the inputs to adhere to this shape
(the shape of the training dataset) for training as well as predicting.
Since our test dataset is smaller (shape: (1,842,1)), we need some way to match the required shape.
While modeling sequences with RNNs, it is a common practice to pad sequences in order to match a given
shape. Usually in cases where there are multiple sequences to train upon (example, text generation), the size
of the longest sequence is used and the shorter ones are padded to match it. We do so only for programmatic
reasons and discard the padded values otherwise (see keras masking for more on this). The padding utility
is available from the keras.preprocessing.sequence module. The following snippet pads the test dataset
with 0s post the actual data (you can choose between pre-pad and post pad) and then uses the padded
sequence to predict/forecast. We also calculate and print the RMSE score of the forecast.
In [3]: # Pad input sequence
   ...: testPredict = pad_sequences(test_x,
   ...:                                 maxlen=train_x.shape[1],
   ...:                                 padding='post',
   ...:                                 dtype='float64')

493

Chapter 11 ■ Forecasting Stock and Commodity Prices

   ...:
   ...: # forecast values
   ...: testPredict = seq_lstm_model.predict(testPredict)
   ...:
   ...: # evaluate performance
   ...: testScore = math.sqrt(mean_squared_error(test_y[0],
   ...:                         testPredict[0][:test_x.shape[1]]))
   ...: print('Test Score: %.2f RMSE' % (testScore))
Test Score: 0.07 RMSE
We can perform the same steps on the training set as well and check the performance. While generating
the train and test datasets, the function get_seq_train_test() also returned the scaler object. We next use
this scaler object to perform an inverse transformation to get the prediction values in the original scale. The
following snippet performs inverse transformation and then plots the series.
In [4]: # inverse transformation
   ...: trainPredict = scaler.inverse_transform(trainPredict.\
   ...:                                        reshape(trainPredict.shape[1]))
   ...: testPredict = scaler.inverse_transform(testPredict.\
   ...:                                        reshape(testPredict.shape[1]))
   ...:
   ...: train_size = len(trainPredict)+1
   ...:
   ...: # plot the true and forecasted values
   ...: plt.plot(sp_close_series.index,
   ...:                sp_close_series.values,c='black',
   ...:                alpha=0.3,label='True Data')
   ...:
   ...: plt.plot(sp_close_series.index[1:train_size],
   ...:                trainPredict,
   ...:                label='Training Fit',c='g')
   ...:
   ...: plt.plot(sp_close_series.index[train_size+1:],
   ...:                testPredict[:test_x.shape[1]],
   ...:                label='Forecast')
   ...: plt.title('Forecast Plot')
   ...: plt.legend()
   ...: plt.show()

Figure 11-18. Forecast for S&P 500 using LSTM based sequence modeling

494

Chapter 11 ■ Forecasting Stock and Commodity Prices

The forecast plot in Figure 11-18 shows a promising picture. We can see that the training fit is nearly
perfect which is kind of expected. The testing performance or the forecast also shows decent performance.
Even though the forecast deviates from the actual at places, the overall performance both in terms of RMSE
and the fit seemed to have worked.
Through the use of TimeDistributed layer wrapper we achieved the goal of modeling this data as a
time series. The model not just had a better performance in terms of overall fit, it required far less feature
engineering and a much simpler model (in terms of number of training parameters). In this model, we
also truly utilized the power of LSTMs by allowing it to learn and figure out what and how much of past
information impacts the forecast (as compared to regression modeling case where we had restricted the
window sizes).
Two important points before we conclude this section. First, both the models have their own
advantages and disadvantages. The aim of this section was to chalk out potential ways of modeling a given
problem. The actual usage mostly depends upon the requirements of the use case. Secondly and more
importantly, either of the models is for learning/demonstration purposes. Actual stock price forecasting
requires far more rigor and knowledge; we just scraped the tip of the iceberg.

Upcoming Techniques: Prophet
The Data Science landscape is ever evolving and new algorithms, tweaks and tools are coming up at a rapid
pace. One such tool is called Prophet. This is a framework, open sourced by Facebook’s Data Science team
for analyzing and forecasting time series.
Prophet uses an additive model that can work with trending and seasonal data. The aim of this tool
is to enable forecasting at scale. This is still in beta, yet has some really useful features. More on this is
available at https://facebookincubator.github.io/prophet/. The research and intuition behind this tool
is available in the paper available at https://facebookincubator.github.io/prophet/static/prophet_
paper_20170113.pdf.
The installation steps are outlined on the web site and are straightforward through pip and conda.
Prophet also uses scikit style APIs of fit() and predict() with additional utilities to better
handle time series data. For the hands-on examples in this section, you can refer to the jupyter notebook
notebook_stock_prediction_fbprophet.ipynb for the necessary code snippets and examples.

■■Note Prophet is still in beta and is undergoing changes. Also, its installation on Windows platform is known
to cause issues. Kindly use conda install (steps mentioned on the web site) with Anaconda distribution to
avoid issues.
Since we already have the S&P 500 index price information available in a dataframe/series. We now test
how we can use this tool to forecast. We begin with converting the time series index into a column of its own
(simply how prophet expects the data) followed by splitting the series into training and testing (90-10 split).
The following snippet performs the required actions.
In [1] :
    ...:
    ...:
    ...:
    ...:
    ...:
    ...:

# reset index to get date_time as a column
prophet_df = sp_df.reset_index()
# prepare the required dataframe
prophet_df.rename(columns={'index':'ds','Close':'y'},inplace=True)
prophet_df = prophet_df[['ds','y']]

495

Chapter 11 ■ Forecasting Stock and Commodity Prices

    ...:
    ...:
    ...:
    ...:

# prepare train and test sets
train_size = int(prophet_df.shape[0]*0.9)
train_df = prophet_df.ix[:train_size]
test_df = prophet_df.ix[train_size+1:]

Once we have the datasets prepared, we create an object of the Prophet class and simply fit the model
using fit() function. Kindly note that the model expects the time series value to be in a column named ‘y’
and timestamp in column named ‘ds’. To make forecasts, prophet requires the set of dates for which we need
to forecast. For this, it provides a clean utility called the make_future_dataframe(), which takes the number
of days required for the forecast as input. The following snippet uses this dataframe to forecast values.
In [2] :
    ...:
    ...:
    ...:
    ...:

# prepare a future dataframe
test_dates = pro_model.make_future_dataframe(periods=test_df.shape[0])
# forecast values
forecast_df = pro_model.predict(test_dates)

The output from the predict() function is a dataframe that includes both in-sample predictions as
well as forecasted values. The dataframe also includes the confidence interval values. All of this can be easily
plotted using the plot() function of the model object. The following snippet plots the forecasted values
against the original time series along with its confidence intervals.
In [3] : # plot against true data
    ...: plt.plot(forecast_df.yhat,c='r',label='Forecast')
    ...: plt.plot(forecast_df.yhat_lower.iloc[train_size+1:],
    ...:                 linestyle='--',c='b',alpha=0.3,
    ...:                 label='Confidence Interval')
    ...: plt.plot(forecast_df.yhat_upper.iloc[train_size+1:],
    ...:                 linestyle='--',c='b',alpha=0.3,
    ...:                 label='Confidence Interval')
    ...: plt.plot(prophet_df.y,c='g',label='True Data')
    ...: plt.legend()
    ...: plt.title('Prophet Model Forecast Against True Data')
    ...: plt.show()

Figure 11-19. Forecasts from prophet against true/observed values

496

Chapter 11 ■ Forecasting Stock and Commodity Prices

The model’s forecasts are a bit off the mark (see Figure 11-19), but the exercise clearly demonstrates the
possibilities here. The forecast dataframe provides even more details about seasonality, weekly trends, and
so on. You are encouraged to explore this further. Prophet is based upon Stan. Stan is statistical modeling
language/framework that provides algorithms exposed through interfaces for all major languages, including
python. You may explore more on this at http://mc-stan.org/.

S
 ummary
This chapter introduced the concepts of time series forecasting and analysis using stock and commodity
price information. Through this chapter we covered the basic components of a time series along with
common techniques for preprocessing such data. We then worked on the gold price prediction use case.
This use case utilized the quandl library to get daily gold price information. We then discussed traditional
time series analysis techniques and introduced key concepts related to Box-Jenkin’s methodology and
ARIMA in particular. We also discussed techniques for identification and transformation of non-stationary
time series into one using AD Fuller tests, ACF and PACF plots. We modeled the gold price information
using ARIMA based on statsmodel APIs while developing some key utility functions like auto_arima() and
arima_gridsearch_cv(). Key insights and caveats were also discussed. The next section of the chapter
introduced the stock price prediction use case. Here, we utilized pandas_datareader to get S&P 500 daily
closing price information.
To solve this use case, we utilized RNN based models. Primarily we provided two alternative
perspectives of formulating the forecasting problem, both using LSTMs. The first formulation closely
imitated the regression concepts discussed in earlier chapters. A two-layer stacked LSTM network was used
to forecast stock price information. The second perspective utilized TimeDistributed layer wrapper from
keras to enable sequence modeling of the stock price information. Various utilities and key concepts were
discussed while working on the use case. Finally, an upcoming tool (still in beta), Prophet from Facebook
was discussed. The tool is made available by Facebook’s Data Science team to perform forecasting at scale.
We utilized the framework to quickly evaluate its performance on the same stock price information and
shared the results. A multitude of techniques and concepts were introduced in this chapter, along with the
intuition on how to formulate certain time series problems. Stay tuned for some more exciting use cases in
the next chapter.

497

CHAPTER 12

Deep Learning for Computer Vision
Deep Learning is not just a keyword abuzz in the industry and academics, it has thrown wide open a whole
new field of possibilities. Deep Learning models are being employed in all sorts of use cases and domains,
some of which we saw in the previous chapters. Deep neural networks have tremendous potential to learn
complex non-linear functions, patterns, and representations. Their power is driving research in multiple
fields, including computer vision, audio-visual analysis, chatbots and natural language understanding,
to name a few. In this chapter, we touch on some of the advanced areas in the field of computer vision,
which have recently come into prominence with the advent of Deep Learning. This includes real-world
applications like image categorization and classification and the very popular concept of image artistic
style transfer. Computer vision is all about the art and science of making machines understand high-level
useful patterns and representations from images and videos so that it would be able to make intelligent
decisions similar to what a human would do upon observing its surroundings. Building on core concepts
like convolutional neural networks and transfer learning, this chapter provides you with a glimpse into the
forefront of Deep Learning research with several real-world case studies from computer vision.
This chapter discusses convolutional neural networks through the task of image classification using
publicly available datasets like CIFAR, ImageNet, and MNIST. We will utilize our understanding of CNNs to
then take on the task of style transfer and understand how neural networks can be used to understand
high-level features. Through this chapter, we cover the following topics in detail:
•

Brief overview of convolutional neural networks

•

Image classification using CNNs from scratch

•

Transfer learning: image classification using pretrained models

•

Neural style transfer using CNNs

The code samples, jupyter notebooks, and sample datasets for this chapter are available in the GitHub
repository for this book at https://github.com/dipanjanS/practical-machine-learning-with-python
under the directory/folder for Chapter 12.

Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are similar to the general neural networks we have discussed over
the course of this book. The additional explicit assumption of input being an image (tensor) is what makes
CNNs optimized and different than the usual neural networks. This explicit assumption is what allows us to
design deep CNNs while keeping the number of trainable parameters in check (in comparison to general
neural networks).

© Dipanjan Sarkar, Raghav Bali and Tushar Sharma 2018
D. Sarkar et al., Practical Machine Learning with Python, https://doi.org/10.1007/978-1-4842-3207-1_12

499

Chapter 12 ■ Deep Learning for Computer Vision

We touched upon the concepts of CNNs in Chapter 1 (in the section “Deep Learning”) and Chapter 4
(in the section :Feature Engineering on Image Data). However, as a quick refresher, the following are the key
concepts worth reiterating:
•

Convolutional Layer: This is the key differentiating component of a CNN as
compared to other neural networks. Convolutional layer or conv layer is a set of
learnable filters. These filters help capture spatial features. These are usually small
(along the width and height) but cover the full depth (color range) of the image.
During the forward pass, we slide the filter across the width and the height of the
image while computing the dot product between the filter attributes and the input at
any position. The output is a two-dimensional activation map from each filter, which
are then stacked to get the final output.

•

Pooling Layer: These are basically down-sampling layers used to reduce spatial
size and number of parameters. These layers also help in controlling overfitting.
Pooling layers are inserted in between conv layers. Pooling layers can perform down
sampling using functions such as max, average, L2-norm, and so on.

•

Fully Connected Layer: Also known as FC layer. These are similar to fully connected
layers in general neural networks. These have full connections to all neurons in the
previous layer. This layer helps perform the tasks of classification.

•

Parameter Sharing: The unique thing about CNNs apart from the conv layer
is parameter sharing. Conv layers use same set of weights across the filters thus
reducing the overall number of parameters required.

A typical CNN architecture with all the components is depicted in Figure 12-1, which is a LeNet CNN
model (Source: deeplearning.net).

Figure 12-1. LeNet CNN model (source: deeplearning.net)
CNNs have been studied in-depth and are being constantly improved and experimented with. For an
in-depth understanding of CNNs, refer to courses such as one from Stanford available at http://cs231n.
github.io/convolutional-networks.

500

Chapter 12 ■ Deep Learning for Computer Vision

Image Classification with CNNs
Convolutional Neural Networks are prime examples of the potential and power of neural networks to learn
detailed feature representations and patterns from images and perform complex tasks, ranging from object
recognition to image classification and many more. CNNs have gone through tremendous research and
advancements have led to more complex and power architectures, like VGG-16, VGG-19, Inception V3, and
many more interesting models.
We begin with a getting some hands-on experience with CNNs by working on an image classification
problem. We shared an example of CNN based classification in Chapter 4 through the notebook, Bonus Classifying handwritten digits using Deep CNNs.ipynb, which talks about classifying and predicting
human handwritten digits by leveraging CNN based Deep Learning. In case you haven’t gone through it, do
not worry as we will go through a detailed example here. For our Deep Learning needs, we will be utilizing
the keras framework with the tensorflow backend, similar to what we used in the previous chapters.

Problem Statement
Given a set of images containing real-world objects, it is fairly easy for humans to recognize them. Our task
here is to build a multiclass (10 classes or categories) image classifier that can identify the correct class label
of a given image. For this task, we will be utilizing the CIFAR10 dataset.

Dataset
The CIFAR10 dataset is a collection of tiny labeled images spanning across 10 different classes.
The dataset was collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton and is available at
https://www.cs.toronto.edu/~kriz/cifar.html as well as through the datasets module in keras.
This dataset contains tiny images of size 32 x 32 with 50,000 training and 10,000 test samples. Each
image can fall into one and only one of the following classes.
•

Automobile

•

Airplane

•

Bird

•

Cat

•

Deer

•

Dog

•

Frog

•

Horse

•

Ship

•

Truck

Each class is mutually exclusive. There is another larger version of the dataset called the CIFAR100. For
the purpose of this section, we will consider the CIFAR10 dataset.
We would be accessing the CIFAR10 dataset through the keras.datasets module. Download the
required files if they are not already present.

501

Chapter 12 ■ Deep Learning for Computer Vision

CNN Based Deep Learning Classifier from Scratch
Similar to any Machine Learning algorithm, neural networks also require the input data to be certain shape,
size, and type. So, before we reach the modeling step, the first thing is to preprocess the data itself. The
following snippet gets the dataset and then performs one hot encoding of the labels. Remember there are 10
classes to work with and hence we are dealing with a multi-class classification problem.
In [1]:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:

import keras
from keras.datasets import cifar10
num_classes = 10
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

The dataset, if not already present locally, would be downloaded automatically. The following are the
shapes of the objects obtained.
In [2]: print('x_train shape:', x_train.shape)
   ...: print(x_train.shape[0], 'train samples')
   ...: print(x_test.shape[0], 'test samples')
   ...:
x_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples
Now that we have training and test datasets. The next step is to build the CNN model. Since we have
two dimensional images (the third dimension is the channel information), we will be using Conv2D layers.
As discussed in the previous section, CNNs uses a combination of convolutional layers and pooling layers
followed by a fully connected end to identify/classify the data. The model architecture is built as follows.
In [3]: model = Sequential()
   ...: model.add(Conv2D(32, kernel_size=(3, 3),
   ...:                         activation='relu',
   ...:                         input_shape=input_shape))
   ...: model.add(Conv2D(64, (3, 3), activation='relu'))
   ...: model.add(MaxPooling2D(pool_size=(2, 2)))
   ...: model.add(Dropout(0.25))
   ...: model.add(Flatten())
   ...: model.add(Dense(128, activation='relu'))
   ...: model.add(Dropout(0.5))
   ...: model.add(Dense(num_classes, activation='softmax'))
It starts off with a convolutional layer with a total of 32 3 x 3 filters and activation function as the
rectified linear unit (relu). The input shape resembles each image size, i.e. 32 x 32 x 3 (color image has three
channels—RGB). This is followed by another convolutional layer and a max-pooling layer. Finally, we have
the fully connected dense layer. Since we have 10 classes to choose from, the final output layer has a softmax
activation.

502

Chapter 12 ■ Deep Learning for Computer Vision

The next step involves compiling. We use categorical_crossentropy as our loss function since we are
dealing with multiple classes. Besides this, we use the Adadelta optimizer and then train the classifier on the
training data. The following snippet showcases the same.
In [4]: model.compile(loss=keras.losses.categorical_crossentropy,
   ...:                optimizer=keras.optimizers.Adadelta(),
   ...:                metrics=['accuracy'])
   ...:
   ...: model.fit(x_train, y_train,
   ...:                batch_size=batch_size,
   ...:                epochs=epochs,
   ...:                verbose=1)
Epoch 1/10
50000/50000 [==============================] - 256s - loss: 7.3118
Epoch 2/10
50000/50000 [==============================] - 250s - loss: 1.7923
Epoch 3/10
50000/50000 [==============================] - 252s - loss: 1.5781
...
Epoch 9/10
50000/50000 [==============================] - 251s - loss: 1.1019
Epoch 10/10
50000/50000 [==============================] - 254s - loss: 1.0584

- acc: 0.1798   
- acc: 0.3564   
- acc: 0.4383   

- acc: 0.6163   
- acc: 0.6284   

From the preceding output, it is clear that we trained the model for 10 epochs. This takes anywhere
between 200-400 secs on a CPU, the performance improves manifold when done using a GPU. We can see
the accuracy is around 63% based on the last epoch. We will now evaluate the testing performance, this is
checked using the evaluate function of the model object. The results are as follows.
Test loss: 1.10143025074
Test accuracy: 0.6354
Thus, we can see that our very simple CNN based Deep Learning model achieved an accuracy of 63.5%,
given the fact that we have built a very simple model and that we haven't done much preprocessing or model
tuning. You are encouraged to try different CNN architectures and experiment with hyperparameter tuning
to see how the results can be improved.
The initial few conv layers of the model kind of work toward feature extraction while the last couple of
layers (fully connected) help in classifying the data. Thus, it would be interesting to see how the image data
is manipulated by the conv-net we just created. Luckily, keras provides hooks to extract information at
intermediate steps in the model. They depict how various regions of the image activate the conv layers and
how the corresponding feature representations and patterns are extracted.

503

Chapter 12 ■ Deep Learning for Computer Vision

Figure 12-2. Sample image from the CIFAR10 dataset
A sample flow of how an image is viewed by the CNN is explained in the notebook
notebook_cnn_cifar10_classifier.ipynb. It contains rest of the code discussed in this section. Figure 12-2
shows an image from the test dataset. It looks like a ship and the model correctly identifies the same as well
as depicted in this snippet.
# actual image id
img_idx = 999
# actual image label
In [5]: y_test[img_idx]
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.])
# predict label with our model
In [6]: test_image =np.expand_dims(x_test[img_idx], axis=0)
   ...: model.predict_classes(test_image,batch_size=1)
1/1 [==============================] - 0s
Out[16]:
array([8], dtype=int64)
You can extract and view the activation maps of the image based on what representations are learned
and extracted by the conv layers using the get_activations(...) and display_activations(...) functions
in the notebook. Figure 12-3 shows the activations of initial conv layers of the CNN model we just built.

Figure 12-3. Sample image through a CNN layer

504

Chapter 12 ■ Deep Learning for Computer Vision

We also recommend you go through the section “Automated Feature Engineering with Deep Learning”
in Chapter 4 to learn more about extracting feature representations from images using convolutional layers.

CNN Based Deep Learning Classifier with Pretrained Models
Building a classifier from scratch has its own set of pros and cons. Yet, a more refined approach is to leverage
pre-trained models over large complex datasets. There are many famous CNN architectures like LeNet,
ResNet, VGG-16, VGG-19, and so on. These models have deep and complex architectures that have been
fine-tuned and trained over diverse, large datasets. Hence, these models have been proven to have amazing
performance on complex object recognition tasks.
Since obtaining large labeled datasets and training highly complex and deep neural networks is a timeconsuming task (training a complex CNN like VGG-19 could take a few weeks, even using GPUs). In practice,
we utilize a concept, what is formally termed as transfer learning. This concept of transfer learning helps us
leverage existing models for our tasks. The core idea is to leverage the learning, which the model learned
from being trained over a large dataset and then transfer this learning by re-using the same model to extract
feature representations from new images. There are several strategies of performing transfer learning, some
of which are mentioned as follows:
•

Pre-trained model as feature extractor: The pre-trained model is used to extract
features for our dataset. We build a fully connected classifier on top of these features.
In this case we only need to train the fully connected classifier, which does not take
much time.

•

Fine-tuning pre-trained models: It is possible to fine-tune an existing pre-trained
model by fixing some of the layers and allowing others to learn/update weights
apart from the fully connected layers. Usually it is observed that initial layers capture
generic features while the deeper ones become more specific in terms of feature
extraction. Thus, depending upon the requirements, we fix certain layers and
fine-tune the rest.

In this section, we see an example where we will utilize a pre-trained conv-network as a feature
extractor and build fully connected layer based classifier on top of it and train the model. We will not train
the feature extraction layers and hence leverage principles of transfer learning by using the pre-trained conv
layers for feature extraction.
The VGG-19 model from the Visual Geometry Group of the Oxford University is one state-of-the-art
convolutional neural network. This has been shown to perform extremely well on various benchmarks and
competitions. VGG19 is a 19-layer conv-net trained on ImageNet dataset. ImageNet is visual database of
hand-annotated images amounting to 10 million spanning across 9,000+ categories. This model has been
widely studied and used in tasks such as transfer learning.

■■Note

More details on this and other research by the VGG group is available at http://www.robots.
ox.ac.uk/~vgg/research/very_deep/.

This pretrained model is available through the keras.applications module. As mentioned, we will
utilize VGG-19 to act as feature extractor to help us build a classifier on CIFAR10 dataset.
Since we would be using VGG-19 for feature extraction, we do not need the top (or fully connected)
layers of this model. keras makes this as simple as setting a single flag value to False. The following snippet
loads the VGG-19 model architecture consisting of the conv layers and leaves out the fully connected layers.

505

Chapter 12 ■ Deep Learning for Computer Vision

In [1]: from keras import applications
   ...:
   ...: vgg_model = applications.VGG19(include_top=False, weights='imagenet')
Now that the pre-trained model is available, we will utilize it to extract features from our training
dataset. Remember VGG-19 is trained upon ImageNet while we would be using CIFAR10 to build a classifier.
Since ImageNet contains over 10 million images spanning across 9,000+ categories, it is safe to assume that
CIFAR10's categories would a subset here. Before moving on to feature extraction using the VGG-19 model, it
would be a good idea to check out the model's architecture.
In [1]: vgg_model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, None, None, 3)     0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, None, None, 64)    1792    
_________________________________________________________________
block1_conv2 (Conv2D)        (None, None, None, 64)    36928    
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, None, None, 64)    0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, None, None, 128)   73856   
_________________________________________________________________
block2_conv2 (Conv2D)        (None, None, None, 128)   147584   
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, None, None, 128)   0         
_________________________________________________________________
block3_conv1 (Conv2D)        (None, None, None, 256)   295168   
_________________________________________________________________
block3_conv2 (Conv2D)        (None, None, None, 256)   590080   
_________________________________________________________________
block3_conv3 (Conv2D)        (None, None, None, 256)   590080   
_________________________________________________________________
block3_conv4 (Conv2D)        (None, None, None, 256)   590080   
_________________________________________________________________
block3_pool (MaxPooling2D)   (None, None, None, 256)   0         
_________________________________________________________________
block4_conv1 (Conv2D)        (None, None, None, 512)   1180160   
_________________________________________________________________
block4_conv2 (Conv2D)        (None, None, None, 512)   2359808   
_________________________________________________________________
block4_conv3 (Conv2D)        (None, None, None, 512)   2359808   
_________________________________________________________________
block4_conv4 (Conv2D)        (None, None, None, 512)   2359808   
_________________________________________________________________
block4_pool (MaxPooling2D)   (None, None, None, 512)   0         
_________________________________________________________________
block5_conv1 (Conv2D)        (None, None, None, 512)   2359808   
_________________________________________________________________
block5_conv2 (Conv2D)        (None, None, None, 512)   2359808   

506

Chapter 12 ■ Deep Learning for Computer Vision

_________________________________________________________________
block5_conv3 (Conv2D)        (None, None, None, 512)   2359808   
_________________________________________________________________
block5_conv4 (Conv2D)        (None, None, None, 512)   2359808   
_________________________________________________________________
block5_pool (MaxPooling2D)   (None, None, None, 512)   0         
=================================================================
Total params: 20,024,384
Trainable params: 20,024,384
Non-trainable params: 0
_________________________________________________________________
From the preceding output, you can see that the architecture is huge with a lot of layers. Figure 12-4
depicts the same in an easier-to-understand visual depicting all the layers. Remember that we do not use the
fully connected layers depicted in the extreme right of Figure 12-4. We recommend checking out the paper Very
Deep Convolutional Networks for Large-Scale Image Recognition by Karen Simonyan and Andrew Zisserman of
the Visual Geometry Group, Department of Engineering Science, University of Oxford. The paper is available at
https://arxiv.org/abs/1409.1556 and talks in detail about the architecture of these models.

Figure 12-4. Visual depiction of the VGG-19 architecture
Loading the CIFAR10 training and test datasets is the same as discussed in the previous section. We
perform similar one hot encoding of the labels as well. Since the VGG19 model has been loaded without the
final fully connected layers, the predict(...) function of the model helps us get the extracted features on
our dataset. The following snippet extracts the features for both training and test datasets.
In [2]: bottleneck_features_train = vgg_model.predict(x_train, verbose=1)
   ...: bottleneck_features_test = vgg_model.predict(x_test, verbose=1)

507

Chapter 12 ■ Deep Learning for Computer Vision

These features are widely known as bottleneck features due to the fact that there is an overall decrease
in the volume of the input data points. It would be worth exploring the model summary to understand how
the VGG model transforms the data. The output of this stage (the bottleneck features) is used as input to
the classifier we are going to build next. The following snippet builds a simple fully connected two-layer
classifier.
In [3]: clf_model = Sequential()
   ...: clf_model.add(Flatten(input_shape=bottleneck_features_train.shape[1:]))
   ...: clf_model.add(Dense(512, activation='relu'))
   ...: clf_model.add(Dropout(0.5))
   ...: clf_model.add(Dense(256, activation='relu'))
   ...: clf_model.add(Dropout(0.5))
   ...: clf_model.add(Dense(num_classes, activation='softmax'))
   ...: clf_model.compile(loss=keras.losses.categorical_crossentropy,
                          optimizer=keras.optimizers.Adadelta(),
                          metrics=['accuracy'])
The model's input layer matches the dimensions of the bottleneck features (for obvious reasons). As
with the CNN model we built from scratch, this model also has a dense output layer with softmax activation
function. Training this model as opposed to a complete VGG19 is fairly simple and fast, as depicted in the
following snippet.
In [4]: clf_model.fit(bottleneck_features_train, y_train,
   ...:               epochs=epochs, verbose=1)
Epoch 1/50
50000/50000 [==============================] - 8s - loss:
Epoch 2/50
50000/50000 [==============================] - 7s - loss:
Epoch 3/50
50000/50000 [==============================] - 7s - loss:
...
Epoch 48/50
50000/50000 [==============================] - 8s - loss:
Epoch 49/50
50000/50000 [==============================] - 8s - loss:
Epoch 50/50
50000/50000 [==============================] - 8s - loss:

batch_size=batch_size,

7.2495 - acc: 0.2799     
2.2513 - acc: 0.2768     
1.9096 - acc: 0.3521     

0.9368 - acc: 0.6814     
0.9223 - acc: 0.6832     
0.9197 - acc: 0.6830    

We can add hooks to stop the training early based of early stop criteria, etc. But for now, we keep things
simple. Complete code for this section is available in the notebook notebook_pretrained_cnn_cifar10_
classifier.ipynb. Overall, we achieve an accuracy of 68% on the training dataset and around 64% on the
test dataset.
Now you’ll see the performance of this classifier built on top a pre-trained model on the test dataset.
The following snippet showcases a utility function that takes the index number of an image in the test
dataset as input and compares the actual labels and the predicted labels.
def predict_label(img_idx,show_proba=True):
    plt.imshow(x_test[img_idx],aspect='auto')
    plt.title("Image to be Labeled")
    plt.show()

508

Chapter 12 ■ Deep Learning for Computer Vision

    print("Actual Class:{}".format(np.nonzero(y_test[img_idx])[0][0]))
    test_image =np.expand_dims(x_test[img_idx], axis=0)
    bf = vgg_model.predict(test_image,verbose=0)
    pred_label = clf_model.predict_classes(bf,batch_size=1,verbose=0)
    print("Predicted Class:{}".format(pred_label[0]))
    if show_proba:
        print("Predicted Probabilities")
        print(clf_model.predict_proba(bf))
The following is the output of the predict_label(...) function when tested against a couple of images
from the test dataset. As depicted in Figure 12-5, we correctly predicted the images belong to class label 5
(dog) and 9 (truck)!

Figure 12-5. Predicted labels from pre-trained CNN based classifier
This section demonstrated the power and advantages of transfer learning. Instead of spending time
reinventing the wheel, with a few lines of code, we were able to leverage state of the art neural network for
our classification task.
The concept of transfer learning is what forms the basis of neural style transfer, which we will discuss in
the next section.

Artistic Style Transfer with CNNs
Paintings (or for that matter any form of art) require special skill which a few have mastered. Paintings
present complex interplay of content and style. Photographs on the other hand are a combination of
perspectives and light. When the two are combined, the results are spectacular and surprising. One such
example is shown in Figure 12-6.

509

Chapter 12 ■ Deep Learning for Computer Vision

Figure 12-6. Left Image: The original photograph depicting the Neckarfront in Tubingen, Germany. Right
Image: The painting (inset: The Starry Night by Vincent van Gogh) that provided the style for the respective
generated image. Source: A Neural Algorithm of Artistic Style, Gatys et al. (arXiv:1508.06576v2)
The results in Figure 12-6 showcase how a painting's (Van Gogh's The Starry Night) style has been
transferred to a photograph of the Neckarfront. At first glance, the process seems to have picked up the
content from the photograph, the style, colors, and stroke patterns from the painting and generated the final
outcome. The results are amazing, but what is more surprising is, how was it done?
Figure 12-6 showcases a process termed as artistic style transfer. The process is an outcome of research
by Gatys et al. and is presented in their paper A Neural Algorithm for Artistic Style. In this section, we
discuss the intricacies of this paper from an implementation point of view and see how we can perform this
technique ourselves.

■■Note Prisma is an app that transforms photos into works of art using techniques of artistic style transfer
based on convolution neural networks. More about the app is available at https://prisma-ai.com/.

Background
Formally, neural style transfer is the process of applying the “style” of a reference image to a specific target
image such that in the process, the original “content” of the target image remains unchanged. Here, style
is defined as colors, patterns, and textures present in the reference image, while content is defined as the
overall structure and higher-level components of the image.
The main objective here is then, to retain the content of the original target image, while superimposing
or adopting the style of the reference image on the target image. To define this concept mathematically,
consider three images—the original content (denoted as c), the reference style (denoted as s), and the
generated image (denoted as g). Thus, we need a way to measure how different are images c and g in terms of
their content. A function that tends to 0 if c and g are completely different and grows otherwise. This can be
concisely stated in terms of a loss function as:
Lcontent = distance(c, g)
Where distance is a norm function like L2. On the same lines, we can define another function that
captures how different images s and g are in terms of their style. In other words, this can be stated as follows:
Lstyle = distance(s, g)

510

Chapter 12 ■ Deep Learning for Computer Vision

Thus, for the overall process of neural style transfer, we have an overall loss function, which can be
defined as a combination of content and style loss functions.
Lstyle - transfer = argming(αLcontent(c, g) + βLstyle(s, g))
Where α and β are weights used to control the impact of content and style components on the overall
loss. The loss function we will try to minimize consists of three parts namely, the content loss, the style loss,
and the total variation loss, which we will be talking about later.
The beauty of Deep Learning is that by leveraging architectures like deep convolutional neural networks
(CNNs), we can mathematically define the above-mentioned style and content functions. We will be
using principles of transfer learning in building our system for neural style transfer. We introduced
the concept of transfer learning using a pre-trained deep CNN model like VGG-19. We will be leveraging the
same pre-trained model for the task of neural style transfer. The main steps are outlined as follows.
•

Leverage VGG-19 to help compute layer activations for the style, content, and
generated image.

•

Use these activations to define specific loss functions mentioned earlier.

•

Finally, use gradient descent to minimize the overall loss.

We recommend you follow along this section with the notebook titled Neural Style Transfer.ipynb,
which contains step-by-step details of the style transfer process. We would also like to give a special mention
and thanks to François Chollet as well as Harish Narayanan for providing some excellent resources on style
transfer. Details on the same will be mentioned later. We also recommend you check out the following
papers (detailed links shared later on).
•

A Neural Algorithm of Artistic Style by Leon A. Gatys, Alexander S. Ecker, and
Matthias Bethge

•

Perceptual Losses for Real-Time Style Transfer and Super-Resolution by Justin
Johnson, Alexandre Alahi, and Li Fei-Fei

Preprocessing
The first and foremost step toward implementation of such a network is to preprocess the data or images in
this case. The following are quick utilities to preprocess images for size and channel adjustments.
import numpy as np
from keras.applications import vgg19
from keras.preprocessing.image import load_img, img_to_array
def preprocess_image(image_path, height=None, width=None):
    height = 400 if not height else height
    width = width if width else int(width * height / height)
    img = load_img(image_path, target_size=(height, width))
    img = img_to_array(img)
    img = np.expand_dims(img, axis=0)
    img = vgg19.preprocess_input(img)
    return img
def deprocess_image(x):
    # Remove zero-center by mean pixel
    x[:, :, 0] += 103.939

511

Chapter 12 ■ Deep Learning for Computer Vision

    x[:, :, 1] += 116.779
    x[:, :, 2] += 123.68
    # 'BGR'->'RGB'
    x = x[:, :, ::-1]
    x = np.clip(x, 0, 255).astype('uint8')
    return x
As we would be writing custom loss functions and manipulation routines, we would need to define
certain placeholders. keras is a high-level library that utilizes tensor manipulation backends (like
tensorflow, theano, and CNTK) to perform the heavy lifting. Thus, these placeholders provide high-level
abstractions to work with the underlying tensor object. The following snippet prepares placeholders for style,
content, and generated images along with the input tensor for the neural network.
In [1]: # This is the path to the image you want to transform.
   ...: TARGET_IMG = 'data/city_road.jpg'
   ...: # This is the path to the style image.
   ...: REFERENCE_STYLE_IMG = 'data/style2.png'
   ...:
   ...: width, height = load_img(TARGET_IMG).size
   ...: img_height = 320
   ...: img_width = int(width * img_height / height)
   ...:
   ...:
   ...: target_image = K.constant(preprocess_image(TARGET_IMG,
   ...:                                                 height=img_height,
   ...:                                                 width=img_width))
   ...: style_image = K.constant(preprocess_image(REFERENCE_STYLE_IMG,
   ...:                                                 height=img_height,
   ...:                                                 width=img_width))
   ...:
   ...: # Placeholder for our generated image
   ...: generated_image = K.placeholder((1, img_height, img_width, 3))
   ...:
   ...: # Combine the 3 images into a single batch
   ...: input_tensor = K.concatenate([target_image,
   ...:                                 style_image,
   ...:                                 generated_image], axis=0)
We will load the pre-trained VGG-19 model as we did in the previous section, i.e., without the top fully
connected layers. The only difference here is that we would be providing the model constructor, the size
dimensions of the input tensor. The following snippet fetches the pretrained model.
In [2]: model = vgg19.VGG19(input_tensor=input_tensor,
   ...:                         weights='imagenet',
   ...:                         include_top=False)
You may use the summary() function to understand the architecture of the pre-trained model.

512

Chapter 12 ■ Deep Learning for Computer Vision

Loss Functions
As discussed in the background subsection, the problem of neural style transfer revolves around loss
functions of content and style. In this subsection, we will discuss and define the required loss functions.

Content Loss
In any CNN-based model, activations from top layers contain more global and abstract information
(high-level structures like a face) and bottom layers will contain local information (low-level structures
like eyes, nose, edges, and corners) about the image. We would want to leverage the top layers of a CNN for
capturing the right representations for the content of an image.
Hence, for the content loss, considering we will be using the pretrained VGG-19 CNN, we can define
our loss function as the L2 norm (scaled and squared Euclidean distance) between the activations of a
top layer (giving feature representations) computed over the target image and the activations of the same
layer computed over the generated image. Assuming we usually get feature representations relevant to the
content of images from the top layers of a CNN, the generated image is expected to look similar to the base
target image. The following snippet showcases the function to compute the content loss.
def content_loss(base, combination):
    return K.sum(K.square(combination - base))

Style Loss
The original paper on neural style transfer, A Neural Algorithm of Artistic Style by Gatys et al., leverages
multiple convolutional layers in the CNN (instead of one) to extract meaningful patterns and representations
capturing information pertaining to appearance or style from the reference style image across all spatial
scales irrespective of the image content.
Staying true to the original paper, we will be leveraging the Gram matrix and computing the same
over the feature representations generated by the convolution layers. The Gram matrix computes the
inner product between the feature maps produced in any given conv layer. The inner products terms are
proportional to the covariances of corresponding feature sets and hence captures patterns of correlations
between the features of a layer that tend to activate together. These feature correlations help capture relevant
aggregate statistics of the patterns of a particular spatial scale, which correspond to the style, texture, and
appearance and not the components and objects present in an image.
The style loss is thus defined as the scaled and squared Frobenius norm of the difference between the
Gram matrices of the reference style and the generated images. Minimizing this loss helps ensure that the
textures found at different spatial scales in the reference style image will be similar in generated image.
The following snippet thus defines a style loss function based on Gram matrix calculation.
def style_loss(style, combination, height, width):
    def build_gram_matrix(x):
        features = K.batch_flatten(K.permute_dimensions(x, (2, 0, 1)))
        gram_matrix = K.dot(features, K.transpose(features))
        return gram_matrix
    S = build_gram_matrix(style)
    C = build_gram_matrix(combination)
    channels = 3
    size = height * width
    return K.sum(K.square(S - C)) / (4. * (channels ** 2) * (size ** 2))

513

Chapter 12 ■ Deep Learning for Computer Vision

Total Variation Loss
It was observed that optimization to reduce only the style and content losses led to highly pixelated and
noisy outputs. To cover the same, total variation loss was introduced.
The total variation loss is analogous to regularization loss. This is introduced for ensuring spatial
continuity and smoothness in the generated image to avoid noisy and overly pixelated results. The same is
defined in the function as follows.
def total_variation_loss(x):
    a = K.square(
        x[:, :img_height - 1, :img_width - 1, :] - x[:, 1:, :img_width - 1, :])
    b = K.square(
        x[:, :img_height - 1, :img_width - 1, :] - x[:, :img_height - 1, 1:, :])
    return K.sum(K.pow(a + b, 1.25))

Overall Loss Function
Having defined the components of the overall loss function for neural style transfer, the next step is to piece
together these building blocks. Since content and style information is captured by the CNNs at different
depths in the network, we need to apply and calculate loss at appropriate layers for each type of loss.
Utilizing insights and research by Gatys et al. and Johnson et al. in their respective papers, we define the
following utility to identify the content and style layers from the VGG-19 model. Even though Johnson et al.
leverages the VGG-16 model for faster and better performance, we constrain ourselves to the VGG-19 model
for ease of understanding and consistency across runs.
# define function to set layers based on source paper followed
def set_cnn_layers(source='gatys'):
    if source == 'gatys':
        # config from Gatys et al.
        content_layer = 'block5_conv2'
        style_layers = ['block1_conv1', 'block2_conv1', 'block3_conv1',
                        'block4_conv1', 'block5_conv1']
    elif source == 'johnson':
        # config from Johnson et al.
        content_layer = 'block2_conv2'
        style_layers = ['block1_conv2', 'block2_conv2', 'block3_conv3',
                        'block4_conv3', 'block5_conv3']
    else:
        # use Gatys config as the default anyway
        content_layer = 'block5_conv2'
        style_layers = ['block1_conv1', 'block2_conv1', 'block3_conv1',
                        'block4_conv1', 'block5_conv1']
    return content_layer, style_layers
The following snippet then applies the overall loss function based on the layers selected from
the set_cnn_layers() function for content and style.
In [2]:
   ...:
   ...:
   ...:

514

# weights for the weighted average loss function
content_weight = 0.025
style_weight = 1.0
total_variation_weight = 1e-4

Chapter 12 ■ Deep Learning for Computer Vision

   ...:
   ...: # set the source research paper followed and set the content and style layers
   ...: source_paper = 'gatys'
   ...: content_layer, style_layers = set_cnn_layers(source=source_paper)
   ...:
   ...: ## build the weighted loss function
   ...:
   ...: # initialize total loss
   ...: loss = K.variable(0.)
   ...:
   ...: # add content loss
   ...: layer_features = layers[content_layer]
   ...: target_image_features = layer_features[0, :, :, :]
   ...: combination_features = layer_features[2, :, :, :]
   ...: loss += content_weight * content_loss(target_image_features,
   ...:                                        combination_features)
   ...:
   ...: # add style loss
   ...: for layer_name in style_layers:
   ...: layer_features = layers[layer_name]
   ...: style_reference_features = layer_features[1, :, :, :]
   ...: combination_features = layer_features[2, :, :, :]
   ...: sl = style_loss(style_reference_features, combination_features,
   ...:                height=img_height, width=img_width)
   ...: loss += (style_weight / len(style_layers)) * sl
   ...:
   ...: # add total variation loss
   ...: loss += total_variation_weight * total_variation_loss(generated_image)

Custom Optimizer
The objective is to iteratively minimize the overall loss with the help of an optimization algorithm. In
the paper by Gatys et al., optimization was done using the L-BFGS algorithm, which is an optimization
algorithm based on quasi-Newton methods, which is popularly used for solving non-linear optimization
problems and parameter estimation. This method usually converges faster than standard gradient descent.
SciPy has an implementation available in scipy.optimize.fmin_l_bfgs_b(). However, limitations include
the function being applicable only to flat 1D vectors, unlike 3D image matrices which we are dealing with,
and the fact that value of loss function and gradients need to be passed as two separate functions.
We build an Evaluator class based on patterns followed by keras creator François Chollet to compute
both loss and gradients values in one pass instead of independent and separate computations. This will
return the loss value when called the first time and will cache the gradients for the next call. Thus, it would
be more efficient than computing both independently. The following snippet defines the Evaluator class.
class Evaluator(object):
    def __init__(self, height=None, width=None):
        self.loss_value = None
        self.grads_values = None
        self.height = height
        self.width = width

515

Chapter 12 ■ Deep Learning for Computer Vision

    def loss(self, x):
        assert self.loss_value is None
        x = x.reshape((1, self.height, self.width, 3))
        outs = fetch_loss_and_grads([x])
        loss_value = outs[0]
        grad_values = outs[1].flatten().astype('float64')
        self.loss_value = loss_value
        self.grad_values = grad_values
        return self.loss_value
    def grads(self, x):
        assert self.loss_value is not None
        grad_values = np.copy(self.grad_values)
        self.loss_value = None
        self.grad_values = None
        return grad_values
The loss and gradients are retrieved as follows. The snippet also creates an object of the Evaluator class.
In [3]:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:

# Get the gradients of the generated image wrt the loss
grads = K.gradients(loss, generated_image)[0]
# Function to fetch the values of the current loss and the current gradients
fetch_loss_and_grads = K.function([generated_image], [loss, grads])
# evaluator object
evaluator = Evaluator(height=img_height, width=img_width)

Style Transfer in Action
The final piece of the puzzle is to use all the building blocks and see the style transfer in action. The art/style
and content images are available data directory for reference. The following snippet outlines how loss and
gradients are evaluated. We also write back outputs after regular intervals (5, 10, and so on iterations) to later
understand how the process of neural style transfer transforms the images in consideration.
In [4]: result_prefix = 'style_transfer_result_'+TARGET_IMG.split('.')[0]
   ...: result_prefix = result_prefix+'_'+source_paper
   ...: iterations = 20
   ...:
   ...: # Run scipy-based optimization (L-BFGS) over the pixels of the generated image
   ...: # so as to minimize the neural style loss.
   ...: # This is our initial state: the target image.
   ...: # Note that `scipy.optimize.fmin_l_bfgs_b` can only process flat vectors.
   ...: x = preprocess_image(TARGET_IMG, height=img_height, width=img_width)
   ...: x = x.flatten()
   ...:
   ...: for i in range(iterations):
   ...:         print('Start of iteration', (i+1))
   ...:         start_time = time.time()
   ...:         x, min_val, info = fmin_l_bfgs_b(evaluator.loss, x,
   ...:                                 fprime=evaluator.grads, maxfun=20)

516

Chapter 12 ■ Deep Learning for Computer Vision

   ...:         print('Current loss value:', min_val)
   ...:         if (i+1) % 5 == 0 or i == 0:
   ...:                # Save current generated image only every 5 iterations
   ...:                img = x.copy().reshape((img_height, img_width, 3))
   ...:                img = deprocess_image(img)
   ...:                fname = result_prefix + '_at_iteration_%d.png' %(i+1)
   ...:                imsave(fname, img)
   ...:                print('Image saved as', fname)
   ...:         end_time = time.time()
   ...:         print('Iteration %d completed in %ds' % (i+1, end_time - start_time))
It must be pretty evident by now that neural style transfer is a computationally expensive task. For the
set of images in consideration, each iteration took between 500-1000 seconds on a Intel i5 CPU with 8GB
RAM. On an average, each iteration takes around 500 seconds but if you run multiple networks together,
each iteration takes up to 1,000 seconds. You may observe speedups if the same is done using GPUs. The
following is the output of some of the iterations. We print the loss and time taken for each iteration and save
the image after five iterations.
Start of iteration 1
Current loss value: 2.4219e+09
Image saved as style_transfer_result_city_road_gatys_at_iteration_1.png
Iteration 1 completed in 506s
Start of iteration 2
Current loss value: 9.58614e+08
Iteration 2 completed in 542s
Start of iteration 3
Current loss value: 6.3843e+08
Iteration 3 completed in 854s
Start of iteration 4
Current loss value: 4.91831e+08
Iteration 4 completed in 727s
Start of iteration 5
Current loss value: 4.03013e+08
Image saved as style_transfer_result_city_road_gatys_at_iteration_5.png
Iteration 5 completed in 878s
...
Start of iteration 19
Current loss value: 1.62501e+08
Iteration 19 completed in 836s
Start of iteration 20
Current loss value: 1.5698e+08
Image saved as style_transfer_result_city_road_gatys_at_iteration_20.png
Iteration 20 completed in 838s
Now you’ll learn how the neural style transfer has worked out for the images in consideration.
Remember that we performed checkpoint outputs after certain iterations for every pair of style and
content images.

517

Chapter 12 ■ Deep Learning for Computer Vision

■■Note The style we use for our first image, depicted in Figure 12-7, is named Edtaonisl. This is a 1913
master piece by Francis Picabia. Through this oil painting Francis Picabia pioneered a new visual language.
More details about this painting are available at http://www.artic.edu/aic/collections/artwork/80062.
We utilize matplotlib and skimage libraries to load and understand the style transfer magic! The
following snippet loads the city road image as our content and Edtaonisl painting as our style image.
In [5]:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:
   ...:

from skimage import io
from glob import glob
from matplotlib import pyplot as plt
cr_content_image = io.imread('results/city road/city_road.jpg')
cr_style_image = io.imread('results/city road/style2.png')

fig = plt.figure(figsize = (12, 4))
ax1 = fig.add_subplot(1,2, 1)
ax1.imshow(cr_content_image)
t1 = ax1.set_title('City Road Image')
ax2 = fig.add_subplot(1,2, 2)
ax2.imshow(cr_style_image)
t2 = ax2.set_title('Edtaonisl Style')

Figure 12-7. The City Road image as content and the Edtaonisl painting as style image for neural style transfer

The following snippet loads the generated images (style transferred images) as observed after the first,
tenth, and twentieth iteration.
In [6]:
   ...:
   ...:
   ...:
   ...:
   ...:

518

fig = plt.figure(figsize = (20, 5))
ax1 = fig.add_subplot(1,3, 1)
ax1.imshow(cr_iter1)
t1 = ax1.set_title('Iteration 1')
ax2 = fig.add_subplot(1,3, 2)
ax2.imshow(cr_iter10)

Chapter 12 ■ Deep Learning for Computer Vision

   ...:
   ...:
   ...:
   ...:
   ...:

t2 = ax2.set_title('Iteration 10')
ax3 = fig.add_subplot(1,3, 3)
ax3.imshow(cr_iter20)
t3 = ax3.set_title('Iteration 20')
t = fig.suptitle('City Road Image after Style Transfer')

Figure 12-8. The City Road image style transfer at the first, tenth, and twentieth iteration
The results depicted in Figure 12-8 sure seem pleasant and amazing. It is quite apparent how the
generated image in the initial iterations resembles the structure of the content and, as the iterations
progress, the style starts influencing the texture, color, strokes, and so on, more and more.

■■Note The style used in our next example depicted in Figure 12-9 is the famous painting named The Great
Wave by Katsushika Hokusai. The artwork was completed in 1830-32. It is amazing to see the styles of such
talented artists being transferred to everyday photographs. More on this artwork is available at http://www.
metmuseum.org/art/collection/search/36491.

Figure 12-9. The Italy Street image as content and Wave Style painting as the style image for neural style transfer
We experimented with a few more sets of images and the results truly were surprising and pleasant
to look at. The output from neural style transfer for an image depicting an Italian street (see Figure 12-9) is
shown in Figure 12-10 at different iterations.

519

Chapter 12 ■ Deep Learning for Computer Vision

Figure 12-10. Italian street image style transfer at the first, tenth and twentieth iteration
The results depicted in Figure 12-10 are definitely a pleasure to look at and give the feeling of an entire
city underwater! We encourage you to use images of your own with this same framework. Also feel free to
experiment with leveraging different convolution layers for the style and content feature representations as
mentioned in Gatys et al. and Johnson et al.

■■Note The concept and details of neural style transfer were introduced and explained by Gatys et al. and
Johnson et al. in their respective papers available at https://arxiv.org/abs/1508.06576 and
https://arxiv.org/abs/1603.08155. You can also check out the book Deep Learning with Python by François
Chollet as well as Harish Narayanan's excellent blog for a detailed step-by-step guide on neural style transfer:
https://harishnarayanan.org/writing/artistic-style-transfer/.

S
 ummary
This chapter presented topics from the very forefront of the Machine Learning landscape. Through this
chapter we utilized our learnings about Machine Learning in general and Deep Learning in particular to
understand the concepts of image classification, transfer learning, and style transfer. The chapter started off
with a quick brush up of concepts related to Convolutional Neural Networks and how they are optimized
architectures to handle image related data. We then worked towards developing image classifiers. The first
classifier was developed from scratch and with the help of keras we were able to achieve decent results.
The second classifier utilized a pre-trained VGG-19 deep CNN model as an image feature extractor.
The pre-trained model based classifier helped us understand the concept of transfer learning and how it is
beneficial. The closing section of the chapter introduced the advanced topic of Neural Style Transfer, the
main highlight of this chapter. Style transfer is the process of applying the style of a reference image to a
specific target image such that in the process, the original content of the target image remains unchanged.
This process utilizes the potential of CNNs to understand image features at different granularities along
with transfer learning. Based on our understanding of these concepts and the research work by Gatys et al.
and Johnson et al., we provided a step-by-step guide to implement a system of neural style transfer. We
concluded the section by presenting some amazing results from the process of neural style transfer.
Deep Learning is opening new doors every day. Its application to different domains and problems is
showcasing its potential to solve problems previously unknown. Machine Learning is an ever evolving and
a very involved field. Through this book, we traveled from the basics of Machine Learning frameworks,
Python ecosystem to different algorithms and concepts. We then covered multiple use cases across chapters
showcasing different scenarios and ways a problem can be solved using the tools from the Machine Learning
toolbox. The universe of Machine Learning is expanding at breakneck speeds; our attempt here was to get
you started on the right track, on this wonderful journey.

520

Index


       
A
Advanced supervised deep learning models
dense layer, 361
embedding layer, 355, 357
LSTM-based classification model, 355
LSTM cell
architecture, 360
data flow, 361
model performance metrics, LSTM, 362
most_common(count) function, 356
norm_train_reviews and
norm_test_reviews, 355
PAD_INDEX, 355
parameters, Embedding
layer, 358
RNN and LSTM units, structure, 359
text sentiment class labels, 356
tokenized_train corpus, 355
word embeddings generation, 358
AFINN lexicon, 338–339
Algorithmic trading, 483
Annotations, 175
Anomaly detection, 41
Applied computer science, See Practical computer
science
Area under curve (AUC), 277, 432
Array elements
advanced indexing, 79
basic indexing
and slicing, 77–78
boolean indexing, 80
integer array indexing, 79
linear algebra, 82–83
operations, 80–82
Artificial intelligence (AI)
defined, 14, 25
major facets, 25, 26
NLP, 14
objectives, 26
text analytics, 14
Artificial neural networks (ANNs), 31, 102–104, 352

Artistic style transfer, CNNs
in action, 516–519
background, 510–511
custom optimizer, 515–516
loss functions
content loss, 513
overall loss function, 514–515
style loss, 513
total variation loss, 514
preprocessing, 511–512
Association rule-mining method, 41
Autoencoders, 34
Auto feature generation, 184
Auto Regressive Integrated Moving Average
(ARIMA) model, 475–476
Axis controls
adjust axis, 173
log scale, 174
tick range, 173–174
y-axis, 172


       
B
Backpropagation algorithm, 32
Backward elimination, 248
Bagging methods, 437
Bag of N-grams model, 212
Bag of words model
numeric vector, 211
visual, 233–236
Bar plots, 154–155
Batch learning methods, 43
Bayes Theorem, 24
Bias and variance
generalization error, 287
overfitting, 287
tradeoff, 284–285, 287
underfitting, 287
Bike Sharing dataset
EDA
correlations, 314–315
distribution and trends, 310, 312

© Dipanjan Sarkar, Raghav Bali and Tushar Sharma 2018
D. Sarkar et al., Practical Machine Learning with Python, https://doi.org/10.1007/978-1-4842-3207-1

521

■ INDEX

Bike Sharing dataset (cont.)
outliers, 312–313
preprocessing, 308–310
linear regression, 320–321
modeling (see Modeling, Bike Sharing dataset)
problem statement, 308
regression analysis
assumptions, 316
cross validation, 317
normality test, 316–317
residual analysis, 316
R-squared, 317
types, 315
Binary classification model, 258
Bin-counting scheme, 208
Bing Liu’s lexicon, 337
Binning image intensity distribution, 227
Boosting methods, 437
Bot, See Web crawler
Box-Cox transform, 198–200
Box plots, 157–158
Building machine intelligence, 52


       
C
Calinski-Harabaz index, 280–281
Candidate model, 256
Canny edge detector, 229
Categorical data
encoding features
bin-counting scheme, 208
dummy coding scheme, 206
effect coding scheme, 207
one hot encoding scheme, 203, 205
feature hashing scheme, 208
nominal, 200–202
ordinal, 200, 202–203
Categorical variables, 137
Channel pixels, 225–227
Chi-square test, 245–246
Classification models
binary classification, 258
confusion matrix
accuracy, 274
F1 score, 275
precision, 274–275
structure, 272–273
test dataset, 272
handwritten digit, 264–266
multi-class classification, 258
output formats, 258
Clustering methods, 39
Clustering models
Calinski-Harabaz index, 280–281
completeness, 279

522

density based, 260
distance between data points, 279
evaluation, 278
hierarchical, 260, 269–270
homogeneity, 279
partition based, 260, 267–268
SC, 280
V-measure, 279
Clustering strategy
data cleaning
cluster analysis, 387, 389–392
CustomerID field, 381
data preprocessing, 383–385
frequency and monetary value, 382–383
K-means clustering, 386–387
recency, 381–382
separate transactions, geographical
region, 380
RFM model, customer value, 380
Clustering vs. customer segmentation, 379
Comma Separated Values (CSV)
dataframe, 85–87
dict, 123
pandas, 123
reader function, 123
sample file, 122
Computation, theory of, 15
Computer science (CS)
algorithms, 15
code, 16
data structures, 16
defined, 14
practical, 15
programming languages, 16
theoretical, 15
Conditional probability, 23
Confusion matrix
accuracy, 274
F1 score, 275
precision, 274–275
structure, 272–273
test dataset, 272
Content-based recommendation engines, 457
Convolutional neural networks (CNNs)
architecture, 500
artistic style transfer (see Artistic style
transfer, CNNs)
components, 32–33
feature map visualizations, 238–239
image classification
dataset, 501
deep learning classifier, pretrained
models, 505–509
deep learning classifier, scratch, 502–505
problem statement, 501

■ INDEX

two-layer
pooling, 236
stride, 236
visualizing, 237
CRoss Industry Standard Process for Data Mining
(CRISP-DM) process model
assessment stage, 47
attribute generation, 50
building machine intelligence, 52
business context and requirements, 46
business problem, 47
data collection, 48
data description, 49
data integration, 50
data mining lifecycle, 45–46
data mining problem, 48
data preparation, 50
data quality analysis, 49
data understanding, 48
data wrangling, 50
deployment, 52
EDA, 49
evaluation phase, 52
ML pipelines
data preparation, 53
data processing and wrangling, 53
data retrieval, 53
deployment and monitoring, 54
feature extraction and engineering, 53
feature scaling and selection, 54
model evaluation and tuning, 54
modeling, 54
standard, 53
supervised, 54
unsupervised, 55
model assessment, 51
project plan, 48
training model, 51
tuned models, 51
Cross selling, association rule-mining
dependencies, 396
EDA, 396–397, 399–400
FP growth, 395–396, 401–405
market basket analysis, 393–394
orange table data structure, 400
transaction set, 394
Cross validation (CV)
K-fold, 291
model building and tuning, 288–290
single data point, 291
Curse of dimensionality, 40
Customer segmentation
clustering strategy (see Clustering strategy)
objectives
customer understanding, 378
higher revenue, 379

latent customer segments, 379
optimal product placement, 378
target marketing, 378
strategies
clustering, 379
EDA, 379
Custom optimizer, 515


       
D
2Darray, See Matrix
Data collection
CSV file, 122–123
defined, 121–122
HTML, 131–132
JSON, 124, 126–127
SQL, 136
Web scraping (see Web scraping)
Data description
categorical, 137
defined, 121
numeric, 137
text, 137
Data-driven decisions, 4, 7
Data mining, 14, 25
problem, 48
Data munging, See Data wrangling
Data Science, 16–18
Datasets, 25, 178
Data structures, 16
Data summarization
agg() function, 151
groupby() function, 150
quantity_purchased, 151
user_class, 150
Data visualization
defined, 121
matplotlib
annotations, 175
axis controls, 172–174
figure and subplots, 162–167
global parameters, 176
graph legend, 170–171
plot formatting, 167–170
pylab, 161
pyplot, 161
pandas, 152
bar plots, 154–155
box plots, 157–158
histograms, 155–156
line charts, 152–154
pie charts, 156–157
scatter plots, 158–161
Data wrangling
defined, 121
downstream steps, 138

523

■ INDEX

Data wrangling (cont.)
product purchase transactions dataset
attributes/features/properties, 139–140
categorical data, 147–148
duplicates, 147
filtering data, 141, 143
missing values, 145–146
normalization process, 148
string data, 149
transformations, 144–145
typecasting, 144
Date-based features, 221
Decision tree, 283, 295
Decision tree based regression
algorithms, 325
hyperparameters, 325–327, 329
node splitting, 324–325
stopping criteria, 325
testing, 329
training, 326–329
Decision Tree Regressor, 326, 328–329
Deep Learning, 14
ANN, 31
architectures, 30
autoencoder, 34
backpropagation, 32
characteristics, 29
CNN, 32–33
comparing learning pipelines, 30
comparison of machine learning and, 29
distributed representational, 29
hierarchical layered representation, 29
keras, 108
LSTMs, 34
MLP, 32
model building process, 109
neural network, 30, 109–111
power, 111–112
representational learning, 28
RNN, 33
tensorflow packages, 107
theano packages, 105–106
Deep neural network (DNN), 352
Density based clustering models, 260
Deployment model
custom development, 303
persistence model, 302
service, 304
Descriptive statistics, 24
Distributed Machine Learning Community
(DMLC), 440
Document Object Model (DOM) parser, 129
Dummy coding scheme, 206

524


       
E
EDA, See Exploratory data analysis (EDA)
Effect coding scheme, 207
Efficient market hypothesis, 483
Eigen decomposition, 21–22
Embedded methods, 242
Ensemble model, 248
Euler’s number, 197
Exploratory data analysis (EDA), 49, 155, 374–376
correlations, 314–315
data enhancing, 451–452
distribution and trends, 310, 312
loading and trimming data, 448, 450–451
outliers, 312–313
preprocessing, 308–310
visual analysis
popular artist, 454
popular songs, 452–453
user vs. songs distribution, 455–456
eXtensible Markup Language (XML)
annotated with key components, 128
attributes, 128
content, 128
DOM parser, 129
element, 128
ElementTree parser, 129–130
SAX parser, 129
tag, 128


       
F
False positive rate (FPR), 431
Feature engineering
business and domain, 184
business problem, 182
data and, datasets 178
data types, 184
definitions, 181
evaluating model, 183
feature, 182
model accuracy, 182
models, 179
predictive models, 182
raw data, 179, 182
representation of data, 183
unseen data, 182
Feature extraction methods, 40
Feature hashing scheme, 208
Feature scaling
jupyter notebook, 239
min-max, 240–241
online videos, 239

■ INDEX

robust, 241
standardized, 240
Feature selection methods, 40
Figure module
axes objects, 162
cosine curve, 163–164
sine curve, 162–164
Filter methods, 242
Fine-tuning pre-trained models, 505
Fixed-width binning, 193–194
Forecasting gold price
dataset, 474
modeling
ACF and PACF plots, 480
ARIMA(1,1,1), 483
arima_grid_search_cv(), 481
forecast() method, 483
mean and standard deviation plot, 478–479
plot, gold prices, 476
stationarity, 477
statsmodel library, 478
test statistic, 479
problem statement, 474
traditional approaches
ARIMA model, 475–476
differencing, 475
stationarity, 475
unit root tests, 475


       
G
Generalized linear models, 38
Gini impurity/index, 325
Global parameters, 176
Global Vectors for Word
Representation (GloVe), 351
Gradient Boosting Machines (GBM) model, 440
Graph legend, 170–171
Grayscale image pixels, 227
Grid search
breast cancer dataset, 292, 294
SVM, 292
GridSearchCV() method, 327, 328


       
H
Hacking skills, 17
Hierarchical clustering model, 260, 269–270
Histogram of oriented gradients (HOG), 230
Histograms, price distribution, 155–156
Hybrid-recommendation engines, 457
Hyperparameters, 45
decision tree, 283
definition, 283

grid search
breast cancer dataset, 292, 294
SVM, 292
randomized search, 294–295
Hyper Text Markup Language (HTML), 131–132
Hypothesis models, 261


       
I
Image data
binning image intensity distribution, 227
canny edge detector, 229
channel pixels, 225–227
EXIF data, 225
grayscale image pixels, 227
HOG, 230
image aggregation statistics, 228
raw image, 224–227
SURF, 231
two-layer CNN
feature map visualizations, 238–239
pooling, 236
stride, 236
visualizing, 237
VBOW, 233–235
Inferential statistics, 25
Information gain, 325
Instance based learning methods, 44
Internet Movie Database (IMDb), 332
Interpretation model
decision trees, 296
logistic regression model
data point with no cancer, 301–302
features, 299
malignant cancer, 301
worst area feature, 299–300
Skater, 296–297
Inter-Quartile Range (IQR), 241


       
J
Java Script Object Notation (JSON)
dict, 126
nested attributes, 125
object structure, 124
pandas, 126–127
sample file, 125


       
K
K-fold cross validation, 317, 320, 327–328
K-means algorithm, 267–268
K-means clustering model, 215
Knowledge discovery of databases (KDD), 14, 25

525

■ INDEX


       
L
Lasso regression, See Least absolute shrinkage and
selection operator (Lasso) regression
Latent Dirichlet Allocation (LDA), 216–217, 368
Latent Semantic Indexing (LSI), 216
Least absolute shrinkage and selection operator
(Lasso) regression, 38
LeNet CNN model, 33
Linear algebra, 18
Linear regression models, 319–320
building process, 256
candidate model, 256
coefficient of determination, 281
MSE, 282
multiple, 259
nonlinear, 259
simple, 259
Line charts, 152–154
Logistic regression
algorithm, 60
evaluation/cost function, 262
optimization, 263
representation, 262
Log transform, 197
Long short term memory networks (LSTMs), 34, 355
Looped networks, 33
Loss functions, 513–514


       
M
Machine Learning (ML)
benefits, 8
bias and variance, 284–286
challenges, 64
comparison of deep learning and, 29
data analysis
features, 410–412
process and merge datasets, 409
datasets, 408
definition, 9
descriptive statistics, 413
evaluation, 261
experience (E), 12
GitHub repository, 408
history, 8
inferential statistics, 414–415
mini-batches, 44
multivariate analysis, 407, 419–426
optimization, 261
paradigm, 6–7
performance (P), 12
physicochemical properties, 407
predictive modeling, 426–427

526

real-world applications, 64
representation, 261
student performance data and grant
recommendation
attributes, 56
deployment, 61
evaluation, 61
feature extraction and engineering, 57–58, 60
logistic regression algorithm, 60
predictions, 62–64
retrieve data, 56–57
supervised learning, 257
task (T)
anomalous, 11
automated machine translation, 11
classification/categorization, 11
clusters/groups, 11
natural language, 11
regression, 11
structured annotation, 11
transcriptions, 11
UCI Machine Learning Repository, 408
univariate analysis, 416, 418
wine quality prediction, 433–446
wine types, 427–433
Machine Learning (ML) pipelines
components, 180
data preparation, 53
data processing and wrangling, 53
data retrieval, 53
deployment and monitoring, 54
feature engineering, scaling, and selection, 180
feature extraction and engineering, 53
feature scaling and selection, 54
model evaluation and tuning, 54
modeling, 54
revisiting, 180
supervised, 54
unsupervised, 55
Marginal probability, 23
Market basket analysis, See Association
rule-mining method
Mathematics
Bayes Theorem, 24
eigen decomposition, 21–22
linear algebra, 18
matrix, 19–20
norm, 21
probability, 23
random variable, 23
scalar, 19
SVD, 22
tensor, 21
vector, 19

■ INDEX

Matplotlib
annotations, 175
axis controls
adjust axis, 173
log scale, 174
tick range, 173–174
y-axis, 172
figure
axes objects, 162
cosine curve, 163–164
sine curve, 162–164
global parameters, 176
graph legend, 170–171
plot formatting
color and alpha properties, 167
line_width and shorthand notation, 170
marker and line style properties, 168–169
subplots
pyplot module, 165
subplot2grid() function, 167
using add_subplot method, 165
Matrix
numpy array, 19
operations, 20
row and column index, 19
Matrix factorization based recommendation
engines, 461–465
Mean absolute error, 325
Mean squared error (MSE), 282, 324
Million song dataset taste profile, 448
Min-max scaling, 240–241
Model based learning methods, 45
Model evaluation, 271
Model fine-tuning, 330
Modeling, Bike Sharing dataset
decision tree based regression
algorithms, 325
hyperparameters, 325
interpretability, 323
node splitting, 324–325
stopping criteria, 325
testing, 329
training, 326–328
encoded categorical attributes, 319
fit_transform_ohe() function, 318
linear regression
testing, 321–323
training, 320–321
model_selection module, 318
scikit-learn’s train_test_split() function, 318
MPQA subjectivity lexicon, 337
Multi-class classification model, 258
Multi-label classification model, 258
Multi-layer perceptrons (MLPs), 32, 352
Multiple regression, 38
Multivariable regression, See Multiple regression


       
N
Natural language processing (NLP), 14, 331
AI, 14
applications, 26–27
operations on textual data, 27–28
Natural Language Tool Kit (NLTK), 113–115
Neural networks, 14
Nominal categorical variables, 137, 200–202
Non-Negative Matrix factorization, 368
Non-zero vector, 21
Norm, 21
Numeric data, 137
binarization, 187–188
binning
adaptive, 194–197
fixed-width, 192–194
quantile, 194–197
counts, 187
interaction features, 189–191
Pokémon, 186
raw measures, 185
rounding operations, 188
statistical transformations (see Statistical
transformations)


       
O
Offline learning methods, See Batch learning methods
One hot encoding scheme, 203, 205
Online learning methods, 44
Online retail transactions dataset, 374
Optimization methods, 262–263
Ordinal categorical variables, 137, 200, 202–203
Ordinary least squares (OLS), 37


       
P
Pandas
CSV files, 85–86
databases to dataframe, 87
dataframe, 84–85
data retrieval, 85
series, 84
Partition based clustering method, 260, 267
Pattern lexicon, 338
Persistence model, 302
Pie charts, 156–157
Plot formatting
color and alpha properties, 167
line_width and shorthand notation, 170
marker and line style properties, 168–169
Pokémon
attack and defense, 189–190
Generation attribute, 202–203
generations and legendary status, 203, 205

527

■ INDEX

Pokémon (cont.)
LabelEncoder objects, 205
numeric data, 186
raw data, 186
statistical measures, 186
Pooling layers, 33
Practical computer science, 15
Pre-trained model as feature extractor, 505
Principal component analysis (PCA), 40, 250–252
Probability
conditional, 23
defined, 18
distribution, 23
marginal, 23
PDF, 23
PMF, 23
Probability density function (PDF), 23
Probability mass function (PMF), 23
Python
ABC, 67
advanced APIs, 98–99
advantages, 68
anaconda python environment, 69–70
ANNs, 102–104
ANOVA analysis, 117–118
community support, 72
core APIs, 97
data access
head and tail, 87
slicing and dicing data, 88–91
data operations
concatenating dataframes, 94–96
descriptive statistics functions, 92–93
and fillna function, 91–92
values attribute, 91
data science, 71
deep neural networks, 104
easy and rapid prototyping, 71
easy to collaboration, 72
environment, 69
installation and execution, 73–74
Jupyter notebooks, 72–73
librarie installing, 71
modules, 116–117
natural language processing, 112
neural networks and deep
learning, 102
NLTK, 113–115
Numpy, 75–77
one-stop solution, 72
pitfalls, 68
powerful set of packages, 71
regression models, 99–101
scikit-learn, 96
statsmodels, 116

528

text analytics, 112, 115–116
Python Package Index (PyPI), 71


       
Q
Quasi-Newton methods, 515


       
R
Random forest model, 249
Randomized parameter search, 294–295
Random variable, 23
Raw image, 224–227
Raw measures, 185
Receiver Operating Characteristic (ROC) curve
AUC, 277
FPR, 276
sample, 277
scoring classifiers, 276
steps, 276
TPR, 276
Recommendation engine
item similarity based, 459–461
libraries, 466
matrix factorization, 461–465
popularity-based, 458–459
singular value decomposition, 463
types, 457
utility, 457
Recurrent neural networks (RNNs), 33, 355
Recursive Feature Elimination (RFE), 247–248
Regression function, 315
Regression methods
generalized linear models, 38
Lasso, 38
multiple, 38
non-linear, 38
OLS, 37
polynomial, 38
prediction of house prices, 37
ridge, 38
scikit-learn, 99–101
Reinforcement learning methods, 42–43
Retail transactions dataset, 375
Ridge regression, 38
Robust scaling, 241


       
S
Scalar, 19
Semi-supervised learning methods, 42
Sentiment analysis, movie reviews
advanced supervised deep learning models
(see Advanced supervised deep learning
models)

■ INDEX

causation
predictive models interpretation, 367
topic modeling, 371
data getting, 333
problem statement, 332
setting up dependencies, 332
supervised deep learning models
averaged word vector representations, 350
compile(…) method, 353
dependencies, 349
DNN model architecture, 353
DNN, sentiment classification, 351
GloVe embeddings, 351
num_input_features, 352
performance metrics, DNN, 354
shuffle parameter, 354
softmax activation function, 353
text-based sentiment class labels, 349
word2vec features, 353
word2vec model, 350–351
supervised learning, 345–346
text normalization pipeline, 333–335
traditional supervised machine learning
models, 346–348
unsupervised lexicon-based models
(see Unsupervised lexicon-based models)
SentiWordNet lexicon, 340–342
Silhouette coefficient (SC), 280
Simple API for XML (SAX) parser, 129
Singular value decomposition (SVD), 22, 250–251
Skater, 297
Softmax function, 353
Speeded Up Robust Features (SURF), 231
SQL, 136
Stacking methods, 437
Standard gradient descent, 515
Statistical methods, 244–247
Statistical transformations
Box-Cox transform, 198–200
log transform, 197
Statistics
defined, 24
descriptive, 24
inferential, 25
statsmodels, 480
Stock price prediction
dataset, 484
efficient market hypothesis, 483
LSTM and RNN
regression modeling, 490
sequence modeling, 486, 495
structure, 485
problem statement, 484
Prophet tool, 495–496

String data
stemming and
lemmatization, 149
stopword removal, 149
tokenization, 149
Subplots
add_subplot method, 165
pyplot module, 165
subplot2grid() function, 167
Supervised learning methods
classification, 36, 345–346
objective, 35
regression
generalized linear models, 38
Lasso, 38
multiple, 38
non-linear, 38
OLS, 37
polynomial, 38
prediction of house prices, 37
ridge, 38
training data, 35
Support vector machine (SVM) model, 292


       
T
Temporal data
date-based features, 221
time-based features, 222–224
Tensor, 21
Text analytics, 14
Text data, 137, 209–210
Text normalization
expand_contractions(…) function, 333
lemmatize_text(…), 334
module, 335–336
normalize_corpus(…), 334
remove_accented_chars(…)
function, 333
remove_special_characters(…), 333
remove_stopwords(…), 334
strip_html_tags(…) function, 333
Text pre-processing, 210–211
TF-IDF model, 213
Theoretical computer science, 15
Threshold-based methods, 243–244
Time-based features, 222–224
Time series data analysis
components, 469–470
date_of_visit, 468
pandas, 468
smoothing techniques
exponential smoothening, 473
moving average, 471–472

529

■ INDEX

Time series forecasting, 468
Traditional programming paradigm, 5
Traditional supervised machine learning
models, 346–348
True positive rate (TPR), 431
Tuning model, 282


       
U
Unsupervised learning methods, 41
anomaly detection, 41
clustering, 39
defined, 38
dimensionality reduction, 40
Unsupervised lexicon-based models
AFINN lexicon, 338
Bing Liu’s lexicon, 337
dependencies and configuration
settings, 336
MPQA subjectivity lexicon, 337
pattern lexicon, 338
SentiWordNet Lexicon, 340–342
VADER lexicon, 342–344
User-based recommendation
engines, 457


       
V
VADER lexicon, 342–344
Variance reduction, 325
Vector, 19
Video game genres, 201–202
Visual bag of words model (VBOW)
K-means model, 233–234
SURF feature, 234–235
V-measure, 279

530


       
W
Web crawler, 132
Web scraping
Apress web site’s blog page, 132, 134
BeautifulSoup library, 135–136
crawl, 132
regular expression, 134–135
scrape, 132
Wine quality prediction
alcohol and volatile acidity, 446
Bagging methods, 437
Boosting methods, 437
DecisionTreeClassifier estimator, 434
gini parameter, 436
hyperparameters, 438–439
LimeTabularExplainer object, 443
micro-averaging, 441–442
model interpretation, 444
random forest model, 437–438
scikit-learn framework, 445
skater, 440–441
stacking methods, 437
train and test datasets, 433
tree model, 435
XGBoost model, 440
word2vec embedding model, 217–220
Wrapper methods, 242


       
X, Y
XML, See eXtensible Markup Language (XML)


       
Z
Z-score scaling, 240
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : Yes Has XFA : No Language : EN XMP Toolkit : Adobe XMP Core 5.2-c001 63.143651, 2012/04/05-09:01:49 Create Date : 2017:12:19 18:23:03+05:30 Creator Tool : Adobe InDesign CS6 (Windows) Modify Date : 2017:12:19 20:59:13+05:30 Metadata Date : 2017:12:19 20:59:13+05:30 Producer : Adobe PDF Library 10.0.1 Format : application/pdf Document ID : uuid:76047740-5493-4f2c-a9d0-cd34bed18558 Instance ID : uuid:778b9317-167b-4000-a182-67ce9a152bd5 Rendition Class : default Version ID : 1 History Action : converted, converted, converted History Instance ID : uuid:92f1462c-19d4-40fa-b361-2eb15b79791b, uuid:d38a6418-341c-4577-9d4e-5c2131eb416c, uuid:ff54f60d-6c7c-46d0-90ef-58f89065d2f1 History Parameters : converted to PDF/A-2b, converted to PDF/A-2b, converted to PDF/A-2b History Software Agent : pdfToolbox, pdfToolbox, pdfToolbox History When : 2017:12:19 19:00:37+05:30, 2017:12:19 20:16:22+05:30, 2017:12:19 20:59:07+05:30 Part : 2 Conformance : B Schemas Namespace URI : http://ns.adobe.com/pdf/1.3/ Schemas Prefix : pdf Schemas Schema : Adobe PDF Schema Schemas Property Category : internal Schemas Property Description : A name object indicating whether the document has been modified to include trapping information Schemas Property Name : Trapped Schemas Property Value Type : Text Schemas Value Type Description : Identifies a portion of a document. This can be a position at which the document has been changed since the most recent event history (stEvt:changed). For a resource within an xmpMM:Ingredients list, the ResourceRef uses this type to identify both the portion of the containing document that refers to the resource, and the portion of the referenced resource that is referenced. Schemas Value Type Namespace URI: http://ns.adobe.com/xap/1.0/sType/Part# Schemas Value Type Prefix : stPart Schemas Value Type Type : Part Page Layout : SinglePage Page Mode : UseOutlines Page Count : 545 Creator : Adobe InDesign CS6 (Windows)
EXIF Metadata provided by EXIF.tools
Practical Machine Learning With Python A Problem Solver’s Guide To Building Real World Intelligen

Sample WebPage

Navigation menu

Versions of this User Manual:

Views

Navigation