The Data Science Design Manual


User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 453 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Steven S. Skiena
The Data Science Design Manual
Steven S. Skiena
Computer Science Department
Stony Brook University
Stony Brook, NY
ISSN 1868-0941 ISSN 1868-095X (electronic)
Texts in Computer Science
ISBN 978-3-319-55443-3 ISBN 978-3-319-55444-0 (eBook)
DOI 10.1007/978-3-319-55444-0
Library of Congress Control Number: 2017943201
© The Author(s) 2017
This Springer imprint is published by Springer Nature
The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Making sense of the world around us requires obtaining and analyzing data from
our environment. Several technology trends have recently collided, providing
new opportunities to apply our data analysis savvy to greater challenges than
ever before.
Computer storage capacity has increased exponentially; indeed remembering
has become so cheap that it is almost impossible to get computer systems to for-
get. Sensing devices increasingly monitor everything that can be observed: video
streams, social media interactions, and the position of anything that moves.
Cloud computing enables us to harness the power of massive numbers of ma-
chines to manipulate this data. Indeed, hundreds of computers are summoned
each time you do a Google search, scrutinizing all of your previous activity just
to decide which is the best ad to show you next.
The result of all this has been the birth of data science, a new field devoted
to maximizing value from vast collections of information. As a discipline, data
science sits somewhere at the intersection of statistics, computer science, and
machine learning, but it is building a distinct heft and character of its own.
This book serves as an introduction to data science, focusing on the skills and
principles needed to build systems for collecting, analyzing, and interpreting
My professional experience as a researcher and instructor convinces me that
one major challenge of data science is that it is considerably more subtle than it
looks. Any student who has ever computed their grade point average (GPA) can
be said to have done rudimentary statistics, just as drawing a simple scatter plot
lets you add experience in data visualization to your resume. But meaningfully
analyzing and interpreting data requires both technical expertise and wisdom.
That so many people do these basics so badly provides my inspiration for writing
this book.
To the Reader
I have been gratified by the warm reception that my book The Algorithm Design
Manual [Ski08] has received since its initial publication in 1997. It has been
recognized as a unique guide to using algorithmic techniques to solve problems
that often arise in practice. The book you are holding covers very different
material, but with the same motivation.
In particular, here I stress the following basic principles as fundamental to
becoming a good data scientist:
Valuing doing the simple things right: Data science isn’t rocket science.
Students and practitioners often get lost in technological space, pursuing
the most advanced machine learning methods, the newest open source
software libraries, or the glitziest visualization techniques. However, the
heart of data science lies in doing the simple things right: understanding
the application domain, cleaning and integrating relevant data sources,
and presenting your results clearly to others.
Simple doesn’t mean easy, however. Indeed it takes considerable insight
and experience to ask the right questions, and sense whether you are mov-
ing toward correct answers and actionable insights. I resist the temptation
to drill deeply into clean, technical material here just because it is teach-
able. There are plenty of other books which will cover the intricacies of
machine learning algorithms or statistical hypothesis testing. My mission
here is to lay the groundwork of what really matters in analyzing data.
Developing mathematical intuition: Data science rests on a foundation of
mathematics, particularly statistics and linear algebra. It is important to
understand this material on an intuitive level: why these concepts were
developed, how they are useful, and when they work best. I illustrate
operations in linear algebra by presenting pictures of what happens to
matrices when you manipulate them, and statistical concepts by exam-
ples and reducto ad absurdum arguments. My goal here is transplanting
intuition into the reader.
But I strive to minimize the amount of formal mathematics used in pre-
senting this material. Indeed, I will present exactly one formal proof in
this book, an incorrect proof where the associated theorem is obviously
false. The moral here is not that mathematical rigor doesn’t matter, be-
cause of course it does, but that genuine rigor is impossible until after
there is comprehension.
Think like a computer scientist, but act like a statistician: Data science
provides an umbrella linking computer scientists, statisticians, and domain
specialists. But each community has its own distinct styles of thinking and
action, which gets stamped into the souls of its members.
In this book, I emphasize approaches which come most naturally to com-
puter scientists, particularly the algorithmic manipulation of data, the use
of machine learning, and the mastery of scale. But I also seek to transmit
the core values of statistical reasoning: the need to understand the appli-
cation domain, proper appreciation of the small, the quest for significance,
and a hunger for exploration.
No discipline has a monopoly on the truth. The best data scientists incor-
porate tools from multiple areas, and this book strives to be a relatively
neutral ground where rival philosophies can come to reason together.
Equally important is what you will not find in this book. I do not emphasize
any particular language or suite of data analysis tools. Instead, this book pro-
vides a high-level discussion of important design principles. I seek to operate at
a conceptual level more than a technical one. The goal of this manual is to get
you going in the right direction as quickly as possible, with whatever software
tools you find most accessible.
To the Instructor
This book covers enough material for an “Introduction to Data Science” course
at the undergraduate or early graduate student levels. I hope that the reader
has completed the equivalent of at least one programming course and has a bit
of prior exposure to probability and statistics, but more is always better than
I have made a full set of lecture slides for teaching this course available online
at Data resources for projects and assignments
are also available there to aid the instructor. Further, I make available online
video lectures using these slides to teach a full-semester data science course. Let
me help teach your class, through the magic of the web!
Pedagogical features of this book include:
War Stories: To provide a better perspective on how data science tech-
niques apply to the real world, I include a collection of “war stories,” or
tales from our experience with real problems. The moral of these stories is
that these methods are not just theory, but important tools to be pulled
out and used as needed.
False Starts: Most textbooks present methods as a fait accompli, ob-
scuring the ideas involved in designing them, and the subtle reasons why
other approaches fail. The war stories illustrate my reasoning process on
certain applied problems, but I weave such coverage into the core material
as well.
Take-Home Lessons: Highlighted “take-home” lesson boxes scattered
through each chapter emphasize the big-picture concepts to learn from
each chapter.
Homework Problems: I provide a wide range of exercises for home-
work and self-study. Many are traditional exam-style problems, but there
are also larger-scale implementation challenges and smaller-scale inter-
view questions, reflecting the questions students might encounter when
searching for a job. Degree of difficulty ratings have been assigned to all
In lieu of an answer key, a Solution Wiki has been set up, where solutions to
all even numbered problems will be solicited by crowdsourcing. A similar
system with my Algorithm Design Manual produced coherent solutions,
or so I am told. As a matter of principle I refuse to look at them, so let
the buyer beware.
Kaggle Challenges: Kaggle ( provides a forum for data
scientists to compete in, featuring challenging real-world problems on fas-
cinating data sets, and scoring to test how good your model is relative to
other submissions. The exercises for each chapter include three relevant
Kaggle challenges, to serve as a source of inspiration, self-study, and data
for other projects and investigations.
Data Science Television: Data science remains mysterious and even
threatening to the broader public. The Quant Shop is an amateur take
on what a data science reality show should be like. Student teams tackle
a diverse array of real-world prediction problems, and try to forecast the
outcome of future events. Check it out at
A series of eight 30-minute episodes has been prepared, each built around
a particular real-world prediction problem. Challenges include pricing art
at an auction, picking the winner of the Miss Universe competition, and
forecasting when celebrities are destined to die. For each, we observe as a
student team comes to grips with the problem, and learn along with them
as they build a forecasting model. They make their predictions, and we
watch along with them to see if they are right or wrong.
In this book, The Quant Shop is used to provide concrete examples of
prediction challenges, to frame discussions of the data science modeling
pipeline from data acquisition to evaluation. I hope you find them fun, and
that they will encourage you to conceive and take on your own modeling
Chapter Notes: Finally, each tutorial chapter concludes with a brief notes
section, pointing readers to primary sources and additional references.
My bright and loving daughters Bonnie and Abby are now full-blown teenagers,
meaning that they don’t always process statistical evidence with as much alacrity
as I would I desire. I dedicate this book to them, in the hope that their analysis
skills improve to the point that they always just agree with me.
And I dedicate this book to my beautiful wife Renee, who agrees with me
even when she doesn’t agree with me, and loves me beyond the support of all
creditable evidence.
My list of people to thank is large enough that I have probably missed some.
I will try to do enumerate them systematically to minimize omissions, but ask
those I’ve unfairly neglected for absolution.
First, I thank those who made concrete contributions to help me put this
book together. Yeseul Lee served as an apprentice on this project, helping with
figures, exercises, and more during summer 2016 and beyond. You will see
evidence of her handiwork on almost every page, and I greatly appreciate her
help and dedication. Aakriti Mittal and Jack Zheng also contributed to a few
of the figures.
Students in my Fall 2016 Introduction to Data Science course (CSE 519)
helped to debug the manuscript, and they found plenty of things to debug. I
particularly thank Rebecca Siford, who proposed over one hundred corrections
on her own. Several data science friends/sages reviewed specific chapters for
me, and I thank Anshul Gandhi, Yifan Hu, Klaus Mueller, Francesco Orabona,
Andy Schwartz, and Charles Ward for their efforts here.
I thank all the Quant Shop students from Fall 2015 whose video and mod-
eling efforts are so visibly on display. I particularly thank Jan (Dini) Diskin-
Zimmerman, whose editing efforts went so far beyond the call of duty I felt like
a felon for letting her do it.
My editors at Springer, Wayne Wheeler and Simon Rees, were a pleasure to
work with as usual. I also thank all the production and marketing people who
helped get this book to you, including Adrian Pieron and Annette Anlauf.
Several exercises were originated by colleagues or inspired by other sources.
Reconstructing the original sources years later can be challenging, but credits
for each problem (to the best of my recollection) appear on the website.
Much of what I know about data science has been learned through working
with other people. These include my Ph.D. students, particularly Rami al-Rfou,
Mikhail Bautin, Haochen Chen, Yanqing Chen, Vivek Kulkarni, Levon Lloyd,
Andrew Mehler, Bryan Perozzi, Yingtao Tian, Junting Ye, Wenbin Zhang, and
postdoc Charles Ward. I fondly remember all of my Lydia project masters
students over the years, and remind you that my prize offer to the first one who
names their daughter Lydia remains unclaimed. I thank my other collaborators
with stories to tell, including Bruce Futcher, Justin Gardin, Arnout van de Rijt,
and Oleksii Starov.
I remember all members of the General Sentiment/Canrock universe, partic-
ularly Mark Fasciano, with whom I shared the start-up dream and experienced
what happens when data hits the real world. I thank my colleagues at Yahoo
Labs/Research during my 2015–2016 sabbatical year, when much of this book
was conceived. I single out Amanda Stent, who enabled me to be at Yahoo
during that particularly difficult year in the company’s history. I learned valu-
able things from other people who have taught related data science courses,
including Andrew Ng and Hans-Peter Pfister, and thank them all for their help.
If you have a procedure with ten parameters, you probably missed
– Alan Perlis
It is traditional for the author to magnanimously accept the blame for whatever
deficiencies remain. I don’t. Any errors, deficiencies, or problems in this book
are somebody else’s fault, but I would appreciate knowing about them so as to
determine who is to blame.
Steven S. Skiena
Department of Computer Science
Stony Brook University
Stony Brook, NY 11794-2424
May 2017
1 What is Data Science? 1
1.1 Computer Science, Data Science, and Real Science . . . . . . . . 2
1.2 Asking Interesting Questions from Data . . . . . . . . . . . . . . 4
1.2.1 The Baseball Encyclopedia . . . . . . . . . . . . . . . . . 5
1.2.2 The Internet Movie Database (IMDb) . . . . . . . . . . . 7
1.2.3 GoogleNgrams........................ 10
1.2.4 New York Taxi Records . . . . . . . . . . . . . . . . . . . 11
1.3 PropertiesofData .......................... 14
1.3.1 Structured vs. Unstructured Data . . . . . . . . . . . . . 14
1.3.2 Quantitative vs. Categorical Data . . . . . . . . . . . . . 15
1.3.3 Big Data vs. Little Data . . . . . . . . . . . . . . . . . . . 15
1.4 Classification and Regression . . . . . . . . . . . . . . . . . . . . 16
1.5 Data Science Television: The Quant Shop . . . . . . . . . . . . . 17
1.5.1 Kaggle Challenges . . . . . . . . . . . . . . . . . . . . . . 19
1.6 About the War Stories . . . . . . . . . . . . . . . . . . . . . . . . 19
1.7 War Story: Answering the Right Question . . . . . . . . . . . . . 21
1.8 ChapterNotes ............................ 22
1.9 Exercises ............................... 23
2 Mathematical Preliminaries 27
2.1 Probability .............................. 27
2.1.1 Probability vs. Statistics . . . . . . . . . . . . . . . . . . . 29
2.1.2 Compound Events and Independence . . . . . . . . . . . . 30
2.1.3 Conditional Probability . . . . . . . . . . . . . . . . . . . 31
2.1.4 Probability Distributions . . . . . . . . . . . . . . . . . . 32
2.2 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.1 Centrality Measures . . . . . . . . . . . . . . . . . . . . . 34
2.2.2 Variability Measures . . . . . . . . . . . . . . . . . . . . . 36
2.2.3 Interpreting Variance . . . . . . . . . . . . . . . . . . . . 37
2.2.4 Characterizing Distributions . . . . . . . . . . . . . . . . 39
2.3 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.1 Correlation Coefficients: Pearson and Spearman Rank . . 41
2.3.2 The Power and Significance of Correlation . . . . . . . . . 43
2.3.3 Correlation Does Not Imply Causation! . . . . . . . . . . 45
2.3.4 Detecting Periodicities by Autocorrelation . . . . . . . . . 46
2.4 Logarithms .............................. 47
2.4.1 Logarithms and Multiplying Probabilities . . . . . . . . . 48
2.4.2 Logarithms and Ratios . . . . . . . . . . . . . . . . . . . . 48
2.4.3 Logarithms and Normalizing Skewed Distributions . . . . 49
2.5 War Story: Fitting Designer Genes . . . . . . . . . . . . . . . . . 50
2.6 ChapterNotes ............................ 52
2.7 Exercises ............................... 53
3 Data Munging 57
3.1 Languages for Data Science . . . . . . . . . . . . . . . . . . . . . 57
3.1.1 The Importance of Notebook Environments . . . . . . . . 59
3.1.2 Standard Data Formats . . . . . . . . . . . . . . . . . . . 61
3.2 CollectingData............................ 64
3.2.1 Hunting............................ 64
3.2.2 Scraping............................ 67
3.2.3 Logging ............................ 68
3.3 CleaningData ............................ 69
3.3.1 Errors vs. Artifacts . . . . . . . . . . . . . . . . . . . . . 69
3.3.2 Data Compatibility . . . . . . . . . . . . . . . . . . . . . . 72
3.3.3 Dealing with Missing Values . . . . . . . . . . . . . . . . . 76
3.3.4 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . 78
3.4 War Story: Beating the Market . . . . . . . . . . . . . . . . . . . 79
3.5 Crowdsourcing ............................ 80
3.5.1 The Penny Demo . . . . . . . . . . . . . . . . . . . . . . . 81
3.5.2 When is the Crowd Wise? . . . . . . . . . . . . . . . . . . 82
3.5.3 Mechanisms for Aggregation . . . . . . . . . . . . . . . . 83
3.5.4 Crowdsourcing Services . . . . . . . . . . . . . . . . . . . 84
3.5.5 Gamication ......................... 88
3.6 ChapterNotes ............................ 90
3.7 Exercises ............................... 90
4 Scores and Rankings 95
4.1 The Body Mass Index (BMI) . . . . . . . . . . . . . . . . . . . . 96
4.2 Developing Scoring Systems . . . . . . . . . . . . . . . . . . . . . 99
4.2.1 Gold Standards and Proxies . . . . . . . . . . . . . . . . . 99
4.2.2 Scores vs. Rankings . . . . . . . . . . . . . . . . . . . . . 100
4.2.3 Recognizing Good Scoring Functions . . . . . . . . . . . . 101
4.3 Z-scores and Normalization . . . . . . . . . . . . . . . . . . . . . 103
4.4 Advanced Ranking Techniques . . . . . . . . . . . . . . . . . . . 104
4.4.1 EloRankings ......................... 104
4.4.2 Merging Rankings . . . . . . . . . . . . . . . . . . . . . . 108
4.4.3 Digraph-based Rankings . . . . . . . . . . . . . . . . . . . 109
4.4.4 PageRank...........................111
4.5 War Story: Clyde’s Revenge . . . . . . . . . . . . . . . . . . . . . 111
4.6 Arrow’s Impossibility Theorem . . . . . . . . . . . . . . . . . . . 114
4.7 War Story: Who’s Bigger? . . . . . . . . . . . . . . . . . . . . . . 115
4.8 ChapterNotes ............................118
4.9 Exercises ............................... 119
5 Statistical Analysis 121
5.1 Statistical Distributions . . . . . . . . . . . . . . . . . . . . . . . 122
5.1.1 The Binomial Distribution . . . . . . . . . . . . . . . . . . 123
5.1.2 The Normal Distribution . . . . . . . . . . . . . . . . . . 124
5.1.3 Implications of the Normal Distribution . . . . . . . . . . 126
5.1.4 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . 127
5.1.5 Power Law Distributions . . . . . . . . . . . . . . . . . . . 129
5.2 Sampling from Distributions . . . . . . . . . . . . . . . . . . . . . 132
5.2.1 Random Sampling beyond One Dimension . . . . . . . . . 133
5.3 Statistical Significance . . . . . . . . . . . . . . . . . . . . . . . . 135
5.3.1 The Significance of Significance . . . . . . . . . . . . . . . 135
5.3.2 The T-test: Comparing Population Means . . . . . . . . . 137
5.3.3 The Kolmogorov-Smirnov Test . . . . . . . . . . . . . . . 139
5.3.4 The Bonferroni Correction . . . . . . . . . . . . . . . . . . 141
5.3.5 False Discovery Rate . . . . . . . . . . . . . . . . . . . . . 142
5.4 War Story: Discovering the Fountain of Youth? . . . . . . . . . . 143
5.5 Permutation Tests and P-values . . . . . . . . . . . . . . . . . . . 145
5.5.1 Generating Random Permutations . . . . . . . . . . . . . 147
5.5.2 DiMaggio’s Hitting Streak . . . . . . . . . . . . . . . . . . 148
5.6 Bayesian Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.7 ChapterNotes ............................151
5.8 Exercises ............................... 151
6 Visualizing Data 155
6.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . 156
6.1.1 Confronting a New Data Set . . . . . . . . . . . . . . . . 156
6.1.2 Summary Statistics and Anscombe’s Quartet . . . . . . . 159
6.1.3 Visualization Tools . . . . . . . . . . . . . . . . . . . . . . 160
6.2 Developing a Visualization Aesthetic . . . . . . . . . . . . . . . . 162
6.2.1 Maximizing Data-Ink Ratio . . . . . . . . . . . . . . . . . 163
6.2.2 Minimizing the Lie Factor . . . . . . . . . . . . . . . . . . 164
6.2.3 Minimizing Chartjunk . . . . . . . . . . . . . . . . . . . . 165
6.2.4 Proper Scaling and Labeling . . . . . . . . . . . . . . . . 167
6.2.5 Effective Use of Color and Shading . . . . . . . . . . . . . 168
6.2.6 The Power of Repetition . . . . . . . . . . . . . . . . . . . 169
6.3 ChartTypes ............................. 170
6.3.1 TabularData.........................170
6.3.2 Dot and Line Plots . . . . . . . . . . . . . . . . . . . . . . 174
6.3.3 ScatterPlots .........................177
6.3.4 Bar Plots and Pie Charts . . . . . . . . . . . . . . . . . . 179
6.3.5 Histograms ..........................183
6.3.6 DataMaps ..........................187
6.4 Great Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . 189
6.4.1 Marey’s Train Schedule . . . . . . . . . . . . . . . . . . . 189
6.4.2 Snow’s Cholera Map . . . . . . . . . . . . . . . . . . . . . 191
6.4.3 New York’s Weather Year . . . . . . . . . . . . . . . . . . 192
6.5 ReadingGraphs............................192
6.5.1 The Obscured Distribution . . . . . . . . . . . . . . . . . 193
6.5.2 Overinterpreting Variance . . . . . . . . . . . . . . . . . . 193
6.6 Interactive Visualization . . . . . . . . . . . . . . . . . . . . . . . 195
6.7 War Story: TextMapping the World . . . . . . . . . . . . . . . . 196
6.8 ChapterNotes ............................198
6.9 Exercises ............................... 199
7 Mathematical Models 201
7.1 Philosophies of Modeling . . . . . . . . . . . . . . . . . . . . . . . 201
7.1.1 Occam’s Razor . . . . . . . . . . . . . . . . . . . . . . . . 201
7.1.2 Bias–Variance Trade-Offs . . . . . . . . . . . . . . . . . . 202
7.1.3 What Would Nate Silver Do? . . . . . . . . . . . . . . . . 203
7.2 A Taxonomy of Models . . . . . . . . . . . . . . . . . . . . . . . 205
7.2.1 Linear vs. Non-Linear Models . . . . . . . . . . . . . . . . 206
7.2.2 Blackbox vs. Descriptive Models . . . . . . . . . . . . . . 206
7.2.3 First-Principle vs. Data-Driven Models . . . . . . . . . . . 207
7.2.4 Stochastic vs. Deterministic Models . . . . . . . . . . . . 208
7.2.5 Flat vs. Hierarchical Models . . . . . . . . . . . . . . . . . 209
7.3 BaselineModels............................210
7.3.1 Baseline Models for Classification . . . . . . . . . . . . . . 210
7.3.2 Baseline Models for Value Prediction . . . . . . . . . . . . 212
7.4 Evaluating Models . . . . . . . . . . . . . . . . . . . . . . . . . . 212
7.4.1 Evaluating Classifiers . . . . . . . . . . . . . . . . . . . . 213
7.4.2 Receiver-Operator Characteristic (ROC) Curves . . . . . 218
7.4.3 Evaluating Multiclass Systems . . . . . . . . . . . . . . . 219
7.4.4 Evaluating Value Prediction Models . . . . . . . . . . . . 221
7.5 Evaluation Environments . . . . . . . . . . . . . . . . . . . . . . 224
7.5.1 Data Hygiene for Evaluation . . . . . . . . . . . . . . . . 225
7.5.2 Amplifying Small Evaluation Sets . . . . . . . . . . . . . 226
7.6 War Story: 100% Accuracy . . . . . . . . . . . . . . . . . . . . . 228
7.7 Simulation Models . . . . . . . . . . . . . . . . . . . . . . . . . . 229
7.8 War Story: Calculated Bets . . . . . . . . . . . . . . . . . . . . . 230
7.9 ChapterNotes ............................233
7.10Exercises ............................... 234
8 Linear Algebra 237
8.1 The Power of Linear Algebra . . . . . . . . . . . . . . . . . . . . 237
8.1.1 Interpreting Linear Algebraic Formulae . . . . . . . . . . 238
8.1.2 Geometry and Vectors . . . . . . . . . . . . . . . . . . . . 240
8.2 Visualizing Matrix Operations . . . . . . . . . . . . . . . . . . . . 241
8.2.1 Matrix Addition . . . . . . . . . . . . . . . . . . . . . . . 242
8.2.2 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . 243
8.2.3 Applications of Matrix Multiplication . . . . . . . . . . . 244
8.2.4 Identity Matrices and Inversion . . . . . . . . . . . . . . . 248
8.2.5 Matrix Inversion and Linear Systems . . . . . . . . . . . . 250
8.2.6 MatrixRank .........................251
8.3 Factoring Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 252
8.3.1 Why Factor Feature Matrices? . . . . . . . . . . . . . . . 252
8.3.2 LU Decomposition and Determinants . . . . . . . . . . . 254
8.4 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . 255
8.4.1 Properties of Eigenvalues . . . . . . . . . . . . . . . . . . 255
8.4.2 Computing Eigenvalues . . . . . . . . . . . . . . . . . . . 256
8.5 Eigenvalue Decomposition . . . . . . . . . . . . . . . . . . . . . . 257
8.5.1 Singular Value Decomposition . . . . . . . . . . . . . . . . 258
8.5.2 Principal Components Analysis . . . . . . . . . . . . . . . 260
8.6 War Story: The Human Factors . . . . . . . . . . . . . . . . . . . 262
8.7 ChapterNotes ............................263
8.8 Exercises ............................... 263
9 Linear and Logistic Regression 267
9.1 LinearRegression........................... 268
9.1.1 Linear Regression and Duality . . . . . . . . . . . . . . . 268
9.1.2 Error in Linear Regression . . . . . . . . . . . . . . . . . . 269
9.1.3 Finding the Optimal Fit . . . . . . . . . . . . . . . . . . . 270
9.2 Better Regression Models . . . . . . . . . . . . . . . . . . . . . . 272
9.2.1 Removing Outliers . . . . . . . . . . . . . . . . . . . . . . 272
9.2.2 Fitting Non-Linear Functions . . . . . . . . . . . . . . . . 273
9.2.3 Feature and Target Scaling . . . . . . . . . . . . . . . . . 274
9.2.4 Dealing with Highly-Correlated Features . . . . . . . . . . 277
9.3 War Story: Taxi Deriver . . . . . . . . . . . . . . . . . . . . . . . 277
9.4 Regression as Parameter Fitting . . . . . . . . . . . . . . . . . . 279
9.4.1 Convex Parameter Spaces . . . . . . . . . . . . . . . . . . 280
9.4.2 Gradient Descent Search . . . . . . . . . . . . . . . . . . . 281
9.4.3 What is the Right Learning Rate? . . . . . . . . . . . . . 283
9.4.4 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . 285
9.5 Simplifying Models through Regularization . . . . . . . . . . . . 286
9.5.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . 286
9.5.2 LASSO Regression . . . . . . . . . . . . . . . . . . . . . . 287
9.5.3 Trade-Offs between Fit and Complexity . . . . . . . . . . 288
9.6 Classification and Logistic Regression . . . . . . . . . . . . . . . 289
9.6.1 Regression for Classification . . . . . . . . . . . . . . . . . 290
9.6.2 Decision Boundaries . . . . . . . . . . . . . . . . . . . . . 291
9.6.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 292
9.7 Issues in Logistic Classification . . . . . . . . . . . . . . . . . . . 295
9.7.1 Balanced Training Classes . . . . . . . . . . . . . . . . . . 295
9.7.2 Multi-Class Classification . . . . . . . . . . . . . . . . . . 297
9.7.3 Hierarchical Classification . . . . . . . . . . . . . . . . . . 298
9.7.4 Partition Functions and Multinomial Regression . . . . . 299
9.8 ChapterNotes ............................300
9.9 Exercises ............................... 301
10 Distance and Network Methods 303
10.1 Measuring Distances . . . . . . . . . . . . . . . . . . . . . . . . . 303
10.1.1 Distance Metrics . . . . . . . . . . . . . . . . . . . . . . . 304
10.1.2 The LkDistance Metric . . . . . . . . . . . . . . . . . . . 305
10.1.3 Working in Higher Dimensions . . . . . . . . . . . . . . . 307
10.1.4 Dimensional Egalitarianism . . . . . . . . . . . . . . . . . 308
10.1.5 Points vs. Vectors . . . . . . . . . . . . . . . . . . . . . . 309
10.1.6 Distances between Probability Distributions . . . . . . . . 310
10.2 Nearest Neighbor Classification . . . . . . . . . . . . . . . . . . . 311
10.2.1 Seeking Good Analogies . . . . . . . . . . . . . . . . . . . 312
10.2.2 k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . 313
10.2.3 Finding Nearest Neighbors . . . . . . . . . . . . . . . . . 315
10.2.4 Locality Sensitive Hashing . . . . . . . . . . . . . . . . . . 317
10.3 Graphs, Networks, and Distances . . . . . . . . . . . . . . . . . . 319
10.3.1 Weighted Graphs and Induced Networks . . . . . . . . . . 320
10.3.2 Talking About Graphs . . . . . . . . . . . . . . . . . . . . 321
10.3.3 Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . 323
10.5Clustering............................... 327
10.5.1 k-means Clustering . . . . . . . . . . . . . . . . . . . . . . 330
10.5.2 Agglomerative Clustering . . . . . . . . . . . . . . . . . . 336
10.5.3 Comparing Clusterings . . . . . . . . . . . . . . . . . . . . 341
10.5.4 Similarity Graphs and Cut-Based Clustering . . . . . . . 341
10.6 War Story: Cluster Bombing . . . . . . . . . . . . . . . . . . . . 344
10.7ChapterNotes ............................ 345
10.8Exercises ............................... 346
11 Machine Learning 351
11.1.1 Formulation..........................354
11.1.2 Dealing with Zero Counts (Discounting) . . . . . . . . . . 356
11.2 Decision Tree Classifiers . . . . . . . . . . . . . . . . . . . . . . . 357
11.2.1 Constructing Decision Trees . . . . . . . . . . . . . . . . . 359
11.2.2 Realizing Exclusive Or . . . . . . . . . . . . . . . . . . . . 361
11.2.3 Ensembles of Decision Trees . . . . . . . . . . . . . . . . . 362
11.3 Boosting and Ensemble Learning . . . . . . . . . . . . . . . . . . 363
11.3.1 Voting with Classifiers . . . . . . . . . . . . . . . . . . . . 363
11.3.2 Boosting Algorithms . . . . . . . . . . . . . . . . . . . . . 364
11.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . 366
11.4.1 LinearSVMs .........................369
11.4.2 Non-linear SVMs . . . . . . . . . . . . . . . . . . . . . . . 369
11.4.3 Kernels ............................371
11.5 Degrees of Supervision . . . . . . . . . . . . . . . . . . . . . . . . 372
11.5.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . 372
11.5.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . 372
11.5.3 Semi-supervised Learning . . . . . . . . . . . . . . . . . . 374
11.5.4 Feature Engineering . . . . . . . . . . . . . . . . . . . . . 375
11.6DeepLearning ............................ 377
11.6.1 Networks and Depth . . . . . . . . . . . . . . . . . . . . . 378
11.6.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . 382
11.6.3 Word and Graph Embeddings . . . . . . . . . . . . . . . . 383
11.7 War Story: The Name Game . . . . . . . . . . . . . . . . . . . . 385
11.8ChapterNotes ............................ 387
11.9Exercises ............................... 388
12 Big Data: Achieving Scale 391
12.1WhatisBigData? .......................... 392
12.1.1 Big Data as Bad Data . . . . . . . . . . . . . . . . . . . . 392
12.1.2 TheThreeVs......................... 394
12.2 War Story: Infrastructure Matters . . . . . . . . . . . . . . . . . 395
12.3 Algorithmics for Big Data . . . . . . . . . . . . . . . . . . . . . . 397
12.3.1 Big Oh Analysis . . . . . . . . . . . . . . . . . . . . . . . 397
12.3.2 Hashing............................399
12.3.3 Exploiting the Storage Hierarchy . . . . . . . . . . . . . . 401
12.3.4 Streaming and Single-Pass Algorithms . . . . . . . . . . . 402
12.4 Filtering and Sampling . . . . . . . . . . . . . . . . . . . . . . . . 403
12.4.1 Deterministic Sampling Algorithms . . . . . . . . . . . . . 404
12.4.2 Randomized and Stream Sampling . . . . . . . . . . . . . 406
12.5Parallelism .............................. 406
12.5.1 One, Two, Many . . . . . . . . . . . . . . . . . . . . . . . 407
12.5.2 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . 409
12.5.3 GridSearch..........................409
12.5.4 Cloud Computing Services . . . . . . . . . . . . . . . . . . 410
12.6MapReduce .............................. 410
12.6.1 Map-Reduce Programming . . . . . . . . . . . . . . . . . 412
12.6.2 MapReduce under the Hood . . . . . . . . . . . . . . . . . 414
12.7 Societal and Ethical Implications . . . . . . . . . . . . . . . . . . 416
12.8ChapterNotes ............................ 419
12.9Exercises ............................... 419
13 Coda 423
13.2 Go to Graduate School! . . . . . . . . . . . . . . . . . . . . . . . 424
13.3 Professional Consulting Services . . . . . . . . . . . . . . . . . . 425
14 Bibliography 427
Chapter 1
What is Data Science?
The purpose of computing is insight, not numbers.
– Richard W. Hamming
What is data science? Like any emerging field, it hasn’t been completely defined
yet, but you know enough about it to be interested or else you wouldn’t be
reading this book.
I think of data science as lying at the intersection of computer science, statis-
tics, and substantive application domains. From computer science comes ma-
chine learning and high-performance computing technologies for dealing with
scale. From statistics comes a long tradition of exploratory data analysis, sig-
nificance testing, and visualization. From application domains in business and
the sciences comes challenges worthy of battle, and evaluation standards to
assess when they have been adequately conquered.
But these are all well-established fields. Why data science, and why now? I
see three reasons for this sudden burst of activity:
New technology makes it possible to capture, annotate, and store vast
amounts of social media, logging, and sensor data. After you have amassed
all this data, you begin to wonder what you can do with it.
Computing advances make it possible to analyze data in novel ways and at
ever increasing scales. Cloud computing architectures give even the little
guy access to vast power when they need it. New approaches to machine
learning have lead to amazing advances in longstanding problems, like
computer vision and natural language processing.
Prominent technology companies (like Google and Facebook) and quan-
titative hedge funds (like Renaissance Technologies and TwoSigma) have
proven the power of modern data analytics. Success stories applying data
to such diverse areas as sports management (Moneyball [Lew04]) and elec-
tion forecasting (Nate Silver [Sil12]) have served as role models to bring
data science to a large popular audience.
This introductory chapter has three missions. First, I will try to explain how
good data scientists think, and how this differs from the mindset of traditional
programmers and software developers. Second, we will look at data sets in terms
of the potential for what they can be used for, and learn to ask the broader
questions they are capable of answering. Finally, I introduce a collection of
data analysis challenges that will be used throughout this book as motivating
1.1 Computer Science, Data Science, and Real
Computer scientists, by nature, don’t respect data. They have traditionally
been taught that the algorithm was the thing, and that data was just meat to
be passed through a sausage grinder.
So to qualify as an effective data scientist, you must first learn to think like
a real scientist. Real scientists strive to understand the natural world, which
is a complicated and messy place. By contrast, computer scientists tend to
build their own clean and organized virtual worlds and live comfortably within
them. Scientists obsess about discovering things, while computer scientists in-
vent rather than discover.
People’s mindsets strongly color how they think and act, causing misunder-
standings when we try to communicate outside our tribes. So fundamental are
these biases that we are often unaware we have them. Examples of the cultural
differences between computer science and real science include:
Data vs. method centrism: Scientists are data driven, while computer
scientists are algorithm driven. Real scientists spend enormous amounts
of effort collecting data to answer their question of interest. They invent
fancy measuring devices, stay up all night tending to experiments, and
devote most of their thinking to how to get the data they need.
By contrast, computer scientists obsess about methods: which algorithm
is better than which other algorithm, which programming language is best
for a job, which program is better than which other program. The details
of the data set they are working on seem comparably unexciting.
Concern about results: Real scientists care about answers. They analyze
data to discover something about how the world works. Good scientists
care about whether the results make sense, because they care about what
the answers mean.
By contrast, bad computer scientists worry about producing plausible-
looking numbers. As soon as the numbers stop looking grossly wrong,
they are presumed to be right. This is because they are personally less
invested in what can be learned from a computation, as opposed to getting
it done quickly and efficiently.
Robustness: Real scientists are comfortable with the idea that data has
errors. In general, computer scientists are not. Scientists think a lot about
possible sources of bias or error in their data, and how these possible prob-
lems can effect the conclusions derived from them. Good programmers use
strong data-typing and parsing methodologies to guard against formatting
errors, but the concerns here are different.
Becoming aware that data can have errors is empowering. Computer
scientists chant “garbage in, garbage out” as a defensive mantra to ward
off criticism, a way to say that’s not my job. Real scientists get close
enough to their data to smell it, giving it the sniff test to decide whether
it is likely to be garbage.
Precision: Nothing is ever completely true or false in science, while every-
thing is either true or false in computer science or mathematics.
Generally speaking, computer scientists are happy printing floating point
numbers to as many digits as possible: 8/13 = 0.61538461538. Real
scientists will use only two significant digits: 8/13 0.62. Computer
scientists care what a number is, while real scientists care what it means.
Aspiring data scientists must learn to think like real scientists. Your job is
going to be to turn numbers into insight. It is important to understand the why
as much as the how.
To be fair, it benefits real scientists to think like data scientists as well. New
experimental technologies enable measuring systems on vastly greater scale than
ever possible before, through technologies like full-genome sequencing in biology
and full-sky telescope surveys in astronomy. With new breadth of view comes
new levels of vision.
Traditional hypothesis-driven science was based on asking specific questions
of the world and then generating the specific data needed to confirm or deny
it. This is now augmented by data-driven science, which instead focuses on
generating data on a previously unheard of scale or resolution, in the belief that
new discoveries will come as soon as one is able to look at it. Both ways of
thinking will be important to us:
Given a problem, what available data will help us answer it?
Given a data set, what interesting problems can we apply it to?
There is another way to capture this basic distinction between software en-
gineering and data science. It is that software developers are hired to build
systems, while data scientists are hired to produce insights.
This may be a point of contention for some developers. There exist an
important class of engineers who wrangle the massive distributed infrastructures
necessary to store and analyze, say, financial transaction or social media data
on a full Facebook or Twitter-level of scale. Indeed, I will devote Chapter 12
to the distinctive challenges of big data infrastructures. These engineers are
building tools and systems to support data science, even though they may not
personally mine the data they wrangle. Do they qualify as data scientists?
This is a fair question, one I will finesse a bit so as to maximize the poten-
tial readership of this book. But I do believe that the better such engineers
understand the full data analysis pipeline, the more likely they will be able to
build powerful tools capable of providing important insights. A major goal of
this book is providing big data engineers with the intellectual tools to think like
big data scientists.
1.2 Asking Interesting Questions from Data
Good data scientists develop an inherent curiosity about the world around them,
particularly in the associated domains and applications they are working on.
They enjoy talking shop with the people whose data they work with. They ask
them questions: What is the coolest thing you have learned about this field?
Why did you get interested in it? What do you hope to learn by analyzing your
data set? Data scientists always ask questions.
Good data scientists have wide-ranging interests. They read the newspaper
every day to get a broader perspective on what is exciting. They understand that
the world is an interesting place. Knowing a little something about everything
equips them to play in other people’s backyards. They are brave enough to get
out of their comfort zones a bit, and driven to learn more once they get there.
Software developers are not really encouraged to ask questions, but data
scientists are. We ask questions like:
What things might you be able to learn from a given data set?
What do you/your people really want to know about the world?
What will it mean to you once you find out?
Computer scientists traditionally do not really appreciate data. Think about
the way algorithm performance is experimentally measured. Usually the pro-
gram is run on “random data” to see how long it takes. They rarely even look
at the results of the computation, except to verify that it is correct and efficient.
Since the “data” is meaningless, the results cannot be important. In contrast,
real data sets are a scarce resource, which required hard work and imagination
to obtain.
Becoming a data scientist requires learning to ask questions about data, so
let’s practice. Each of the subsections below will introduce an interesting data
set. After you understand what kind of information is available, try to come
up with, say, five interesting questions you might explore/answer with access to
this data set.
Figure 1.1: Statistical information on the performance of Babe Ruth can be
found at
The key is thinking broadly: the answers to big, general questions often lie
buried in highly-specific data sets, which were by no means designed to contain
1.2.1 The Baseball Encyclopedia
Baseball has long had an outsized importance in the world of data science. This
sport has been called the national pastime of the United States; indeed, French
historian Jacques Barzun observed that “Whoever wants to know the heart and
mind of America had better learn baseball.” I realize that many readers are not
American, and even those that are might be completely disinterested in sports.
But stick with me for a while.
What makes baseball important to data science is its extensive statistical
record of play, dating back for well over a hundred years. Baseball is a sport of
discrete events: pitchers throw balls and batters try to hit them – that naturally
lends itself to informative statistics. Fans get immersed in these statistics as chil-
dren, building their intuition about the strengths and limitations of quantitative
analysis. Some of these children grow up to become data scientists. Indeed, the
success of Brad Pitt’s statistically-minded baseball team in the movie Moneyball
remains the American public’s most vivid contact with data science.
This historical baseball record is available at
com. There you will find complete statistical data on the performance of every
player who even stepped on the field. This includes summary statistics of each
season’s batting, pitching, and fielding record, plus information about teams
Figure 1.2: Personal information on every major league baseball player is avail-
able at
and awards as shown in Figure 1.1.
But more than just statistics, there is metadata on the life and careers of all
the people who have ever played major league baseball, as shown in Figure 1.2.
We get the vital statistics of each player (height, weight, handedness) and their
lifespan (when/where they were born and died). We also get salary information
(how much each player got paid every season) and transaction data (how did
they get to be the property of each team they played for).
Now, I realize that many of you do not have the slightest knowledge of or
interest in baseball. This sport is somewhat reminiscent of cricket, if that helps.
But remember that as a data scientist, it is your job to be interested in the
world around you. Think of this as chance to learn something.
So what interesting questions can you answer with this baseball data set?
Try to write down five questions before moving on. Don’t worry, I will wait here
for you to finish.
The most obvious types of questions to answer with this data are directly
related to baseball:
How can we best measure an individual player’s skill or value?
How fairly do trades between teams generally work out?
What is the general trajectory of player’s performance level as they mature
and age?
To what extent does batting performance correlate with position played?
For example, are outfielders really better hitters than infielders?
These are interesting questions. But even more interesting are questions
about demographic and social issues. Almost 20,000 major league baseball play-
ers have taken the field over the past 150 years, providing a large, extensively-
documented cohort of men who can serve as a proxy for even larger, less well-
documented populations. Indeed, we can use this baseball player data to answer
questions like:
Do left-handed people have shorter lifespans than right-handers? Handed-
ness is not captured in most demographic data sets, but has been diligently
assembled here. Indeed, analysis of this data set has been used to show
that right-handed people live longer than lefties [HC88]!
How often do people return to live in the same place where they were
born? Locations of birth and death have been extensively recorded in this
data set. Further, almost all of these people played at least part of their
career far from home, thus exposing them to the wider world at a critical
time in their youth.
Do player salaries generally reflect past, present, or future performance?
To what extent have heights and weights been increasing in the population
at large?
There are two particular themes to be aware of here. First, the identifiers
and reference tags (i.e. the metadata) often prove more interesting in a data set
than the stuff we are supposed to care about, here the statistical record of play.
Second is the idea of a statistical proxy, where you use the data set you have
to substitute for the one you really want. The data set of your dreams likely
does not exist, or may be locked away behind a corporate wall even if it does.
A good data scientist is a pragmatist, seeing what they can do with what they
have instead of bemoaning what they cannot get their hands on.
1.2.2 The Internet Movie Database (IMDb)
Everybody loves the movies. The Internet Movie Database (IMDb) provides
crowdsourced and curated data about all aspects of the motion picture industry,
at IMDb currently contains data on over 3.3 million movies and
TV programs. For each film, IMDb includes its title, running time, genres, date
of release, and a full list of cast and crew. There is financial data about each
production, including the budget for making the film and how well it did at the
box office.
Finally, there are extensive ratings for each film from viewers and critics.
This rating data consists of scores on a zero to ten stars scale, cross-tabulated
into averages by age and gender. Written reviews are often included, explaining
why a particular critic awarded a given number of stars. There are also links
between films: for example, identifying which other films have been watched
most often by viewers of It’s a Wonderful Life.
Every actor, director, producer, and crew member associated with a film
merits an entry in IMDb, which now contains records on 6.5 million people.
Figure 1.3: Representative film data from the Internet Movie Database.
Figure 1.4: Representative actor data from the Internet Movie Database.
These happen to include my brother, cousin, and sister-in-law. Each actor
is linked to every film they appeared in, with a description of their role and
their ordering in the credits. Available data about each personality includes
birth/death dates, height, awards, and family relations.
So what kind of questions can you answer with this movie data?
Perhaps the most natural questions to ask IMDb involve identifying the
extremes of movies and actors:
Which actors appeared in the most films? Earned the most money? Ap-
peared in the lowest rated films? Had the longest career or the shortest
What was the highest rated film each year, or the best in each genre?
Which movies lost the most money, had the highest-powered casts, or got
the least favorable reviews.
Then there are larger-scale questions one can ask about the nature of the
motion picture business itself:
How well does movie gross correlate with viewer ratings or awards? Do
customers instinctively flock to trash, or is virtue on the part of the cre-
ative team properly rewarded?
How do Hollywood movies compare to Bollywood movies, in terms of rat-
ings, budget, and gross? Are American movies better received than foreign
films, and how does this differ between U.S. and non-U.S. reviewers?
What is the age distribution of actors and actresses in films? How much
younger is the actress playing the wife, on average, than the actor playing
the husband? Has this disparity been increasing or decreasing with time?
Live fast, die young, and leave a good-looking corpse? Do movie stars live
longer or shorter lives than bit players, or compared to the general public?
Assuming that people working together on a film get to know each other,
the cast and crew data can be used to build a social network of the movie
business. What does the social network of actors look like? The Oracle of
Bacon ( posits Kevin Bacon as the center of
the Hollywood universe and generates the shortest path to Bacon from any
other actor. Other actors, like Samuel L. Jackson, prove even more central.
More critically, can we analyze this data to determine the probability that
someone will like a given movie? The technique of collaborative filtering finds
people who liked films that I also liked, and recommends other films that they
liked as good candidates for me. The 2007 Netflix Prize was a $1,000,000 com-
petition to produce a ratings engine 10% better than the proprietary Netflix
system. The ultimate winner of this prize (BellKor) used a variety of data
sources and techniques, including the analysis of links [BK07].
Figure 1.5: The rise and fall of data processing, as witnessed by Google Ngrams.
1.2.3 Google Ngrams
Printed books have been the primary repository of human knowledge since
Gutenberg’s invention of movable type in 1439. Physical objects live somewhat
uneasily in today’s digital world, but technology has a way of reducing every-
thing to data. As part of its mission to organize the world’s information, Google
undertook an effort to scan all of the world’s published books. They haven’t
quite gotten there yet, but the 30 million books thus far digitized represent over
20% of all books ever published.
Google uses this data to improve search results, and provide fresh access
to out-of-print books. But perhaps the coolest product is Google Ngrams, an
amazing resource for monitoring changes in the cultural zeitgeist. It provides
the frequency with which short phrases occur in books published each year.
Each phrase must occur at least forty times in their scanned book corpus. This
eliminates obscure words and phrases, but leaves over two billion time series
available for analysis.
This rich data set shows how language use has changed over the past 200
years, and has been widely applied to cultural trend analysis [MAV+11]. Figure
1.5 uses this data to show how the word data fell out of favor when thinking
about computing. Data processing was the popular term associated with the
computing field during the punched card and spinning magnetic tape era of the
1950s. The Ngrams data shows that the rapid rise of Computer Science did not
eclipse Data Processing until 1980. Even today, Data Science remains almost
invisible on this scale.
Check out Google Ngrams at I promise
you will enjoy playing with it. Compare hot dog to tofu,science against religion,
freedom to justice, and sex vs. marriage, to better understand this fantastic
telescope for looking into the past.
But once you are done playing, think of bigger things you could do if you
got your hands on this data. Assume you have access to the annual number
of references for all words/phrases published in books over the past 200 years.
Google makes this data freely available. So what are you going to do with it?
Observing the time series associated with particular words using the Ngrams
Viewer is fun. But more sophisticated historical trends can be captured by
aggregating multiple time series together. The following types of questions
seem particularly interesting to me:
How has the amount of cursing changed over time? Use of the four-
letter words I am most familiar with seem to have exploded since 1960,
although it is perhaps less clear whether this reflects increased cussing or
lower publication standards.
How often do new words emerge and get popular? Do these words tend
to stay in common usage, or rapidly fade away? Can we detect when
words change meaning over time, like the transition of gay from happy to
Have standards of spelling been improving or deteriorating with time,
especially now that we have entered the era of automated spell check-
ing? Rarely-occurring words that are only one character removed from a
commonly-used word are likely candidates to be spelling errors (e.g. al-
gorithm vs. algorthm). Aggregated over many different misspellings, are
such errors increasing or decreasing?
You can also use this Ngrams corpus to build a language model that captures
the meaning and usage of the words in a given language. We will discuss word
embeddings in Section 11.6.3, which are powerful tools for building language
models. Frequency counts reveal which words are most popular. The frequency
of word pairs appearing next to each other can be used to improve speech
recognition systems, helping to distinguish whether the speaker said that’s too
bad or that’s to bad. These millions of books provide an ample data set to build
representative models from.
1.2.4 New York Taxi Records
Every financial transaction today leaves a data trail behind it. Following these
paths can lead to interesting insights.
Taxi cabs form an important part of the urban transportation network. They
roam the streets of the city looking for customers, and then drive them to their
destination for a fare proportional to the length of the trip. Each cab contains
a metering device to calculate the cost of the trip as a function of time. This
meter serves as a record keeping device, and a mechanism to ensure that the
driver charges the proper amount for each trip.
The taxi meters currently employed in New York cabs can do many things
beyond calculating fares. They act as credit card terminals, providing a way
Figure 1.6: Representative fields from the New York city taxi cab data: pick up
and dropoff points, distances, and fares.
for customers to pay for rides without cash. They are integrated with global
positioning systems (GPS), recording the exact location of every pickup and
drop off. And finally, since they are on a wireless network, these boxes can
communicate all of this data back to a central server.
The result is a database documenting every single trip by all taxi cabs in
one of the world’s greatest cities, a small portion of which is shown in Figure
1.6. Because the New York Taxi and Limousine Commission is a public agency,
its non-confidential data is available to all under the Freedom of Information
Act (FOA).
Every ride generates two records: one with data on the trip, the other with
details of the fare. Each trip is keyed to the medallion (license) of each car
coupled with the identifier of each driver. For each trip, we get the time/date
of pickup and drop-off, as well as the GPS coordinates (longitude and latitude)
of the starting location and destination. We do not get GPS data of the route
they traveled between these points, but to some extent that can be inferred by
the shortest path between them.
As for fare data, we get the metered cost of each trip, including tax, surcharge
and tolls. It is traditional to pay the driver a tip for service, the amount of which
is also recorded in the data.
So I’m talking to you. This taxi data is readily available, with records of
over 80 million trips over the past several years. What are you going to do with
Any interesting data set can be used to answer questions on many different
scales. This taxi fare data can help us better understand the transportation
industry, but also how the city works and how we could make it work even
better. Natural questions with respect to the taxi industry include:
Figure 1.7: Which neighborhoods in New York city tip most generously? The
relatively remote outer boroughs of Brooklyn and Queens, where trips are
longest and supply is relatively scarce.
How much money do drivers make each night, on average? What is the
distribution? Do drivers make more on sunny days or rainy days?
Where are the best spots in the city for drivers to cruise, in order to pick
up profitable fares? How does this vary at different times of the day?
How far do drivers travel over the course of a night’s work? We can’t
answer this exactly using this data set, because it does not provide GPS
data of the route traveled between fares. But we do know the last place
of drop off, the next place of pickup, and how long it took to get between
them. Together, this should provide enough information to make a sound
Which drivers take their unsuspecting out-of-town passengers for a “ride,”
running up the meter on what should be a much shorter, cheaper trip?
How much are drivers tipped, and why? Do faster drivers get tipped
better? How do tipping rates vary by neighborhood, and is it the rich
neighborhoods or poor neighborhoods which prove more generous?
I will confess we did an analysis of this, which I will further describe in
the war story of Section 9.3. We found a variety of interesting patterns
[SS15]. Figure 1.7 shows that Manhattanites are generally cheapskates
relative to large swaths of Brooklyn, Queens, and Staten Island, where
trips are longer and street cabs a rare but welcome sight.
But the bigger questions have to do with understanding transportation in
the city. We can use the taxi travel times as a sensor to measure the level of
traffic in the city at a fine level. How much slower is traffic during rush hour
than other times, and where are delays the worst? Identifying problem areas is
the first step to proposing solutions, by changing the timing patterns of traffic
lights, running more buses, or creating high-occupancy only lanes.
Similarly we can use the taxi data to measure transportation flows across
the city. Where are people traveling to, at different times of the day? This tells
us much more than just congestion. By looking at the taxi data, we should
be able to see tourists going from hotels to attractions, executives from fancy
neighborhoods to Wall Street, and drunks returning home from nightclubs after
a bender.
Data like this is essential to designing better transportation systems. It is
wasteful for a single rider to travel from point ato point bwhen there is another
rider at point a+who also wants to get there. Analysis of the taxi data enables
accurate simulation of a ride sharing system, so we can accurately evaluate the
demands and cost reductions of such a service.
1.3 Properties of Data
This book is about techniques for analyzing data. But what is the underlying
stuff that we will be studying? This section provides a brief taxonomy of the
properties of data, so we can better appreciate and understand what we will be
working on.
1.3.1 Structured vs. Unstructured Data
Certain data sets are nicely structured, like the tables in a database or spread-
sheet program. Others record information about the state of the world, but in
a more heterogeneous way. Perhaps it is a large text corpus with images and
links like Wikipedia, or the complicated mix of notes and test results appearing
in personal medical records.
Generally speaking, this book will focus on dealing with structured data.
Data is often represented by a matrix, where the rows of the matrix represent
distinct items or records, and the columns represent distinct properties of these
items. For example, a data set about U.S. cities might contain one row for each
city, with columns representing features like state, population, and area.
When confronted with an unstructured data source, such as a collection of
tweets from Twitter, our first step is generally to build a matrix to structure
it. A bag of words model will construct a matrix with a row for each tweet, and
a column for each frequently used vocabulary word. Matrix entry M[i, j] then
denotes the number of times tweet icontains word j. Such matrix formulations
will motivate our discussion of linear algebra, in Chapter 8.
1.3.2 Quantitative vs. Categorical Data
Quantitative data consists of numerical values, like height and weight. Such data
can be incorporated directly into algebraic formulas and mathematical models,
or displayed in conventional graphs and charts.
By contrast, categorical data consists of labels describing the properties of
the objects under investigation, like gender, hair color, and occupation. This
descriptive information can be every bit as precise and meaningful as numerical
data, but it cannot be worked with using the same techniques.
Categorical data can usually be coded numerically. For example, gender
might be represented as male = 0 or f emale = 1. But things get more com-
plicated when there are more than two characters per feature, especially when
there is not an implicit order between them. We may be able to encode hair
colors as numbers by assigning each shade a distinct value like gray hair = 0,
red hair = 1, and blond hair = 2. However, we cannot really treat these val-
ues as numbers, for anything other than simple identity testing. Does it make
any sense to talk about the maximum or minimum hair color? What is the
interpretation of my hair color minus your hair color?
Most of what we do in this book will revolve around numerical data. But
keep an eye out for categorical features, and methods that work for them. Clas-
sification and clustering methods can be thought of as generating categorical
labels from numerical data, and will be a primary focus in this book.
1.3.3 Big Data vs. Little Data
Data science has become conflated in the public eye with big data, the analysis of
massive data sets resulting from computer logs and sensor devices. In principle,
having more data is always better than having less, because you can always
throw some of it away by sampling to get a smaller set if necessary.
Big data is an exciting phenomenon, and we will discuss it in Chapter 12. But
in practice, there are difficulties in working with large data sets. Throughout
this book we will look at algorithms and best practices for analyzing data. In
general, things get harder once the volume gets too large. The challenges of big
data include:
The analysis cycle time slows as data size grows: Computational opera-
tions on data sets take longer as their volume increases. Small spreadsheets
provide instantaneous response, allowing you to experiment and play what
if? But large spreadsheets can be slow and clumsy to work with, and
massive-enough data sets might take hours or days to get answers from.
Clever algorithms can permit amazing things to be done with big data,
but staying small generally leads to faster analysis and exploration.
Large data sets are complex to visualize: Plots with millions of points on
them are impossible to display on computer screens or printed images, let
alone conceptually understand. How can we ever hope to really understand
something we cannot see?
Simple models do not require massive data to fit or evaluate: A typical
data science task might be to make a decision (say, whether I should offer
this fellow life insurance?) on the basis of a small number of variables:
say age, gender, height, weight, and the presence or absence of existing
medical conditions.
If I have this data on 1 million people with their associated life outcomes, I
should be able to build a good general model of coverage risk. It probably
wouldn’t help me build a substantially better model if I had this data
on hundreds of millions of people. The decision criteria on only a few
variables (like age and martial status) cannot be too complex, and should
be robust over a large number of applicants. Any observation that is so
subtle it requires massive data to tease out will prove irrelevant to a large
business which is based on volume.
Big data is sometimes called bad data. It is often gathered as the by-product
of a given system or procedure, instead of being purposefully collected to answer
your question at hand. The result is that we might have to go to heroic efforts
to make sense of something just because we have it.
Consider the problem of getting a pulse on voter preferences among presi-
dential candidates. The big data approach might analyze massive Twitter or
Facebook feeds, interpreting clues to their opinions in the text. The small data
approach might be to conduct a poll, asking a few hundred people this specific
question and tabulating the results. Which procedure do you think will prove
more accurate? The right data set is the one most directly relevant to the tasks
at hand, not necessarily the biggest one.
Take-Home Lesson: Do not blindly aspire to analyze large data sets. Seek the
right data to answer a given question, not necessarily the biggest thing you can
get your hands on.
1.4 Classification and Regression
Two types of problems arise repeatedly in traditional data science and pattern
recognition applications, the challenges of classification and regression. As this
book has developed, I have pushed discussions of the algorithmic approaches
to solving these problems toward the later chapters, so they can benefit from a
solid understanding of core material in data munging, statistics, visualization,
and mathematical modeling.
Still, I will mention issues related to classification and regression as they
arise, so it makes sense to pause here for a quick introduction to these problems,
to help you recognize them when you see them.
Classification: Often we seek to assign a label to an item from a discrete
set of possibilities. Such problems as predicting the winner of a particular
sporting contest (team Aor team B?) or deciding the genre of a given
movie (comedy, drama, or animation?) are classification problems, since
each entail selecting a label from the possible choices.
Regression: Another common task is to forecast a given numerical quan-
tity. Predicting a person’s weight or how much snow we will get this year
is a regression problem, where we forecast the future value of a numerical
function in terms of previous values and other relevant features.
Perhaps the best way to see the intended distinction is to look at a variety
of data science problems and label (classify) them as regression or classification.
Different algorithmic methods are used to solve these two types of problems,
although the same questions can often be approached in either way:
Will the price of a particular stock be higher or lower tomorrow? (classi-
What will the price of a particular stock be tomorrow? (regression)
Is this person a good risk to sell an insurance policy to? (classification)
How long do we expect this person to live? (regression)
Keep your eyes open for classification and regression problems as you en-
counter them in your life, and in this book.
1.5 Data Science Television: The Quant Shop
I believe that hands-on experience is necessary to internalize basic principles.
Thus when I teach data science, I like to give each student team an interesting
but messy forecasting challenge, and demand that they build and evaluate a
predictive model for the task.
These forecasting challenges are associated with events where the students
must make testable predictions. They start from scratch: finding the relevant
data sets, building their own evaluation environments, and devising their model.
Finally, I make them watch the event as it unfolds, so as to witness the vindi-
cation or collapse of their prediction.
As an experiment, we documented the evolution of each group’s project
on video in Fall 2014. Professionally edited, this became The Quant Shop, a
television-like data science series for a general audience. The eight episodes of
this first season are available at, and include:
Finding Miss Universe – The annual Miss Universe competition aspires
to identify the most beautiful woman in the world. Can computational
models predict who will win a beauty contest? Is beauty just subjective,
or can algorithms tell who is the fairest one of all?
Modeling the Movies – The business of movie making involves a lot of
high-stakes data analysis. Can we build models to predict which film will
gross the most on Christmas day? How about identifying which actors
will receive awards for their performance?
Winning the Baby Pool – Birth weight is an important factor in assessing
the health of a newborn child. But how accurately can we predict junior’s
weight before the actual birth? How can data clarify environmental risks
to developing pregnancies?
The Art of the Auction – The world’s most valuable artworks sell at auc-
tions to the highest bidder. But can we predict how many millions a
particular J.W. Turner painting will sell for? Can computers develop an
artistic sense of what’s worth buying?
White Christmas – Weather forecasting is perhaps the most familiar do-
main of predictive modeling. Short-term forecasts are generally accurate,
but what about longer-term prediction? What places will wake up to a
snowy Christmas this year? And can you tell one month in advance?
Predicting the Playoffs – Sports events have winners and losers, and book-
ies are happy to take your bets on the outcome of any match. How well can
statistics help predict which football team will win the Super Bowl? Can
Google’s PageRank algorithm pick the winners on the field as accurately
as it does on the web?
The Ghoul Pool – Death comes to all men, but when? Can we apply
actuarial models to celebrities, to decide who will be the next to die?
Similar analysis underlies the workings of the life insurance industry, where
accurate predictions of lifespan are necessary to set premiums which are
both sustainable and affordable.
Figure 1.8: Exciting scenes from data science television: The Quant Shop.
Playing the Market – Hedge fund quants get rich when guessing right
about tomorrow’s prices, and poor when wrong. How accurately can we
predict future prices of gold and oil using histories of price data? What
other information goes into building a successful price model?
I encourage you to watch some episodes of The Quant Shop in tandem with
reading this book. We try to make it fun, although I am sure you will find
plenty of things to cringe at. Each show runs for thirty minutes, and maybe
will inspire you to tackle a prediction challenge of your own.
These programs will certainly give you more insight into these eight specific
challenges. I will use these projects throughout this book to illustrate important
lessons in how to do data science, both as positive and negative examples. These
projects provide a laboratory to see how intelligent but inexperienced people not
wildly unlike yourself thought about a data science problem, and what happened
when they did.
1.5.1 Kaggle Challenges
Another source of inspiration are challenges from Kaggle (,
which provides a competitive forum for data scientists. New challenges are
posted on a regular basis, providing a problem definition, training data, and
a scoring function over hidden evaluation data. A leader board displays the
scores of the strongest competitors, so you can see how well your model stacks
up in comparison with your opponents. The winners spill their modeling secrets
during post-contest interviews, to help you improve your modeling skills.
Performing well on Kaggle challenges is an excellent credential to put on your
resume to get a good job as a data scientist. Indeed, potential employers will
track you down if you are a real Kaggle star. But the real reason to participate
is that the problems are fun and inspiring, and practice helps make you a better
data scientist.
The exercises at the end of each chapter point to expired Kaggle challenges,
loosely connected to the material in that chapter. Be forewarned that Kaggle
provides a misleading glamorous view of data science as applied machine learn-
ing, because it presents extremely well-defined problems with the hard work
of data collection and cleaning already done for you. Still, I encourage you to
check it out for inspiration, and as a source of data for new projects.
1.6 About the War Stories
Genius and wisdom are two distinct intellectual gifts. Genius shows in discover-
ing the right answer, making imaginative mental leaps which overcome obstacles
and challenges. Wisdom shows in avoiding obstacles in the first place, providing
a sense of direction or guiding light that keeps us moving soundly in the right
Genius is manifested in technical strength and depth, the ability to see things
and do things that other people cannot. In contrast, wisdom comes from ex-
perience and general knowledge. It comes from listening to others. Wisdom
comes from humility, observing how often you have been wrong in the past and
figuring out why you were wrong, so as to better recognize future traps and
avoid them.
Data science, like most things in life, benefits more from wisdom than from
genius. In this book, I seek to pass on wisdom that I have accumulated the hard
way through war stories, gleaned from a diverse set of projects I have worked
Large-scale text analytics and NLP: My Data Science Laboratory at Stony
Brook University works on a variety of projects in big data, including sen-
timent analysis from social media, historical trends analysis, deep learning
approaches to natural language processing (NLP), and feature extraction
from networks.
Start-up companies: I served as co-founder and chief scientist to two
data analytics companies: General Sentiment and Thrivemetrics. General
Sentiment analyzed large-scale text streams from news, blogs, and social
media to identify trends in the sentiment (positive or negative) associated
with people, places, and things. Thrivemetrics applied this type of analysis
to internal corporate communications, like email and messaging systems.
Neither of these ventures left me wealthy enough to forgo my royalties
from this book, but they did provide me with experience on cloud-based
computing systems, and insight into how data is used in industry.
Collaborating with real scientists: I have had several interesting collab-
orations with biologists and social scientists, which helped shape my un-
derstanding of the complexities of working with real data. Experimental
data is horribly noisy and riddled with errors, yet you must do the best
you can with what you have, in order to discover how the world works.
Building gambling systems: A particularly amusing project was building
a system to predict the results of jai-alai matches so we could bet on them,
an experience recounted in my book Calculated Bets: Computers, Gam-
bling, and Mathematical Modeling to Win [Ski01]. Our system relied on
web scraping for data collection, statistical analysis, simulation/modeling,
and careful evaluation. We also have developed and evaluated predictive
models for movie grosses [ZS09], stock prices [ZS10], and football games
[HS10] using social media analysis.
Ranking historical figures: By analyzing Wikipedia to extract meaningful
variables on over 800,000 historical figures, we developed a scoring func-
tion to rank them by their strength as historical memes. This ranking
does a great job separating the greatest of the great (Jesus, Napoleon,
Shakespeare, Mohammad, and Lincoln round out the top five) from lesser
mortals, and served as the basis for our book Who’s Bigger?: Where His-
torical Figures Really Rank [SW13].
All this experience drives what I teach in this book, especially the tales that
I describe as war stories. Every one of these war stories is true. Of course, the
stories improve somewhat in the retelling, and the dialogue has been punched
up to make them more interesting to read. However, I have tried to honestly
trace the process of going from a raw problem to a solution, so you can watch
how it unfolded.
1.7 War Story: Answering the Right Question
Our research group at Stony Brook University developed an NLP-based system
for analyzing millions of news, blogs and social media messages, and reducing
this text to trends concerning all the entities under discussion. Counting the
number of mentions each name receives in a text stream (volume) is easy, in
principle. Determining whether the connotation of a particular reference is
positive or negative (sentiment analysis) is hard. But our system did a pretty
good job, particularly when aggregated over many references.
This technology served as the foundation for a social media analysis company
named General Sentiment. It was exciting living through a start-up starting up,
facing the challenges of raising money, hiring staff, and developing new products.
But perhaps the biggest problem we faced was answering the right question.
The General Sentiment system recorded trends about the sentiment and volume
for every person, place, and thing that was ever mentioned in news, blogs, and
social media: over 20 million distinct entities. We monitored the reputations of
celebrities and politicians. We monitored the fates of companies and products.
We tracked the performance of sports teams, and the buzz about movies. We
could do anything!
But it turns out that no one pays you to do anything. They pay you to do
something, to solve a particular problem they have, or eliminate a specific pain
point in their business. Being able to do anything proves to be a terrible sales
strategy, because it requires you to find that need afresh for each and every
Facebook didn’t open up to the world until September 2006. So when Gen-
eral Sentiment started in 2008, we were at the very beginning of the social media
era. We had lots of interest from major brands and advertising agencies which
knew that social media was ready to explode. They knew this newfangled thing
was important, and that they had to be there. They knew that proper analysis
of social media data could give them fresh insights into what their customers
were thinking. But they didn’t know exactly what it was they really wanted to
One aircraft engine manufacturer was very interested in learning how much
the kids talked about them on Facebook. We had to break it to them gently
that the answer was zero. Other potential customers demanded proof that we
were more accurate than the Nielsen television ratings. But of course, if you
wanted Nielsen ratings then you should buy them from Nielsen. Our system
provided different insights from a completely different world. But you had to
know what you wanted in order to use them.
We did manage to get substantial contracts from a very diverse group of
customers, including consumer brands like Toyota and Blackberry, governmental
organizations like the Hawaii tourism office, and even the presidential campaign
of Republican nominee Mitt Romney in 2012. Our analysts provided them
insights into a wide variety of business issues:
What did people think about Hawaii? (Answer: they think it is a very
nice place to visit.)
How quickly would Toyota’s sentiment recover after news of serious brake
problems in their cars? (Answer: about six months.)
What did people think about Blackberry’s new phone models? (Answer:
they liked the iPhone much better.)
How quickly would Romney’s sentiment recover after insulting 47% of the
electorate in a recorded speech? (Answer: never.)
But each sale required entering a new universe, involving considerable effort
and imagination on the part of our sales staff and research analysts. We never
managed to get two customers in the same industry, which would have let us
benefit from scale and accumulated wisdom.
Of course, the customer is always right. It was our fault that we could not
explain to them the best way to use our technology. The lesson here is that the
world will not beat a path to your door just for a new source of data. You must
be able to supply the right questions before you can turn data into money.
1.8 Chapter Notes
The idea of using historical records from baseball players to establish that left-
handers have shorter lifespans is due to Halpern and Coren [HC88, HC91],
but their conclusion remains controversial. The percentage of left-handers in
the population has been rapidly growing, and the observed effects may be a
function of survivorship bias [McM04]. So lefties, hang in there! Full disclosure:
I am one of you.
The discipline of quantitative baseball analysis is sometimes called sabermet-
rics, and its leading light is a fellow named Bill James. I recommend budding
data scientists read his Historical Baseball Abstract [Jam10] as an excellent ex-
ample of how one turns numbers into knowledge and understanding. Time
Magazine once said of James: “Much of the joy of reading him comes from the
extravagant spectacle of a first-rate mind wasting itself on baseball.” I thank for permission to use images of their website
in this book. Ditto to Amazon, the owner of IMDb.
The potential of ride-sharing systems in New York was studied by Santi et.
al. [SRS+14], who showed that almost 95% of the trips could have been shared
with no more than five minutes delay per trip.
The Lydia system for sentiment analysis is described in [GSS07]. Methods
to identify changes in word meaning through analysis of historical text corpora
like Google Ngram are reported in [KARPS15].
1.9 Exercises
Identifying Data Sets
1-1. [3] Identify where interesting data sets relevant to the following domains can be
found on the web:
(a) Books.
(b) Horse racing.
(c) Stock prices.
(d) Risks of diseases.
(e) Colleges and universities.
(f) Crime rates.
(g) Bird watching.
For each of these data sources, explain what you must do to turn this data into
a usable format on your computer for analysis.
1-2. [3] Propose relevant data sources for the following The Quant Shop prediction
challenges. Distinguish between sources of data that you are sure somebody must
have, and those where the data is clearly available to you.
(a) Miss Universe.
(b) Movie gross.
(c) Baby weight.
(d) Art auction price.
(e) White Christmas.
(f) Football champions.
(g) Ghoul pool.
(h) Gold/oil prices.
1-3. [3] Visit, and identify five data sets that sound interesting to
you. For each write a brief description, and propose three interesting things you
might do with them.
Asking Questions
1-4. [3] For each of the following data sources, propose three interesting questions
you can answer by analyzing them:
(a) Credit card billing data.
(b) Click data from
(c) White Pages residential/commercial telephone directory.
1-5. [5] Visit Entrez, the National Center for Biotechnology Information (NCBI)
portal. Investigate what data sources are available, particularly the Pubmed
and Genome resources. Propose three interesting projects to explore with each
of them.
1-6. [5] You would like to conduct an experiment to establish whether your friends
prefer the taste of regular Coke or Diet Coke. Briefly outline a design for such
a study.
1-7. [5] You would like to conduct an experiment to see whether students learn better
if they study without any music, with instrumental music, or with songs that
have lyrics. Briefly outline the design for such a study.
1-8. [5] Traditional polling operations like Gallup use a procedure called random digit
dialing, which dials random strings of digits instead of picking phone numbers
from the phone book. Suggest why such polls are conducted using random digit
Implementation Projects
1-9. [5] Write a program to scrape the best-seller rank for a book on
Use this to plot the rank of all of Skiena’s books over time. Which one of these
books should be the next item that you purchase? Do you have friends for whom
they would make a welcome and appropriate gift? :-)
1-10. [5] For your favorite sport (baseball, football, basketball, cricket, or soccer)
identify a data set with the historical statistical records for all major partici-
pants. Devise and implement a ranking system to identify the best player at
each position.
Interview Questions
1-11. [3] For each of the following questions: (1) produce a quick guess based only on
your understanding of the world, and then (2) use Google to find supportable
numbers to produce a more principled estimate from. How much did your two
estimates differ by?
(a) How many piano tuners are there in the entire world?
(b) How much does the ice in a hockey rink weigh?
(c) How many gas stations are there in the United States?
(d) How many people fly in and out of LaGuardia Airport every day?
(e) How many gallons of ice cream are sold in the United States each year?
(f) How many basketballs are purchased by the National Basketball Associa-
tion (NBA) each year?
(g) How many fish are there in all the world’s oceans?
(h) How many people are flying in the air right now, all over the world?
(i) How many ping-pong balls can fit in a large commercial jet?
(j) How many miles of paved road are there in your favorite country?
(k) How many dollar bills are sitting in the wallets of all people at Stony Brook
(l) How many gallons of gasoline does a typical gas station sell per day?
(m) How many words are there in this book?
(n) How many cats live in New York city?
(o) How much would it cost to fill a typical car’s gas tank with Starbuck’s
(p) How much tea is there in China?
(q) How many checking accounts are there in the United States?
1-12. [3] What is the difference between regression and classification?
1-13. [8] How would you build a data-driven recommendation system? What are the
limitations of this approach?
1-14. [3] How did you become interested in data science?
1-15. [3] Do you think data science is an art or a science?
Kaggle Challenges
1-16. Who survived the shipwreck of the Titanic?
1-17. Where is a particular taxi cab going?
1-18. How long will a given taxi trip take?
Chapter 2
Mathematical Preliminaries
A data scientist is someone who knows more statistics than a com-
puter scientist and more computer science than a statistician.
– Josh Blumenstock
You must walk before you can run. Similarly, there is a certain level of mathe-
matical maturity which is necessary before you should be trusted to do anything
meaningful with numerical data.
In writing this book, I have assumed that the reader has had some degree
of exposure to probability and statistics, linear algebra, and continuous math-
ematics. I have also assumed that they have probably forgotten most of it, or
perhaps didn’t always see the forest (why things are important, and how to use
them) for the trees (all the details of definitions, proofs, and operations).
This chapter will try to refresh your understanding of certain basic math-
ematical concepts. Follow along with me, and pull out your old textbooks if
necessary for future reference. Deeper concepts will be introduced later in the
book when we need them.
2.1 Probability
Probability theory provides a formal framework for reasoning about the likeli-
hood of events. Because it is a formal discipline, there are a thicket of associated
definitions to instantiate exactly what we are reasoning about:
An experiment is a procedure which yields one of a set of possible out-
comes. As our ongoing example, consider the experiment of tossing two
six-sided dice, one red and one blue, with each face baring a distinct inte-
ger {1,...,6}.
Asample space Sis the set of possible outcomes of an experiment. In our
dice example, there are 36 possible outcomes, namely
An event Eis a specified subset of the outcomes of an experiment. The
event that the sum of the dice equals 7 or 11 (the conditions to win at
craps on the first roll) is the subset
The probability of an outcome s, denoted p(s) is a number with the two
For each outcome sin sample space S, 0 p(s)1.
The sum of probabilities of all outcomes adds to one: PsSp(s) = 1.
If we assume two distinct fair dice, the probability p(s) = (1/6) ×(1/6) =
1/36 for all outcomes sS.
The probability of an event Eis the sum of the probabilities of the out-
comes of the experiment. Thus
p(E) = X
An alternate formulation is in terms of the complement of the event ¯
the case when Edoes not occur. Then
This is useful, because often it is easier to analyze P(¯
E) than P(E) di-
Arandom variable Vis a numerical function on the outcomes of a proba-
bility space. The function “sum the values of two dice” (V((a, b)) = a+b)
produces an integer result between 2 and 12. This implies a probabil-
ity distribution of the values of the random variable. The probability
P(V(s) = 7) = 1/6, as previously shown, while P(V(s) = 12) = 1/36.
The expected value of a random variable Vdefined on a sample space S,
E(V) is defined
E(V) = X
All this you have presumably seen before. But it provides the language we
will use to connect between probability and statistics. The data we see usually
comes from measuring properties of observed events. The theory of probability
and statistics provides the tools to analyze this data.
2.1.1 Probability vs. Statistics
Probability and statistics are related areas of mathematics which concern them-
selves with analyzing the relative frequency of events. Still, there are funda-
mental differences in the way they see the world:
Probability deals with predicting the likelihood of future events, while
statistics involves the analysis of the frequency of past events.
Probability is primarily a theoretical branch of mathematics, which studies
the consequences of mathematical definitions. Statistics is primarily an
applied branch of mathematics, which tries to make sense of observations
in the real world.
Both subjects are important, relevant, and useful. But they are different, and
understanding the distinction is crucial in properly interpreting the relevance
of mathematical evidence. Many a gambler has gone to a cold and lonely grave
for failing to make the proper distinction between probability and statistics.
This distinction will perhaps become clearer if we trace the thought process
of a mathematician encountering her first craps game:
If this mathematician were a probabilist, she would see the dice and think
“Six-sided dice? Each side of the dice is presumably equally likely to land
face up. Now assuming that each face comes up with probability 1/6, I
can figure out what my chances are of crapping out.”
If instead a statistician wandered by, she would see the dice and think
“How do I know that they are not loaded? I’ll watch a while, and keep
track of how often each number comes up. Then I can decide if my ob-
servations are consistent with the assumption of equal-probability faces.
Once I’m confident enough that the dice are fair, I’ll call a probabilist to
tell me how to bet.”
In summary, probability theory enables us to find the consequences of a
given ideal world, while statistical theory enables us to measure the extent to
which our world is ideal. This constant tension between theory and practice is
why statisticians prove to be a tortured group of individuals compared with the
happy-go-lucky probabilists.
Modern probability theory first emerged from the dice tables of France in
1654. Chevalier de M´er´e, a French nobleman, wondered whether the player or
the house had the advantage in a particular betting game.1In the basic version,
the player rolls four dice, and wins provided none of them are a 6. The house
collects on the even money bet if at least one 6 appears.
De M´er´e brought this problem to the attention of the French mathematicians
Blaise Pascal and Pierre de Fermat, most famous as the source of Fermat’s Last
Theorem. Together, these men worked out the basics of probability theory,
1He really shouldn’t have wondered. The house always has the advantage.
Figure 2.1: Venn diagrams illustrating set difference (left), intersection (middle),
and union (right).
along the way establishing that the house wins this dice game with probability
p= 1 (5/6)40.517, where the probability p= 0.5 would denote a fair game
where the house wins exactly half the time.
2.1.2 Compound Events and Independence
We will be interested in complex events computed from simpler events Aand B
on the same set of outcomes. Perhaps event Ais that at least one of two dice
be an even number, while event Bdenotes rolling a total of either 7 or 11. Note
that there exist certain outcomes of Awhich are not outcomes of B, specifically
This is the set difference operation. Observe that here BA={}, because
every pair adding to 7 or 11 must contain one odd and one even number.
The outcomes in common between both events Aand Bare called the in-
tersection, denoted AB. This can be written as
Outcomes which appear in either Aor Bare called the union, denoted AB.
With the complement operation ¯
A=SA, we get a rich language for combining
events, shown in Figure 2.1. We can readily compute the probability of any of
these sets by summing the probabilities of the outcomes in the defined sets.
The events Aand Bare independent if and only if
P(AB) = P(A)×P(B).
This means that there is no special structure of outcomes shared between events
Aand B. Assuming that half of the students in my class are female, and half
the students in my class are above average, we would expect that a quarter of
my students are both female and above average if the events are independent.
Probability theorists love independent events, because it simplifies their cal-
culations. But data scientists generally don’t. When building models to predict
the likelihood of some future event B, given knowledge of some previous event
A, we want as strong a dependence of Bon Aas possible.
Suppose I always use an umbrella if and only if it is raining. Assume that
the probability it is raining here (event B) is, say, p= 1/5. This implies the
probability that I am carrying my umbrella (event A) is q= 1/5. But even
more, if you know the state of the rain you know exactly whether I have my
umbrella. These two events are perfectly correlated.
By contrast, suppose the events were independent. Then
P(A|B) = P(AB)
and whether it is raining has absolutely no impact on whether I carry my pro-
tective gear.
Correlations are the driving force behind predictive models, so we will discuss
how to measure them and what they mean in Section 2.3.
2.1.3 Conditional Probability
When two events are correlated, there is a dependency between them which
makes calculations more difficult. The conditional probability of Agiven B,
P(A|B) is defined:
P(A|B) = P(AB)
Recall the dice rolling events from Section 2.1.2, namely:
Event Ais that at least one of two dice be an even number.
Event Bis the sum of the two dice is either a 7 or an 11.
Observe that P(A|B) = 1, because any roll summing to an odd value must
consist of one even and one odd number. Thus AB=B, analogous to the
umbrella case above. For P(B|A), note that P(AB)=9/36 and P(A) =
25/36, so P(B|A)=9/25.
Conditional probability will be important to us, because we are interested in
the likelihood of an event A(perhaps that a particular piece of email is spam)
as a function of some evidence B(perhaps the distribution of words within the
document). Classification problems generally reduce to computing conditional
probabilities, in one way or another.
Our primary tool to compute conditional probabilities will be Bayes theorem,
which reverses the direction of the dependencies:
P(B|A) = P(A|B)P(B)
Often it proves easier to compute probabilities in one direction than another, as
in this problem. By Bayes theorem P(B|A) = (1·9/36)/(25/36) = 9/25, exactly
Figure 2.2: The probability density function (pdf) of the sum of two dice con-
tains exactly the same information as the cumulative density function (cdf), but
looks very different.
what we got before. We will revisit Bayes theorem in Section 5.6, where it will
establish the foundations of computing probabilities in the face of evidence.
2.1.4 Probability Distributions
Random variables are numerical functions where the values are associated with
probabilities of occurrence. In our example where V(s) the sum of two tossed
dice, the function produces an integer between 2 and 12. The probability of a
particular value V(s) = Xis the sum of the probabilities of all the outcomes
which add up to X.
Such random variables can be represented by their probability density func-
tion, or pdf. This is a graph where the x-axis represents the range of values
the random variable can take on, and the y-axis denotes the probability of that
given value. Figure 2.2 (left) presents the pdf of the sum of two fair dice. Ob-
serve that the peak at X= 7 corresponds to the most frequent dice total, with
a probability of 1/6.
Such pdf plots have a strong relationship to histograms of data frequency,
where the x-axis again represents the range of value, but ynow represents the
observed frequency of exactly how many event occurrences were seen for each
given value X. Converting a histogram to a pdf can be done by dividing each
bucket by the total frequency over all buckets. The sum of the entries then
becomes 1, so we get a probability distribution.
Histograms are statistical: they reflect actual observations of outcomes. In
contrast, pdfs are probabilistic: they represent the underlying chance that the
next observation will have value X. We often use the histogram of observations
h(x) in practice to estimate the probabilities2by normalizing counts by the total
2A technique called discounting offers a better way to estimate the frequency of rare events,
and will be discussed in Section 11.1.2.
Figure 2.3: iPhone quarterly sales data presented as cumulative and incremental
(quarterly) distributions. Which curve did Apple CEO Tim Cook choose to
number of observations:
P(k=X) = h(k=X)
There is another way to represent random variables which often proves use-
ful, called a cumulative density function or cdf. The cdf is the running sum of
the probabilities in the pdf; as a function of k, it reflects the probability that
Xkinstead of the probability that X=k. Figure 2.2 (right) shows the
cdf of the dice sum distribution. The values increase monotonically from left
to right, because each term comes from adding a positive probability to the
previous total. The rightmost value is 1, because all outcomes produce a value
no greater than the maximum.
It is important to realize that the pdf P(V) and cdf C(V) of a given random
variable Vcontain exactly the same information. We can move back and forth
between them because:
P(k=X) = C(Xk+δ)C(Xk),
where δ= 1 for integer distributions. The cdf is the running sum of the pdf, so
C(Xk) = X
Just be aware of which distribution you are looking at. Cumulative distribu-
tions always get higher as we move to the right, culminating with a probability
of C(X≤ ∞) = 1. By contrast, the total area under the curve of a pdf equals
1, so the probability at any point in the distribution is generally substantially
An amusing example of the difference between cumulative and incremental
distributions is shown in Figure 2.3. Both distributions show exactly the same
data on Apple iPhone sales, but which curve did Apple CEO Tim Cook choose to
present at a major shareholder event? The cumulative distribution (red) shows
that sales are exploding, right? But it presents a misleading view of growth
rate, because incremental change is the derivative of this function, and hard to
visualize. Indeed, the sales-per-quarter plot (blue) shows that the rate of iPhone
sales actually had declined for the last two periods before the presentation.
2.2 Descriptive Statistics
Descriptive statistics provide ways of capturing the properties of a given data
set or sample. They summarize observed data, and provide a language to talk
about it. Representing a group of elements by a new derived element, like
mean, min, count, or sum reduces a large data set to a small summary statistic:
aggregation as data reduction.
Such statistics can become features in their own right when taken over natu-
ral groups or clusters in the full data set. There are two main types of descriptive
Central tendency measures, which capture the center around which the
data is distributed.
Variation or variability measures, which describe the data spread, i.e. how
far the measurements lie from the center.
Together these statistics tell us an enormous amount about our distribution.
2.2.1 Centrality Measures
The first element of statistics we are exposed to in school are the basic centrality
measures: mean, median, and mode. These are the right place to start when
thinking of a single number to characterize a data set.
Mean: You are probably quite comfortable with the use of the arithmetic
mean, where we sum values and divide by the number of observations:
We can easily maintain the mean under a stream of insertions and dele-
tions, by keeping the sum of values separate from the frequency count,
and divide only on demand.
The mean is very meaningful to characterize symmetric distributions with-
out outliers, like height and weight. That it is symmetric means the num-
ber of items above the mean should be roughly the same as the number
below. That it is without outliers means that the range of values is rea-
sonably tight. Note that a single MAXINT creeping into an otherwise
sound set of observations throws the mean wildly off. The median is a
centrality measure which proves more appropriate with such ill-behaved
Geometric mean: The geometric mean is the nth root of the product of n
a1a2. . . an
The geometric mean is always less than or equal to the arithmetic mean.
For example, the geometric mean of the sums of 36 dice rolls is 6.5201, as
opposed to the arithmetic mean of 7. It is very sensitive to values near
zero. A single value of zero lays waste to the geometric mean: no matter
what other values you have in your data, you end up with zero. This is
somewhat analogous to having an outlier of in an arithmetic mean.
But geometric means prove their worth when averaging ratios. The ge-
ometric mean of 1/2 and 2/1 is 1, whereas the mean is 1.25. There is
less available “room” for ratios to be less than 1 than there is for ratios
above 1, creating an asymmetry that the arithmetic mean overstates. The
geometric mean is more meaningful in these cases, as is the arithmetic
mean of the logarithms of the ratios.
Median: The median is the exact middle value among a data set; just as
many elements lie above the median as below it. There is a quibble about
what to take as the median when you have an even number of elements.
You can take either one of the two central candidates: in any reasonable
data set these two values should be about the same. Indeed in the dice
example, both are 7.
A nice property of the median as so defined is that it must be a genuine
value of the original data stream. There actually is someone of median
height to you can point to as an example, but presumably no one in the
world is of exactly average height. You lose this property when you average
the two center elements.
Which centrality measure is best for applications? The median typically
lies pretty close to the arithmetic mean in symmetrical distributions, but
it is often interesting to see how far apart they are, and on which side of
the mean the median lies.
The median generally proves to be a better statistic for skewed distribu-
tions or data with outliers: like wealth and income. Bill Gates adds $250
to the mean per capita wealth in the United States, but nothing to the
median. If he makes you personally feel richer, then go ahead and use the
mean. But the median is the more informative statistic here, as it will be
for any power law distribution.
Hours Hours
Figure 2.4: Two distinct probability distributions with µ= 3000 for the lifespan
of light bulbs: normal (left) and with zero variance (right).
Mode: The mode is the most frequent element in the data set. This is 7
in our ongoing dice example, because it occurs six times out of thirty-six
elements. Frankly, I’ve never seen the mode as providing much insight
as centrality measure, because it often isn’t close to the center. Samples
measured over a large range should have very few repeated elements or
collisions at any particular value. This makes the mode a matter of hap-
penstance. Indeed, the most frequently occurring elements often reveal
artifacts or anomalies in a data set, such as default values or error codes
that do not really represent elements of the underlying distribution.
The related concept of the peak in a frequency distribution (or histogram)
is meaningful, but interesting peaks only get revealed through proper buck-
eting. The current peak of the annual salary distribution in the United
States lies between $30,000 and $40,000 per year, although the mode pre-
sumably sits at zero.
2.2.2 Variability Measures
The most common measure of variability is the standard deviation σ, which
measures sum of squares differences between the individual elements and the
A related statistic, the variance V, is the square of the standard deviation,
i.e. V=σ2. Sometimes it is more convenient to talk about variance than
standard deviation, because the term is eight characters shorter. But they
measure exactly the same thing.
As an example, consider the humble light bulb, which typically comes with
an expected working life, say µ= 3000 hours, derived from some underlying dis-
tribution shown in Figure 2.4. In a conventional bulb, the chance of it lasting
longer than µis presumably about the same as that of it burning out quicker,
and this degree of uncertainty is measured by σ. Alternately, imagine a “printer
cartridge bulb,” where the evil manufacturer builds very robust bulbs, but in-
cludes a counter so they can prevent it from ever glowing after 3000 hours of
use. Here µ= 3000 and σ= 0. Both distributions have the same mean, but
substantially different variance.
The sum of squares penalty in the formula for σmeans that one outlier value
dunits from the mean contributes as much to the variance as d2points each
one unit from the mean, so the variance is very sensitive to outliers.
An often confusing matter concerns the denominator in the formula for stan-
dard deviation. Should we divide by nor n1? The difference here is technical.
The standard deviation of the full population divides by n, whereas the standard
deviation of the sample divides by n1. The issue is that sampling just one
point tells us absolutely nothing about the underlying variance in any popu-
lation, where it is perfectly reasonable to say there is zero variance in weight
among the population of a one-person island. But for reasonable-sized data sets
n(n1), so it really doesn’t matter.
2.2.3 Interpreting Variance
Repeated observations of the same phenomenon do not always produce the
same results, due to random noise or error. Sampling errors result when our
observations capture unrepresentative circumstances, like measuring rush hour
traffic on weekends as well as during the work week. Measurement errors reflect
the limits of precision inherent in any sensing device. The notion of signal
to noise ratio captures the degree to which a series of observations reflects a
quantity of interest as opposed to data variance. As data scientists, we care
about changes in the signal instead of the noise, and such variance often makes
this problem surprisingly difficult.
I think of variance as an inherent property of the universe, akin to the speed
of light or the time-value of money. Each morning you weigh yourself on a scale
you are guaranteed to get a different number, with changes reflecting when you
last ate (sampling error), the flatness of the floor, or the age of the scale (both
measurement error) as much as changes in your body mass (actual variation).
So what is your real weight?
Every measured quantity is subject to some level of variance, but the phe-
nomenon cuts much deeper than that. Much of what happens in the world is
just random fluctuations or arbitrary happenstance causing variance even when
the situation is unchanged. Data scientists seek to explain the world through
data, but distressingly often there is no real phenomena to explain, only a ghost
created by variance. Examples include:
The stock market: Consider the problem of measuring the relative “skill”
of different stock market investors. We know that Warren Buffet is much
better at investing than we are. But very few professional investors prove
consistently better than others. Certain investment vehicles wildly out-
perform the market in any given time period. However, the hot fund one
Figure 2.5: Sample variance on hitters with a real 30% success rate results in a
wide range of observed performance even over 500 trials per season.
year usually underperforms the market the year after, which shouldn’t
happen if this outstanding performance was due to skill rather than luck.
The fund managers themselves are quick to credit profitable years to their
own genius, but losses to unforeseeable circumstances. However, several
studies have shown that the performance of professional investors is es-
sentially random, meaning there is little real difference in skill. Most
investors are paying managers for previously-used luck. So why do these
entrail-readers get paid so much money?
Sports performance: Students have good semesters and bad semesters, as
reflected by their grade point average (GPA). Athletes have good and bad
seasons, as reflected by their performance and statistics. Do such changes
reflect genuine differences in effort and ability, or are they just variance?
In baseball, .300 hitters (players who hit with a 30% success rate) represent
consistency over a full season. Batting .275 is not a noteworthy season,
but hit .300 and you are a star. Hit .325 and you are likely to be the
batting champion.
Figure 2.5 shows the results of a simple simulation, where random numbers
were used to decide the outcome of each at-bat over a 500 at-bats/season.
Our synthetic player is a real .300 hitter, because we programmed it to
report a hit with probability 300/1000 (0.3). The results show that a real
.300 hitter has a 10% chance of hitting .275 or below, just by chance.
Such a season will typically be explained away by injuries or maybe the
inevitable effects of age on athletic performance. But it could just be
natural variance. Smart teams try to acquire a good hitter after a lousy
season, when the price is cheaper, trying to take advantage of this variance.
Our .300 hitter also has a 10% chance of batting above .325, but you
can be pretty sure that they will ascribe such a breakout season to their
improved conditioning or training methods instead of the fact they just
got lucky. Good or bad season, or lucky/unlucky: it is hard to tell the
signal from the noise.
Model performance: As data scientists, we will typically develop and eval-
uate several models for each predictive challenge. The models may range
from very simple to complex, and vary in their training conditions or
Typically the model with the best accuracy on the training corpus will
be paraded triumphantly before the world as the right one. But small
differences in the performance between models is likely explained by sim-
ple variance rather than wisdom: which training/evaluation pairs were
selected, how well parameters were optimized, etc.
Remember this when it comes to training machine learning models. In-
deed, when asked to choose between models with small performance dif-
ferences between them, I am more likely to argue for the simplest model
than the one with the highest score. Given a hundred people trying to
predict heads and tails on a stream of coin tosses, one of them is guar-
anteed to end up with the most right answers. But there is no reason to
believe that this fellow has any better predictive powers than the rest of
2.2.4 Characterizing Distributions
Distributions do not necessarily have much probability mass exactly at the
mean. Consider what your wealth would look like after you borrow $100 million,
and then bet it all on an even money coin flip. Heads you are now $100 million
in clear, tails you are $100 million in hock. Your expected wealth is zero, but
this mean does not tell you much about the shape of your wealth distribution.
However, taken together the mean and standard deviation do a decent job
of characterizing any distribution. Even a relatively small amount of mass
positioned far from the mean would add a lot to the standard deviation, so a
small value of σimplies the bulk of the mass must be near the mean.
To be precise, regardless of how your data is distributed, at least (1
(1/k2))th of the mass must lie within ±kstandard deviations of the mean.
This means that at least 75% of all the data must lie within 2σof the mean,
and almost 89% within 3σfor any distribution.
We will see that even tighter bounds hold when we know the distribution is
well-behaved, like the Gaussian or normal distribution. But this is why it is a
great practice to report both µand σwhenever you talk about averages. The
average height of adult women in the United States is 63.7±2.7 inches, meaning
µ= 63.7 and σ= 2.7. The average temperature in Orlando, Fl is 60.3 degrees
Fahrenheit. However, there have been many more 100 degree days at Disney
World than 100 inch (8.33 foot) women visiting to enjoy them.
Take-Home Lesson: Report both the mean and standard deviation to charac-
terize your distribution, written as µ±σ.
2.3 Correlation Analysis
Suppose we are given two variables xand y, represented by a sample of npoints
of the form (xi, yi), for 1 in. We say that xand yare correlated when the
value of xhas some predictive power on the value of y.
The correlation coefficient r(X, Y ) is a statistic that measures the degree
to which Yis a function of X, and vice versa. The value of the correlation
coefficient ranges from 1 to 1, where 1 means fully correlated and 0 implies
no relation, or independent variables. Negative correlations imply that the
variables are anti-correlated, meaning that when Xgoes up, Ygoes down.
Perfectly anti-correlated variables have a correlation of 1. Note that nega-
tive correlations are just as good for predictive purposes as positive ones. That
you are less likely to be unemployed the more education you have is an example
of a negative correlation, so the level of education can indeed help predict job
status. Correlations around 0 are useless for forecasting.
Observed correlations drives many of the predictive models we build in data
science. Representative strengths of correlations include:
Are taller people more likely to remain lean? The observed correlation
between height and BMI is r=0.711, so height is indeed negatively
correlated with body mass index (BMI).3
Do standardized tests predict the performance of students in college? The
observed correlation between SAT scores and freshmen GPA is r= 0.47,
so yes, there is some degree of predictive power. But social economic
status is just as strongly correlated with SAT scores (r= 0.42).4
Does financial status affect health? The observed correlation between
household income and the prevalence of coronary artery disease is r=
0.717, so there is a strong negative correlation. So yes, the wealthier
you are, the lower your risk of having a heart attack.5
Does smoking affect health? The observed correlation between a group’s
propensity to smoke and their mortality rate is r= 0.716, so for G-d’s
sake, don’t smoke.6
Do violent video games increase aggressive behavior? The observed cor-
relation between play and violence is r= 0.19, so there is a weak but
significant correlation.7
This section will introduce the primary measurements of correlation. Fur-
ther, we study how to appropriately determine the strength and power of any
observed correlation, to help us understand when the connections between vari-
ables are real.
2.3.1 Correlation Coefficients: Pearson and Spearman Rank
In fact, there are two primary statistics used to measure correlation. Mercifully,
both operate on the same 1 to 1 scale, although they measure somewhat
different things. These different statistics are appropriate in different situations,
so you should be aware of both of them.
The Pearson Correlation Coefficient
The more prominent of the two statistics is Pearson correlation, defined as
=Cov(X, Y )
Let’s parse this equation. Suppose Xand Yare strongly correlated. Then
we would expect that when xiis greater than the mean ¯
X, then yishould be
bigger than its mean ¯
Y. When xiis lower than its mean, yishould follow. Now
look at the numerator. The sign of each term is positive when both values are
above (1 ×1) or below (1×1) their respective means. The sign of each term
is negative ((1×1) or (1 ×1)) if they move in opposite directions, suggesting
negative correlation. If Xand Ywere uncorrelated, then positive and negative
terms should occur with equal frequency, offsetting each other and driving the
value to zero.
The numerator’s operation determining the sign of the correlation is so useful
that we give it a name, covariance, computed:
Cov(X, Y ) =
Remember covariance: we will see it again in Section 8.2.3.
The denominator of the Pearson formula reflects the amount of variance in
the two variables, as measured by their standard deviations. The covariance
between Xand Ypotentially increases with the variance of these variables, and
this denominator is the magic amount to divide it by to bring correlation to a
1 to 1 scale.
Figure 2.6: The function y=|x|does not have a linear model, but seems like it
should be easily fitted despite weak correlations.
The Spearman Rank Correlation Coefficient
The Pearson correlation coefficient defines the degree to which a linear predictor
of the form y=m·x+bcan fit the observed data. This generally does a good job
measuring the similarity between the variables, but it is possible to construct
pathological examples where the correlation coefficient between Xand Yis zero,
yet Yis completely dependent on (and hence perfectly predictable from) X.
Consider points of the form (x, |x|), where xis uniformly (or symmetrically)
sampled from the interval [1,1] as shown in Figure 2.6. The correlation will
be zero because for every point (x, x) there will be an offsetting point (x, x),
yet y=|x|is a perfect predictor. Pearson correlation measures how well the
best linear predictors can work, but says nothing about weirder functions like
absolute value.
The Spearman rank correlation coefficient essentially counts the number of
pairs of input points which are out of order. Suppose that our data set contains
points (x1, y1) and (x2, y2) where x1< x2and y1< y2. This is a vote that
the values are positively correlated, whereas the vote would be for a negative
correlation if y2< y1.
Summing up over all pairs of points and normalizing properly gives us Spear-
man rank correlation. Let rank(xi) be the rank position of xiin sorted order
among all xi, so the rank of the smallest value is 1 and the largest value n. Then
ρ= 1 6Pd2
where di=rank(xi)rank(yi).
The relationship between our two coefficients is better delineated by the
example in Figure 2.7. In addition to giving high scores to non-linear but
monotonic functions, Spearman correlation is less sensitive to extreme outlier
elements than Pearson. Let p= (x1, ymax) be the data point with largest value
Figure 2.7: A monotonic but not linear point set has a Spearman coefficient
r= 1 even though it has no good linear fit (left). Highly-correlated sequences
are recognized by both coefficients (center), but the Pearson coefficient is much
more sensitive to outliers (right).
of yin a given data set. Suppose we replace pwith p0= (x1,). The Pearson
correlation will go crazy, since the best fit now becomes the vertical line x=x1.
But the Spearman correlation will be unchanged, since all the points were under
p, just as they are now under p0.
2.3.2 The Power and Significance of Correlation
The correlation coefficient rreflects the degree to which xcan be used to predict
yin a given sample of points S. As |r| → 1, these predictions get better and
But the real question is how this correlation will hold up in the real world,
outside the sample. Stronger correlations have larger |r|, but also involve sam-
ples of enough points to be significant. There is a wry saying that if you want
to fit your data by a straight line, it is best to sample it at only two points.
Your correlation becomes more impressive the more points it is based on.
The statistical limits in interpreting correlations are presented in Figure 2.8,
based on strength and size:
Strength of correlation: R2: The square of the sample correlation coef-
ficient r2estimates the fraction of the variance in Yexplained by Xin
a simple linear regression. The correlation between height and weight is
approximately 0.8, meaning it explains about two thirds of the variance.
Figure 2.8 (left) shows how rapidly r2decreases with r. There is a pro-
found limit to how excited we should get about establishing a weak corre-
lation. A correlation of 0.5 possesses only 25% of the maximum predictive
power, and a correlation of r= 0.1 only 1%. Thus the predictive value of
correlations decreases rapidly with r.
What do we mean by “explaining the variance”? Let f(x) = mx+cbe the
Figure 2.8: Limits in interpreting significance. The r2value shows that weak
correlations explain only a small fraction of the variance (left). The level of cor-
relation necessary to be statistically significance decreases rapidly with sample
size n(right).
Figure 2.9: Plotting ri=yif(xi) shows that the residual values have lower
variance and mean zero. The original data points are on the left, with the
corresponding residuals on the right.
predictive value of yfrom x, with the parameters mand ccorresponding
to the best possible fit. The residual values ri=yif(xi) will have mean
zero, as shown in Figure 2.9. Further, the variance of the full data set
V(Y) should be much larger than V(r) if there is a good linear fit f(x).
If xand yare perfectly correlated, there should be no residual error, and
V(r) = 0. If xand yare totally uncorrelated, the fit should contribute
nothing, and V(y)V(r). Generally speaking, 1 r2=V(r)/V (y).
Consider Figure 2.9, showing a set of points (left) admitting a good linear
fit, with correlation r= 0.94. The corresponding residuals ri=yif(xi)
are plotted on the right. The variance of the yvalues on the left V(y) =
0.056, substantially greater than the variance V(r)=0.0065 on the right.
1r2= 0.116 V(r)/V (y)=0.116.
Statistical significance: The statistical significance of a correlation depends
upon its sample size nas well as r. By tradition, we say that a correlation
of npoints is significant if there is an α1/20 = 0.05 chance that we
would observe a correlation as strong as rin any random set of npoints.
This is not a particularly strong standard. Even small correlations become
significant at the 0.05 level with large enough sample sizes, as shown in
Figure 2.8 (right). A correlation of r= 0.1 becomes significant at α=
0.05 around n= 300, even though such a factor explains only 1% of the
Weak but significant correlations can have value in big data models involving
large numbers of features. Any single feature/correlation might explain/predict
only small effects, but taken together a large number of weak but independent
correlations may have strong predictive power. Maybe. We will discuss signifi-
cance again in greater detail in Section 5.3.
2.3.3 Correlation Does Not Imply Causation!
You have heard this before: correlation does not imply causation:
The number of police active in a precinct correlate strongly with the local
crime rate, but the police do not cause the crime.
The amount of medicine people take correlates with the probability they
are sick, but the medicine does not cause the illness.
At best, the implication works only one way. But many observed correlations
are completely spurious, with neither variable having any real impact on the
Still, correlation implies causation is a common error in thinking, even among
those who understand logical reasoning. Generally speaking, few statistical tools
are available to tease out whether Areally causes B. We can conduct controlled
experiments, if we can manipulate one of the variables and watch the effect on
Figure 2.10: Correlation does not imply causation. (Source https://www.xkcd.
Figure 2.11: Cyclic trends in a time series (left) are revealed through correlating
it against shifts of itself (right) .
the other. For example, the fact that we can put people on a diet that makes
them lose weight without getting shorter is convincing evidence that weight does
not cause height. But it is often harder to do these experiments the other way,
e.g. there is no reasonable way to make people shorter other than by hacking
off limbs.
2.3.4 Detecting Periodicities by Autocorrelation
Suppose a space alien was hired to analyze U.S. sales at a toy company. Instead
of a nice smooth function showing a consistent trend, they would be astonished
to see a giant bump every twelfth month, every year. This alien would have
discovered the phenomenon of Christmas.
Seasonal trends reflect cycles of a fixed duration, rising and falling in a reg-
ular pattern. Many human activities proceed with a seven-day cycle associated
with the work week. Large populations of a type of insect called a cicada emerge
on a 13-year or 17-year cycle, in an effort to prevent predators from learning to
eat them.
How can we recognize such cyclic patterns in a sequence S? Suppose we
correlate the values of Siwith Si+p, for all 1 inp. If the values are in sync
for a particular period length p, then this correlation with itself will be unusually
high relative to other possible lag values. Comparing a sequence to itself is called
an autocorrelation, and the series of correlations for all 1 kn1 is called
the autocorrelation function. Figure 2.11 presents a time series of daily sales,
and the associated autocorrelation function for this data. The peak at a shift of
seven days (and every multiple of seven days) establishes that there is a weekly
periodicity in sales: more stuff gets sold on weekends.
Autocorrelation is an important concept in predicting future events, because
it means we can use previous observations as features in a model. The heuristic
that tomorrow’s weather will be similar to today’s is based on autocorrelation,
with a lag of p= 1 days. Certainly we would expect such a model to be
more accurate than predictions made on weather data from six months ago (lag
p= 180 days).
Generally speaking, the autocorrelation function for many quantities tends
to be highest for very short lags. This is why long-term predictions are less accu-
rate than short-term forecasts: the autocorrelations are generally much weaker.
But periodic cycles do sometimes stretch much longer. Indeed, a weather fore-
cast based on a lag of p= 365 days will be much better than one of p= 180,
because of seasonal effects.
Computing the full autocorrelation function requires calculating n1 differ-
ent correlations on points of the time series, which can get expensive for large n.
Fortunately, there is an efficient algorithm based on the fast Fourier transform
(FFT), which makes it possible to construct the autocorrelation function even
for very long sequences.
2.4 Logarithms
The logarithm is the inverse exponential function y=bx, an equation that can
be rewritten as x= logby. This definition is the same as saying that
Exponential functions grow at a very fast rate: consider b={21,22,23,24, . . .}.
In contrast, logarithms grow a very slow rate: these are just the exponents of
the previous series {1,2,3,4, . . .}. They are associated with any process where
we are repeatedly multiplying by some value of b, or repeatedly dividing by b.
Just remember the definition:
y= logbxby=x.
Logarithms are very useful things, and arise often in data analysis. Here
I detail three important roles logarithms play in data science. Surprisingly,
only one of them is related to the seven algorithmic applications of logarithms
I present in The Algorithm Design Manual [Ski08]. Logarithms are indeed very
useful things.
2.4.1 Logarithms and Multiplying Probabilities
Logarithms were first invented as an aide to computation, by reducing the prob-
lem of multiplication to that of addition. In particular, to compute the product
p=x·y, we could compute the sum of the logarithms s= logbx+ logbyand
then take the inverse of the logarithm (i.e. raising bto the sth power) to get p,
This is the trick that powered the mechanical slide rules that geeks used in the
days before pocket calculators.
However, this idea remains important today, particularly when multiplying
long chains of probabilities. Probabilities are small numbers. Thus multiplying
long chains of probability yield very small numbers that govern the chances of
very rare events. There are serious numerical stability problems with floating
point multiplication on real computers. Numerical errors will creep in, and will
eventually overwhelm the true value of small-enough numbers.
Summing the logarithms of probabilities is much more numerically stable
than multiplying them, but yields an equivalent result because:
pi=bP, where P=
We can raise our sum to an exponential if we need the real probability, but
usually this is not necessary. When we just need to compare two probabilities
to decide which one is larger we can safely stay in log world, because bigger
logarithms correspond to bigger probabilities.
There is one quirk to be aware of. Recall that the log2(1
2) = 1. The
logarithms of probabilities are all negative numbers except for log(1) = 0. This
is the reason why equations with logs of probabilities often feature negative
signs in strange places. Be on the lookout for them.
2.4.2 Logarithms and Ratios
Ratios are quantities of the form a/b. They occur often in data sets either as
elementary features or values derived from feature pairs. Ratios naturally occur
in normalizing data for conditions (i.e. weight after some treatment over the
initial weight) or time (i.e. today’s price over yesterday’s price).
But ratios behave differently when reflecting increases than decreases. The
ratio 200/100 is 200% above baseline, but 100/200 is only 50% below despite
being a similar magnitude change. Thus doing things like averaging ratios is
committing a statistical sin. Do you really want a doubling followed by a halving
to average out as an increase, as opposed to a neutral change?
Figure 2.12: Plotting ratios on a scale cramps the space allocated to small ratios
relative to large ratios (left). Plotting the logarithms of ratios better represents
the underlying data (right).
One solution here would have been to use the geometric mean. But better is
taking the logarithm of these ratios, so that they yield equal displacement, since
log22 = 1 and log2(1/2) = 1. We get the extra bonus that a unit ratio maps
to zero, so positive and negative numbers correspond to improper and proper
ratios, respectively.
A rookie mistake my students often make involves plotting the value of ratios
instead of their logarithms. Figure 2.12 (left) is a graph from a student paper,
showing the ratio of new score over old score on data over 24 hours (each red
dot is the measurement for one hour) on four different data sets (each given a
row). The solid black line shows the ratio of one, where both scores give the
same result. Now try to read this graph: it isn’t easy because the points on the
left side of the line are cramped together in a narrow strip. What jumps out at
you are the outliers. Certainly the new algorithm does terrible on 7UM917 in
the top row: that point all the way to the right is a real outlier.
Except that it isn’t. Now look at Figure 2.12 (right), where we plot the
logarithms of the ratios. The space devoted to left and right of the black line
can now be equal. And it shows that this point wasn’t really such an outlier at
all. The magnitude of improvement of the leftmost points is much greater than
that of the rightmost points. This plot reveals that new algorithm generally
makes things better, only because we are showing logs of ratios instead of the
ratios themselves.
2.4.3 Logarithms and Normalizing Skewed Distributions
Variables which follow symmetric, bell-shaped distributions tend to be nice as
features in models. They show substantial variation, so they can be used to
discriminate between things, but not over such a wide range that outliers are
But not every distribution is symmetric. Consider the one in Figure 2.13
Figure 2.13: Hitting a skewed data distribution (left) with a log often yields a
more bell-shaped distribution (right).
(left). The tail on the right goes much further than the tail on the left. And
we are destined to see far more lopsided distributions when we discuss power
laws, in Section 5.1.5. Wealth is representative of such a distribution, where
the poorest human has zero or perhaps negative wealth, the average person
(optimistically) is in the thousands of dollars, and Bill Gates is pushing $100
billion as of this writing.
We need a normalization to convert such distributions into something easier
to deal with. To ring the bell of a power law distribution we need something
non-linear, that reduces large values to a disproportionate degree compared to
more modest values.
The logarithm is the transformation of choice for power law variables. Hit
your long-tailed distribution with a log and often good things happen. The
distribution in Figure 2.13 happened to be the log normal distribution, so taking
the logarithm yielded a perfect bell-curve on right. Taking the logarithm of
variables with a power law distribution brings them more in line with traditional
distributions. For example, as an upper-middle class professional, my wealth is
roughly the same number of logs from my starving students as I am from Bill
Sometimes taking the logarithm proves too drastic a hit, and a less dramatic
non-linear transformation like the square root works better to normalize a dis-
tribution. The acid test is to plot a frequency distribution of the transformed
values and see if it looks bell-shaped: grossly-symmetric, with a bulge in the
middle. That is when you know you have the right function.
2.5 War Story: Fitting Designer Genes
The word bioinformatician is life science speak for “data scientist,” the prac-
titioner of an emerging discipline which studies massive collections of DNA
sequence data looking for patterns. Sequence data is very interesting to work
with, and I have played bioinformatician in research projects since the very
beginnings of the human genome project.
DNA sequences are strings on the four letter alphabet {A, C, G, T }. Proteins
form the stuff that we are physically constructed from, and are composed of
strings of 20 different types of molecular units, called amino acids. Genes are
the DNA sequences which describe exactly how to make specific proteins, with
the units each described by a triplet of {A, C, G, T }s called codons.
For our purposes, it suffices to know that there are a huge number of possible
DNA sequences describing genes which could code for any particular desired
protein sequence. But only one of them is used. My biologist collaborators and
I wanted to know why.
Originally, it was assumed that all of these different synonymous encodings
were essentially identical, but statistics performed on sequence data made it
clear that certain codons are used more often than others. The biological con-
clusion is that “codons matter,” and there are good biological reasons why this
should be.
We became interested in whether “neighboring pairs of codon matter.” Per-
haps certain pairs of triples are like oil and water, and hate to mix. Certain
letter pairs in English have order preferences: you see the bigram gh far more
often than hg. Maybe this is true of DNA as well? If so, there would be pairs
of triples which should be underrepresented in DNA sequence data.
To test this, we needed a score comparing the number of times we actually
see a particular triple (say x=CAT ) next to another particular triple (say
y=GAG) to what we would expect by chance. Let F(xy) be the frequency
of xy, number of times we actually see codon xfollowed by codon yin the
DNA sequence database. These codons code for specific amino acids, say a
and brespectively. For amino acid a, the probability that it will be coded by
xis P(x) = F(x)/F (a), and similarly P(y) = F(y)/F (b). Then the expected
number of times of seeing xy is
Expected(xy) = F(x)
Based on this, we can compute a codon pair score for any given hexamer xy
as follows:
CP S(xy) = ln Observed(xy)
Expected(xy)= ln
Taking the logarithm of this ratio produced very nice properties. Most im-
portantly, the sign of the score distinguished over-represented pairs from under-
represented pairs. Because the magnitudes were symmetric (+1 was just as
impressive as 1) we could add or average these scores in a sensible way to give
a score for each gene. We used these scores to design genes that should be bad
for viruses, which gave an exciting new technology for making vaccines. See the
chapter notes (Section 2.6) for more details.
Figure 2.14: Patterns in DNA sequences with the lowest codon pair scores
become obvious on inspection. When interpreted in-frame, the stop symbol
TAG is substantially depleted (left). When interpreted in the other two frames,
the most avoided patterns are all very low complexity, like runs of a single base
Knowing that certain pairs of codons were bad did not explain why they were
bad. But by computing two related scores (details unimportant) and sorting
the triplets based on them, as shown in Figure 2.14, certain patterns popped
out. Do you notice the patterns? All the bad sequences on the left contain
T AG, which turns out to be a special codon that tells the gene to stop. And
all the bad sequences on the right consist of Cand Gin very simple repetitive
sequences. These explain biologically why patterns are avoided by evolution,
meaning we discovered something very meaningful about life.
There are two take-home lessons from this story. First, developing numerical
scoring functions which highlight specific aspects of items can be very useful
to reveal patterns. Indeed, Chapter 4 will focus on the development of such
systems. Second, hitting such quantities with a logarithm can make them even
more useful, enabling us to see the forest for the trees.
2.6 Chapter Notes
There are many excellent introductions to probability theory available, including
[Tij12, BT08]. The same goes for elementary statistics, with good introductory
texts including [JWHT13, Whe13]. The brief history of probability theory in
this chapter is based on Weaver [Wea82].
In its strongest form, the efficient market hypothesis states that the stock
market is essentially unpredictable using public information. My personal advice
is that you should invest in index funds that do not actively try to predict the
direction of the market. Malkiel’s A Random Walk Down Wall Street [Mal99]
is an excellent introduction to such investment thinking.
The Fast Fourier Transform (FFT) provides an O(nlog n) time algorithm to
compute the full autocorrelation function of an n-element sequence, where the
straightforward computation of ncorrelations takes O(n2). Bracewell [Bra99]
and Brigham [Bri88] are excellent introductions to Fourier transforms and the
FFT. See also the exposition in Press [PFTV07].
The comic strip in Figure 2.10 comes from Randall Munroe’s webcomic xkcd,
specifically, and is reprinted with permission.
The war story of Section 2.5 revolves around our work on how the phe-
nomenon of codon pair bias affects gene translation. Figure 2.14 comes from
my collaborator Justin Gardin. See [CPS+08, MCP+10, Ski12] for discussions
of how we exploited codon pair bias to design vaccines for viral diseases like
polio and the flu.
2.7 Exercises
2-1. [3] Suppose that 80% of people like peanut butter, 89% like jelly, and 78% like
both. Given that a randomly sampled person likes peanut butter, what is the
probability that she also likes jelly?
2-2. [3] Suppose that P(A) = 0.3 and P(B) = 0.7.
(a) Can you compute P(Aand B) if you only know P(A) and P(B)?
(b) Assuming that events Aand Barise from independent random processes:
What is P(Aand B)?
What is P(Aor B)?
What is P(A|B)?
2-3. [3] Consider a game where your score is the maximum value from two dice.
Compute the probability of each event from {1,...,6}.
2-4. [8] Prove that the cumulative distribution function of the maximum of a pair of
values drawn from random variable Xis the square of the original cumulative
distribution function of X.
2-5. [5] If two binary random variables Xand Yare independent, are ¯
X(the com-
plement of X) and Yalso independent? Give a proof or a counterexample.
2-6. [3] Compare each pair of distributions to decide which one has the greater
mean and the greater standard deviation. You do not need to calculate the
actual values of µand σ, just how they compare with each other.
(a) i. 3,5,5,5,8,11,11,11,13.
ii. 3,5,5,5,8,11,11,11,20.
(b) i. 20,0,0,0,15,25,30,30.
ii. 40,0,0,0,15,25,30,30.
(c) i. 0,2,4,6,8,10.
ii. 20,22,24,26,28,30.
(d) i. 100,200,300,400,500.
ii. 0,50,300,550,600.
2-7. [3] Construct a probability distribution where none of the mass lies within one
σof the mean.
2-8. [3] How does the arithmetic and geometric mean compare on random integers?
2-9. [3] Show that the arithmetic mean equals the geometric mean when all terms
are the same.
Correlation Analysis
2-10. [3] True or false: a correlation coefficient of 0.9 indicates a stronger linear
relationship than a correlation coefficient of 0.5. Explain why.
2-11. [3] What would be the correlation coefficient between the annual salaries of
college and high school graduates at a given company, if for each possible job
title the college graduates always made:
(a) $5,000 more than high school grads?
(b) 25% more than high school grads?
(c) 15% less than high school grads?
2-12. [3] What would be the correlation between the ages of husbands and wives if
men always married woman who were:
(a) Three years younger than themselves?
(b) Two years older than themselves?
(c) Half as old as themselves?
2-13. [5] Use data or literature found in a Google search to estimate/measure the
strength of the correlation between:
(a) Hits and walks scored for hitters in baseball.
(b) Hits and walks allowed by pitchers in baseball.
2-14. [5] Compute the Pearson and Spearman Rank correlations for uniformly drawn
samples of points (x, xk). How do these values change as a function of increasing
2-15. [3] Show that the logarithm of any number less than 1 is negative.
2-16. [3] Show that the logarithm of zero is undefined.
2-17. [5] Prove that
2-18. [5] Prove the correctness of the formula for changing a base-blogarithm to base-
a, that
loga(x) = logb(x)/logb(a).
Implementation Projects
2-19. [3] Find some interesting data sets, and compare how similar their means and
medians are. What are the distributions where the mean and median differ on
the most?
2-20. [3] Find some interesting data sets and search all pairs for interesting correla-
tions. Perhaps start with what is available at
data. What do you find?
Interview Questions
2-21. [3] What is the probability of getting exactly kheads on ntosses, where the
coin has a probability of pin coming up heads on each toss? What about kor
more heads?
2-22. [5] Suppose that the probability of getting a head on the ith toss of an ever-
changing coin is f(i). How would you efficiently compute the probability of
getting exactly kheads in ntosses?
2-23. [5] At halftime of a basketball game you are offered two possible challenges:
(a) Take three shots, and make at least two of them.
(b) Take eight shots, and make at least five of them.
Which challenge should you pick to have a better chance of winning the game?
2-24. [3] Tossing a coin ten times resulted in eight heads and two tails. How would
you analyze whether a coin is fair? What is the p-value?
2-25. [5] Given a stream of nnumbers, show how to select one uniformly at random
using only constant storage. What if you don’t know nin advance?
2-26. [5] Ak-streak starts at toss iin a sequence of ncoin flips when the outcome of the
ith flip and the next k1 flips are identical. For example, sequence HTTTHH
contains 2-streaks starting at the second, third, and fifth tosses. What are the
expected number of k-streaks that you will see in ntosses of a fair coin ?
2-27. [5] A person randomly types an eight-digit number into a pocket calculator.
What is the probability that the number looks the same even if the calculator
is turned upside down?
2-28. [3] You play a dice rolling game where you have two choices:
(a) Roll the dice once and get rewarded with a prize equal to the outcome
number (e.g, $3 for number “3”) and then stop the game.
(b) You can reject the first reward according to its outcome and roll the dice
a second time, and get rewarded in the same way.
Which strategy should you choose to maximize your reward? That is, for what
outcomes of the first roll should you chose to play the second game? What is
the statistical expectation of reward if you choose the second strategy?
2-29. [3] What is A/B testing and how does it work?
2-30. [3] What is the difference between statistical independence and correlation?
2-31. [3] We often say that correlation does not imply causation. What does this
2-32. [5] What is the difference between a skewed distribution and a uniform one?
Kaggle Challenges
2-33. Cause–effect pairs: correlation vs. causation.
2-34. Predict the next “random number” in a sequence.
2-35. Predict the fate of animals at a pet shelter.
Chapter 3
Data Munging
On two occasions I have been asked, “Pray, Mr. Babbage, if you put
into the machine wrong figures, will the right answers come out?”
. . . I am not able rightly to apprehend the kind of confusion of ideas
that could provoke such a question.
– Charles Babbage
Most data scientists spend much of their time cleaning and formatting data.
The rest spend most of their time complaining that there is no data available
to do what they want to do.
In this chapter, we will work through some of the basic mechanics of com-
puting with data. Not the high-faluting stuff like statistics or machine learning,
but the grunt work of finding data and cleaning it that goes under the moniker
of data munging.
While practical questions like “What is the best library or programming
language available?” are clearly important, the answers change so rapidly that
a book like this one is the wrong place to address them. So I will stick at the
level of general principles, instead of shaping this book around a particular set
of software tools. Still, we will discuss the landscape of available resources in
this chapter: why they exist, what they do, and how best to use them.
The first step in any data science project is getting your hands on the right
data. But this is often distressingly hard. This chapter will survey the richest
hunting grounds for data resources, and then introduce techniques for cleaning
what you kill. Wrangling your data so you that can safely analyze it is critical
for meaningful results. As Babbage himself might have said more concisely,
“garbage in, garbage out.”
3.1 Languages for Data Science
In theory, every sufficiently powerful programming language is capable of ex-
pressing any algorithm worth computing. But in practice, certain programming
languages prove much better than others at specific tasks. Better here might
denote easier for the programmer or perhaps more computationally efficient,
depending upon the mission at hand.
The primary data science programming languages to be aware of are:
Python: This is today’s bread-and-butter programming language for data
science. Python contains a variety of language features to make basic
data munging easier, like regular expressions. It is an interpreted lan-
guage, making the development process quicker and enjoyable. Python
is supported by an enormous variety of libraries, doing everything from
scraping to visualization to linear algebra and machine learning.
Perhaps the biggest strike against Python is efficiency: interpreted lan-
guages cannot compete with compiled ones for speed. But Python compil-
ers exist in a fashion, and support linking in efficient C/assembly language
libraries for computationally-intensive tasks. Bottom line, Python should
probably be your primary tool in working through the material we present
in this book.
Perl: This used to be the go to language for data munging on the web,
before Python ate it for lunch. In the TIOBE programming language pop-
ularity index (, Python first ex-
ceeded Perl in popularity in 2008 and hasn’t looked back. There are several
reasons for this, including stronger support for object-oriented program-
ming and better available libraries, but the bottom line is that there are
few good reasons to start projects in Perl at this point. Don’t be surprised
if you encounter it in some legacy project, however.
R: This is the programming language of statisticians, with the deepest
libraries available for data analysis and visualization. The data science
world is split between R and Python camps, with R perhaps more suit-
able for exploration and Python better for production use. The style of
interaction with R is somewhat of an acquired taste, so I encourage you
to play with it a bit to see whether it feels natural to you.
Linkages exist between R and Python, so you can conveniently call R
library functions in Python code. This provides access to advanced statis-
tical methods, which may not be supported by the native Python libraries.
Matlab: The Mat here stands for matrix, as Matlab is a language de-
signed for the fast and efficient manipulation of matrices. As we will see,
many machine learning algorithms reduce to operations on matrices, mak-
ing Matlab a natural choice for engineers programming at a high-level of
Matlab is a proprietary system. However, much of its functionality is
available in GNU Octave, an open-source alternative.
Java and C/C++: These mainstream programming languages for the
development of large systems are important in big data applications. Par-
allel processing systems like Hadoop and Spark are based on Java and
C++, respectively. If you are living in the world of distributed comput-
ing, then you are living in a world of Java and C++ instead of the other
languages listed here.
Mathematica/Wolfram Alpha: Mathematica is a proprietary system pro-
viding computational support for all aspects of numerical and symbolic
mathematics, built upon the less proprietary Wolfram programming lan-
guage. It is the foundation of the Wolfram Alpha computational knowl-
edge engine, which processes natural language-like queries through a mix
of algorithms and pre-digested data sources. Check it out at http://www.
I will confess a warm spot for Mathematica.1It is what I tend to reach
for when I am doing a small data analysis or simulation, but cost has
traditionally put it out of the range of many users. The release of the
Wolfram language perhaps now opens it up to a wider community.
Excel: Spreadsheet programs like Excel are powerful tools for exploratory
data analysis, such as playing with a given data set to see what it contains.
They deserve our respect for such applications.
Full featured spreadsheet programs contain a surprising amount of hidden
functionality for power users. A student of mine who rose to become a
Microsoft executive told me that 25% of all new feature requests for Excel
proposed functionality already present there. The special functions and
data manipulation features you want probably are in Excel if you look
hard enough, in the same way that a Python library for what you need
probably will be found if you search for it.
3.1.1 The Importance of Notebook Environments
The primary deliverable for a data science project should not be a program. It
should not be a data set. It should not be the results of running the program
on your data. It should not just be a written report.
The deliverable result of every data science project should be a computable
notebook tying together the code, data, computational results, and written
analysis of what you have learned in the process. Figure 3.1 presents an excerpt
from a Jupyter/IPython notebook, showing how it integrates code, graphics,
and documentation into a descriptive document which can be executed like a
The reason this is so important is that computational results are the product
of long chains of parameter selections and design decisions. This creates several
problems that are solved by notebook computing environments:
1Full disclosure: I have known Stephen Wolfram for over thirty years. Indeed, we invented
the iPad together [Bar10, MOR+88].
Figure 3.1: Jupyter/IPython notebooks tie together code, computational re-
sults, and documentation.
Computations need to be reproducible. We must be able to run the same
programs again from scratch, and get exactly the same result. This means
that data pipelines must be complete: taking raw input and producing the
final output. It is terrible karma to start with a raw data set, do some
processing, edit/format the data files by hand, and then do some more
processing – because what you did by hand cannot be readily done again
on another data set, or undone after you realize that you may have goofed
Computations must be tweakable. Often reconsideration or evaluation will
prompt a change to one or more parameters or algorithms. This requires
rerunning the notebook to produce the new computation. There is nothing
more disheartening to be given a big data product without provenance and
told that this is the final result and you can’t change anything. A notebook
is never finished until after the entire project is done.
Data pipelines need to be documented. That notebooks permit you to
integrate text and visualizations with your code provides a powerful way
to communicate what you are doing and why, in ways that traditional
programming environments cannot match.
Take-Home Lesson: Use a notebook environment like IPython or Mathematica
to build and report the results of any data science project.
3.1.2 Standard Data Formats
Data comes from all sorts of places, and in all kinds of formats. Which represen-
tation is best depends upon who the ultimate consumer is. Charts and graphs
are marvelous ways to convey the meaning of numerical data to people. Indeed,
Chapter 6 will focus on techniques for visualizing data. But these pictures are
essentially useless as a source of data to compute with. There is a long way
from printed maps to Google Maps.
The best computational data formats have several useful properties:
They are easy for computers to parse: Data written in a useful format is
destined to be used again, elsewhere. Sophisticated data formats are often
supported by APIs that govern technical details ensuring proper format.
They are easy for people to read: Eyeballing data is an essential operation
in many contexts. Which of the data files in this directory is the right one
for me to use? What do we know about the data fields in this file? What
is the gross range of values for each particular field?
These use cases speak to the enormous value of being able to open a data
file in a text editor to look at it. Typically, this means presenting the
data in a human-readable text-encoded format, with records demarcated
by separate lines, and fields separated by delimiting symbols.
They are widely used by other tools and systems: The urge to invent
proprietary data standard beats firmly in the corporate heart, and most
software developers would rather share a toothbrush than a file format.
But these are impulses to be avoided. The power of data comes from
mixing and matching it with other data resources, which is best facilitated
by using popular standard formats.
One property I have omitted from this list is conciseness, since it is generally
not a primary concern for most applications running on modern computing
systems. The quest to minimize data storage costs often works against other
goals. Cleverly packing multiple fields into the higher-order bits of integers saves
space, but at the cost of making it incompatible and unreadable.
General compression utilities like gzip prove amazingly good at removing the
redundancy of human-friendly formatting. Disk prices are unbelievably cheap:
as I write this you can buy a 4TB drive for about $100, meaning less than the
cost of one hour of developer time wasted programming a tighter format. Unless
you are operating at the scale of Facebook or Google, conciseness does not have
nearly the importance you are liable to think it does.2
The most important data formats/representations to be aware of are dis-
cussed below:
CSV (comma separated value) files: These files provide the simplest, most
popular format to exchange data between programs. That each line repre-
sents a single record, with fields separated by commas, is obvious from in-
spection. But subtleties revolve around special characters and text strings:
what if your data about names contains a comma, like “Thurston Howell,
Jr.” The csv format provides ways to escape code such characters so they
are not treated as delimiters, but it is messy. A better alternative is to
use a rarer delimiter character, as in tsv or tab separated value files.
The best test of whether your csv file is properly formatted is whether
Microsoft Excel or some other spreadsheet program can read it without
hassle. Make sure the results of every project pass this test as soon as the
first csv file has been written, to avoid pain later.
XML (eXtensible Markup Language): Structured but non-tabular data
are often written as text with annotations. The natural output of a
named-entity tagger for text wraps the relevant substrings of a text in
brackets denoting person, place, or thing. I am writing this book in La-
Tex, a formatting language with bracketing commands positioned around
mathematical expressions and italicized text. All webpages are written in
HTML, the hypertext markup language which organizes documents using
bracketing commands like <b> and </b> to enclose bold faced text.
XML is a language for writing specifications of such markup languages. A
proper XML specification enables the user to parse any document comply-
ing with the specification. Designing such specifications and fully adhering
2Indeed, my friends at Google assure me that they are often slovenly about space even at
the petabyte scale.
to them requires discipline, but is worthwhile. In the first version of our
Lydia text analysis system, we wrote our markups in a “pseudo-XML,”
read by ad hoc parsers that handled 99% of the documents correctly but
broke whenever we tried to extend them. After a painful switch to XML,
everything worked more reliably and more efficiently, because we could
deploy fast, open-source XML parsers to handle all the dirty work of en-
forcing our specifications.
SQL (structured query language) databases: Spreadsheets are naturally
structured around single tables of data. In contrast, relational databases
prove excellent for manipulating multiple distinct but related tables, using
SQL to provide a clunky but powerful query language.
Any reasonable database system imports and exports records as either csv
or XML files, as well as an internal content dump. The internal represen-
tation in databases is opaque, so it really isn’t accurate to describe them
as a data format. Still, I emphasize them here because SQL databases
generally prove a better and more powerful solution than manipulating
multiple data files in an ad hoc manner.
JSON (JavaScript Object Notation): This is a format for transmitting data
objects between programs. It is a natural way to communicate the state of
variables/data structures from one system to another. This representation
is basically a list of attribute-value pairs corresponding to variable/field
names, and the associated values:
{"firstName":"John", "lastName":"Doe"},
{"firstName":"Anna", "lastName":"Smith"},
{"firstName":"Peter", "lastName":"Jones"}
Because library functions that support reading and writing JSON objects
are readily available in all modern programming languages, it has become
a very convenient way to store data structures for later use. JSON objects
are human readable, but are quite cluttered-looking, representing arrays of
records compared to CSV files. Use them for complex structured objects,
but not simple tables of data.
Protocol buffers: These are a language/platform-neutral way of serializing
structured data for communications and storage across applications. They
are essentially lighter weight versions of XML (where you define the format
of your structured data), designed to communicate small amounts of data
across programs like JSON. This data format is used for much of the inter-
machine communication at Google. Apache Thrift is a related standard,
used at Facebook.
3.2 Collecting Data
The most critical issue in any data science or modeling project is finding the
right data set. Identifying viable data sources is an art, one that revolves around
three basic questions:
Who might actually have the data I need?
Why might they decide to make it available to me?
How can I get my hands on it?
In this section, we will explore the answers to these questions. We look at
common sources of data, and what you are likely to be able to find and why.
We then review the primary mechanisms for getting access, including APIs,
scraping, and logging.
3.2.1 Hunting
Who has the data, and how can you get it? Some of the likely suspects are
reviewed below.
Companies and Proprietary Data Sources
Large companies like Facebook, Google, Amazon, American Express, and Blue
Cross have amazing amounts of exciting data about users and transactions,
data which could be used to improve how the world works. The problem is that
getting outside access is usually impossible. Companies are reluctant to share
data for two good reasons:
Business issues, and the fear of helping their competition.
Privacy issues, and the fear of offending their customers.
A heartwarming tale of what can happen with corporate data release oc-
curred when AOL provided academics with a data set of millions of queries to
its search engine, carefully stripped of identifying information. The first thing
the academics discovered was that the most frequently-entered queries were des-
perate attempts to escape to other search engines like Google. This did nothing
to increase public confidence in the quality of AOL search.
Their second discovery was that it proved much harder to anonymize search
queries than had previously been suspected. Sure you can replace user names
with id numbers, but it is not that hard to figure out who the guy on Long Island
repeatedly querying Steven Skiena,Stony Brook, and
search?q=Skiena&src=sprv is. Indeed, as soon as it became publicized that
people’s identities had been revealed by this data release, the responsible party
was fired and the data set disappeared. User privacy is important, and ethical
issues around data science will be discussed in Section 12.7.
So don’t think you are going to sweet talk companies into releasing confiden-
tial user data. However, many responsible companies like The New York Times,
Twitter, Facebook, and Google do release certain data, typically by rate-limited
application program interfaces (APIs). They generally have two motives:
Providing customers and third parties with data that can increase sales.
For example, releasing data about query frequency and ad pricing can
encourage more people to place ads on a given platform.
It is generally better for the company to provide well-behaved APIs than
having cowboys repeatedly hammer and scrape their site.
So hunt for a public API before reading Section 3.2.2 on scraping. You won’t
find exactly the content or volume that you dream of, but probably something
that will suffice to get started. Be aware of limits and terms of use.
Other organizations do provide bulk downloads of interesting data for off-
line analysis, as with the Google Ngrams, IMDb, and the taxi fare data sets
discussed in Chapter 1. Large data sets often come with valuable metadata,
such as book titles, image captions, and edit history, which can be re-purposed
with proper imagination.
Finally, most organizations have internal data sets of relevance to their busi-
ness. As an employee, you should be able to get privileged access while you
work there. Be aware that companies have internal data access policies, so you
will still be subject to certain restrictions. Violating the terms of these policies
is an excellent way to become an ex-employee.
Government Data Sources
Collecting data is one of the important things that governments do. Indeed,
the requirement that the United States conduct a census of its population is
mandated by our constitution, and has been running on schedule every ten
years since 1790.
City, state, and federal governments have become increasingly committed
to open data, to facilitate novel applications and improve how government can
fulfill its mission. The website is an initiative by the federal
government to centrally collect its data sources, and at last count points to over
100,000 data sets!
Government data differs from industrial data in that, in principle, it belongs
to the People. The Freedom of Information Act (FOI) enables any citizen to
make a formal request for any government document or data set. Such a request
triggers a process to determine what can be released without compromising the
national interest or violating privacy.
State governments operate under fifty different sets of laws, so data that is
tightly held in one jurisdiction may be freely available in others. Major cities
like New York have larger data processing operations than many states, again
with restrictions that vary by location.
I recommend the following way of thinking about government records. If
you cannot find what you need online after some snooping around, figure out
which agency is likely to have it. Make a friendly call to them to see if they
can help you find what you want. But if they stonewall you, feel free to try for
a FOI act request. Preserving privacy is typically the biggest issue in deciding
whether a particular government data set can be released.
Academic Data Sets
There is a vast world of academic scholarship, covering all that humanity has
deemed worth knowing. An increasing fraction of academic research involves
the creation of large data sets. Many journals now require making source data
available to other researchers prior to publication. Expect to be able to find
vast amounts of economic, medical, demographic, historical, and scientific data
if you look hard enough.
The key to finding these data sets is to track down the relevant papers.
There is an academic literature on just about any topic of interest. Google
Scholar is the most accessible source of research publications. Search by topic,
and perhaps “Open Science” or “data.” Research publications will typically
provide pointers to where its associated data can be found. If not, contacting
the author directly with a request should quickly yield the desired result.
The biggest catch with using published data sets is that someone else has
worked hard to analyze them before you got to them, so these previously mined
sources may have been sucked dry of interesting new results. But bringing fresh
questions to old data generally opens new possibilities.
Often interesting data science projects involve collaborations between re-
searchers from different disciplines, such as the social and natural sciences.
These people speak different languages than you do, and may seem intimidating
at first. But they often welcome collaboration, and once you get past the jargon
it is usually possible to understand their issues on a reasonable level without
specialized study. Be assured that people from other disciplines are generally
not any smarter than you are.
Sweat Equity
Sometimes you will have to work for your data, instead of just taking it from
others. Much historical data still exists only in books or other paper documents,
thus requiring manual entry and curation. A graph or table might contain
information that we need, but it can be hard to get numbers from a graphic
locked in a PDF (portable document format) file.
I have observed that computationally-oriented people vastly over-estimate
the amount of effort it takes to do manual data entry. At one record per minute,
you can easily enter 1,000 records in only two work days. Instead, computational
people tend to devote massive efforts trying to avoid such grunt work, like
hunting in vain for optical character recognition (OCR) systems that don’t make
a mess of the file, or spending more time cleaning up a noisy scan than it would
take to just type it in again fresh.
A middle ground here comes in paying someone else to do the dirty work for
you. Crowdsourcing platforms like Amazon Turk and CrowdFlower enable you
to pay for armies of people to help you extract data, or even collect it in the
first place. Tasks requiring human annotation like labeling images or answering
surveys are particularly good use of remote workers. Crowdsourcing will be
discussed in greater detail in Section 3.5.
Many amazing open data resources have been built up by teams of contrib-
utors, like Wikipedia, Freebase, and IMDb. But there is an important concept
to remember: people generally work better when you pay them.
3.2.2 Scraping
Webpages often contain valuable text and numerical data, which we would like
to get our hands on. For example, in our project to build a gambling system
for the sport of jai-alai, we needed to feed our system the results of yesterday’s
matches and the schedule of what games were going on today. Our solution
was to scrape the websites of jai-alai betting establishments, which posted this
information for their fans.
There are two distinct steps to make this happen, spidering and scraping:
Spidering is the process of downloading the right set of pages for analysis.
Scraping is the fine art of stripping this content from each page to prepare
it for computational analysis.
The first thing to realize is that webpages are generally written in simple-to-
understand formatting languages like HTML and/or JavaScript. Your browser
knows these languages, and interprets the text of the webpage as a program
to specify what to display. By calling a function that emulates/pretends to
be a web browser, your program can download any webpage and interpret the
contents for analysis.
Traditionally, scraping programs were site-specific scripts hacked up to look
for particular HTML patterns flanking the content of interest. This exploited the
fact that large numbers of pages on specific websites are generated by programs
themselves, and hence highly predictable in their format. But such scripts tend
to be ugly and brittle, breaking whenever the target website tinkers with the
internal structure of its pages.
Today, libraries in languages like Python (see BeautifulSoup) make it easier
to write robust spiders and scrapers. Indeed, someone else probably has already
written a spider/scraper for every popular website and made it available on
SourceForge or Github, so search before you code.
Certain spidering missions may be trivial, for example, hitting a single URL
(uniform resource locator) at regular time intervals. Such patterns occur in mon-
itoring, say, the sales rank of this book from its Amazon page. Somewhat more
sophisticated approaches to spidering are based on the name regularity of the
underlying URLs. If all the pages on a site are specified by the date or product
ID number, for example,
iterating through the entire range of interesting values becomes just a matter
of counting.
The most advanced form of spidering is web crawling, where you systemat-
ically traverse all outgoing links from a given root page, continuing recursively
until you have visited every page on the target website. This is what Google does
in indexing the web. You can do it too, with enough patience and easy-to-find
web crawling libraries in Python.
Please understand that politeness limits how rapidly you should spider/crawl
a given website. It is considered bad form to hit a site more than once a second,
and indeed best practices dictate that providers block access to the people who
are hammering them.
Every major website contains a terms of service document that restricts what
you can legally do with any associated data. Generally speaking, most sites will
leave you alone provided you don’t hammer them, and do not redistribute any
data you scrape. Understand that this is an observation, not a legal opinion.
Indeed, read about the Aaron Schwartz case, where a well-known Internet fig-
ure was brought up on serious criminal charges for violating terms of services
in spidering/scraping journal articles, and literally hounded to death. If you
are attempting a web-scraping project professionally, be sure that management
understands the terms of service before you get too creative with someone else’s
3.2.3 Logging
If you own a potential data source, treat it like you own it. Internal access to
a web service, communications device, or laboratory instrument grants you the
right and responsibility to log all activity for downstream analysis.
Amazing things can be done with ambient data collection from weblogs
and sensing devices, soon destined to explode with the coming “Internet of
Things.” The accelerometers in cell phones can be used to measure the strength
of earthquakes, with the correlation of events within a region sufficient to filter
out people driving on bumpy roads or leaving their phones in a clothes dryer.
Monitoring the GPS data of a fleet of taxi cabs tracks traffic congestion on
city streets. Computational analysis of image and video streams opens the
door to countless applications. Another cool idea is to use cameras as weather
instruments, by looking at the color of the sky in the background of the millions
of photographs uploaded to photo sites daily.
The primary reason to instrument your system to collect data is because
you can. You might not know exactly what to do with it now, but any well-
constructed data set is likely to become of value once it hits a certain critical
mass of size.
Current storage costs make clear just how low a barrier it is to instrument a
system. My local Costco is currently selling three terabyte disk drive for under
$100, which is Big O of nothing. If each transaction record takes 1 kilobyte (one
thousand characters), this device in principle has room for 3 billion records,
roughly one for every two people on earth.
The important considerations in designing any logging system are:
Build it to endure with limited maintenance. Set it and forget it, by
provisioning it with enough storage for unlimited expansion, and a backup.
Store all fields of possible value, without going crazy.
Use a human-readable format or transactions database, so you can un-
derstand exactly what is in there when the time comes, months or years
later, to sit down and analyze your data.
3.3 Cleaning Data
“Garbage in, garbage out” is the fundamental principle of data analysis. The
road from raw data to a clean, analyzable data set can be a long one.
Many potential issues can arise in cleaning data for analysis. In this section,
we discuss identifying processing artifacts and integrating diverse data sets. Our
focus here is the processing before we do our real analysis, to make sure that
the garbage never gets in in the first place.
Take-Home Lesson: Savvy painting restorers only do things to the original that
are reversible. They never do harm. Similarly, data cleaning is always done
on a copy of the original data, ideally by a pipeline that makes changes in a
systematic and repeatable way.
3.3.1 Errors vs. Artifacts
Under ancient Jewish law, if a suspect on trial was unanimously found guilty
by all judges, then this suspect would be acquitted. The judges had noticed
that unanimous agreement often indicates the presence of a systemic error in
the judicial process. They reasoned that when something seems too good to be
true, a mistake has likely been made somewhere.
If we view data items as measurements about some aspect of the world,
data errors represent information that is fundamentally lost in acquisition. The
Gaussian noise blurring the resolution of our sensors represents error, precision
which has been permanently lost. The two hours of missing logs because the
server crashed represents data error: it is information which cannot be recon-
structed again.
By contrast, artifacts are generally systematic problems arising from pro-
cessing done to the raw information it was constructed from. The good news is
that processing artifacts can be corrected, so long as the original raw data set
remains available. The bad news is that these artifacts must be detected before
they can be corrected.
Figure 3.2: What artifacts can you find in this time series, counting the number
of author’s names first appearing in the scientific literature each year?
The key to detecting processing artifacts is the “sniff test,” examining the
product closely enough to get a whiff of something bad. Something bad is usually
something unexpected or surprising, because people are naturally optimists.
Surprising observations are what data scientists live for. Indeed, such insights
are the primary reason we do what we do. But in my experience, most surprises
turn out to be artifacts, so we must look at them skeptically.
Figure 3.2 presents computational results from a project where we investi-
gated the process of scientific publication. It shows a time series of the 100,000
most prolific authors, binned according to the year of their first paper appearing
in Pubmed, an essentially complete bibliography of the biomedical literature.
Study this figure closely, and see if you can discover any artifacts worth
commenting on. I see at least two of them. Extra credit will be awarded if you
can figure out what caused the problem.
The key to finding artifacts is to look for anomalies in the data, that con-
tradict what you expect to see. What should the distribution in the number of
virgin authors look like, and how should it change over time? First, construct
a prior distribution of what you expect to see, so that you can then properly
evaluate potential anomalies against it.
My intuition says that the distribution of new top scientists should be pretty
flat, because new stars are born with every successive class of graduate students.
I would also guess that there may be a gradual drift upward as population
expands, and more people enter the scientific community. But that’s not what
I see in Figure 3.2. So try to enumerate what the anomalies/potential artifacts
are. . .
I see two big bumps when I look at Figure 3.2: a left bump starting around
Figure 3.3: The cleaned data removes these artifacts, and the resulting distri-
bution looks correct.
1965, and a peak which explodes in 2002. On reflection, the leftmost bump
makes sense. This left peak occurs the year when Pubmed first started to
systematically collect bibliographic records. Although there is some very in-
complete data from 1960–1964, most older scientists who had been publishing
papers for several years would “emerge” only with the start of systematic records
in 1965. So this explains the left peak, which then settles down by 1970 to what
looks like the flat distribution we expected.
But what about that giant 2002 peak? And the decline in new authors to
almost zero in the years which precede it? A similar decline is also visible to
the right of the big peak. Were all the world’s major scientists destined to be
born in 2002?
A careful inspection of the records in the big peak revealed the source of
the anomaly: first names. In the early days of Pubmed, authors were identified
by their initials and last names. But late in 2001, SS Skiena became Steven S.
Skiena, so it looked like a new author emerging from the heavens.
But why the declines to nothingness to the left and right of this peak? Recall
that we limited this study to the 100,000 most prolific scientists. A scientific
rock star emerging in 1998 would be unlikely to appear in this ranking because
their name was doomed to change a few years later, not leaving enough time
to accumulate a full career of papers. Similar things happen at the very right
of the distribution: newly created scientists in 2010 would never be able to
achieve a full career’s work in only a couple of years. Both phenomena are
neatly explained by this first name basis.
Cleaning this data to unify name references took us a few iterations to get
right. Even after eliminating the 2002 peak, we still saw a substantial dip in
prominent scientists starting their careers in the mid 1990s. This was because
many people who had a great half career pre-first names and a second great
half career post-first names did not rise to the threshold of a great full career
in either single period. Thus we had to match all the names in the full before
identifying who were the top 100,000 scientists.
Figure 3.3 shows our final distribution of authors, which matches the pla-
tonic ideal of what we expected the distribution to be. Don’t be too quick to
rationalize away how your data looks coming out of the computer. My collabo-
rators were at one point ready to write off the 2002 bump as due to increases in
research funding or the creation of new scientific journals. Always be suspicious
of whether your data is clean enough to trust.
3.3.2 Data Compatibility
We say that a comparison of two items is “apples to apples” when it is fair com-
parison, that the items involved are similar enough that they can be meaning-
fully stood up against each other. In contrast, “apples to oranges” comparisons
are ultimately meaningless. For example:
It makes no sense to compare weights of 123.5 against 78.9, when one is
in pounds and the other is in kilograms.
It makes no sense to directly compare the movie gross of Gone with the
Wind against that of Avatar, because 1939 dollars are 15.43 times more
valuable than 2009 dollars.
It makes no sense to compare the price of gold at noon today in New York
and London, because the time zones are five hours off, and the prices
affected by intervening events.
It makes no sense to compare the stock price of Microsoft on February 17,
2003 to that of February 18, 2003, because the intervening 2-for-1 stock
split cut the price in half, but reflects no change in real value.
These types of data comparability issues arise whenever data sets are merged.
Here I hope to show you how insidious such comparability issues can be, to
sensitize you as to why you need to be aware of them. Further, for certain
important classes of conversions I point to ways to deal with them.
Take-Home Lesson: Review the meaning of each of the fields in any data set
you work with. If you do not understand what’s in there down to the units of
measurement, there is no sensible way you can use it.
Unit Conversions
Quantifying observations in physical systems requires standard units of measure-
ment. Unfortunately there exist many functionally equivalent but incompatible
systems of measurement. My 12-year old daughter and I both weigh about 70,
but one of us is in pounds and the other in kilograms.
Disastrous things like rocket explosions happen when measurements are en-
tered into computer systems using the wrong units of measurement. In par-
ticular, NASA lost the $125 million Mars Climate Orbiter space mission on
September 23, 1999 due to a metric-to-English conversion issue.
Such problems are best addressed by selecting a single system of measure-
ments and sticking to it. The metric system offers several advantages over the
traditional English system. In particular, individual measurements are naturally
expressed as single decimal quantities (like 3.28 meters) instead of incomparable
pairs of quantities (5 feet, 8 inches). This same issue arises in measuring angles
(radians vs. degrees/seconds) and weight (kilograms vs. pounds/oz).
Sticking to the metric system does not by itself solve all comparability is-
sues, since there is nothing to prevent you from mixing heights in meters and
centimeters. But it is a good start.
How can you defend yourself against incompatible units when merging data
sets? Vigilance has to be your main weapon. Make sure that you know the
intended units for each numerical column in your data set, and verify compat-
ibility when merging. Any column which does not have an associated unit or
object type should immediately be suspect.
When merging records from diverse sources, it is an excellent practice to
create a new “origin” or “source” field to identify where each record came from.
This provides at least the hope that unit conversion mistakes can be corrected
later, by systematically operating on the records from the problematic source.
A partially-automated procedure to detect such problems can be devised
from statistical significance testing, to be discussed in Section 5.3. Suppose we
were to plot the frequencies of human heights in a merged data set of English
(feet) and metric (meter) measurements. We would see one peak in the distri-
bution around 1.8 and a second around 5.5. The existence of multiple peaks in a
distribution should make us suspicious. The p-value resulting from significance
testing on the two input populations provides a rigorous measurement of the
degree to which our suspicions are validated.
Numerical Representation Conversions
Numerical features are the easiest to incorporate into mathematical models.
Indeed, certain machine learning algorithms such as linear regression and sup-
port vector machines work only with numerically-coded data. But even turning
numbers into numbers can be a subtle problem. Numerical fields might be rep-
resented in different ways: as integers (123), as decimals (123.5), or even as
fractions (123 1/2). Numbers can even be represented as text, requiring the
conversion from “ten million” to 10000000 for numerical processing.
Numerical representation issues can take credit for destroying another rocket
ship. An Ariane 5 rocket launched at a cost of $500 million on June 4, 1996
exploded forty seconds after lift-off, with the cause ultimately ascribed to an
unsuccessful conversion of a 64-bit floating point number to a 16-bit integer.
The distinction between integers and floating point (real) numbers is impor-
tant to maintain. Integers are counting numbers: quantities which are really
discrete should be represented as integers. Physically measured quantities are
never precisely quantified, because we live in a continuous world. Thus all mea-
surements should be reported as real numbers. Integer approximations of real
numbers are sometimes used in a misbegotten attempt to save space. Don’t do
this: the quantification effects of rounding or truncation introduces artifacts.
In one particularly clumsy data set we encountered, baby weights were rep-
resented as two integer fields (pounds and the remaining ounces). Much better
would have been to combine them into a single decimal quantity.
Name Unification
Integrating records from two distinct data sets requires them to share a common
key field. Names are frequently used as key fields, but they are often reported
inconsistently. Is Jos´e the same fellow as Jose? Such diacritic marks are banned
from the official birth records of several U.S. states, in an aggressive attempt to
force them to be consistent.
As another case in point, databases show my publications as authored by the
Cartesian product of my first (Steve,Steven, or S.), middle (Sol,S., or blank),
and last (Skiena) names, allowing for nine different variations. And things get
worse if we include misspellings. I can find myself on Google with a first name
of Stephen and last names of Skienna and Skeina.
Unifying records by key is a very ugly problem, which doesn’t have a magic
bullet. This is exactly why ID numbers were invented, so use them as keys if
you possibly can.
The best general technique is unification: doing simple text transformations
to reduce each name to a single canonical version. Converting all strings to
lower case increases the number of (usually correct) collisions. Eliminating
middle names or at least reducing them to an abbreviation creates even more
name matches/collisions, as does mapping first names to canonical versions (like
turning all Steves into Stevens).
Any such transformation runs the risk of creating Frankenstein-people, single
records assembled from multiple bodies. Applications differ in whether the
greater danger lies in merging too aggressively or too timidly. Figure out where
your task sits on this spectrum and act accordingly.
An important concern in merging data sets is character code unification.
Characters in text strings are assigned numerical representations, with the map-
ping between symbols and number governed by the character code standard.
Unfortunately, there are several different character code standards in common
usage, meaning that what you scrape from a webpage might not be in the same
character code as assumed by the system which will process it.
Historically, the good old 7-bit ASCII code standard was expanded to the
8-bit IS0 8859-1 Latin alphabet code, which adds characters and punctuation
marks from several European languages. UTF-8 is an encoding of all Unicode
characters using variable numbers of 8-bit blocks, which is backwards compatible
with ASCII. It is the dominant encoding for web-pages, although other systems
remain in use.
Correctly unifying character codes after merging is pretty much impossible.
You must have the discipline to pick a single code as a standard, and check the
encoding of each input file on preprocessing, converting it to the target before
further work.
Time/Date Unification
Data/time stamps are used to infer the relative order of events, and group events
by relative simultaneity. Integrating event data from multiple sources requires
careful cleaning to ensure meaningful results.
First let us consider issues in measuring time. The clocks from two computers
never exactly agree, so precisely aligning logs from different systems requires a
mix of work and guesswork. There are also time zone issues when dealing with
data from different regions, as well as diversities in local rules governing changes
in daylight saving time.
The right answer here is to align all time measurements to Coordinated Uni-
versal Time (UTC), a modern standard subsuming the traditional Greenwich
Mean Time (GMT). A related standard is UNIX time, which reports an event’s
precise time in terms of the number of elapsed seconds since 00:00:00 UTC on
Thursday, January 1, 1970.
The Gregorian calendar is common throughout the technology world, al-
though many other calendar systems are in use in different countries. Subtle
algorithms must be used to convert between calendar systems, as described in
[RD01]. A bigger problem for date alignment concerns the proper interpretation
of time zones and the international date line.
Time series unification is often complicated by the nature of the business
calendar. Financial markets are closed on weekends and holidays, making for
questions of interpretation when you are correlating, say, stock prices to local
temperature. What is the right moment over the weekend to measure tempera-
ture, so as to be consistent with other days of the week? Languages like Python
contain extensive libraries to deal with financial time series data to get issues
like this correct. Similar issues arise with monthly data, because months (and
even years) have different lengths.
Financial Unification
Money makes the world go round, which is why so many data science projects
revolve around financial time series. But money can be dirty, so this data
requires cleaning.
One issue here is currency conversion, representing international prices using
a standardized financial unit. Currency exchange rates can vary by a few percent
within a given day, so certain applications require time-sensitive conversions.
Conversion rates are not truly standardized. Different markets will each have
different rates and spreads, the gap between buying and selling prices that cover
the cost of conversion.
The other important correction is for inflation. The time value of money
implies that a dollar today is (generally) more valuable than a dollar a year
from now, with interest rates providing the right way to discount future dollars.
Inflation rates are estimated by tracking price changes over baskets of items,
and provide a way to standardize the purchasing power of a dollar over time.
Using unadjusted prices in a model over non-trivial periods of time is just
begging for trouble. A group of my students once got very excited by the
strong correlation observed between stock prices and oil prices over a thirty-
year period, and so tried to use stock prices in a commodity prediction model.
But both goods were priced in dollars, without any adjustment as they inflated.
The time series of prices of essentially any pair of items will correlate strongly
over time when you do not correct for inflation.
In fact, the most meaningful way to represent price changes over time is
probably not differences but returns, which normalize the difference by the initial
ri=pi+1 pi
This is more analogous to a percentage change, with the advantage here that
taking the logarithm of this ratio becomes symmetric to gains and losses.
Financial time series contain many other subtleties which require cleaning.
Many stocks give scheduled dividends to the shareholder on a particular date
every year. Say, for example, that Microsoft will pay a $2.50 dividend on Jan-
uary 16. If you own a share of Microsoft at the start of business that day, you
receive this check, so the value of the share then immediately drops by $2.50
the moment after the dividend is issued. This price decline reflects no real loss
to the shareholder, but properly cleaned data needs to factor the dividend into
the price of the stock. It is easy to imagine a model trained on uncorrected
price data learning to sell stocks just prior to its issuing dividends, and feeling
unjustly proud of itself for doing so.
3.3.3 Dealing with Missing Values
Not all data sets are complete. An important aspect of data cleaning is iden-
tifying fields for which data isn’t there, and then properly compensating for
What is the year of death of a living person?
What should you do with a survey question left blank, or filled with an
obviously outlandish value?
What is the relative frequency of events too rare to see in a limited-size
Numerical data sets expect a value for every element in a matrix. Setting
missing values to zero is tempting, but generally wrong, because there is always
some ambiguity as to whether these values should be interpreted as data or not.
Is someone’s salary zero because he is unemployed, or did he just not answer
the question?
The danger with using nonsense values as not-data symbols is that they
can get misinterpreted as data when it comes time to build models. A linear
regression model trained to predict salaries from age, education, and gender will
have trouble with people who refused to answer the question.
Using a value like 1 as a no-data symbol has exactly the same deficiencies
as zero. Indeed, be like the mathematician who is afraid of negative numbers:
stop at nothing to avoid them.
Take-Home Lesson: Separately maintain both the raw data and its cleaned
version. The raw data is the ground truth, and must be preserved intact for
future analysis. The cleaned data may be improved using imputation to fill in
missing values. But keep raw data distinct from cleaned, so we can investigate
different approaches to guessing.
So how should we deal with missing values? The simplest approach is to
drop all records containing missing values. This works just fine when it leaves
enough training data, provided the missing values are absent for non-systematic
reasons. If the people refusing to state their salary were generally those above
the mean, dropping these records will lead to biased results.
But typically we want to make use of records with missing fields. It can be
better to estimate or impute missing values, instead of leaving them blank. We
need general methods for filling in missing values. Candidates include:
Heuristic-based imputation: Given sufficient knowledge of the underlying
domain, we should be able to make a reasonable guess for the value of
certain fields. If I need to fill in a value for the year you will die, guessing
birth year+80 will prove about right on average, and a lot faster than
waiting for the final answer.
Mean value imputation: Using the mean value of a variable as a proxy
for missing values is generally sensible. First, adding more values with
the mean leaves the mean unchanged, so we do not bias our statistics by
such imputation. Second, fields with mean values add a vanilla flavor to
most models, so they have a muted impact on any forecast made using
the data.
But the mean might not be appropriate if there is a systematic reason
for missing data. Suppose we used the mean death-year in Wikipedia to
impute the missing value for all living people. This would prove disastrous,
with many people recorded as dying before they were actually born.
Random value imputation: Another approach is to select a random value
from the column to replace the missing value. This would seem to set us up
for potentially lousy guesses, but that is actually the point. Repeatedly
selecting random values permits statistical evaluation of the impact of
imputation. If we run the model ten times with ten different imputed
values and get widely varying results, then we probably shouldn’t have
much confidence in the model. This accuracy check is particularly valuable
when there is a substantial fraction of values missing from the data set.
Imputation by nearest neighbor: What if we identify the complete record
which matches most closely on all fields present, and use this nearest
neighbor to infer the values of what is missing? Such predictions should
be more accurate than the mean, when there are systematic reasons to
explain variance among records.
This approach requires a distance function to identify the most similar
records. Nearest neighbor methods are an important technique in data
science, and will be presented in greater detail in Section 10.2.
Imputation by interpolation: More generally, we can use a method like
linear regression (see Section 9.1) to predict the values of the target col-
umn, given the other fields in the record. Such models can be trained over
full records and then applied to those with missing values.
Using linear regression to predict missing values works best when there is
only one field missing per record. The potential danger here is creating
significant outliers through lousy predictions. Regression models can easily
turn an incomplete record into an outlier, by filling the missing fields in
with unusually high or low values. This would lead downstream analysis
to focus more attention on the records with missing values, exactly the
opposite of what we want to do.
Such concerns emphasize the importance of outlier detection, the final step
in the cleaning process that will be considered here.
3.3.4 Outlier Detection
Mistakes in data collection can easily produce outliers that can interfere with
proper analysis. An interesting example concerns the largest dinosaur vertebra
ever discovered. Measured at 1500 millimeters, it implies an individual that was
188 feet long. This is amazing, particularly because the second largest specimen
ever discovered comes in at only 122 feet.
The most likely explanation here (see [Gol16]) is that this giant fossil never
actually existed: it has been missing from the American Museum of Natural
History for over a hundred years. Perhaps the original measurement was taken
on a conventionally-sized bone and the center two digits accidentally transposed,
reducing the vertebra down to 1050 millimeters.
Outlier elements are often created by data entry mistakes, as apparently was
the case here. They can also result from errors in scraping, say an irregularity
in formatting causing a footnote number to be interpreted as a numerical value.
Just because something is written down doesn’t make it correct. As with the
dinosaur example, a single outlier element can lead to major misinterpretations.
General sanity checking requires looking at the largest and smallest values
in each variable/column to see whether they are too far out of line. This can
best be done by plotting the frequency histogram and looking at the location of
the extreme elements. Visual inspection can also confirm that the distribution
looks the way it should, typically bell-shaped.
In normally distributed data, the probability that a value is kstandard de-
viations from the mean decreases exponentially with k. This explains why there
are no 10-foot basketball players, and provides a sound threshold to identify
outliers. Power law distributions are less easy to detect outliers in: there really
is a Bill Gates worth over 10,000 times as much as the average individual.
It is too simple to just delete the rows containing outlier fields and move
on. Outliers often point to more systematic problems that one must deal with.
Consider a data set of historical figures by lifespan. It is easy to finger the
biblical Methuselah (at 969 years) as an outlier, and remove him.
But it is better to figure out whether he is indicative of other figures that we
should consider removing. Observe that Methuselah had no firmly established
birth and death dates. Perhaps the published ages of anybody without dates
should be considered suspicious enough to prune. By contrast, the person with
the shortest lifespan in Wikipedia (John I, King of France) lived only five days.
But his birth (November 15) and death (November 20) dates in 1316 convinces
me that his lifespan was accurate.
3.4 War Story: Beating the Market
Every time we met, my graduate student Wenbin told me we were making
money. But he sounded less and less confident every time I asked.
Our Lydia sentiment analysis system took in massive text feeds of news and
social media, reducing them to daily time series of frequency and sentiment
for the millions of different people, places, and organizations mentioned within.
When somebody wins a sports championship, many articles get written describ-
ing how great an athlete they are. But when this player then gets busted on drug
charges, the tone of the articles about them immediately changes. By keeping
count of the relative frequency of association with positive words (“victorious”)
vs. negative words (“arrested”) in the text stream, we can construct sentiment
signals for any news-worthy entity.
Wenbin studied how sentiment signals could be used to predict future events
like the gross for a given movie, in response to the quality of published reviews
or buzz. But he particularly wanted to use this data to play the stock market.
Stocks move up and down according to news. A missed earnings report is bad
news for a company, so the price goes down. Food and Drug Administration
(FDA) approval of a new drug is great news for the company which owns it, so
the price goes up. If Wenbin could use our sentiment signal to predict future
stock prices, well, let’s just say I wouldn’t have to pay him as a research assistant
So he simulated a strategy of buying the stocks that showed the highest
sentiment in that day’s news, and then shorting those with the lowest sentiment.
He got great results. “See,” he said. “We are making money.”
The numbers looked great, but I had one quibble. Using today’s news re-
sults to predict current price movements wasn’t really fair, because the event
described in the article may have already moved the price before we had any
chance to read about it. Stock prices should react very quickly to important
So Wenbin simulated the strategy of buying stocks based on sentiment from
the previous day’s news, to create a gap between the observed news and price
changes. The return rate went down substantially, but was still positive. “See,”
he said. “We are still making money.”
But I remained a little uncomfortable with this. Many economists believe
that the financial markets are efficient, meaning that all public news is instantly
reflected in changing prices. Prices certainly changed in response to news, but
you would not be able to get in fast enough to exploit the information. We had
to remain skeptical enough to make sure there were no data/timing problems
that could explain our results.
So I pressed Wenbin about exactly how he had performed his simulation. His
strategy bought and sold at the closing price every day. But that left sixteen
hours until the next day’s open, plenty of time for the world to react to events
that happ