# The Data Science Design Manual

### The%20Data%20Science%20Design%20Manual

User Manual:

Open the PDF directly: View PDF .

Page Count: 453 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Steven S. Skiena

The Data Science Design Manual

Steven S. Skiena

Computer Science Department

Stony Brook University

Stony Brook, NY

USA

ISSN 1868-0941 ISSN 1868-095X (electronic)

Texts in Computer Science

ISBN 978-3-319-55443-3 ISBN 978-3-319-55444-0 (eBook)

DOI 10.1007/978-3-319-55444-0

Library of Congress Control Number: 2017943201

© The Author(s) 2017

This Springer imprint is published by Springer Nature

The registered company is Springer International Publishing AG

The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Making sense of the world around us requires obtaining and analyzing data from

our environment. Several technology trends have recently collided, providing

new opportunities to apply our data analysis savvy to greater challenges than

ever before.

Computer storage capacity has increased exponentially; indeed remembering

has become so cheap that it is almost impossible to get computer systems to for-

get. Sensing devices increasingly monitor everything that can be observed: video

streams, social media interactions, and the position of anything that moves.

Cloud computing enables us to harness the power of massive numbers of ma-

chines to manipulate this data. Indeed, hundreds of computers are summoned

each time you do a Google search, scrutinizing all of your previous activity just

to decide which is the best ad to show you next.

The result of all this has been the birth of data science, a new ﬁeld devoted

to maximizing value from vast collections of information. As a discipline, data

science sits somewhere at the intersection of statistics, computer science, and

machine learning, but it is building a distinct heft and character of its own.

This book serves as an introduction to data science, focusing on the skills and

principles needed to build systems for collecting, analyzing, and interpreting

data.

My professional experience as a researcher and instructor convinces me that

one major challenge of data science is that it is considerably more subtle than it

looks. Any student who has ever computed their grade point average (GPA) can

be said to have done rudimentary statistics, just as drawing a simple scatter plot

lets you add experience in data visualization to your resume. But meaningfully

analyzing and interpreting data requires both technical expertise and wisdom.

That so many people do these basics so badly provides my inspiration for writing

this book.

To the Reader

I have been gratiﬁed by the warm reception that my book The Algorithm Design

Manual [Ski08] has received since its initial publication in 1997. It has been

recognized as a unique guide to using algorithmic techniques to solve problems

that often arise in practice. The book you are holding covers very diﬀerent

material, but with the same motivation.

In particular, here I stress the following basic principles as fundamental to

becoming a good data scientist:

•Valuing doing the simple things right: Data science isn’t rocket science.

Students and practitioners often get lost in technological space, pursuing

the most advanced machine learning methods, the newest open source

software libraries, or the glitziest visualization techniques. However, the

heart of data science lies in doing the simple things right: understanding

the application domain, cleaning and integrating relevant data sources,

and presenting your results clearly to others.

Simple doesn’t mean easy, however. Indeed it takes considerable insight

and experience to ask the right questions, and sense whether you are mov-

ing toward correct answers and actionable insights. I resist the temptation

to drill deeply into clean, technical material here just because it is teach-

able. There are plenty of other books which will cover the intricacies of

machine learning algorithms or statistical hypothesis testing. My mission

here is to lay the groundwork of what really matters in analyzing data.

•Developing mathematical intuition: Data science rests on a foundation of

mathematics, particularly statistics and linear algebra. It is important to

understand this material on an intuitive level: why these concepts were

developed, how they are useful, and when they work best. I illustrate

operations in linear algebra by presenting pictures of what happens to

matrices when you manipulate them, and statistical concepts by exam-

ples and reducto ad absurdum arguments. My goal here is transplanting

intuition into the reader.

But I strive to minimize the amount of formal mathematics used in pre-

senting this material. Indeed, I will present exactly one formal proof in

this book, an incorrect proof where the associated theorem is obviously

false. The moral here is not that mathematical rigor doesn’t matter, be-

cause of course it does, but that genuine rigor is impossible until after

there is comprehension.

•Think like a computer scientist, but act like a statistician: Data science

provides an umbrella linking computer scientists, statisticians, and domain

specialists. But each community has its own distinct styles of thinking and

action, which gets stamped into the souls of its members.

In this book, I emphasize approaches which come most naturally to com-

puter scientists, particularly the algorithmic manipulation of data, the use

of machine learning, and the mastery of scale. But I also seek to transmit

the core values of statistical reasoning: the need to understand the appli-

cation domain, proper appreciation of the small, the quest for signiﬁcance,

and a hunger for exploration.

No discipline has a monopoly on the truth. The best data scientists incor-

porate tools from multiple areas, and this book strives to be a relatively

neutral ground where rival philosophies can come to reason together.

Equally important is what you will not ﬁnd in this book. I do not emphasize

any particular language or suite of data analysis tools. Instead, this book pro-

vides a high-level discussion of important design principles. I seek to operate at

a conceptual level more than a technical one. The goal of this manual is to get

you going in the right direction as quickly as possible, with whatever software

tools you ﬁnd most accessible.

To the Instructor

This book covers enough material for an “Introduction to Data Science” course

at the undergraduate or early graduate student levels. I hope that the reader

has completed the equivalent of at least one programming course and has a bit

of prior exposure to probability and statistics, but more is always better than

less.

I have made a full set of lecture slides for teaching this course available online

at http://www.data-manual.com. Data resources for projects and assignments

are also available there to aid the instructor. Further, I make available online

video lectures using these slides to teach a full-semester data science course. Let

me help teach your class, through the magic of the web!

Pedagogical features of this book include:

•War Stories: To provide a better perspective on how data science tech-

niques apply to the real world, I include a collection of “war stories,” or

tales from our experience with real problems. The moral of these stories is

that these methods are not just theory, but important tools to be pulled

out and used as needed.

•False Starts: Most textbooks present methods as a fait accompli, ob-

scuring the ideas involved in designing them, and the subtle reasons why

other approaches fail. The war stories illustrate my reasoning process on

certain applied problems, but I weave such coverage into the core material

as well.

•Take-Home Lessons: Highlighted “take-home” lesson boxes scattered

through each chapter emphasize the big-picture concepts to learn from

each chapter.

•Homework Problems: I provide a wide range of exercises for home-

work and self-study. Many are traditional exam-style problems, but there

are also larger-scale implementation challenges and smaller-scale inter-

view questions, reﬂecting the questions students might encounter when

searching for a job. Degree of diﬃculty ratings have been assigned to all

problems.

In lieu of an answer key, a Solution Wiki has been set up, where solutions to

all even numbered problems will be solicited by crowdsourcing. A similar

system with my Algorithm Design Manual produced coherent solutions,

or so I am told. As a matter of principle I refuse to look at them, so let

the buyer beware.

•Kaggle Challenges: Kaggle (www.kaggle.com) provides a forum for data

scientists to compete in, featuring challenging real-world problems on fas-

cinating data sets, and scoring to test how good your model is relative to

other submissions. The exercises for each chapter include three relevant

Kaggle challenges, to serve as a source of inspiration, self-study, and data

for other projects and investigations.

•Data Science Television: Data science remains mysterious and even

threatening to the broader public. The Quant Shop is an amateur take

on what a data science reality show should be like. Student teams tackle

a diverse array of real-world prediction problems, and try to forecast the

outcome of future events. Check it out at http://www.quant-shop.com.

A series of eight 30-minute episodes has been prepared, each built around

a particular real-world prediction problem. Challenges include pricing art

at an auction, picking the winner of the Miss Universe competition, and

forecasting when celebrities are destined to die. For each, we observe as a

student team comes to grips with the problem, and learn along with them

as they build a forecasting model. They make their predictions, and we

watch along with them to see if they are right or wrong.

In this book, The Quant Shop is used to provide concrete examples of

prediction challenges, to frame discussions of the data science modeling

pipeline from data acquisition to evaluation. I hope you ﬁnd them fun, and

that they will encourage you to conceive and take on your own modeling

challenges.

•Chapter Notes: Finally, each tutorial chapter concludes with a brief notes

section, pointing readers to primary sources and additional references.

Dedication

My bright and loving daughters Bonnie and Abby are now full-blown teenagers,

meaning that they don’t always process statistical evidence with as much alacrity

as I would I desire. I dedicate this book to them, in the hope that their analysis

skills improve to the point that they always just agree with me.

And I dedicate this book to my beautiful wife Renee, who agrees with me

even when she doesn’t agree with me, and loves me beyond the support of all

creditable evidence.

Acknowledgments

My list of people to thank is large enough that I have probably missed some.

I will try to do enumerate them systematically to minimize omissions, but ask

those I’ve unfairly neglected for absolution.

First, I thank those who made concrete contributions to help me put this

book together. Yeseul Lee served as an apprentice on this project, helping with

ﬁgures, exercises, and more during summer 2016 and beyond. You will see

evidence of her handiwork on almost every page, and I greatly appreciate her

help and dedication. Aakriti Mittal and Jack Zheng also contributed to a few

of the ﬁgures.

Students in my Fall 2016 Introduction to Data Science course (CSE 519)

helped to debug the manuscript, and they found plenty of things to debug. I

particularly thank Rebecca Siford, who proposed over one hundred corrections

on her own. Several data science friends/sages reviewed speciﬁc chapters for

me, and I thank Anshul Gandhi, Yifan Hu, Klaus Mueller, Francesco Orabona,

Andy Schwartz, and Charles Ward for their eﬀorts here.

I thank all the Quant Shop students from Fall 2015 whose video and mod-

eling eﬀorts are so visibly on display. I particularly thank Jan (Dini) Diskin-

Zimmerman, whose editing eﬀorts went so far beyond the call of duty I felt like

a felon for letting her do it.

My editors at Springer, Wayne Wheeler and Simon Rees, were a pleasure to

work with as usual. I also thank all the production and marketing people who

helped get this book to you, including Adrian Pieron and Annette Anlauf.

Several exercises were originated by colleagues or inspired by other sources.

Reconstructing the original sources years later can be challenging, but credits

for each problem (to the best of my recollection) appear on the website.

Much of what I know about data science has been learned through working

with other people. These include my Ph.D. students, particularly Rami al-Rfou,

Mikhail Bautin, Haochen Chen, Yanqing Chen, Vivek Kulkarni, Levon Lloyd,

Andrew Mehler, Bryan Perozzi, Yingtao Tian, Junting Ye, Wenbin Zhang, and

postdoc Charles Ward. I fondly remember all of my Lydia project masters

students over the years, and remind you that my prize oﬀer to the ﬁrst one who

names their daughter Lydia remains unclaimed. I thank my other collaborators

with stories to tell, including Bruce Futcher, Justin Gardin, Arnout van de Rijt,

and Oleksii Starov.

I remember all members of the General Sentiment/Canrock universe, partic-

ularly Mark Fasciano, with whom I shared the start-up dream and experienced

what happens when data hits the real world. I thank my colleagues at Yahoo

Labs/Research during my 2015–2016 sabbatical year, when much of this book

was conceived. I single out Amanda Stent, who enabled me to be at Yahoo

during that particularly diﬃcult year in the company’s history. I learned valu-

able things from other people who have taught related data science courses,

including Andrew Ng and Hans-Peter Pﬁster, and thank them all for their help.

If you have a procedure with ten parameters, you probably missed

some.

– Alan Perlis

Caveat

It is traditional for the author to magnanimously accept the blame for whatever

deﬁciencies remain. I don’t. Any errors, deﬁciencies, or problems in this book

are somebody else’s fault, but I would appreciate knowing about them so as to

determine who is to blame.

Steven S. Skiena

Department of Computer Science

Stony Brook University

Stony Brook, NY 11794-2424

http://www.cs.stonybrook.edu/~skiena

skiena@data-manual.com

May 2017

Contents

1 What is Data Science? 1

1.1 Computer Science, Data Science, and Real Science . . . . . . . . 2

1.2 Asking Interesting Questions from Data . . . . . . . . . . . . . . 4

1.2.1 The Baseball Encyclopedia . . . . . . . . . . . . . . . . . 5

1.2.2 The Internet Movie Database (IMDb) . . . . . . . . . . . 7

1.2.3 GoogleNgrams........................ 10

1.2.4 New York Taxi Records . . . . . . . . . . . . . . . . . . . 11

1.3 PropertiesofData .......................... 14

1.3.1 Structured vs. Unstructured Data . . . . . . . . . . . . . 14

1.3.2 Quantitative vs. Categorical Data . . . . . . . . . . . . . 15

1.3.3 Big Data vs. Little Data . . . . . . . . . . . . . . . . . . . 15

1.4 Classiﬁcation and Regression . . . . . . . . . . . . . . . . . . . . 16

1.5 Data Science Television: The Quant Shop . . . . . . . . . . . . . 17

1.5.1 Kaggle Challenges . . . . . . . . . . . . . . . . . . . . . . 19

1.6 About the War Stories . . . . . . . . . . . . . . . . . . . . . . . . 19

1.7 War Story: Answering the Right Question . . . . . . . . . . . . . 21

1.8 ChapterNotes ............................ 22

1.9 Exercises ............................... 23

2 Mathematical Preliminaries 27

2.1 Probability .............................. 27

2.1.1 Probability vs. Statistics . . . . . . . . . . . . . . . . . . . 29

2.1.2 Compound Events and Independence . . . . . . . . . . . . 30

2.1.3 Conditional Probability . . . . . . . . . . . . . . . . . . . 31

2.1.4 Probability Distributions . . . . . . . . . . . . . . . . . . 32

2.2 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.2.1 Centrality Measures . . . . . . . . . . . . . . . . . . . . . 34

2.2.2 Variability Measures . . . . . . . . . . . . . . . . . . . . . 36

2.2.3 Interpreting Variance . . . . . . . . . . . . . . . . . . . . 37

2.2.4 Characterizing Distributions . . . . . . . . . . . . . . . . 39

2.3 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.3.1 Correlation Coeﬃcients: Pearson and Spearman Rank . . 41

2.3.2 The Power and Signiﬁcance of Correlation . . . . . . . . . 43

2.3.3 Correlation Does Not Imply Causation! . . . . . . . . . . 45

2.3.4 Detecting Periodicities by Autocorrelation . . . . . . . . . 46

2.4 Logarithms .............................. 47

2.4.1 Logarithms and Multiplying Probabilities . . . . . . . . . 48

2.4.2 Logarithms and Ratios . . . . . . . . . . . . . . . . . . . . 48

2.4.3 Logarithms and Normalizing Skewed Distributions . . . . 49

2.5 War Story: Fitting Designer Genes . . . . . . . . . . . . . . . . . 50

2.6 ChapterNotes ............................ 52

2.7 Exercises ............................... 53

3 Data Munging 57

3.1 Languages for Data Science . . . . . . . . . . . . . . . . . . . . . 57

3.1.1 The Importance of Notebook Environments . . . . . . . . 59

3.1.2 Standard Data Formats . . . . . . . . . . . . . . . . . . . 61

3.2 CollectingData............................ 64

3.2.1 Hunting............................ 64

3.2.2 Scraping............................ 67

3.2.3 Logging ............................ 68

3.3 CleaningData ............................ 69

3.3.1 Errors vs. Artifacts . . . . . . . . . . . . . . . . . . . . . 69

3.3.2 Data Compatibility . . . . . . . . . . . . . . . . . . . . . . 72

3.3.3 Dealing with Missing Values . . . . . . . . . . . . . . . . . 76

3.3.4 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . 78

3.4 War Story: Beating the Market . . . . . . . . . . . . . . . . . . . 79

3.5 Crowdsourcing ............................ 80

3.5.1 The Penny Demo . . . . . . . . . . . . . . . . . . . . . . . 81

3.5.2 When is the Crowd Wise? . . . . . . . . . . . . . . . . . . 82

3.5.3 Mechanisms for Aggregation . . . . . . . . . . . . . . . . 83

3.5.4 Crowdsourcing Services . . . . . . . . . . . . . . . . . . . 84

3.5.5 Gamiﬁcation ......................... 88

3.6 ChapterNotes ............................ 90

3.7 Exercises ............................... 90

4 Scores and Rankings 95

4.1 The Body Mass Index (BMI) . . . . . . . . . . . . . . . . . . . . 96

4.2 Developing Scoring Systems . . . . . . . . . . . . . . . . . . . . . 99

4.2.1 Gold Standards and Proxies . . . . . . . . . . . . . . . . . 99

4.2.2 Scores vs. Rankings . . . . . . . . . . . . . . . . . . . . . 100

4.2.3 Recognizing Good Scoring Functions . . . . . . . . . . . . 101

4.3 Z-scores and Normalization . . . . . . . . . . . . . . . . . . . . . 103

4.4 Advanced Ranking Techniques . . . . . . . . . . . . . . . . . . . 104

4.4.1 EloRankings ......................... 104

4.4.2 Merging Rankings . . . . . . . . . . . . . . . . . . . . . . 108

4.4.3 Digraph-based Rankings . . . . . . . . . . . . . . . . . . . 109

4.4.4 PageRank...........................111

4.5 War Story: Clyde’s Revenge . . . . . . . . . . . . . . . . . . . . . 111

4.6 Arrow’s Impossibility Theorem . . . . . . . . . . . . . . . . . . . 114

4.7 War Story: Who’s Bigger? . . . . . . . . . . . . . . . . . . . . . . 115

4.8 ChapterNotes ............................118

4.9 Exercises ............................... 119

5 Statistical Analysis 121

5.1 Statistical Distributions . . . . . . . . . . . . . . . . . . . . . . . 122

5.1.1 The Binomial Distribution . . . . . . . . . . . . . . . . . . 123

5.1.2 The Normal Distribution . . . . . . . . . . . . . . . . . . 124

5.1.3 Implications of the Normal Distribution . . . . . . . . . . 126

5.1.4 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . 127

5.1.5 Power Law Distributions . . . . . . . . . . . . . . . . . . . 129

5.2 Sampling from Distributions . . . . . . . . . . . . . . . . . . . . . 132

5.2.1 Random Sampling beyond One Dimension . . . . . . . . . 133

5.3 Statistical Signiﬁcance . . . . . . . . . . . . . . . . . . . . . . . . 135

5.3.1 The Signiﬁcance of Signiﬁcance . . . . . . . . . . . . . . . 135

5.3.2 The T-test: Comparing Population Means . . . . . . . . . 137

5.3.3 The Kolmogorov-Smirnov Test . . . . . . . . . . . . . . . 139

5.3.4 The Bonferroni Correction . . . . . . . . . . . . . . . . . . 141

5.3.5 False Discovery Rate . . . . . . . . . . . . . . . . . . . . . 142

5.4 War Story: Discovering the Fountain of Youth? . . . . . . . . . . 143

5.5 Permutation Tests and P-values . . . . . . . . . . . . . . . . . . . 145

5.5.1 Generating Random Permutations . . . . . . . . . . . . . 147

5.5.2 DiMaggio’s Hitting Streak . . . . . . . . . . . . . . . . . . 148

5.6 Bayesian Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . 150

5.7 ChapterNotes ............................151

5.8 Exercises ............................... 151

6 Visualizing Data 155

6.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . 156

6.1.1 Confronting a New Data Set . . . . . . . . . . . . . . . . 156

6.1.2 Summary Statistics and Anscombe’s Quartet . . . . . . . 159

6.1.3 Visualization Tools . . . . . . . . . . . . . . . . . . . . . . 160

6.2 Developing a Visualization Aesthetic . . . . . . . . . . . . . . . . 162

6.2.1 Maximizing Data-Ink Ratio . . . . . . . . . . . . . . . . . 163

6.2.2 Minimizing the Lie Factor . . . . . . . . . . . . . . . . . . 164

6.2.3 Minimizing Chartjunk . . . . . . . . . . . . . . . . . . . . 165

6.2.4 Proper Scaling and Labeling . . . . . . . . . . . . . . . . 167

6.2.5 Eﬀective Use of Color and Shading . . . . . . . . . . . . . 168

6.2.6 The Power of Repetition . . . . . . . . . . . . . . . . . . . 169

6.3 ChartTypes ............................. 170

6.3.1 TabularData.........................170

6.3.2 Dot and Line Plots . . . . . . . . . . . . . . . . . . . . . . 174

6.3.3 ScatterPlots .........................177

6.3.4 Bar Plots and Pie Charts . . . . . . . . . . . . . . . . . . 179

6.3.5 Histograms ..........................183

6.3.6 DataMaps ..........................187

6.4 Great Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . 189

6.4.1 Marey’s Train Schedule . . . . . . . . . . . . . . . . . . . 189

6.4.2 Snow’s Cholera Map . . . . . . . . . . . . . . . . . . . . . 191

6.4.3 New York’s Weather Year . . . . . . . . . . . . . . . . . . 192

6.5 ReadingGraphs............................192

6.5.1 The Obscured Distribution . . . . . . . . . . . . . . . . . 193

6.5.2 Overinterpreting Variance . . . . . . . . . . . . . . . . . . 193

6.6 Interactive Visualization . . . . . . . . . . . . . . . . . . . . . . . 195

6.7 War Story: TextMapping the World . . . . . . . . . . . . . . . . 196

6.8 ChapterNotes ............................198

6.9 Exercises ............................... 199

7 Mathematical Models 201

7.1 Philosophies of Modeling . . . . . . . . . . . . . . . . . . . . . . . 201

7.1.1 Occam’s Razor . . . . . . . . . . . . . . . . . . . . . . . . 201

7.1.2 Bias–Variance Trade-Oﬀs . . . . . . . . . . . . . . . . . . 202

7.1.3 What Would Nate Silver Do? . . . . . . . . . . . . . . . . 203

7.2 A Taxonomy of Models . . . . . . . . . . . . . . . . . . . . . . . 205

7.2.1 Linear vs. Non-Linear Models . . . . . . . . . . . . . . . . 206

7.2.2 Blackbox vs. Descriptive Models . . . . . . . . . . . . . . 206

7.2.3 First-Principle vs. Data-Driven Models . . . . . . . . . . . 207

7.2.4 Stochastic vs. Deterministic Models . . . . . . . . . . . . 208

7.2.5 Flat vs. Hierarchical Models . . . . . . . . . . . . . . . . . 209

7.3 BaselineModels............................210

7.3.1 Baseline Models for Classiﬁcation . . . . . . . . . . . . . . 210

7.3.2 Baseline Models for Value Prediction . . . . . . . . . . . . 212

7.4 Evaluating Models . . . . . . . . . . . . . . . . . . . . . . . . . . 212

7.4.1 Evaluating Classiﬁers . . . . . . . . . . . . . . . . . . . . 213

7.4.2 Receiver-Operator Characteristic (ROC) Curves . . . . . 218

7.4.3 Evaluating Multiclass Systems . . . . . . . . . . . . . . . 219

7.4.4 Evaluating Value Prediction Models . . . . . . . . . . . . 221

7.5 Evaluation Environments . . . . . . . . . . . . . . . . . . . . . . 224

7.5.1 Data Hygiene for Evaluation . . . . . . . . . . . . . . . . 225

7.5.2 Amplifying Small Evaluation Sets . . . . . . . . . . . . . 226

7.6 War Story: 100% Accuracy . . . . . . . . . . . . . . . . . . . . . 228

7.7 Simulation Models . . . . . . . . . . . . . . . . . . . . . . . . . . 229

7.8 War Story: Calculated Bets . . . . . . . . . . . . . . . . . . . . . 230

7.9 ChapterNotes ............................233

7.10Exercises ............................... 234

8 Linear Algebra 237

8.1 The Power of Linear Algebra . . . . . . . . . . . . . . . . . . . . 237

8.1.1 Interpreting Linear Algebraic Formulae . . . . . . . . . . 238

8.1.2 Geometry and Vectors . . . . . . . . . . . . . . . . . . . . 240

8.2 Visualizing Matrix Operations . . . . . . . . . . . . . . . . . . . . 241

8.2.1 Matrix Addition . . . . . . . . . . . . . . . . . . . . . . . 242

8.2.2 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . 243

8.2.3 Applications of Matrix Multiplication . . . . . . . . . . . 244

8.2.4 Identity Matrices and Inversion . . . . . . . . . . . . . . . 248

8.2.5 Matrix Inversion and Linear Systems . . . . . . . . . . . . 250

8.2.6 MatrixRank .........................251

8.3 Factoring Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 252

8.3.1 Why Factor Feature Matrices? . . . . . . . . . . . . . . . 252

8.3.2 LU Decomposition and Determinants . . . . . . . . . . . 254

8.4 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . 255

8.4.1 Properties of Eigenvalues . . . . . . . . . . . . . . . . . . 255

8.4.2 Computing Eigenvalues . . . . . . . . . . . . . . . . . . . 256

8.5 Eigenvalue Decomposition . . . . . . . . . . . . . . . . . . . . . . 257

8.5.1 Singular Value Decomposition . . . . . . . . . . . . . . . . 258

8.5.2 Principal Components Analysis . . . . . . . . . . . . . . . 260

8.6 War Story: The Human Factors . . . . . . . . . . . . . . . . . . . 262

8.7 ChapterNotes ............................263

8.8 Exercises ............................... 263

9 Linear and Logistic Regression 267

9.1 LinearRegression........................... 268

9.1.1 Linear Regression and Duality . . . . . . . . . . . . . . . 268

9.1.2 Error in Linear Regression . . . . . . . . . . . . . . . . . . 269

9.1.3 Finding the Optimal Fit . . . . . . . . . . . . . . . . . . . 270

9.2 Better Regression Models . . . . . . . . . . . . . . . . . . . . . . 272

9.2.1 Removing Outliers . . . . . . . . . . . . . . . . . . . . . . 272

9.2.2 Fitting Non-Linear Functions . . . . . . . . . . . . . . . . 273

9.2.3 Feature and Target Scaling . . . . . . . . . . . . . . . . . 274

9.2.4 Dealing with Highly-Correlated Features . . . . . . . . . . 277

9.3 War Story: Taxi Deriver . . . . . . . . . . . . . . . . . . . . . . . 277

9.4 Regression as Parameter Fitting . . . . . . . . . . . . . . . . . . 279

9.4.1 Convex Parameter Spaces . . . . . . . . . . . . . . . . . . 280

9.4.2 Gradient Descent Search . . . . . . . . . . . . . . . . . . . 281

9.4.3 What is the Right Learning Rate? . . . . . . . . . . . . . 283

9.4.4 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . 285

9.5 Simplifying Models through Regularization . . . . . . . . . . . . 286

9.5.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . 286

9.5.2 LASSO Regression . . . . . . . . . . . . . . . . . . . . . . 287

9.5.3 Trade-Oﬀs between Fit and Complexity . . . . . . . . . . 288

9.6 Classiﬁcation and Logistic Regression . . . . . . . . . . . . . . . 289

9.6.1 Regression for Classiﬁcation . . . . . . . . . . . . . . . . . 290

9.6.2 Decision Boundaries . . . . . . . . . . . . . . . . . . . . . 291

9.6.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 292

9.7 Issues in Logistic Classiﬁcation . . . . . . . . . . . . . . . . . . . 295

9.7.1 Balanced Training Classes . . . . . . . . . . . . . . . . . . 295

9.7.2 Multi-Class Classiﬁcation . . . . . . . . . . . . . . . . . . 297

9.7.3 Hierarchical Classiﬁcation . . . . . . . . . . . . . . . . . . 298

9.7.4 Partition Functions and Multinomial Regression . . . . . 299

9.8 ChapterNotes ............................300

9.9 Exercises ............................... 301

10 Distance and Network Methods 303

10.1 Measuring Distances . . . . . . . . . . . . . . . . . . . . . . . . . 303

10.1.1 Distance Metrics . . . . . . . . . . . . . . . . . . . . . . . 304

10.1.2 The LkDistance Metric . . . . . . . . . . . . . . . . . . . 305

10.1.3 Working in Higher Dimensions . . . . . . . . . . . . . . . 307

10.1.4 Dimensional Egalitarianism . . . . . . . . . . . . . . . . . 308

10.1.5 Points vs. Vectors . . . . . . . . . . . . . . . . . . . . . . 309

10.1.6 Distances between Probability Distributions . . . . . . . . 310

10.2 Nearest Neighbor Classiﬁcation . . . . . . . . . . . . . . . . . . . 311

10.2.1 Seeking Good Analogies . . . . . . . . . . . . . . . . . . . 312

10.2.2 k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . 313

10.2.3 Finding Nearest Neighbors . . . . . . . . . . . . . . . . . 315

10.2.4 Locality Sensitive Hashing . . . . . . . . . . . . . . . . . . 317

10.3 Graphs, Networks, and Distances . . . . . . . . . . . . . . . . . . 319

10.3.1 Weighted Graphs and Induced Networks . . . . . . . . . . 320

10.3.2 Talking About Graphs . . . . . . . . . . . . . . . . . . . . 321

10.3.3 Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . 323

10.4PageRank...............................325

10.5Clustering............................... 327

10.5.1 k-means Clustering . . . . . . . . . . . . . . . . . . . . . . 330

10.5.2 Agglomerative Clustering . . . . . . . . . . . . . . . . . . 336

10.5.3 Comparing Clusterings . . . . . . . . . . . . . . . . . . . . 341

10.5.4 Similarity Graphs and Cut-Based Clustering . . . . . . . 341

10.6 War Story: Cluster Bombing . . . . . . . . . . . . . . . . . . . . 344

10.7ChapterNotes ............................ 345

10.8Exercises ............................... 346

11 Machine Learning 351

11.1NaiveBayes..............................354

11.1.1 Formulation..........................354

11.1.2 Dealing with Zero Counts (Discounting) . . . . . . . . . . 356

11.2 Decision Tree Classiﬁers . . . . . . . . . . . . . . . . . . . . . . . 357

11.2.1 Constructing Decision Trees . . . . . . . . . . . . . . . . . 359

11.2.2 Realizing Exclusive Or . . . . . . . . . . . . . . . . . . . . 361

11.2.3 Ensembles of Decision Trees . . . . . . . . . . . . . . . . . 362

11.3 Boosting and Ensemble Learning . . . . . . . . . . . . . . . . . . 363

11.3.1 Voting with Classiﬁers . . . . . . . . . . . . . . . . . . . . 363

11.3.2 Boosting Algorithms . . . . . . . . . . . . . . . . . . . . . 364

11.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . 366

11.4.1 LinearSVMs .........................369

11.4.2 Non-linear SVMs . . . . . . . . . . . . . . . . . . . . . . . 369

11.4.3 Kernels ............................371

11.5 Degrees of Supervision . . . . . . . . . . . . . . . . . . . . . . . . 372

11.5.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . 372

11.5.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . 372

11.5.3 Semi-supervised Learning . . . . . . . . . . . . . . . . . . 374

11.5.4 Feature Engineering . . . . . . . . . . . . . . . . . . . . . 375

11.6DeepLearning ............................ 377

11.6.1 Networks and Depth . . . . . . . . . . . . . . . . . . . . . 378

11.6.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . 382

11.6.3 Word and Graph Embeddings . . . . . . . . . . . . . . . . 383

11.7 War Story: The Name Game . . . . . . . . . . . . . . . . . . . . 385

11.8ChapterNotes ............................ 387

11.9Exercises ............................... 388

12 Big Data: Achieving Scale 391

12.1WhatisBigData? .......................... 392

12.1.1 Big Data as Bad Data . . . . . . . . . . . . . . . . . . . . 392

12.1.2 TheThreeVs......................... 394

12.2 War Story: Infrastructure Matters . . . . . . . . . . . . . . . . . 395

12.3 Algorithmics for Big Data . . . . . . . . . . . . . . . . . . . . . . 397

12.3.1 Big Oh Analysis . . . . . . . . . . . . . . . . . . . . . . . 397

12.3.2 Hashing............................399

12.3.3 Exploiting the Storage Hierarchy . . . . . . . . . . . . . . 401

12.3.4 Streaming and Single-Pass Algorithms . . . . . . . . . . . 402

12.4 Filtering and Sampling . . . . . . . . . . . . . . . . . . . . . . . . 403

12.4.1 Deterministic Sampling Algorithms . . . . . . . . . . . . . 404

12.4.2 Randomized and Stream Sampling . . . . . . . . . . . . . 406

12.5Parallelism .............................. 406

12.5.1 One, Two, Many . . . . . . . . . . . . . . . . . . . . . . . 407

12.5.2 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . 409

12.5.3 GridSearch..........................409

12.5.4 Cloud Computing Services . . . . . . . . . . . . . . . . . . 410

12.6MapReduce .............................. 410

12.6.1 Map-Reduce Programming . . . . . . . . . . . . . . . . . 412

12.6.2 MapReduce under the Hood . . . . . . . . . . . . . . . . . 414

12.7 Societal and Ethical Implications . . . . . . . . . . . . . . . . . . 416

12.8ChapterNotes ............................ 419

12.9Exercises ............................... 419

13 Coda 423

13.1GetaJob!...............................423

13.2 Go to Graduate School! . . . . . . . . . . . . . . . . . . . . . . . 424

13.3 Professional Consulting Services . . . . . . . . . . . . . . . . . . 425

14 Bibliography 427

Chapter 1

What is Data Science?

The purpose of computing is insight, not numbers.

– Richard W. Hamming

What is data science? Like any emerging ﬁeld, it hasn’t been completely deﬁned

yet, but you know enough about it to be interested or else you wouldn’t be

reading this book.

I think of data science as lying at the intersection of computer science, statis-

tics, and substantive application domains. From computer science comes ma-

chine learning and high-performance computing technologies for dealing with

scale. From statistics comes a long tradition of exploratory data analysis, sig-

niﬁcance testing, and visualization. From application domains in business and

the sciences comes challenges worthy of battle, and evaluation standards to

assess when they have been adequately conquered.

But these are all well-established ﬁelds. Why data science, and why now? I

see three reasons for this sudden burst of activity:

•New technology makes it possible to capture, annotate, and store vast

amounts of social media, logging, and sensor data. After you have amassed

all this data, you begin to wonder what you can do with it.

•Computing advances make it possible to analyze data in novel ways and at

ever increasing scales. Cloud computing architectures give even the little

guy access to vast power when they need it. New approaches to machine

learning have lead to amazing advances in longstanding problems, like

computer vision and natural language processing.

•Prominent technology companies (like Google and Facebook) and quan-

titative hedge funds (like Renaissance Technologies and TwoSigma) have

proven the power of modern data analytics. Success stories applying data

to such diverse areas as sports management (Moneyball [Lew04]) and elec-

tion forecasting (Nate Silver [Sil12]) have served as role models to bring

data science to a large popular audience.

2CHAPTER 1. WHAT IS DATA SCIENCE?

This introductory chapter has three missions. First, I will try to explain how

good data scientists think, and how this diﬀers from the mindset of traditional

programmers and software developers. Second, we will look at data sets in terms

of the potential for what they can be used for, and learn to ask the broader

questions they are capable of answering. Finally, I introduce a collection of

data analysis challenges that will be used throughout this book as motivating

examples.

1.1 Computer Science, Data Science, and Real

Science

Computer scientists, by nature, don’t respect data. They have traditionally

been taught that the algorithm was the thing, and that data was just meat to

be passed through a sausage grinder.

So to qualify as an eﬀective data scientist, you must ﬁrst learn to think like

a real scientist. Real scientists strive to understand the natural world, which

is a complicated and messy place. By contrast, computer scientists tend to

build their own clean and organized virtual worlds and live comfortably within

them. Scientists obsess about discovering things, while computer scientists in-

vent rather than discover.

People’s mindsets strongly color how they think and act, causing misunder-

standings when we try to communicate outside our tribes. So fundamental are

these biases that we are often unaware we have them. Examples of the cultural

diﬀerences between computer science and real science include:

•Data vs. method centrism: Scientists are data driven, while computer

scientists are algorithm driven. Real scientists spend enormous amounts

of eﬀort collecting data to answer their question of interest. They invent

fancy measuring devices, stay up all night tending to experiments, and

devote most of their thinking to how to get the data they need.

By contrast, computer scientists obsess about methods: which algorithm

is better than which other algorithm, which programming language is best

for a job, which program is better than which other program. The details

of the data set they are working on seem comparably unexciting.

•Concern about results: Real scientists care about answers. They analyze

data to discover something about how the world works. Good scientists

care about whether the results make sense, because they care about what

the answers mean.

By contrast, bad computer scientists worry about producing plausible-

looking numbers. As soon as the numbers stop looking grossly wrong,

they are presumed to be right. This is because they are personally less

invested in what can be learned from a computation, as opposed to getting

it done quickly and eﬃciently.

1.1. COMPUTER SCIENCE, DATA SCIENCE, AND REAL SCIENCE 3

•Robustness: Real scientists are comfortable with the idea that data has

errors. In general, computer scientists are not. Scientists think a lot about

possible sources of bias or error in their data, and how these possible prob-

lems can eﬀect the conclusions derived from them. Good programmers use

strong data-typing and parsing methodologies to guard against formatting

errors, but the concerns here are diﬀerent.

Becoming aware that data can have errors is empowering. Computer

scientists chant “garbage in, garbage out” as a defensive mantra to ward

oﬀ criticism, a way to say that’s not my job. Real scientists get close

enough to their data to smell it, giving it the sniﬀ test to decide whether

it is likely to be garbage.

•Precision: Nothing is ever completely true or false in science, while every-

thing is either true or false in computer science or mathematics.

Generally speaking, computer scientists are happy printing ﬂoating point

numbers to as many digits as possible: 8/13 = 0.61538461538. Real

scientists will use only two signiﬁcant digits: 8/13 ≈0.62. Computer

scientists care what a number is, while real scientists care what it means.

Aspiring data scientists must learn to think like real scientists. Your job is

going to be to turn numbers into insight. It is important to understand the why

as much as the how.

To be fair, it beneﬁts real scientists to think like data scientists as well. New

experimental technologies enable measuring systems on vastly greater scale than

ever possible before, through technologies like full-genome sequencing in biology

and full-sky telescope surveys in astronomy. With new breadth of view comes

new levels of vision.

Traditional hypothesis-driven science was based on asking speciﬁc questions

of the world and then generating the speciﬁc data needed to conﬁrm or deny

it. This is now augmented by data-driven science, which instead focuses on

generating data on a previously unheard of scale or resolution, in the belief that

new discoveries will come as soon as one is able to look at it. Both ways of

thinking will be important to us:

•Given a problem, what available data will help us answer it?

•Given a data set, what interesting problems can we apply it to?

There is another way to capture this basic distinction between software en-

gineering and data science. It is that software developers are hired to build

systems, while data scientists are hired to produce insights.

This may be a point of contention for some developers. There exist an

important class of engineers who wrangle the massive distributed infrastructures

necessary to store and analyze, say, ﬁnancial transaction or social media data

4CHAPTER 1. WHAT IS DATA SCIENCE?

on a full Facebook or Twitter-level of scale. Indeed, I will devote Chapter 12

to the distinctive challenges of big data infrastructures. These engineers are

building tools and systems to support data science, even though they may not

personally mine the data they wrangle. Do they qualify as data scientists?

This is a fair question, one I will ﬁnesse a bit so as to maximize the poten-

tial readership of this book. But I do believe that the better such engineers

understand the full data analysis pipeline, the more likely they will be able to

build powerful tools capable of providing important insights. A major goal of

this book is providing big data engineers with the intellectual tools to think like

big data scientists.

1.2 Asking Interesting Questions from Data

Good data scientists develop an inherent curiosity about the world around them,

particularly in the associated domains and applications they are working on.

They enjoy talking shop with the people whose data they work with. They ask

them questions: What is the coolest thing you have learned about this ﬁeld?

Why did you get interested in it? What do you hope to learn by analyzing your

data set? Data scientists always ask questions.

Good data scientists have wide-ranging interests. They read the newspaper

every day to get a broader perspective on what is exciting. They understand that

the world is an interesting place. Knowing a little something about everything

equips them to play in other people’s backyards. They are brave enough to get

out of their comfort zones a bit, and driven to learn more once they get there.

Software developers are not really encouraged to ask questions, but data

scientists are. We ask questions like:

•What things might you be able to learn from a given data set?

•What do you/your people really want to know about the world?

•What will it mean to you once you ﬁnd out?

Computer scientists traditionally do not really appreciate data. Think about

the way algorithm performance is experimentally measured. Usually the pro-

gram is run on “random data” to see how long it takes. They rarely even look

at the results of the computation, except to verify that it is correct and eﬃcient.

Since the “data” is meaningless, the results cannot be important. In contrast,

real data sets are a scarce resource, which required hard work and imagination

to obtain.

Becoming a data scientist requires learning to ask questions about data, so

let’s practice. Each of the subsections below will introduce an interesting data

set. After you understand what kind of information is available, try to come

up with, say, ﬁve interesting questions you might explore/answer with access to

this data set.

1.2. ASKING INTERESTING QUESTIONS FROM DATA 5

Figure 1.1: Statistical information on the performance of Babe Ruth can be

found at http://www.baseball-reference.com.

The key is thinking broadly: the answers to big, general questions often lie

buried in highly-speciﬁc data sets, which were by no means designed to contain

them.

1.2.1 The Baseball Encyclopedia

Baseball has long had an outsized importance in the world of data science. This

sport has been called the national pastime of the United States; indeed, French

historian Jacques Barzun observed that “Whoever wants to know the heart and

mind of America had better learn baseball.” I realize that many readers are not

American, and even those that are might be completely disinterested in sports.

But stick with me for a while.

What makes baseball important to data science is its extensive statistical

record of play, dating back for well over a hundred years. Baseball is a sport of

discrete events: pitchers throw balls and batters try to hit them – that naturally

lends itself to informative statistics. Fans get immersed in these statistics as chil-

dren, building their intuition about the strengths and limitations of quantitative

analysis. Some of these children grow up to become data scientists. Indeed, the

success of Brad Pitt’s statistically-minded baseball team in the movie Moneyball

remains the American public’s most vivid contact with data science.

This historical baseball record is available at http://www.baseball-reference.

com. There you will ﬁnd complete statistical data on the performance of every

player who even stepped on the ﬁeld. This includes summary statistics of each

season’s batting, pitching, and ﬁelding record, plus information about teams

6CHAPTER 1. WHAT IS DATA SCIENCE?

Figure 1.2: Personal information on every major league baseball player is avail-

able at http://www.baseball-reference.com.

and awards as shown in Figure 1.1.

But more than just statistics, there is metadata on the life and careers of all

the people who have ever played major league baseball, as shown in Figure 1.2.

We get the vital statistics of each player (height, weight, handedness) and their

lifespan (when/where they were born and died). We also get salary information

(how much each player got paid every season) and transaction data (how did

they get to be the property of each team they played for).

Now, I realize that many of you do not have the slightest knowledge of or

interest in baseball. This sport is somewhat reminiscent of cricket, if that helps.

But remember that as a data scientist, it is your job to be interested in the

world around you. Think of this as chance to learn something.

So what interesting questions can you answer with this baseball data set?

Try to write down ﬁve questions before moving on. Don’t worry, I will wait here

for you to ﬁnish.

The most obvious types of questions to answer with this data are directly

related to baseball:

•How can we best measure an individual player’s skill or value?

•How fairly do trades between teams generally work out?

•What is the general trajectory of player’s performance level as they mature

and age?

•To what extent does batting performance correlate with position played?

For example, are outﬁelders really better hitters than inﬁelders?

These are interesting questions. But even more interesting are questions

about demographic and social issues. Almost 20,000 major league baseball play-

1.2. ASKING INTERESTING QUESTIONS FROM DATA 7

ers have taken the ﬁeld over the past 150 years, providing a large, extensively-

documented cohort of men who can serve as a proxy for even larger, less well-

documented populations. Indeed, we can use this baseball player data to answer

questions like:

•Do left-handed people have shorter lifespans than right-handers? Handed-

ness is not captured in most demographic data sets, but has been diligently

assembled here. Indeed, analysis of this data set has been used to show

that right-handed people live longer than lefties [HC88]!

•How often do people return to live in the same place where they were

born? Locations of birth and death have been extensively recorded in this

data set. Further, almost all of these people played at least part of their

career far from home, thus exposing them to the wider world at a critical

time in their youth.

•Do player salaries generally reﬂect past, present, or future performance?

•To what extent have heights and weights been increasing in the population

at large?

There are two particular themes to be aware of here. First, the identiﬁers

and reference tags (i.e. the metadata) often prove more interesting in a data set

than the stuﬀ we are supposed to care about, here the statistical record of play.

Second is the idea of a statistical proxy, where you use the data set you have

to substitute for the one you really want. The data set of your dreams likely

does not exist, or may be locked away behind a corporate wall even if it does.

A good data scientist is a pragmatist, seeing what they can do with what they

have instead of bemoaning what they cannot get their hands on.

1.2.2 The Internet Movie Database (IMDb)

Everybody loves the movies. The Internet Movie Database (IMDb) provides

crowdsourced and curated data about all aspects of the motion picture industry,

at www.imdb.com. IMDb currently contains data on over 3.3 million movies and

TV programs. For each ﬁlm, IMDb includes its title, running time, genres, date

of release, and a full list of cast and crew. There is ﬁnancial data about each

production, including the budget for making the ﬁlm and how well it did at the

box oﬃce.

Finally, there are extensive ratings for each ﬁlm from viewers and critics.

This rating data consists of scores on a zero to ten stars scale, cross-tabulated

into averages by age and gender. Written reviews are often included, explaining

why a particular critic awarded a given number of stars. There are also links

between ﬁlms: for example, identifying which other ﬁlms have been watched

most often by viewers of It’s a Wonderful Life.

Every actor, director, producer, and crew member associated with a ﬁlm

merits an entry in IMDb, which now contains records on 6.5 million people.

8CHAPTER 1. WHAT IS DATA SCIENCE?

Figure 1.3: Representative ﬁlm data from the Internet Movie Database.

Figure 1.4: Representative actor data from the Internet Movie Database.

1.2. ASKING INTERESTING QUESTIONS FROM DATA 9

These happen to include my brother, cousin, and sister-in-law. Each actor

is linked to every ﬁlm they appeared in, with a description of their role and

their ordering in the credits. Available data about each personality includes

birth/death dates, height, awards, and family relations.

So what kind of questions can you answer with this movie data?

Perhaps the most natural questions to ask IMDb involve identifying the

extremes of movies and actors:

•Which actors appeared in the most ﬁlms? Earned the most money? Ap-

peared in the lowest rated ﬁlms? Had the longest career or the shortest

lifespan?

•What was the highest rated ﬁlm each year, or the best in each genre?

Which movies lost the most money, had the highest-powered casts, or got

the least favorable reviews.

Then there are larger-scale questions one can ask about the nature of the

motion picture business itself:

•How well does movie gross correlate with viewer ratings or awards? Do

customers instinctively ﬂock to trash, or is virtue on the part of the cre-

ative team properly rewarded?

•How do Hollywood movies compare to Bollywood movies, in terms of rat-

ings, budget, and gross? Are American movies better received than foreign

ﬁlms, and how does this diﬀer between U.S. and non-U.S. reviewers?

•What is the age distribution of actors and actresses in ﬁlms? How much

younger is the actress playing the wife, on average, than the actor playing

the husband? Has this disparity been increasing or decreasing with time?

•Live fast, die young, and leave a good-looking corpse? Do movie stars live

longer or shorter lives than bit players, or compared to the general public?

Assuming that people working together on a ﬁlm get to know each other,

the cast and crew data can be used to build a social network of the movie

business. What does the social network of actors look like? The Oracle of

Bacon (https://oracleofbacon.org/) posits Kevin Bacon as the center of

the Hollywood universe and generates the shortest path to Bacon from any

other actor. Other actors, like Samuel L. Jackson, prove even more central.

More critically, can we analyze this data to determine the probability that

someone will like a given movie? The technique of collaborative ﬁltering ﬁnds

people who liked ﬁlms that I also liked, and recommends other ﬁlms that they

liked as good candidates for me. The 2007 Netﬂix Prize was a $1,000,000 com-

petition to produce a ratings engine 10% better than the proprietary Netﬂix

system. The ultimate winner of this prize (BellKor) used a variety of data

sources and techniques, including the analysis of links [BK07].

10 CHAPTER 1. WHAT IS DATA SCIENCE?

Figure 1.5: The rise and fall of data processing, as witnessed by Google Ngrams.

1.2.3 Google Ngrams

Printed books have been the primary repository of human knowledge since

Gutenberg’s invention of movable type in 1439. Physical objects live somewhat

uneasily in today’s digital world, but technology has a way of reducing every-

thing to data. As part of its mission to organize the world’s information, Google

undertook an eﬀort to scan all of the world’s published books. They haven’t

quite gotten there yet, but the 30 million books thus far digitized represent over

20% of all books ever published.

Google uses this data to improve search results, and provide fresh access

to out-of-print books. But perhaps the coolest product is Google Ngrams, an

amazing resource for monitoring changes in the cultural zeitgeist. It provides

the frequency with which short phrases occur in books published each year.

Each phrase must occur at least forty times in their scanned book corpus. This

eliminates obscure words and phrases, but leaves over two billion time series

available for analysis.

This rich data set shows how language use has changed over the past 200

years, and has been widely applied to cultural trend analysis [MAV+11]. Figure

1.5 uses this data to show how the word data fell out of favor when thinking

about computing. Data processing was the popular term associated with the

computing ﬁeld during the punched card and spinning magnetic tape era of the

1950s. The Ngrams data shows that the rapid rise of Computer Science did not

eclipse Data Processing until 1980. Even today, Data Science remains almost

invisible on this scale.

Check out Google Ngrams at http://books.google.com/ngrams. I promise

you will enjoy playing with it. Compare hot dog to tofu,science against religion,

freedom to justice, and sex vs. marriage, to better understand this fantastic

telescope for looking into the past.

But once you are done playing, think of bigger things you could do if you

got your hands on this data. Assume you have access to the annual number

of references for all words/phrases published in books over the past 200 years.

1.2. ASKING INTERESTING QUESTIONS FROM DATA 11

Google makes this data freely available. So what are you going to do with it?

Observing the time series associated with particular words using the Ngrams

Viewer is fun. But more sophisticated historical trends can be captured by

aggregating multiple time series together. The following types of questions

seem particularly interesting to me:

•How has the amount of cursing changed over time? Use of the four-

letter words I am most familiar with seem to have exploded since 1960,

although it is perhaps less clear whether this reﬂects increased cussing or

lower publication standards.

•How often do new words emerge and get popular? Do these words tend

to stay in common usage, or rapidly fade away? Can we detect when

words change meaning over time, like the transition of gay from happy to

homosexual?

•Have standards of spelling been improving or deteriorating with time,

especially now that we have entered the era of automated spell check-

ing? Rarely-occurring words that are only one character removed from a

commonly-used word are likely candidates to be spelling errors (e.g. al-

gorithm vs. algorthm). Aggregated over many diﬀerent misspellings, are

such errors increasing or decreasing?

You can also use this Ngrams corpus to build a language model that captures

the meaning and usage of the words in a given language. We will discuss word

embeddings in Section 11.6.3, which are powerful tools for building language

models. Frequency counts reveal which words are most popular. The frequency

of word pairs appearing next to each other can be used to improve speech

recognition systems, helping to distinguish whether the speaker said that’s too

bad or that’s to bad. These millions of books provide an ample data set to build

representative models from.

1.2.4 New York Taxi Records

Every ﬁnancial transaction today leaves a data trail behind it. Following these

paths can lead to interesting insights.

Taxi cabs form an important part of the urban transportation network. They

roam the streets of the city looking for customers, and then drive them to their

destination for a fare proportional to the length of the trip. Each cab contains

a metering device to calculate the cost of the trip as a function of time. This

meter serves as a record keeping device, and a mechanism to ensure that the

driver charges the proper amount for each trip.

The taxi meters currently employed in New York cabs can do many things

beyond calculating fares. They act as credit card terminals, providing a way

12 CHAPTER 1. WHAT IS DATA SCIENCE?

Figure 1.6: Representative ﬁelds from the New York city taxi cab data: pick up

and dropoﬀ points, distances, and fares.

for customers to pay for rides without cash. They are integrated with global

positioning systems (GPS), recording the exact location of every pickup and

drop oﬀ. And ﬁnally, since they are on a wireless network, these boxes can

communicate all of this data back to a central server.

The result is a database documenting every single trip by all taxi cabs in

one of the world’s greatest cities, a small portion of which is shown in Figure

1.6. Because the New York Taxi and Limousine Commission is a public agency,

its non-conﬁdential data is available to all under the Freedom of Information

Act (FOA).

Every ride generates two records: one with data on the trip, the other with

details of the fare. Each trip is keyed to the medallion (license) of each car

coupled with the identiﬁer of each driver. For each trip, we get the time/date

of pickup and drop-oﬀ, as well as the GPS coordinates (longitude and latitude)

of the starting location and destination. We do not get GPS data of the route

they traveled between these points, but to some extent that can be inferred by

the shortest path between them.

As for fare data, we get the metered cost of each trip, including tax, surcharge

and tolls. It is traditional to pay the driver a tip for service, the amount of which

is also recorded in the data.

So I’m talking to you. This taxi data is readily available, with records of

over 80 million trips over the past several years. What are you going to do with

it?

Any interesting data set can be used to answer questions on many diﬀerent

scales. This taxi fare data can help us better understand the transportation

industry, but also how the city works and how we could make it work even

better. Natural questions with respect to the taxi industry include:

1.2. ASKING INTERESTING QUESTIONS FROM DATA 13

Figure 1.7: Which neighborhoods in New York city tip most generously? The

relatively remote outer boroughs of Brooklyn and Queens, where trips are

longest and supply is relatively scarce.

•How much money do drivers make each night, on average? What is the

distribution? Do drivers make more on sunny days or rainy days?

•Where are the best spots in the city for drivers to cruise, in order to pick

up proﬁtable fares? How does this vary at diﬀerent times of the day?

•How far do drivers travel over the course of a night’s work? We can’t

answer this exactly using this data set, because it does not provide GPS

data of the route traveled between fares. But we do know the last place

of drop oﬀ, the next place of pickup, and how long it took to get between

them. Together, this should provide enough information to make a sound

estimate.

•Which drivers take their unsuspecting out-of-town passengers for a “ride,”

running up the meter on what should be a much shorter, cheaper trip?

•How much are drivers tipped, and why? Do faster drivers get tipped

better? How do tipping rates vary by neighborhood, and is it the rich

neighborhoods or poor neighborhoods which prove more generous?

I will confess we did an analysis of this, which I will further describe in

the war story of Section 9.3. We found a variety of interesting patterns

[SS15]. Figure 1.7 shows that Manhattanites are generally cheapskates

relative to large swaths of Brooklyn, Queens, and Staten Island, where

trips are longer and street cabs a rare but welcome sight.

14 CHAPTER 1. WHAT IS DATA SCIENCE?

But the bigger questions have to do with understanding transportation in

the city. We can use the taxi travel times as a sensor to measure the level of

traﬃc in the city at a ﬁne level. How much slower is traﬃc during rush hour

than other times, and where are delays the worst? Identifying problem areas is

the ﬁrst step to proposing solutions, by changing the timing patterns of traﬃc

lights, running more buses, or creating high-occupancy only lanes.

Similarly we can use the taxi data to measure transportation ﬂows across

the city. Where are people traveling to, at diﬀerent times of the day? This tells

us much more than just congestion. By looking at the taxi data, we should

be able to see tourists going from hotels to attractions, executives from fancy

neighborhoods to Wall Street, and drunks returning home from nightclubs after

a bender.

Data like this is essential to designing better transportation systems. It is

wasteful for a single rider to travel from point ato point bwhen there is another

rider at point a+who also wants to get there. Analysis of the taxi data enables

accurate simulation of a ride sharing system, so we can accurately evaluate the

demands and cost reductions of such a service.

1.3 Properties of Data

This book is about techniques for analyzing data. But what is the underlying

stuﬀ that we will be studying? This section provides a brief taxonomy of the

properties of data, so we can better appreciate and understand what we will be

working on.

1.3.1 Structured vs. Unstructured Data

Certain data sets are nicely structured, like the tables in a database or spread-

sheet program. Others record information about the state of the world, but in

a more heterogeneous way. Perhaps it is a large text corpus with images and

links like Wikipedia, or the complicated mix of notes and test results appearing

in personal medical records.

Generally speaking, this book will focus on dealing with structured data.

Data is often represented by a matrix, where the rows of the matrix represent

distinct items or records, and the columns represent distinct properties of these

items. For example, a data set about U.S. cities might contain one row for each

city, with columns representing features like state, population, and area.

When confronted with an unstructured data source, such as a collection of

tweets from Twitter, our ﬁrst step is generally to build a matrix to structure

it. A bag of words model will construct a matrix with a row for each tweet, and

a column for each frequently used vocabulary word. Matrix entry M[i, j] then

denotes the number of times tweet icontains word j. Such matrix formulations

will motivate our discussion of linear algebra, in Chapter 8.

1.3. PROPERTIES OF DATA 15

1.3.2 Quantitative vs. Categorical Data

Quantitative data consists of numerical values, like height and weight. Such data

can be incorporated directly into algebraic formulas and mathematical models,

or displayed in conventional graphs and charts.

By contrast, categorical data consists of labels describing the properties of

the objects under investigation, like gender, hair color, and occupation. This

descriptive information can be every bit as precise and meaningful as numerical

data, but it cannot be worked with using the same techniques.

Categorical data can usually be coded numerically. For example, gender

might be represented as male = 0 or f emale = 1. But things get more com-

plicated when there are more than two characters per feature, especially when

there is not an implicit order between them. We may be able to encode hair

colors as numbers by assigning each shade a distinct value like gray hair = 0,

red hair = 1, and blond hair = 2. However, we cannot really treat these val-

ues as numbers, for anything other than simple identity testing. Does it make

any sense to talk about the maximum or minimum hair color? What is the

interpretation of my hair color minus your hair color?

Most of what we do in this book will revolve around numerical data. But

keep an eye out for categorical features, and methods that work for them. Clas-

siﬁcation and clustering methods can be thought of as generating categorical

labels from numerical data, and will be a primary focus in this book.

1.3.3 Big Data vs. Little Data

Data science has become conﬂated in the public eye with big data, the analysis of

massive data sets resulting from computer logs and sensor devices. In principle,

having more data is always better than having less, because you can always

throw some of it away by sampling to get a smaller set if necessary.

Big data is an exciting phenomenon, and we will discuss it in Chapter 12. But

in practice, there are diﬃculties in working with large data sets. Throughout

this book we will look at algorithms and best practices for analyzing data. In

general, things get harder once the volume gets too large. The challenges of big

data include:

•The analysis cycle time slows as data size grows: Computational opera-

tions on data sets take longer as their volume increases. Small spreadsheets

provide instantaneous response, allowing you to experiment and play what

if? But large spreadsheets can be slow and clumsy to work with, and

massive-enough data sets might take hours or days to get answers from.

Clever algorithms can permit amazing things to be done with big data,

but staying small generally leads to faster analysis and exploration.

•Large data sets are complex to visualize: Plots with millions of points on

them are impossible to display on computer screens or printed images, let

alone conceptually understand. How can we ever hope to really understand

something we cannot see?

16 CHAPTER 1. WHAT IS DATA SCIENCE?

•Simple models do not require massive data to ﬁt or evaluate: A typical

data science task might be to make a decision (say, whether I should oﬀer

this fellow life insurance?) on the basis of a small number of variables:

say age, gender, height, weight, and the presence or absence of existing

medical conditions.

If I have this data on 1 million people with their associated life outcomes, I

should be able to build a good general model of coverage risk. It probably

wouldn’t help me build a substantially better model if I had this data

on hundreds of millions of people. The decision criteria on only a few

variables (like age and martial status) cannot be too complex, and should

be robust over a large number of applicants. Any observation that is so

subtle it requires massive data to tease out will prove irrelevant to a large

business which is based on volume.

Big data is sometimes called bad data. It is often gathered as the by-product

of a given system or procedure, instead of being purposefully collected to answer

your question at hand. The result is that we might have to go to heroic eﬀorts

to make sense of something just because we have it.

Consider the problem of getting a pulse on voter preferences among presi-

dential candidates. The big data approach might analyze massive Twitter or

Facebook feeds, interpreting clues to their opinions in the text. The small data

approach might be to conduct a poll, asking a few hundred people this speciﬁc

question and tabulating the results. Which procedure do you think will prove

more accurate? The right data set is the one most directly relevant to the tasks

at hand, not necessarily the biggest one.

Take-Home Lesson: Do not blindly aspire to analyze large data sets. Seek the

right data to answer a given question, not necessarily the biggest thing you can

get your hands on.

1.4 Classiﬁcation and Regression

Two types of problems arise repeatedly in traditional data science and pattern

recognition applications, the challenges of classiﬁcation and regression. As this

book has developed, I have pushed discussions of the algorithmic approaches

to solving these problems toward the later chapters, so they can beneﬁt from a

solid understanding of core material in data munging, statistics, visualization,

and mathematical modeling.

Still, I will mention issues related to classiﬁcation and regression as they

arise, so it makes sense to pause here for a quick introduction to these problems,

to help you recognize them when you see them.

•Classiﬁcation: Often we seek to assign a label to an item from a discrete

set of possibilities. Such problems as predicting the winner of a particular

1.5. DATA SCIENCE TELEVISION: THE QUANT SHOP 17

sporting contest (team Aor team B?) or deciding the genre of a given

movie (comedy, drama, or animation?) are classiﬁcation problems, since

each entail selecting a label from the possible choices.

•Regression: Another common task is to forecast a given numerical quan-

tity. Predicting a person’s weight or how much snow we will get this year

is a regression problem, where we forecast the future value of a numerical

function in terms of previous values and other relevant features.

Perhaps the best way to see the intended distinction is to look at a variety

of data science problems and label (classify) them as regression or classiﬁcation.

Diﬀerent algorithmic methods are used to solve these two types of problems,

although the same questions can often be approached in either way:

•Will the price of a particular stock be higher or lower tomorrow? (classi-

ﬁcation)

•What will the price of a particular stock be tomorrow? (regression)

•Is this person a good risk to sell an insurance policy to? (classiﬁcation)

•How long do we expect this person to live? (regression)

Keep your eyes open for classiﬁcation and regression problems as you en-

counter them in your life, and in this book.

1.5 Data Science Television: The Quant Shop

I believe that hands-on experience is necessary to internalize basic principles.

Thus when I teach data science, I like to give each student team an interesting

but messy forecasting challenge, and demand that they build and evaluate a

predictive model for the task.

These forecasting challenges are associated with events where the students

must make testable predictions. They start from scratch: ﬁnding the relevant

data sets, building their own evaluation environments, and devising their model.

Finally, I make them watch the event as it unfolds, so as to witness the vindi-

cation or collapse of their prediction.

As an experiment, we documented the evolution of each group’s project

on video in Fall 2014. Professionally edited, this became The Quant Shop, a

television-like data science series for a general audience. The eight episodes of

this ﬁrst season are available at http://www.quant-shop.com, and include:

•Finding Miss Universe – The annual Miss Universe competition aspires

to identify the most beautiful woman in the world. Can computational

models predict who will win a beauty contest? Is beauty just subjective,

or can algorithms tell who is the fairest one of all?

18 CHAPTER 1. WHAT IS DATA SCIENCE?

•Modeling the Movies – The business of movie making involves a lot of

high-stakes data analysis. Can we build models to predict which ﬁlm will

gross the most on Christmas day? How about identifying which actors

will receive awards for their performance?

•Winning the Baby Pool – Birth weight is an important factor in assessing

the health of a newborn child. But how accurately can we predict junior’s

weight before the actual birth? How can data clarify environmental risks

to developing pregnancies?

•The Art of the Auction – The world’s most valuable artworks sell at auc-

tions to the highest bidder. But can we predict how many millions a

particular J.W. Turner painting will sell for? Can computers develop an

artistic sense of what’s worth buying?

•White Christmas – Weather forecasting is perhaps the most familiar do-

main of predictive modeling. Short-term forecasts are generally accurate,

but what about longer-term prediction? What places will wake up to a

snowy Christmas this year? And can you tell one month in advance?

•Predicting the Playoﬀs – Sports events have winners and losers, and book-

ies are happy to take your bets on the outcome of any match. How well can

statistics help predict which football team will win the Super Bowl? Can

Google’s PageRank algorithm pick the winners on the ﬁeld as accurately

as it does on the web?

•The Ghoul Pool – Death comes to all men, but when? Can we apply

actuarial models to celebrities, to decide who will be the next to die?

Similar analysis underlies the workings of the life insurance industry, where

accurate predictions of lifespan are necessary to set premiums which are

both sustainable and aﬀordable.

Figure 1.8: Exciting scenes from data science television: The Quant Shop.

1.6. ABOUT THE WAR STORIES 19

•Playing the Market – Hedge fund quants get rich when guessing right

about tomorrow’s prices, and poor when wrong. How accurately can we

predict future prices of gold and oil using histories of price data? What

other information goes into building a successful price model?

I encourage you to watch some episodes of The Quant Shop in tandem with

reading this book. We try to make it fun, although I am sure you will ﬁnd

plenty of things to cringe at. Each show runs for thirty minutes, and maybe

will inspire you to tackle a prediction challenge of your own.

These programs will certainly give you more insight into these eight speciﬁc

challenges. I will use these projects throughout this book to illustrate important

lessons in how to do data science, both as positive and negative examples. These

projects provide a laboratory to see how intelligent but inexperienced people not

wildly unlike yourself thought about a data science problem, and what happened

when they did.

1.5.1 Kaggle Challenges

Another source of inspiration are challenges from Kaggle (www.kaggle.com),

which provides a competitive forum for data scientists. New challenges are

posted on a regular basis, providing a problem deﬁnition, training data, and

a scoring function over hidden evaluation data. A leader board displays the

scores of the strongest competitors, so you can see how well your model stacks

up in comparison with your opponents. The winners spill their modeling secrets

during post-contest interviews, to help you improve your modeling skills.

Performing well on Kaggle challenges is an excellent credential to put on your

resume to get a good job as a data scientist. Indeed, potential employers will

track you down if you are a real Kaggle star. But the real reason to participate

is that the problems are fun and inspiring, and practice helps make you a better

data scientist.

The exercises at the end of each chapter point to expired Kaggle challenges,

loosely connected to the material in that chapter. Be forewarned that Kaggle

provides a misleading glamorous view of data science as applied machine learn-

ing, because it presents extremely well-deﬁned problems with the hard work

of data collection and cleaning already done for you. Still, I encourage you to

check it out for inspiration, and as a source of data for new projects.

1.6 About the War Stories

Genius and wisdom are two distinct intellectual gifts. Genius shows in discover-

ing the right answer, making imaginative mental leaps which overcome obstacles

and challenges. Wisdom shows in avoiding obstacles in the ﬁrst place, providing

a sense of direction or guiding light that keeps us moving soundly in the right

direction.

20 CHAPTER 1. WHAT IS DATA SCIENCE?

Genius is manifested in technical strength and depth, the ability to see things

and do things that other people cannot. In contrast, wisdom comes from ex-

perience and general knowledge. It comes from listening to others. Wisdom

comes from humility, observing how often you have been wrong in the past and

ﬁguring out why you were wrong, so as to better recognize future traps and

avoid them.

Data science, like most things in life, beneﬁts more from wisdom than from

genius. In this book, I seek to pass on wisdom that I have accumulated the hard

way through war stories, gleaned from a diverse set of projects I have worked

on:

•Large-scale text analytics and NLP: My Data Science Laboratory at Stony

Brook University works on a variety of projects in big data, including sen-

timent analysis from social media, historical trends analysis, deep learning

approaches to natural language processing (NLP), and feature extraction

from networks.

•Start-up companies: I served as co-founder and chief scientist to two

data analytics companies: General Sentiment and Thrivemetrics. General

Sentiment analyzed large-scale text streams from news, blogs, and social

media to identify trends in the sentiment (positive or negative) associated

with people, places, and things. Thrivemetrics applied this type of analysis

to internal corporate communications, like email and messaging systems.

Neither of these ventures left me wealthy enough to forgo my royalties

from this book, but they did provide me with experience on cloud-based

computing systems, and insight into how data is used in industry.

•Collaborating with real scientists: I have had several interesting collab-

orations with biologists and social scientists, which helped shape my un-

derstanding of the complexities of working with real data. Experimental

data is horribly noisy and riddled with errors, yet you must do the best

you can with what you have, in order to discover how the world works.

•Building gambling systems: A particularly amusing project was building

a system to predict the results of jai-alai matches so we could bet on them,

an experience recounted in my book Calculated Bets: Computers, Gam-

bling, and Mathematical Modeling to Win [Ski01]. Our system relied on

web scraping for data collection, statistical analysis, simulation/modeling,

and careful evaluation. We also have developed and evaluated predictive

models for movie grosses [ZS09], stock prices [ZS10], and football games

[HS10] using social media analysis.

•Ranking historical ﬁgures: By analyzing Wikipedia to extract meaningful

variables on over 800,000 historical ﬁgures, we developed a scoring func-

tion to rank them by their strength as historical memes. This ranking

does a great job separating the greatest of the great (Jesus, Napoleon,

Shakespeare, Mohammad, and Lincoln round out the top ﬁve) from lesser

1.7. WAR STORY: ANSWERING THE RIGHT QUESTION 21

mortals, and served as the basis for our book Who’s Bigger?: Where His-

torical Figures Really Rank [SW13].

All this experience drives what I teach in this book, especially the tales that

I describe as war stories. Every one of these war stories is true. Of course, the

stories improve somewhat in the retelling, and the dialogue has been punched

up to make them more interesting to read. However, I have tried to honestly

trace the process of going from a raw problem to a solution, so you can watch

how it unfolded.

1.7 War Story: Answering the Right Question

Our research group at Stony Brook University developed an NLP-based system

for analyzing millions of news, blogs and social media messages, and reducing

this text to trends concerning all the entities under discussion. Counting the

number of mentions each name receives in a text stream (volume) is easy, in

principle. Determining whether the connotation of a particular reference is

positive or negative (sentiment analysis) is hard. But our system did a pretty

good job, particularly when aggregated over many references.

This technology served as the foundation for a social media analysis company

named General Sentiment. It was exciting living through a start-up starting up,

facing the challenges of raising money, hiring staﬀ, and developing new products.

But perhaps the biggest problem we faced was answering the right question.

The General Sentiment system recorded trends about the sentiment and volume

for every person, place, and thing that was ever mentioned in news, blogs, and

social media: over 20 million distinct entities. We monitored the reputations of

celebrities and politicians. We monitored the fates of companies and products.

We tracked the performance of sports teams, and the buzz about movies. We

could do anything!

But it turns out that no one pays you to do anything. They pay you to do

something, to solve a particular problem they have, or eliminate a speciﬁc pain

point in their business. Being able to do anything proves to be a terrible sales

strategy, because it requires you to ﬁnd that need afresh for each and every

customer.

Facebook didn’t open up to the world until September 2006. So when Gen-

eral Sentiment started in 2008, we were at the very beginning of the social media

era. We had lots of interest from major brands and advertising agencies which

knew that social media was ready to explode. They knew this newfangled thing

was important, and that they had to be there. They knew that proper analysis

of social media data could give them fresh insights into what their customers

were thinking. But they didn’t know exactly what it was they really wanted to

know.

One aircraft engine manufacturer was very interested in learning how much

the kids talked about them on Facebook. We had to break it to them gently

that the answer was zero. Other potential customers demanded proof that we

22 CHAPTER 1. WHAT IS DATA SCIENCE?

were more accurate than the Nielsen television ratings. But of course, if you

wanted Nielsen ratings then you should buy them from Nielsen. Our system

provided diﬀerent insights from a completely diﬀerent world. But you had to

know what you wanted in order to use them.

We did manage to get substantial contracts from a very diverse group of

customers, including consumer brands like Toyota and Blackberry, governmental

organizations like the Hawaii tourism oﬃce, and even the presidential campaign

of Republican nominee Mitt Romney in 2012. Our analysts provided them

insights into a wide variety of business issues:

•What did people think about Hawaii? (Answer: they think it is a very

nice place to visit.)

•How quickly would Toyota’s sentiment recover after news of serious brake

problems in their cars? (Answer: about six months.)

•What did people think about Blackberry’s new phone models? (Answer:

they liked the iPhone much better.)

•How quickly would Romney’s sentiment recover after insulting 47% of the

electorate in a recorded speech? (Answer: never.)

But each sale required entering a new universe, involving considerable eﬀort

and imagination on the part of our sales staﬀ and research analysts. We never

managed to get two customers in the same industry, which would have let us

beneﬁt from scale and accumulated wisdom.

Of course, the customer is always right. It was our fault that we could not

explain to them the best way to use our technology. The lesson here is that the

world will not beat a path to your door just for a new source of data. You must

be able to supply the right questions before you can turn data into money.

1.8 Chapter Notes

The idea of using historical records from baseball players to establish that left-

handers have shorter lifespans is due to Halpern and Coren [HC88, HC91],

but their conclusion remains controversial. The percentage of left-handers in

the population has been rapidly growing, and the observed eﬀects may be a

function of survivorship bias [McM04]. So lefties, hang in there! Full disclosure:

I am one of you.

The discipline of quantitative baseball analysis is sometimes called sabermet-

rics, and its leading light is a fellow named Bill James. I recommend budding

data scientists read his Historical Baseball Abstract [Jam10] as an excellent ex-

ample of how one turns numbers into knowledge and understanding. Time

Magazine once said of James: “Much of the joy of reading him comes from the

extravagant spectacle of a ﬁrst-rate mind wasting itself on baseball.” I thank

http://sports-reference.com for permission to use images of their website

in this book. Ditto to Amazon, the owner of IMDb.

1.9. EXERCISES 23

The potential of ride-sharing systems in New York was studied by Santi et.

al. [SRS+14], who showed that almost 95% of the trips could have been shared

with no more than ﬁve minutes delay per trip.

The Lydia system for sentiment analysis is described in [GSS07]. Methods

to identify changes in word meaning through analysis of historical text corpora

like Google Ngram are reported in [KARPS15].

1.9 Exercises

Identifying Data Sets

1-1. [3] Identify where interesting data sets relevant to the following domains can be

found on the web:

(a) Books.

(b) Horse racing.

(c) Stock prices.

(d) Risks of diseases.

(e) Colleges and universities.

(f) Crime rates.

(g) Bird watching.

For each of these data sources, explain what you must do to turn this data into

a usable format on your computer for analysis.

1-2. [3] Propose relevant data sources for the following The Quant Shop prediction

challenges. Distinguish between sources of data that you are sure somebody must

have, and those where the data is clearly available to you.

(a) Miss Universe.

(b) Movie gross.

(c) Baby weight.

(d) Art auction price.

(e) White Christmas.

(f) Football champions.

(g) Ghoul pool.

(h) Gold/oil prices.

1-3. [3] Visit http://data.gov, and identify ﬁve data sets that sound interesting to

you. For each write a brief description, and propose three interesting things you

might do with them.

Asking Questions

1-4. [3] For each of the following data sources, propose three interesting questions

you can answer by analyzing them:

(a) Credit card billing data.

24 CHAPTER 1. WHAT IS DATA SCIENCE?

(b) Click data from http://www.Amazon.com.

(c) White Pages residential/commercial telephone directory.

1-5. [5] Visit Entrez, the National Center for Biotechnology Information (NCBI)

portal. Investigate what data sources are available, particularly the Pubmed

and Genome resources. Propose three interesting projects to explore with each

of them.

1-6. [5] You would like to conduct an experiment to establish whether your friends

prefer the taste of regular Coke or Diet Coke. Brieﬂy outline a design for such

a study.

1-7. [5] You would like to conduct an experiment to see whether students learn better

if they study without any music, with instrumental music, or with songs that

have lyrics. Brieﬂy outline the design for such a study.

1-8. [5] Traditional polling operations like Gallup use a procedure called random digit

dialing, which dials random strings of digits instead of picking phone numbers

from the phone book. Suggest why such polls are conducted using random digit

dialing.

Implementation Projects

1-9. [5] Write a program to scrape the best-seller rank for a book on Amazon.com.

Use this to plot the rank of all of Skiena’s books over time. Which one of these

books should be the next item that you purchase? Do you have friends for whom

they would make a welcome and appropriate gift? :-)

1-10. [5] For your favorite sport (baseball, football, basketball, cricket, or soccer)

identify a data set with the historical statistical records for all major partici-

pants. Devise and implement a ranking system to identify the best player at

each position.

Interview Questions

1-11. [3] For each of the following questions: (1) produce a quick guess based only on

your understanding of the world, and then (2) use Google to ﬁnd supportable

numbers to produce a more principled estimate from. How much did your two

estimates diﬀer by?

(a) How many piano tuners are there in the entire world?

(b) How much does the ice in a hockey rink weigh?

(c) How many gas stations are there in the United States?

(d) How many people ﬂy in and out of LaGuardia Airport every day?

(e) How many gallons of ice cream are sold in the United States each year?

(f) How many basketballs are purchased by the National Basketball Associa-

tion (NBA) each year?

(g) How many ﬁsh are there in all the world’s oceans?

(h) How many people are ﬂying in the air right now, all over the world?

(i) How many ping-pong balls can ﬁt in a large commercial jet?

(j) How many miles of paved road are there in your favorite country?

1.9. EXERCISES 25

(k) How many dollar bills are sitting in the wallets of all people at Stony Brook

University?

(l) How many gallons of gasoline does a typical gas station sell per day?

(m) How many words are there in this book?

(n) How many cats live in New York city?

(o) How much would it cost to ﬁll a typical car’s gas tank with Starbuck’s

coﬀee?

(p) How much tea is there in China?

(q) How many checking accounts are there in the United States?

1-12. [3] What is the diﬀerence between regression and classiﬁcation?

1-13. [8] How would you build a data-driven recommendation system? What are the

limitations of this approach?

1-14. [3] How did you become interested in data science?

1-15. [3] Do you think data science is an art or a science?

Kaggle Challenges

1-16. Who survived the shipwreck of the Titanic?

https://www.kaggle.com/c/titanic

1-17. Where is a particular taxi cab going?

https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i

1-18. How long will a given taxi trip take?

https://www.kaggle.com/c/pkdd-15-taxi-trip-time-prediction-ii

Chapter 2

Mathematical Preliminaries

A data scientist is someone who knows more statistics than a com-

puter scientist and more computer science than a statistician.

– Josh Blumenstock

You must walk before you can run. Similarly, there is a certain level of mathe-

matical maturity which is necessary before you should be trusted to do anything

meaningful with numerical data.

In writing this book, I have assumed that the reader has had some degree

of exposure to probability and statistics, linear algebra, and continuous math-

ematics. I have also assumed that they have probably forgotten most of it, or

perhaps didn’t always see the forest (why things are important, and how to use

them) for the trees (all the details of deﬁnitions, proofs, and operations).

This chapter will try to refresh your understanding of certain basic math-

ematical concepts. Follow along with me, and pull out your old textbooks if

necessary for future reference. Deeper concepts will be introduced later in the

book when we need them.

2.1 Probability

Probability theory provides a formal framework for reasoning about the likeli-

hood of events. Because it is a formal discipline, there are a thicket of associated

deﬁnitions to instantiate exactly what we are reasoning about:

•An experiment is a procedure which yields one of a set of possible out-

comes. As our ongoing example, consider the experiment of tossing two

six-sided dice, one red and one blue, with each face baring a distinct inte-

ger {1,...,6}.

•Asample space Sis the set of possible outcomes of an experiment. In our

28 CHAPTER 2. MATHEMATICAL PRELIMINARIES

dice example, there are 36 possible outcomes, namely

S={(1,1),(1,2),(1,3),(1,4),(1,5),(1,6),(2,1),(2,2),(2,3),(2,4),(2,5),(2,6),

(3,1),(3,2),(3,3),(3,4),(3,5),(3,6),(4,1),(4,2),(4,3),(4,4),(4,5),(4,6),

(5,1),(5,2),(5,3),(5,4),(5,5),(5,6),(6,1),(6,2),(6,3),(6,4),(6,5),(6,6)}.

•An event Eis a speciﬁed subset of the outcomes of an experiment. The

event that the sum of the dice equals 7 or 11 (the conditions to win at

craps on the ﬁrst roll) is the subset

E={(1,6),(2,5),(3,4),(4,3),(5,2),(6,1),(5,6),(6,5)}.

•The probability of an outcome s, denoted p(s) is a number with the two

properties:

–For each outcome sin sample space S, 0 ≤p(s)≤1.

–The sum of probabilities of all outcomes adds to one: Ps∈Sp(s) = 1.

If we assume two distinct fair dice, the probability p(s) = (1/6) ×(1/6) =

1/36 for all outcomes s∈S.

•The probability of an event Eis the sum of the probabilities of the out-

comes of the experiment. Thus

p(E) = X

s∈E

p(s).

An alternate formulation is in terms of the complement of the event ¯

E,

the case when Edoes not occur. Then

P(E)=1−P(¯

E).

This is useful, because often it is easier to analyze P(¯

E) than P(E) di-

rectly.

•Arandom variable Vis a numerical function on the outcomes of a proba-

bility space. The function “sum the values of two dice” (V((a, b)) = a+b)

produces an integer result between 2 and 12. This implies a probabil-

ity distribution of the values of the random variable. The probability

P(V(s) = 7) = 1/6, as previously shown, while P(V(s) = 12) = 1/36.

•The expected value of a random variable Vdeﬁned on a sample space S,

E(V) is deﬁned

E(V) = X

s∈S

p(s)·V(s).

All this you have presumably seen before. But it provides the language we

will use to connect between probability and statistics. The data we see usually

comes from measuring properties of observed events. The theory of probability

and statistics provides the tools to analyze this data.

2.1. PROBABILITY 29

2.1.1 Probability vs. Statistics

Probability and statistics are related areas of mathematics which concern them-

selves with analyzing the relative frequency of events. Still, there are funda-

mental diﬀerences in the way they see the world:

•Probability deals with predicting the likelihood of future events, while

statistics involves the analysis of the frequency of past events.

•Probability is primarily a theoretical branch of mathematics, which studies

the consequences of mathematical deﬁnitions. Statistics is primarily an

applied branch of mathematics, which tries to make sense of observations

in the real world.

Both subjects are important, relevant, and useful. But they are diﬀerent, and

understanding the distinction is crucial in properly interpreting the relevance

of mathematical evidence. Many a gambler has gone to a cold and lonely grave

for failing to make the proper distinction between probability and statistics.

This distinction will perhaps become clearer if we trace the thought process

of a mathematician encountering her ﬁrst craps game:

•If this mathematician were a probabilist, she would see the dice and think

“Six-sided dice? Each side of the dice is presumably equally likely to land

face up. Now assuming that each face comes up with probability 1/6, I

can ﬁgure out what my chances are of crapping out.”

•If instead a statistician wandered by, she would see the dice and think

“How do I know that they are not loaded? I’ll watch a while, and keep

track of how often each number comes up. Then I can decide if my ob-

servations are consistent with the assumption of equal-probability faces.

Once I’m conﬁdent enough that the dice are fair, I’ll call a probabilist to

tell me how to bet.”

In summary, probability theory enables us to ﬁnd the consequences of a

given ideal world, while statistical theory enables us to measure the extent to

which our world is ideal. This constant tension between theory and practice is

why statisticians prove to be a tortured group of individuals compared with the

happy-go-lucky probabilists.

Modern probability theory ﬁrst emerged from the dice tables of France in

1654. Chevalier de M´er´e, a French nobleman, wondered whether the player or

the house had the advantage in a particular betting game.1In the basic version,

the player rolls four dice, and wins provided none of them are a 6. The house

collects on the even money bet if at least one 6 appears.

De M´er´e brought this problem to the attention of the French mathematicians

Blaise Pascal and Pierre de Fermat, most famous as the source of Fermat’s Last

Theorem. Together, these men worked out the basics of probability theory,

1He really shouldn’t have wondered. The house always has the advantage.

30 CHAPTER 2. MATHEMATICAL PRELIMINARIES

A

S

A

S

A\B

A B

S

A[B

Figure 2.1: Venn diagrams illustrating set diﬀerence (left), intersection (middle),

and union (right).

along the way establishing that the house wins this dice game with probability

p= 1 −(5/6)4≈0.517, where the probability p= 0.5 would denote a fair game

where the house wins exactly half the time.

2.1.2 Compound Events and Independence

We will be interested in complex events computed from simpler events Aand B

on the same set of outcomes. Perhaps event Ais that at least one of two dice

be an even number, while event Bdenotes rolling a total of either 7 or 11. Note

that there exist certain outcomes of Awhich are not outcomes of B, speciﬁcally

A−B={(1,2),(1,4),(2,1),(2,2),(2,3),(2,4),(2,6),(3,2),(3,6),(4,1),

(4,2),(4,4),(4,5),(4,6),(5,4),(6,2),(6,3),(6,4),(6,6)}.

This is the set diﬀerence operation. Observe that here B−A={}, because

every pair adding to 7 or 11 must contain one odd and one even number.

The outcomes in common between both events Aand Bare called the in-

tersection, denoted A∩B. This can be written as

A∩B=A−(S−B).

Outcomes which appear in either Aor Bare called the union, denoted A∪B.

With the complement operation ¯

A=S−A, we get a rich language for combining

events, shown in Figure 2.1. We can readily compute the probability of any of

these sets by summing the probabilities of the outcomes in the deﬁned sets.

The events Aand Bare independent if and only if

P(A∩B) = P(A)×P(B).

This means that there is no special structure of outcomes shared between events

Aand B. Assuming that half of the students in my class are female, and half

the students in my class are above average, we would expect that a quarter of

my students are both female and above average if the events are independent.

2.1. PROBABILITY 31

Probability theorists love independent events, because it simpliﬁes their cal-

culations. But data scientists generally don’t. When building models to predict

the likelihood of some future event B, given knowledge of some previous event

A, we want as strong a dependence of Bon Aas possible.

Suppose I always use an umbrella if and only if it is raining. Assume that

the probability it is raining here (event B) is, say, p= 1/5. This implies the

probability that I am carrying my umbrella (event A) is q= 1/5. But even

more, if you know the state of the rain you know exactly whether I have my

umbrella. These two events are perfectly correlated.

By contrast, suppose the events were independent. Then

P(A|B) = P(A∩B)

P(B)=P(A)P(B)

P(B)=P(A)

and whether it is raining has absolutely no impact on whether I carry my pro-

tective gear.

Correlations are the driving force behind predictive models, so we will discuss

how to measure them and what they mean in Section 2.3.

2.1.3 Conditional Probability

When two events are correlated, there is a dependency between them which

makes calculations more diﬃcult. The conditional probability of Agiven B,

P(A|B) is deﬁned:

P(A|B) = P(A∩B)

P(B)

Recall the dice rolling events from Section 2.1.2, namely:

•Event Ais that at least one of two dice be an even number.

•Event Bis the sum of the two dice is either a 7 or an 11.

Observe that P(A|B) = 1, because any roll summing to an odd value must

consist of one even and one odd number. Thus A∩B=B, analogous to the

umbrella case above. For P(B|A), note that P(A∩B)=9/36 and P(A) =

25/36, so P(B|A)=9/25.

Conditional probability will be important to us, because we are interested in

the likelihood of an event A(perhaps that a particular piece of email is spam)

as a function of some evidence B(perhaps the distribution of words within the

document). Classiﬁcation problems generally reduce to computing conditional

probabilities, in one way or another.

Our primary tool to compute conditional probabilities will be Bayes theorem,

which reverses the direction of the dependencies:

P(B|A) = P(A|B)P(B)

P(A)

Often it proves easier to compute probabilities in one direction than another, as

in this problem. By Bayes theorem P(B|A) = (1·9/36)/(25/36) = 9/25, exactly

32 CHAPTER 2. MATHEMATICAL PRELIMINARIES

Figure 2.2: The probability density function (pdf) of the sum of two dice con-

tains exactly the same information as the cumulative density function (cdf), but

looks very diﬀerent.

what we got before. We will revisit Bayes theorem in Section 5.6, where it will

establish the foundations of computing probabilities in the face of evidence.

2.1.4 Probability Distributions

Random variables are numerical functions where the values are associated with

probabilities of occurrence. In our example where V(s) the sum of two tossed

dice, the function produces an integer between 2 and 12. The probability of a

particular value V(s) = Xis the sum of the probabilities of all the outcomes

which add up to X.

Such random variables can be represented by their probability density func-

tion, or pdf. This is a graph where the x-axis represents the range of values

the random variable can take on, and the y-axis denotes the probability of that

given value. Figure 2.2 (left) presents the pdf of the sum of two fair dice. Ob-

serve that the peak at X= 7 corresponds to the most frequent dice total, with

a probability of 1/6.

Such pdf plots have a strong relationship to histograms of data frequency,

where the x-axis again represents the range of value, but ynow represents the

observed frequency of exactly how many event occurrences were seen for each

given value X. Converting a histogram to a pdf can be done by dividing each

bucket by the total frequency over all buckets. The sum of the entries then

becomes 1, so we get a probability distribution.

Histograms are statistical: they reﬂect actual observations of outcomes. In

contrast, pdfs are probabilistic: they represent the underlying chance that the

next observation will have value X. We often use the histogram of observations

h(x) in practice to estimate the probabilities2by normalizing counts by the total

2A technique called discounting oﬀers a better way to estimate the frequency of rare events,

and will be discussed in Section 11.1.2.

2.1. PROBABILITY 33

Figure 2.3: iPhone quarterly sales data presented as cumulative and incremental

(quarterly) distributions. Which curve did Apple CEO Tim Cook choose to

present?

number of observations:

P(k=X) = h(k=X)

Pxh(x=X)

There is another way to represent random variables which often proves use-

ful, called a cumulative density function or cdf. The cdf is the running sum of

the probabilities in the pdf; as a function of k, it reﬂects the probability that

X≤kinstead of the probability that X=k. Figure 2.2 (right) shows the

cdf of the dice sum distribution. The values increase monotonically from left

to right, because each term comes from adding a positive probability to the

previous total. The rightmost value is 1, because all outcomes produce a value

no greater than the maximum.

It is important to realize that the pdf P(V) and cdf C(V) of a given random

variable Vcontain exactly the same information. We can move back and forth

between them because:

P(k=X) = C(X≤k+δ)−C(X≤k),

where δ= 1 for integer distributions. The cdf is the running sum of the pdf, so

C(X≤k) = X

x≤k

P(X=x).

Just be aware of which distribution you are looking at. Cumulative distribu-

tions always get higher as we move to the right, culminating with a probability

of C(X≤ ∞) = 1. By contrast, the total area under the curve of a pdf equals

1, so the probability at any point in the distribution is generally substantially

less.

34 CHAPTER 2. MATHEMATICAL PRELIMINARIES

An amusing example of the diﬀerence between cumulative and incremental

distributions is shown in Figure 2.3. Both distributions show exactly the same

data on Apple iPhone sales, but which curve did Apple CEO Tim Cook choose to

present at a major shareholder event? The cumulative distribution (red) shows

that sales are exploding, right? But it presents a misleading view of growth

rate, because incremental change is the derivative of this function, and hard to

visualize. Indeed, the sales-per-quarter plot (blue) shows that the rate of iPhone

sales actually had declined for the last two periods before the presentation.

2.2 Descriptive Statistics

Descriptive statistics provide ways of capturing the properties of a given data

set or sample. They summarize observed data, and provide a language to talk

about it. Representing a group of elements by a new derived element, like

mean, min, count, or sum reduces a large data set to a small summary statistic:

aggregation as data reduction.

Such statistics can become features in their own right when taken over natu-

ral groups or clusters in the full data set. There are two main types of descriptive

statistics:

•Central tendency measures, which capture the center around which the

data is distributed.

•Variation or variability measures, which describe the data spread, i.e. how

far the measurements lie from the center.

Together these statistics tell us an enormous amount about our distribution.

2.2.1 Centrality Measures

The ﬁrst element of statistics we are exposed to in school are the basic centrality

measures: mean, median, and mode. These are the right place to start when

thinking of a single number to characterize a data set.

•Mean: You are probably quite comfortable with the use of the arithmetic

mean, where we sum values and divide by the number of observations:

µX=1

n

n

X

i=1

xi

We can easily maintain the mean under a stream of insertions and dele-

tions, by keeping the sum of values separate from the frequency count,

and divide only on demand.

The mean is very meaningful to characterize symmetric distributions with-

out outliers, like height and weight. That it is symmetric means the num-

ber of items above the mean should be roughly the same as the number

2.2. DESCRIPTIVE STATISTICS 35

below. That it is without outliers means that the range of values is rea-

sonably tight. Note that a single MAXINT creeping into an otherwise

sound set of observations throws the mean wildly oﬀ. The median is a

centrality measure which proves more appropriate with such ill-behaved

distributions.

•Geometric mean: The geometric mean is the nth root of the product of n

values:

n

Y

i=1

ai!1/n

=n

√a1a2. . . an

The geometric mean is always less than or equal to the arithmetic mean.

For example, the geometric mean of the sums of 36 dice rolls is 6.5201, as

opposed to the arithmetic mean of 7. It is very sensitive to values near

zero. A single value of zero lays waste to the geometric mean: no matter

what other values you have in your data, you end up with zero. This is

somewhat analogous to having an outlier of ∞in an arithmetic mean.

But geometric means prove their worth when averaging ratios. The ge-

ometric mean of 1/2 and 2/1 is 1, whereas the mean is 1.25. There is

less available “room” for ratios to be less than 1 than there is for ratios

above 1, creating an asymmetry that the arithmetic mean overstates. The

geometric mean is more meaningful in these cases, as is the arithmetic

mean of the logarithms of the ratios.

•Median: The median is the exact middle value among a data set; just as

many elements lie above the median as below it. There is a quibble about

what to take as the median when you have an even number of elements.

You can take either one of the two central candidates: in any reasonable

data set these two values should be about the same. Indeed in the dice

example, both are 7.

A nice property of the median as so deﬁned is that it must be a genuine

value of the original data stream. There actually is someone of median

height to you can point to as an example, but presumably no one in the

world is of exactly average height. You lose this property when you average

the two center elements.

Which centrality measure is best for applications? The median typically

lies pretty close to the arithmetic mean in symmetrical distributions, but

it is often interesting to see how far apart they are, and on which side of

the mean the median lies.

The median generally proves to be a better statistic for skewed distribu-

tions or data with outliers: like wealth and income. Bill Gates adds $250

to the mean per capita wealth in the United States, but nothing to the

median. If he makes you personally feel richer, then go ahead and use the

mean. But the median is the more informative statistic here, as it will be

for any power law distribution.

36 CHAPTER 2. MATHEMATICAL PRELIMINARIES

3000

P(H)

3000

1

Hours Hours

P(H)

0.01

Figure 2.4: Two distinct probability distributions with µ= 3000 for the lifespan

of light bulbs: normal (left) and with zero variance (right).

•Mode: The mode is the most frequent element in the data set. This is 7

in our ongoing dice example, because it occurs six times out of thirty-six

elements. Frankly, I’ve never seen the mode as providing much insight

as centrality measure, because it often isn’t close to the center. Samples

measured over a large range should have very few repeated elements or

collisions at any particular value. This makes the mode a matter of hap-

penstance. Indeed, the most frequently occurring elements often reveal

artifacts or anomalies in a data set, such as default values or error codes

that do not really represent elements of the underlying distribution.

The related concept of the peak in a frequency distribution (or histogram)

is meaningful, but interesting peaks only get revealed through proper buck-

eting. The current peak of the annual salary distribution in the United

States lies between $30,000 and $40,000 per year, although the mode pre-

sumably sits at zero.

2.2.2 Variability Measures

The most common measure of variability is the standard deviation σ, which

measures sum of squares diﬀerences between the individual elements and the

mean:

σ=sPn

i=1(ai−¯a)2

n−1

A related statistic, the variance V, is the square of the standard deviation,

i.e. V=σ2. Sometimes it is more convenient to talk about variance than

standard deviation, because the term is eight characters shorter. But they

measure exactly the same thing.

As an example, consider the humble light bulb, which typically comes with

an expected working life, say µ= 3000 hours, derived from some underlying dis-

tribution shown in Figure 2.4. In a conventional bulb, the chance of it lasting

longer than µis presumably about the same as that of it burning out quicker,

and this degree of uncertainty is measured by σ. Alternately, imagine a “printer

2.2. DESCRIPTIVE STATISTICS 37

cartridge bulb,” where the evil manufacturer builds very robust bulbs, but in-

cludes a counter so they can prevent it from ever glowing after 3000 hours of

use. Here µ= 3000 and σ= 0. Both distributions have the same mean, but

substantially diﬀerent variance.

The sum of squares penalty in the formula for σmeans that one outlier value

dunits from the mean contributes as much to the variance as d2points each

one unit from the mean, so the variance is very sensitive to outliers.

An often confusing matter concerns the denominator in the formula for stan-

dard deviation. Should we divide by nor n−1? The diﬀerence here is technical.

The standard deviation of the full population divides by n, whereas the standard

deviation of the sample divides by n−1. The issue is that sampling just one

point tells us absolutely nothing about the underlying variance in any popu-

lation, where it is perfectly reasonable to say there is zero variance in weight

among the population of a one-person island. But for reasonable-sized data sets

n≈(n−1), so it really doesn’t matter.

2.2.3 Interpreting Variance

Repeated observations of the same phenomenon do not always produce the

same results, due to random noise or error. Sampling errors result when our

observations capture unrepresentative circumstances, like measuring rush hour

traﬃc on weekends as well as during the work week. Measurement errors reﬂect

the limits of precision inherent in any sensing device. The notion of signal

to noise ratio captures the degree to which a series of observations reﬂects a

quantity of interest as opposed to data variance. As data scientists, we care

about changes in the signal instead of the noise, and such variance often makes

this problem surprisingly diﬃcult.

I think of variance as an inherent property of the universe, akin to the speed

of light or the time-value of money. Each morning you weigh yourself on a scale

you are guaranteed to get a diﬀerent number, with changes reﬂecting when you

last ate (sampling error), the ﬂatness of the ﬂoor, or the age of the scale (both

measurement error) as much as changes in your body mass (actual variation).

So what is your real weight?

Every measured quantity is subject to some level of variance, but the phe-

nomenon cuts much deeper than that. Much of what happens in the world is

just random ﬂuctuations or arbitrary happenstance causing variance even when

the situation is unchanged. Data scientists seek to explain the world through

data, but distressingly often there is no real phenomena to explain, only a ghost

created by variance. Examples include:

•The stock market: Consider the problem of measuring the relative “skill”

of diﬀerent stock market investors. We know that Warren Buﬀet is much

better at investing than we are. But very few professional investors prove

consistently better than others. Certain investment vehicles wildly out-

perform the market in any given time period. However, the hot fund one

38 CHAPTER 2. MATHEMATICAL PRELIMINARIES

Figure 2.5: Sample variance on hitters with a real 30% success rate results in a

wide range of observed performance even over 500 trials per season.

year usually underperforms the market the year after, which shouldn’t

happen if this outstanding performance was due to skill rather than luck.

The fund managers themselves are quick to credit proﬁtable years to their

own genius, but losses to unforeseeable circumstances. However, several

studies have shown that the performance of professional investors is es-

sentially random, meaning there is little real diﬀerence in skill. Most

investors are paying managers for previously-used luck. So why do these

entrail-readers get paid so much money?

•Sports performance: Students have good semesters and bad semesters, as

reﬂected by their grade point average (GPA). Athletes have good and bad

seasons, as reﬂected by their performance and statistics. Do such changes

reﬂect genuine diﬀerences in eﬀort and ability, or are they just variance?

In baseball, .300 hitters (players who hit with a 30% success rate) represent

consistency over a full season. Batting .275 is not a noteworthy season,

but hit .300 and you are a star. Hit .325 and you are likely to be the

batting champion.

Figure 2.5 shows the results of a simple simulation, where random numbers

were used to decide the outcome of each at-bat over a 500 at-bats/season.

Our synthetic player is a real .300 hitter, because we programmed it to

report a hit with probability 300/1000 (0.3). The results show that a real

.300 hitter has a 10% chance of hitting .275 or below, just by chance.

Such a season will typically be explained away by injuries or maybe the

inevitable eﬀects of age on athletic performance. But it could just be

natural variance. Smart teams try to acquire a good hitter after a lousy

season, when the price is cheaper, trying to take advantage of this variance.

Our .300 hitter also has a 10% chance of batting above .325, but you

2.2. DESCRIPTIVE STATISTICS 39

can be pretty sure that they will ascribe such a breakout season to their

improved conditioning or training methods instead of the fact they just

got lucky. Good or bad season, or lucky/unlucky: it is hard to tell the

signal from the noise.

•Model performance: As data scientists, we will typically develop and eval-

uate several models for each predictive challenge. The models may range

from very simple to complex, and vary in their training conditions or

parameters.

Typically the model with the best accuracy on the training corpus will

be paraded triumphantly before the world as the right one. But small

diﬀerences in the performance between models is likely explained by sim-

ple variance rather than wisdom: which training/evaluation pairs were

selected, how well parameters were optimized, etc.

Remember this when it comes to training machine learning models. In-

deed, when asked to choose between models with small performance dif-

ferences between them, I am more likely to argue for the simplest model

than the one with the highest score. Given a hundred people trying to

predict heads and tails on a stream of coin tosses, one of them is guar-

anteed to end up with the most right answers. But there is no reason to

believe that this fellow has any better predictive powers than the rest of

us.

2.2.4 Characterizing Distributions

Distributions do not necessarily have much probability mass exactly at the

mean. Consider what your wealth would look like after you borrow $100 million,

and then bet it all on an even money coin ﬂip. Heads you are now $100 million

in clear, tails you are $100 million in hock. Your expected wealth is zero, but

this mean does not tell you much about the shape of your wealth distribution.

However, taken together the mean and standard deviation do a decent job

of characterizing any distribution. Even a relatively small amount of mass

positioned far from the mean would add a lot to the standard deviation, so a

small value of σimplies the bulk of the mass must be near the mean.

To be precise, regardless of how your data is distributed, at least (1 −

(1/k2))th of the mass must lie within ±kstandard deviations of the mean.

This means that at least 75% of all the data must lie within 2σof the mean,

and almost 89% within 3σfor any distribution.

We will see that even tighter bounds hold when we know the distribution is

well-behaved, like the Gaussian or normal distribution. But this is why it is a

great practice to report both µand σwhenever you talk about averages. The

average height of adult women in the United States is 63.7±2.7 inches, meaning

µ= 63.7 and σ= 2.7. The average temperature in Orlando, Fl is 60.3 degrees

Fahrenheit. However, there have been many more 100 degree days at Disney

World than 100 inch (8.33 foot) women visiting to enjoy them.

40 CHAPTER 2. MATHEMATICAL PRELIMINARIES

Take-Home Lesson: Report both the mean and standard deviation to charac-

terize your distribution, written as µ±σ.

2.3 Correlation Analysis

Suppose we are given two variables xand y, represented by a sample of npoints

of the form (xi, yi), for 1 ≤i≤n. We say that xand yare correlated when the

value of xhas some predictive power on the value of y.

The correlation coeﬃcient r(X, Y ) is a statistic that measures the degree

to which Yis a function of X, and vice versa. The value of the correlation

coeﬃcient ranges from −1 to 1, where 1 means fully correlated and 0 implies

no relation, or independent variables. Negative correlations imply that the

variables are anti-correlated, meaning that when Xgoes up, Ygoes down.

Perfectly anti-correlated variables have a correlation of −1. Note that nega-

tive correlations are just as good for predictive purposes as positive ones. That

you are less likely to be unemployed the more education you have is an example

of a negative correlation, so the level of education can indeed help predict job

status. Correlations around 0 are useless for forecasting.

Observed correlations drives many of the predictive models we build in data

science. Representative strengths of correlations include:

•Are taller people more likely to remain lean? The observed correlation

between height and BMI is r=−0.711, so height is indeed negatively

correlated with body mass index (BMI).3

•Do standardized tests predict the performance of students in college? The

observed correlation between SAT scores and freshmen GPA is r= 0.47,

so yes, there is some degree of predictive power. But social economic

status is just as strongly correlated with SAT scores (r= 0.42).4

•Does ﬁnancial status aﬀect health? The observed correlation between

household income and the prevalence of coronary artery disease is r=

−0.717, so there is a strong negative correlation. So yes, the wealthier

you are, the lower your risk of having a heart attack.5

•Does smoking aﬀect health? The observed correlation between a group’s

propensity to smoke and their mortality rate is r= 0.716, so for G-d’s

sake, don’t smoke.6

3https://onlinecourses.science.psu.edu/stat500/node/60

4https://research.collegeboard.org/sites/default/files/publications/2012/9/

researchreport-2009-1-socioeconomic-status-sat-freshman-gpa-analysis-data.pdf

5http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3457990/.

6http://lib.stat.cmu.edu/DASL/Stories/SmokingandCancer.html.

2.3. CORRELATION ANALYSIS 41

•Do violent video games increase aggressive behavior? The observed cor-

relation between play and violence is r= 0.19, so there is a weak but

signiﬁcant correlation.7

This section will introduce the primary measurements of correlation. Fur-

ther, we study how to appropriately determine the strength and power of any

observed correlation, to help us understand when the connections between vari-

ables are real.

2.3.1 Correlation Coeﬃcients: Pearson and Spearman Rank

In fact, there are two primary statistics used to measure correlation. Mercifully,

both operate on the same −1 to 1 scale, although they measure somewhat

diﬀerent things. These diﬀerent statistics are appropriate in diﬀerent situations,

so you should be aware of both of them.

The Pearson Correlation Coeﬃcient

The more prominent of the two statistics is Pearson correlation, deﬁned as

r=Pn

i=1(Xi−¯

X)(Yi−¯

Y)

qPn

i=1(Xi−¯

X)2qPn

i=1(Yi−¯

Y)2

=Cov(X, Y )

σ(X)σ(Y)

Let’s parse this equation. Suppose Xand Yare strongly correlated. Then

we would expect that when xiis greater than the mean ¯

X, then yishould be

bigger than its mean ¯

Y. When xiis lower than its mean, yishould follow. Now

look at the numerator. The sign of each term is positive when both values are

above (1 ×1) or below (−1×−1) their respective means. The sign of each term

is negative ((−1×1) or (1 ×−1)) if they move in opposite directions, suggesting

negative correlation. If Xand Ywere uncorrelated, then positive and negative

terms should occur with equal frequency, oﬀsetting each other and driving the

value to zero.

The numerator’s operation determining the sign of the correlation is so useful

that we give it a name, covariance, computed:

Cov(X, Y ) =

n

X

i=1

(Xi−¯

X)(Yi−¯

Y).

Remember covariance: we will see it again in Section 8.2.3.

The denominator of the Pearson formula reﬂects the amount of variance in

the two variables, as measured by their standard deviations. The covariance

between Xand Ypotentially increases with the variance of these variables, and

this denominator is the magic amount to divide it by to bring correlation to a

−1 to 1 scale.

7http://webspace.pugetsound.edu/facultypages/cjones/chidev/Paper/Articles/

Anderson-Aggression.pdf.

42 CHAPTER 2. MATHEMATICAL PRELIMINARIES

Figure 2.6: The function y=|x|does not have a linear model, but seems like it

should be easily ﬁtted despite weak correlations.

The Spearman Rank Correlation Coeﬃcient

The Pearson correlation coeﬃcient deﬁnes the degree to which a linear predictor

of the form y=m·x+bcan ﬁt the observed data. This generally does a good job

measuring the similarity between the variables, but it is possible to construct

pathological examples where the correlation coeﬃcient between Xand Yis zero,

yet Yis completely dependent on (and hence perfectly predictable from) X.

Consider points of the form (x, |x|), where xis uniformly (or symmetrically)

sampled from the interval [−1,1] as shown in Figure 2.6. The correlation will

be zero because for every point (x, x) there will be an oﬀsetting point (−x, x),

yet y=|x|is a perfect predictor. Pearson correlation measures how well the

best linear predictors can work, but says nothing about weirder functions like

absolute value.

The Spearman rank correlation coeﬃcient essentially counts the number of

pairs of input points which are out of order. Suppose that our data set contains

points (x1, y1) and (x2, y2) where x1< x2and y1< y2. This is a vote that

the values are positively correlated, whereas the vote would be for a negative

correlation if y2< y1.

Summing up over all pairs of points and normalizing properly gives us Spear-

man rank correlation. Let rank(xi) be the rank position of xiin sorted order

among all xi, so the rank of the smallest value is 1 and the largest value n. Then

ρ= 1 −6Pd2

i

n(n2−1)

where di=rank(xi)−rank(yi).

The relationship between our two coeﬃcients is better delineated by the

example in Figure 2.7. In addition to giving high scores to non-linear but

monotonic functions, Spearman correlation is less sensitive to extreme outlier

elements than Pearson. Let p= (x1, ymax) be the data point with largest value

2.3. CORRELATION ANALYSIS 43

Figure 2.7: A monotonic but not linear point set has a Spearman coeﬃcient

r= 1 even though it has no good linear ﬁt (left). Highly-correlated sequences

are recognized by both coeﬃcients (center), but the Pearson coeﬃcient is much

more sensitive to outliers (right).

of yin a given data set. Suppose we replace pwith p0= (x1,∞). The Pearson

correlation will go crazy, since the best ﬁt now becomes the vertical line x=x1.

But the Spearman correlation will be unchanged, since all the points were under

p, just as they are now under p0.

2.3.2 The Power and Signiﬁcance of Correlation

The correlation coeﬃcient rreﬂects the degree to which xcan be used to predict

yin a given sample of points S. As |r| → 1, these predictions get better and

better.

But the real question is how this correlation will hold up in the real world,

outside the sample. Stronger correlations have larger |r|, but also involve sam-

ples of enough points to be signiﬁcant. There is a wry saying that if you want

to ﬁt your data by a straight line, it is best to sample it at only two points.

Your correlation becomes more impressive the more points it is based on.

The statistical limits in interpreting correlations are presented in Figure 2.8,

based on strength and size:

•Strength of correlation: R2: The square of the sample correlation coef-

ﬁcient r2estimates the fraction of the variance in Yexplained by Xin

a simple linear regression. The correlation between height and weight is

approximately 0.8, meaning it explains about two thirds of the variance.

Figure 2.8 (left) shows how rapidly r2decreases with r. There is a pro-

found limit to how excited we should get about establishing a weak corre-

lation. A correlation of 0.5 possesses only 25% of the maximum predictive

power, and a correlation of r= 0.1 only 1%. Thus the predictive value of

correlations decreases rapidly with r.

What do we mean by “explaining the variance”? Let f(x) = mx+cbe the

44 CHAPTER 2. MATHEMATICAL PRELIMINARIES

Figure 2.8: Limits in interpreting signiﬁcance. The r2value shows that weak

correlations explain only a small fraction of the variance (left). The level of cor-

relation necessary to be statistically signiﬁcance decreases rapidly with sample

size n(right).

Figure 2.9: Plotting ri=yi−f(xi) shows that the residual values have lower

variance and mean zero. The original data points are on the left, with the

corresponding residuals on the right.

2.3. CORRELATION ANALYSIS 45

predictive value of yfrom x, with the parameters mand ccorresponding

to the best possible ﬁt. The residual values ri=yi−f(xi) will have mean

zero, as shown in Figure 2.9. Further, the variance of the full data set

V(Y) should be much larger than V(r) if there is a good linear ﬁt f(x).

If xand yare perfectly correlated, there should be no residual error, and

V(r) = 0. If xand yare totally uncorrelated, the ﬁt should contribute

nothing, and V(y)≈V(r). Generally speaking, 1 −r2=V(r)/V (y).

Consider Figure 2.9, showing a set of points (left) admitting a good linear

ﬁt, with correlation r= 0.94. The corresponding residuals ri=yi−f(xi)

are plotted on the right. The variance of the yvalues on the left V(y) =

0.056, substantially greater than the variance V(r)=0.0065 on the right.

Indeed,

1−r2= 0.116 ←→ V(r)/V (y)=0.116.

•Statistical signiﬁcance: The statistical signiﬁcance of a correlation depends

upon its sample size nas well as r. By tradition, we say that a correlation

of npoints is signiﬁcant if there is an α≤1/20 = 0.05 chance that we

would observe a correlation as strong as rin any random set of npoints.

This is not a particularly strong standard. Even small correlations become

signiﬁcant at the 0.05 level with large enough sample sizes, as shown in

Figure 2.8 (right). A correlation of r= 0.1 becomes signiﬁcant at α=

0.05 around n= 300, even though such a factor explains only 1% of the

variance.

Weak but signiﬁcant correlations can have value in big data models involving

large numbers of features. Any single feature/correlation might explain/predict

only small eﬀects, but taken together a large number of weak but independent

correlations may have strong predictive power. Maybe. We will discuss signiﬁ-

cance again in greater detail in Section 5.3.

2.3.3 Correlation Does Not Imply Causation!

You have heard this before: correlation does not imply causation:

•The number of police active in a precinct correlate strongly with the local

crime rate, but the police do not cause the crime.

•The amount of medicine people take correlates with the probability they

are sick, but the medicine does not cause the illness.

At best, the implication works only one way. But many observed correlations

are completely spurious, with neither variable having any real impact on the

other.

Still, correlation implies causation is a common error in thinking, even among

those who understand logical reasoning. Generally speaking, few statistical tools

are available to tease out whether Areally causes B. We can conduct controlled

experiments, if we can manipulate one of the variables and watch the eﬀect on

46 CHAPTER 2. MATHEMATICAL PRELIMINARIES

Figure 2.10: Correlation does not imply causation. (Source https://www.xkcd.

com/552.)

Figure 2.11: Cyclic trends in a time series (left) are revealed through correlating

it against shifts of itself (right) .

the other. For example, the fact that we can put people on a diet that makes

them lose weight without getting shorter is convincing evidence that weight does

not cause height. But it is often harder to do these experiments the other way,

e.g. there is no reasonable way to make people shorter other than by hacking

oﬀ limbs.

2.3.4 Detecting Periodicities by Autocorrelation

Suppose a space alien was hired to analyze U.S. sales at a toy company. Instead

of a nice smooth function showing a consistent trend, they would be astonished

to see a giant bump every twelfth month, every year. This alien would have

discovered the phenomenon of Christmas.

Seasonal trends reﬂect cycles of a ﬁxed duration, rising and falling in a reg-

ular pattern. Many human activities proceed with a seven-day cycle associated

with the work week. Large populations of a type of insect called a cicada emerge

on a 13-year or 17-year cycle, in an eﬀort to prevent predators from learning to

2.4. LOGARITHMS 47

eat them.

How can we recognize such cyclic patterns in a sequence S? Suppose we

correlate the values of Siwith Si+p, for all 1 ≤i≤n−p. If the values are in sync

for a particular period length p, then this correlation with itself will be unusually

high relative to other possible lag values. Comparing a sequence to itself is called

an autocorrelation, and the series of correlations for all 1 ≤k≤n−1 is called

the autocorrelation function. Figure 2.11 presents a time series of daily sales,

and the associated autocorrelation function for this data. The peak at a shift of

seven days (and every multiple of seven days) establishes that there is a weekly

periodicity in sales: more stuﬀ gets sold on weekends.

Autocorrelation is an important concept in predicting future events, because

it means we can use previous observations as features in a model. The heuristic

that tomorrow’s weather will be similar to today’s is based on autocorrelation,

with a lag of p= 1 days. Certainly we would expect such a model to be

more accurate than predictions made on weather data from six months ago (lag

p= 180 days).

Generally speaking, the autocorrelation function for many quantities tends

to be highest for very short lags. This is why long-term predictions are less accu-

rate than short-term forecasts: the autocorrelations are generally much weaker.

But periodic cycles do sometimes stretch much longer. Indeed, a weather fore-

cast based on a lag of p= 365 days will be much better than one of p= 180,

because of seasonal eﬀects.

Computing the full autocorrelation function requires calculating n−1 diﬀer-

ent correlations on points of the time series, which can get expensive for large n.

Fortunately, there is an eﬃcient algorithm based on the fast Fourier transform

(FFT), which makes it possible to construct the autocorrelation function even

for very long sequences.

2.4 Logarithms

The logarithm is the inverse exponential function y=bx, an equation that can

be rewritten as x= logby. This deﬁnition is the same as saying that

blogby=y.

Exponential functions grow at a very fast rate: consider b={21,22,23,24, . . .}.

In contrast, logarithms grow a very slow rate: these are just the exponents of

the previous series {1,2,3,4, . . .}. They are associated with any process where

we are repeatedly multiplying by some value of b, or repeatedly dividing by b.

Just remember the deﬁnition:

y= logbx←→ by=x.

Logarithms are very useful things, and arise often in data analysis. Here

I detail three important roles logarithms play in data science. Surprisingly,

only one of them is related to the seven algorithmic applications of logarithms

48 CHAPTER 2. MATHEMATICAL PRELIMINARIES

I present in The Algorithm Design Manual [Ski08]. Logarithms are indeed very

useful things.

2.4.1 Logarithms and Multiplying Probabilities

Logarithms were ﬁrst invented as an aide to computation, by reducing the prob-

lem of multiplication to that of addition. In particular, to compute the product

p=x·y, we could compute the sum of the logarithms s= logbx+ logbyand

then take the inverse of the logarithm (i.e. raising bto the sth power) to get p,

because:

p=x·y=b(logbx+logby).

This is the trick that powered the mechanical slide rules that geeks used in the

days before pocket calculators.

However, this idea remains important today, particularly when multiplying

long chains of probabilities. Probabilities are small numbers. Thus multiplying

long chains of probability yield very small numbers that govern the chances of

very rare events. There are serious numerical stability problems with ﬂoating

point multiplication on real computers. Numerical errors will creep in, and will

eventually overwhelm the true value of small-enough numbers.

Summing the logarithms of probabilities is much more numerically stable

than multiplying them, but yields an equivalent result because:

n

Y

i=1

pi=bP, where P=

n

X

i=1

logb(pi).

We can raise our sum to an exponential if we need the real probability, but

usually this is not necessary. When we just need to compare two probabilities

to decide which one is larger we can safely stay in log world, because bigger

logarithms correspond to bigger probabilities.

There is one quirk to be aware of. Recall that the log2(1

2) = −1. The

logarithms of probabilities are all negative numbers except for log(1) = 0. This

is the reason why equations with logs of probabilities often feature negative

signs in strange places. Be on the lookout for them.

2.4.2 Logarithms and Ratios

Ratios are quantities of the form a/b. They occur often in data sets either as

elementary features or values derived from feature pairs. Ratios naturally occur

in normalizing data for conditions (i.e. weight after some treatment over the

initial weight) or time (i.e. today’s price over yesterday’s price).

But ratios behave diﬀerently when reﬂecting increases than decreases. The

ratio 200/100 is 200% above baseline, but 100/200 is only 50% below despite

being a similar magnitude change. Thus doing things like averaging ratios is

committing a statistical sin. Do you really want a doubling followed by a halving

to average out as an increase, as opposed to a neutral change?

2.4. LOGARITHMS 49

Figure 2.12: Plotting ratios on a scale cramps the space allocated to small ratios

relative to large ratios (left). Plotting the logarithms of ratios better represents

the underlying data (right).

One solution here would have been to use the geometric mean. But better is

taking the logarithm of these ratios, so that they yield equal displacement, since

log22 = 1 and log2(1/2) = −1. We get the extra bonus that a unit ratio maps

to zero, so positive and negative numbers correspond to improper and proper

ratios, respectively.

A rookie mistake my students often make involves plotting the value of ratios

instead of their logarithms. Figure 2.12 (left) is a graph from a student paper,

showing the ratio of new score over old score on data over 24 hours (each red

dot is the measurement for one hour) on four diﬀerent data sets (each given a

row). The solid black line shows the ratio of one, where both scores give the

same result. Now try to read this graph: it isn’t easy because the points on the

left side of the line are cramped together in a narrow strip. What jumps out at

you are the outliers. Certainly the new algorithm does terrible on 7UM917 in

the top row: that point all the way to the right is a real outlier.

Except that it isn’t. Now look at Figure 2.12 (right), where we plot the

logarithms of the ratios. The space devoted to left and right of the black line

can now be equal. And it shows that this point wasn’t really such an outlier at

all. The magnitude of improvement of the leftmost points is much greater than

that of the rightmost points. This plot reveals that new algorithm generally

makes things better, only because we are showing logs of ratios instead of the

ratios themselves.

2.4.3 Logarithms and Normalizing Skewed Distributions

Variables which follow symmetric, bell-shaped distributions tend to be nice as

features in models. They show substantial variation, so they can be used to

discriminate between things, but not over such a wide range that outliers are

overwhelming.

But not every distribution is symmetric. Consider the one in Figure 2.13

50 CHAPTER 2. MATHEMATICAL PRELIMINARIES

Figure 2.13: Hitting a skewed data distribution (left) with a log often yields a

more bell-shaped distribution (right).

(left). The tail on the right goes much further than the tail on the left. And

we are destined to see far more lopsided distributions when we discuss power

laws, in Section 5.1.5. Wealth is representative of such a distribution, where

the poorest human has zero or perhaps negative wealth, the average person

(optimistically) is in the thousands of dollars, and Bill Gates is pushing $100

billion as of this writing.

We need a normalization to convert such distributions into something easier

to deal with. To ring the bell of a power law distribution we need something

non-linear, that reduces large values to a disproportionate degree compared to

more modest values.

The logarithm is the transformation of choice for power law variables. Hit

your long-tailed distribution with a log and often good things happen. The

distribution in Figure 2.13 happened to be the log normal distribution, so taking

the logarithm yielded a perfect bell-curve on right. Taking the logarithm of

variables with a power law distribution brings them more in line with traditional

distributions. For example, as an upper-middle class professional, my wealth is

roughly the same number of logs from my starving students as I am from Bill

Gates!

Sometimes taking the logarithm proves too drastic a hit, and a less dramatic

non-linear transformation like the square root works better to normalize a dis-

tribution. The acid test is to plot a frequency distribution of the transformed

values and see if it looks bell-shaped: grossly-symmetric, with a bulge in the

middle. That is when you know you have the right function.

2.5 War Story: Fitting Designer Genes

The word bioinformatician is life science speak for “data scientist,” the prac-

titioner of an emerging discipline which studies massive collections of DNA

sequence data looking for patterns. Sequence data is very interesting to work

2.5. WAR STORY: FITTING DESIGNER GENES 51

with, and I have played bioinformatician in research projects since the very

beginnings of the human genome project.

DNA sequences are strings on the four letter alphabet {A, C, G, T }. Proteins

form the stuﬀ that we are physically constructed from, and are composed of

strings of 20 diﬀerent types of molecular units, called amino acids. Genes are

the DNA sequences which describe exactly how to make speciﬁc proteins, with

the units each described by a triplet of {A, C, G, T }s called codons.

For our purposes, it suﬃces to know that there are a huge number of possible

DNA sequences describing genes which could code for any particular desired

protein sequence. But only one of them is used. My biologist collaborators and

I wanted to know why.

Originally, it was assumed that all of these diﬀerent synonymous encodings

were essentially identical, but statistics performed on sequence data made it

clear that certain codons are used more often than others. The biological con-

clusion is that “codons matter,” and there are good biological reasons why this

should be.

We became interested in whether “neighboring pairs of codon matter.” Per-

haps certain pairs of triples are like oil and water, and hate to mix. Certain

letter pairs in English have order preferences: you see the bigram gh far more

often than hg. Maybe this is true of DNA as well? If so, there would be pairs

of triples which should be underrepresented in DNA sequence data.

To test this, we needed a score comparing the number of times we actually

see a particular triple (say x=CAT ) next to another particular triple (say

y=GAG) to what we would expect by chance. Let F(xy) be the frequency

of xy, number of times we actually see codon xfollowed by codon yin the

DNA sequence database. These codons code for speciﬁc amino acids, say a

and brespectively. For amino acid a, the probability that it will be coded by

xis P(x) = F(x)/F (a), and similarly P(y) = F(y)/F (b). Then the expected

number of times of seeing xy is

Expected(xy) = F(x)

F(a)F(y)

F(b)F(ab)

Based on this, we can compute a codon pair score for any given hexamer xy

as follows:

CP S(xy) = ln Observed(xy)

Expected(xy)= ln

F(xy)

F(x)F(y)

F(a)F(b)F(ab)

Taking the logarithm of this ratio produced very nice properties. Most im-

portantly, the sign of the score distinguished over-represented pairs from under-

represented pairs. Because the magnitudes were symmetric (+1 was just as

impressive as −1) we could add or average these scores in a sensible way to give

a score for each gene. We used these scores to design genes that should be bad

for viruses, which gave an exciting new technology for making vaccines. See the

chapter notes (Section 2.6) for more details.

52 CHAPTER 2. MATHEMATICAL PRELIMINARIES

Figure 2.14: Patterns in DNA sequences with the lowest codon pair scores

become obvious on inspection. When interpreted in-frame, the stop symbol

TAG is substantially depleted (left). When interpreted in the other two frames,

the most avoided patterns are all very low complexity, like runs of a single base

(right)

Knowing that certain pairs of codons were bad did not explain why they were

bad. But by computing two related scores (details unimportant) and sorting

the triplets based on them, as shown in Figure 2.14, certain patterns popped

out. Do you notice the patterns? All the bad sequences on the left contain

T AG, which turns out to be a special codon that tells the gene to stop. And

all the bad sequences on the right consist of Cand Gin very simple repetitive

sequences. These explain biologically why patterns are avoided by evolution,

meaning we discovered something very meaningful about life.

There are two take-home lessons from this story. First, developing numerical

scoring functions which highlight speciﬁc aspects of items can be very useful

to reveal patterns. Indeed, Chapter 4 will focus on the development of such

systems. Second, hitting such quantities with a logarithm can make them even

more useful, enabling us to see the forest for the trees.

2.6 Chapter Notes

There are many excellent introductions to probability theory available, including

[Tij12, BT08]. The same goes for elementary statistics, with good introductory

texts including [JWHT13, Whe13]. The brief history of probability theory in

this chapter is based on Weaver [Wea82].

In its strongest form, the eﬃcient market hypothesis states that the stock

market is essentially unpredictable using public information. My personal advice

is that you should invest in index funds that do not actively try to predict the

direction of the market. Malkiel’s A Random Walk Down Wall Street [Mal99]

2.7. EXERCISES 53

is an excellent introduction to such investment thinking.

The Fast Fourier Transform (FFT) provides an O(nlog n) time algorithm to

compute the full autocorrelation function of an n-element sequence, where the

straightforward computation of ncorrelations takes O(n2). Bracewell [Bra99]

and Brigham [Bri88] are excellent introductions to Fourier transforms and the

FFT. See also the exposition in Press et.al. [PFTV07].

The comic strip in Figure 2.10 comes from Randall Munroe’s webcomic xkcd,

speciﬁcally https://xkcd.com/552, and is reprinted with permission.

The war story of Section 2.5 revolves around our work on how the phe-

nomenon of codon pair bias aﬀects gene translation. Figure 2.14 comes from

my collaborator Justin Gardin. See [CPS+08, MCP+10, Ski12] for discussions

of how we exploited codon pair bias to design vaccines for viral diseases like

polio and the ﬂu.

2.7 Exercises

Probability

2-1. [3] Suppose that 80% of people like peanut butter, 89% like jelly, and 78% like

both. Given that a randomly sampled person likes peanut butter, what is the

probability that she also likes jelly?

2-2. [3] Suppose that P(A) = 0.3 and P(B) = 0.7.

(a) Can you compute P(Aand B) if you only know P(A) and P(B)?

(b) Assuming that events Aand Barise from independent random processes:

•What is P(Aand B)?

•What is P(Aor B)?

•What is P(A|B)?

2-3. [3] Consider a game where your score is the maximum value from two dice.

Compute the probability of each event from {1,...,6}.

2-4. [8] Prove that the cumulative distribution function of the maximum of a pair of

values drawn from random variable Xis the square of the original cumulative

distribution function of X.

2-5. [5] If two binary random variables Xand Yare independent, are ¯

X(the com-

plement of X) and Yalso independent? Give a proof or a counterexample.

Statistics

2-6. [3] Compare each pair of distributions to decide which one has the greater

mean and the greater standard deviation. You do not need to calculate the

actual values of µand σ, just how they compare with each other.

(a) i. 3,5,5,5,8,11,11,11,13.

ii. 3,5,5,5,8,11,11,11,20.

(b) i. −20,0,0,0,15,25,30,30.

ii. −40,0,0,0,15,25,30,30.

54 CHAPTER 2. MATHEMATICAL PRELIMINARIES

(c) i. 0,2,4,6,8,10.

ii. 20,22,24,26,28,30.

(d) i. 100,200,300,400,500.

ii. 0,50,300,550,600.

2-7. [3] Construct a probability distribution where none of the mass lies within one

σof the mean.

2-8. [3] How does the arithmetic and geometric mean compare on random integers?

2-9. [3] Show that the arithmetic mean equals the geometric mean when all terms

are the same.

Correlation Analysis

2-10. [3] True or false: a correlation coeﬃcient of −0.9 indicates a stronger linear

relationship than a correlation coeﬃcient of 0.5. Explain why.

2-11. [3] What would be the correlation coeﬃcient between the annual salaries of

college and high school graduates at a given company, if for each possible job

title the college graduates always made:

(a) $5,000 more than high school grads?

(b) 25% more than high school grads?

(c) 15% less than high school grads?

2-12. [3] What would be the correlation between the ages of husbands and wives if

men always married woman who were:

(a) Three years younger than themselves?

(b) Two years older than themselves?

(c) Half as old as themselves?

2-13. [5] Use data or literature found in a Google search to estimate/measure the

strength of the correlation between:

(a) Hits and walks scored for hitters in baseball.

(b) Hits and walks allowed by pitchers in baseball.

2-14. [5] Compute the Pearson and Spearman Rank correlations for uniformly drawn

samples of points (x, xk). How do these values change as a function of increasing

k?

Logarithms

2-15. [3] Show that the logarithm of any number less than 1 is negative.

2-16. [3] Show that the logarithm of zero is undeﬁned.

2-17. [5] Prove that

x·y=b(logbx+logby)

2-18. [5] Prove the correctness of the formula for changing a base-blogarithm to base-

a, that

loga(x) = logb(x)/logb(a).

2.7. EXERCISES 55

Implementation Projects

2-19. [3] Find some interesting data sets, and compare how similar their means and

medians are. What are the distributions where the mean and median diﬀer on

the most?

2-20. [3] Find some interesting data sets and search all pairs for interesting correla-

tions. Perhaps start with what is available at http://www.data-manual.com/

data. What do you ﬁnd?

Interview Questions

2-21. [3] What is the probability of getting exactly kheads on ntosses, where the

coin has a probability of pin coming up heads on each toss? What about kor

more heads?

2-22. [5] Suppose that the probability of getting a head on the ith toss of an ever-

changing coin is f(i). How would you eﬃciently compute the probability of

getting exactly kheads in ntosses?

2-23. [5] At halftime of a basketball game you are oﬀered two possible challenges:

(a) Take three shots, and make at least two of them.

(b) Take eight shots, and make at least ﬁve of them.

Which challenge should you pick to have a better chance of winning the game?

2-24. [3] Tossing a coin ten times resulted in eight heads and two tails. How would

you analyze whether a coin is fair? What is the p-value?

2-25. [5] Given a stream of nnumbers, show how to select one uniformly at random

using only constant storage. What if you don’t know nin advance?

2-26. [5] Ak-streak starts at toss iin a sequence of ncoin ﬂips when the outcome of the

ith ﬂip and the next k−1 ﬂips are identical. For example, sequence HTTTHH

contains 2-streaks starting at the second, third, and ﬁfth tosses. What are the

expected number of k-streaks that you will see in ntosses of a fair coin ?

2-27. [5] A person randomly types an eight-digit number into a pocket calculator.

What is the probability that the number looks the same even if the calculator

is turned upside down?

2-28. [3] You play a dice rolling game where you have two choices:

(a) Roll the dice once and get rewarded with a prize equal to the outcome

number (e.g, $3 for number “3”) and then stop the game.

(b) You can reject the ﬁrst reward according to its outcome and roll the dice

a second time, and get rewarded in the same way.

Which strategy should you choose to maximize your reward? That is, for what

outcomes of the ﬁrst roll should you chose to play the second game? What is

the statistical expectation of reward if you choose the second strategy?

2-29. [3] What is A/B testing and how does it work?

2-30. [3] What is the diﬀerence between statistical independence and correlation?

2-31. [3] We often say that correlation does not imply causation. What does this

mean?

56 CHAPTER 2. MATHEMATICAL PRELIMINARIES

2-32. [5] What is the diﬀerence between a skewed distribution and a uniform one?

Kaggle Challenges

2-33. Cause–eﬀect pairs: correlation vs. causation.

https://www.kaggle.com/c/cause-effect-pairs

2-34. Predict the next “random number” in a sequence.

https://www.kaggle.com/c/random-number-grand-challenge

2-35. Predict the fate of animals at a pet shelter.

https://www.kaggle.com/c/shelter-animal-outcomes

Chapter 3

Data Munging

On two occasions I have been asked, “Pray, Mr. Babbage, if you put

into the machine wrong ﬁgures, will the right answers come out?”

. . . I am not able rightly to apprehend the kind of confusion of ideas

that could provoke such a question.

– Charles Babbage

Most data scientists spend much of their time cleaning and formatting data.

The rest spend most of their time complaining that there is no data available

to do what they want to do.

In this chapter, we will work through some of the basic mechanics of com-

puting with data. Not the high-faluting stuﬀ like statistics or machine learning,

but the grunt work of ﬁnding data and cleaning it that goes under the moniker

of data munging.

While practical questions like “What is the best library or programming

language available?” are clearly important, the answers change so rapidly that

a book like this one is the wrong place to address them. So I will stick at the

level of general principles, instead of shaping this book around a particular set

of software tools. Still, we will discuss the landscape of available resources in

this chapter: why they exist, what they do, and how best to use them.

The ﬁrst step in any data science project is getting your hands on the right

data. But this is often distressingly hard. This chapter will survey the richest

hunting grounds for data resources, and then introduce techniques for cleaning

what you kill. Wrangling your data so you that can safely analyze it is critical

for meaningful results. As Babbage himself might have said more concisely,

“garbage in, garbage out.”

3.1 Languages for Data Science

In theory, every suﬃciently powerful programming language is capable of ex-

pressing any algorithm worth computing. But in practice, certain programming

58 CHAPTER 3. DATA MUNGING

languages prove much better than others at speciﬁc tasks. Better here might

denote easier for the programmer or perhaps more computationally eﬃcient,

depending upon the mission at hand.

The primary data science programming languages to be aware of are:

•Python: This is today’s bread-and-butter programming language for data

science. Python contains a variety of language features to make basic

data munging easier, like regular expressions. It is an interpreted lan-

guage, making the development process quicker and enjoyable. Python

is supported by an enormous variety of libraries, doing everything from

scraping to visualization to linear algebra and machine learning.

Perhaps the biggest strike against Python is eﬃciency: interpreted lan-

guages cannot compete with compiled ones for speed. But Python compil-

ers exist in a fashion, and support linking in eﬃcient C/assembly language

libraries for computationally-intensive tasks. Bottom line, Python should

probably be your primary tool in working through the material we present

in this book.

•Perl: This used to be the go to language for data munging on the web,

before Python ate it for lunch. In the TIOBE programming language pop-

ularity index (http://www.tiobe.com/tiobe-index), Python ﬁrst ex-

ceeded Perl in popularity in 2008 and hasn’t looked back. There are several

reasons for this, including stronger support for object-oriented program-

ming and better available libraries, but the bottom line is that there are

few good reasons to start projects in Perl at this point. Don’t be surprised

if you encounter it in some legacy project, however.

•R: This is the programming language of statisticians, with the deepest

libraries available for data analysis and visualization. The data science

world is split between R and Python camps, with R perhaps more suit-

able for exploration and Python better for production use. The style of

interaction with R is somewhat of an acquired taste, so I encourage you

to play with it a bit to see whether it feels natural to you.

Linkages exist between R and Python, so you can conveniently call R

library functions in Python code. This provides access to advanced statis-

tical methods, which may not be supported by the native Python libraries.

•Matlab: The Mat here stands for matrix, as Matlab is a language de-

signed for the fast and eﬃcient manipulation of matrices. As we will see,

many machine learning algorithms reduce to operations on matrices, mak-

ing Matlab a natural choice for engineers programming at a high-level of

abstraction.

Matlab is a proprietary system. However, much of its functionality is

available in GNU Octave, an open-source alternative.

3.1. LANGUAGES FOR DATA SCIENCE 59

•Java and C/C++: These mainstream programming languages for the

development of large systems are important in big data applications. Par-

allel processing systems like Hadoop and Spark are based on Java and

C++, respectively. If you are living in the world of distributed comput-

ing, then you are living in a world of Java and C++ instead of the other

languages listed here.

•Mathematica/Wolfram Alpha: Mathematica is a proprietary system pro-

viding computational support for all aspects of numerical and symbolic

mathematics, built upon the less proprietary Wolfram programming lan-

guage. It is the foundation of the Wolfram Alpha computational knowl-

edge engine, which processes natural language-like queries through a mix

of algorithms and pre-digested data sources. Check it out at http://www.

wolframalpha.com.

I will confess a warm spot for Mathematica.1It is what I tend to reach

for when I am doing a small data analysis or simulation, but cost has

traditionally put it out of the range of many users. The release of the

Wolfram language perhaps now opens it up to a wider community.

•Excel: Spreadsheet programs like Excel are powerful tools for exploratory

data analysis, such as playing with a given data set to see what it contains.

They deserve our respect for such applications.

Full featured spreadsheet programs contain a surprising amount of hidden

functionality for power users. A student of mine who rose to become a

Microsoft executive told me that 25% of all new feature requests for Excel

proposed functionality already present there. The special functions and

data manipulation features you want probably are in Excel if you look

hard enough, in the same way that a Python library for what you need

probably will be found if you search for it.

3.1.1 The Importance of Notebook Environments

The primary deliverable for a data science project should not be a program. It

should not be a data set. It should not be the results of running the program

on your data. It should not just be a written report.

The deliverable result of every data science project should be a computable

notebook tying together the code, data, computational results, and written

analysis of what you have learned in the process. Figure 3.1 presents an excerpt

from a Jupyter/IPython notebook, showing how it integrates code, graphics,

and documentation into a descriptive document which can be executed like a

program.

The reason this is so important is that computational results are the product

of long chains of parameter selections and design decisions. This creates several

problems that are solved by notebook computing environments:

1Full disclosure: I have known Stephen Wolfram for over thirty years. Indeed, we invented

the iPad together [Bar10, MOR+88].

60 CHAPTER 3. DATA MUNGING

Figure 3.1: Jupyter/IPython notebooks tie together code, computational re-

sults, and documentation.

3.1. LANGUAGES FOR DATA SCIENCE 61

•Computations need to be reproducible. We must be able to run the same

programs again from scratch, and get exactly the same result. This means

that data pipelines must be complete: taking raw input and producing the

ﬁnal output. It is terrible karma to start with a raw data set, do some

processing, edit/format the data ﬁles by hand, and then do some more

processing – because what you did by hand cannot be readily done again

on another data set, or undone after you realize that you may have goofed

up.

•Computations must be tweakable. Often reconsideration or evaluation will

prompt a change to one or more parameters or algorithms. This requires

rerunning the notebook to produce the new computation. There is nothing

more disheartening to be given a big data product without provenance and

told that this is the ﬁnal result and you can’t change anything. A notebook

is never ﬁnished until after the entire project is done.

•Data pipelines need to be documented. That notebooks permit you to

integrate text and visualizations with your code provides a powerful way

to communicate what you are doing and why, in ways that traditional

programming environments cannot match.

Take-Home Lesson: Use a notebook environment like IPython or Mathematica

to build and report the results of any data science project.

3.1.2 Standard Data Formats

Data comes from all sorts of places, and in all kinds of formats. Which represen-

tation is best depends upon who the ultimate consumer is. Charts and graphs

are marvelous ways to convey the meaning of numerical data to people. Indeed,

Chapter 6 will focus on techniques for visualizing data. But these pictures are

essentially useless as a source of data to compute with. There is a long way

from printed maps to Google Maps.

The best computational data formats have several useful properties:

•They are easy for computers to parse: Data written in a useful format is

destined to be used again, elsewhere. Sophisticated data formats are often

supported by APIs that govern technical details ensuring proper format.

•They are easy for people to read: Eyeballing data is an essential operation

in many contexts. Which of the data ﬁles in this directory is the right one

for me to use? What do we know about the data ﬁelds in this ﬁle? What

is the gross range of values for each particular ﬁeld?

These use cases speak to the enormous value of being able to open a data

ﬁle in a text editor to look at it. Typically, this means presenting the

data in a human-readable text-encoded format, with records demarcated

by separate lines, and ﬁelds separated by delimiting symbols.

62 CHAPTER 3. DATA MUNGING

•They are widely used by other tools and systems: The urge to invent

proprietary data standard beats ﬁrmly in the corporate heart, and most

software developers would rather share a toothbrush than a ﬁle format.

But these are impulses to be avoided. The power of data comes from

mixing and matching it with other data resources, which is best facilitated

by using popular standard formats.

One property I have omitted from this list is conciseness, since it is generally

not a primary concern for most applications running on modern computing

systems. The quest to minimize data storage costs often works against other

goals. Cleverly packing multiple ﬁelds into the higher-order bits of integers saves

space, but at the cost of making it incompatible and unreadable.

General compression utilities like gzip prove amazingly good at removing the

redundancy of human-friendly formatting. Disk prices are unbelievably cheap:

as I write this you can buy a 4TB drive for about $100, meaning less than the

cost of one hour of developer time wasted programming a tighter format. Unless

you are operating at the scale of Facebook or Google, conciseness does not have

nearly the importance you are liable to think it does.2

The most important data formats/representations to be aware of are dis-

cussed below:

•CSV (comma separated value) ﬁles: These ﬁles provide the simplest, most

popular format to exchange data between programs. That each line repre-

sents a single record, with ﬁelds separated by commas, is obvious from in-

spection. But subtleties revolve around special characters and text strings:

what if your data about names contains a comma, like “Thurston Howell,

Jr.” The csv format provides ways to escape code such characters so they

are not treated as delimiters, but it is messy. A better alternative is to

use a rarer delimiter character, as in tsv or tab separated value ﬁles.

The best test of whether your csv ﬁle is properly formatted is whether

Microsoft Excel or some other spreadsheet program can read it without

hassle. Make sure the results of every project pass this test as soon as the

ﬁrst csv ﬁle has been written, to avoid pain later.

•XML (eXtensible Markup Language): Structured but non-tabular data

are often written as text with annotations. The natural output of a

named-entity tagger for text wraps the relevant substrings of a text in

brackets denoting person, place, or thing. I am writing this book in La-

Tex, a formatting language with bracketing commands positioned around

mathematical expressions and italicized text. All webpages are written in

HTML, the hypertext markup language which organizes documents using

bracketing commands like <b> and </b> to enclose bold faced text.

XML is a language for writing speciﬁcations of such markup languages. A

proper XML speciﬁcation enables the user to parse any document comply-

ing with the speciﬁcation. Designing such speciﬁcations and fully adhering

2Indeed, my friends at Google assure me that they are often slovenly about space even at

the petabyte scale.

3.1. LANGUAGES FOR DATA SCIENCE 63

to them requires discipline, but is worthwhile. In the ﬁrst version of our

Lydia text analysis system, we wrote our markups in a “pseudo-XML,”

read by ad hoc parsers that handled 99% of the documents correctly but

broke whenever we tried to extend them. After a painful switch to XML,

everything worked more reliably and more eﬃciently, because we could

deploy fast, open-source XML parsers to handle all the dirty work of en-

forcing our speciﬁcations.

•SQL (structured query language) databases: Spreadsheets are naturally

structured around single tables of data. In contrast, relational databases

prove excellent for manipulating multiple distinct but related tables, using

SQL to provide a clunky but powerful query language.

Any reasonable database system imports and exports records as either csv

or XML ﬁles, as well as an internal content dump. The internal represen-

tation in databases is opaque, so it really isn’t accurate to describe them

as a data format. Still, I emphasize them here because SQL databases

generally prove a better and more powerful solution than manipulating

multiple data ﬁles in an ad hoc manner.

•JSON (JavaScript Object Notation): This is a format for transmitting data

objects between programs. It is a natural way to communicate the state of

variables/data structures from one system to another. This representation

is basically a list of attribute-value pairs corresponding to variable/ﬁeld

names, and the associated values:

{"employees":[

{"firstName":"John", "lastName":"Doe"},

{"firstName":"Anna", "lastName":"Smith"},

{"firstName":"Peter", "lastName":"Jones"}

]}

Because library functions that support reading and writing JSON objects

are readily available in all modern programming languages, it has become

a very convenient way to store data structures for later use. JSON objects

are human readable, but are quite cluttered-looking, representing arrays of

records compared to CSV ﬁles. Use them for complex structured objects,

but not simple tables of data.

•Protocol buﬀers: These are a language/platform-neutral way of serializing

structured data for communications and storage across applications. They

are essentially lighter weight versions of XML (where you deﬁne the format

of your structured data), designed to communicate small amounts of data

across programs like JSON. This data format is used for much of the inter-

machine communication at Google. Apache Thrift is a related standard,

used at Facebook.

64 CHAPTER 3. DATA MUNGING

3.2 Collecting Data

The most critical issue in any data science or modeling project is ﬁnding the

right data set. Identifying viable data sources is an art, one that revolves around

three basic questions:

•Who might actually have the data I need?

•Why might they decide to make it available to me?

•How can I get my hands on it?

In this section, we will explore the answers to these questions. We look at

common sources of data, and what you are likely to be able to ﬁnd and why.

We then review the primary mechanisms for getting access, including APIs,

scraping, and logging.

3.2.1 Hunting

Who has the data, and how can you get it? Some of the likely suspects are

reviewed below.

Companies and Proprietary Data Sources

Large companies like Facebook, Google, Amazon, American Express, and Blue

Cross have amazing amounts of exciting data about users and transactions,

data which could be used to improve how the world works. The problem is that

getting outside access is usually impossible. Companies are reluctant to share

data for two good reasons:

•Business issues, and the fear of helping their competition.

•Privacy issues, and the fear of oﬀending their customers.

A heartwarming tale of what can happen with corporate data release oc-

curred when AOL provided academics with a data set of millions of queries to

its search engine, carefully stripped of identifying information. The ﬁrst thing

the academics discovered was that the most frequently-entered queries were des-

perate attempts to escape to other search engines like Google. This did nothing

to increase public conﬁdence in the quality of AOL search.

Their second discovery was that it proved much harder to anonymize search

queries than had previously been suspected. Sure you can replace user names

with id numbers, but it is not that hard to ﬁgure out who the guy on Long Island

repeatedly querying Steven Skiena,Stony Brook, and https://twitter.com/

search?q=Skiena&src=sprv is. Indeed, as soon as it became publicized that

people’s identities had been revealed by this data release, the responsible party

was ﬁred and the data set disappeared. User privacy is important, and ethical

issues around data science will be discussed in Section 12.7.

3.2. COLLECTING DATA 65

So don’t think you are going to sweet talk companies into releasing conﬁden-

tial user data. However, many responsible companies like The New York Times,

Twitter, Facebook, and Google do release certain data, typically by rate-limited

application program interfaces (APIs). They generally have two motives:

•Providing customers and third parties with data that can increase sales.

For example, releasing data about query frequency and ad pricing can

encourage more people to place ads on a given platform.

•It is generally better for the company to provide well-behaved APIs than

having cowboys repeatedly hammer and scrape their site.

So hunt for a public API before reading Section 3.2.2 on scraping. You won’t

ﬁnd exactly the content or volume that you dream of, but probably something

that will suﬃce to get started. Be aware of limits and terms of use.

Other organizations do provide bulk downloads of interesting data for oﬀ-

line analysis, as with the Google Ngrams, IMDb, and the taxi fare data sets

discussed in Chapter 1. Large data sets often come with valuable metadata,

such as book titles, image captions, and edit history, which can be re-purposed

with proper imagination.

Finally, most organizations have internal data sets of relevance to their busi-

ness. As an employee, you should be able to get privileged access while you

work there. Be aware that companies have internal data access policies, so you

will still be subject to certain restrictions. Violating the terms of these policies

is an excellent way to become an ex-employee.

Government Data Sources

Collecting data is one of the important things that governments do. Indeed,

the requirement that the United States conduct a census of its population is

mandated by our constitution, and has been running on schedule every ten

years since 1790.

City, state, and federal governments have become increasingly committed

to open data, to facilitate novel applications and improve how government can

fulﬁll its mission. The website http://Data.gov is an initiative by the federal

government to centrally collect its data sources, and at last count points to over

100,000 data sets!

Government data diﬀers from industrial data in that, in principle, it belongs

to the People. The Freedom of Information Act (FOI) enables any citizen to

make a formal request for any government document or data set. Such a request

triggers a process to determine what can be released without compromising the

national interest or violating privacy.

State governments operate under ﬁfty diﬀerent sets of laws, so data that is

tightly held in one jurisdiction may be freely available in others. Major cities

like New York have larger data processing operations than many states, again

with restrictions that vary by location.

66 CHAPTER 3. DATA MUNGING

I recommend the following way of thinking about government records. If

you cannot ﬁnd what you need online after some snooping around, ﬁgure out

which agency is likely to have it. Make a friendly call to them to see if they

can help you ﬁnd what you want. But if they stonewall you, feel free to try for

a FOI act request. Preserving privacy is typically the biggest issue in deciding

whether a particular government data set can be released.

Academic Data Sets

There is a vast world of academic scholarship, covering all that humanity has

deemed worth knowing. An increasing fraction of academic research involves

the creation of large data sets. Many journals now require making source data

available to other researchers prior to publication. Expect to be able to ﬁnd

vast amounts of economic, medical, demographic, historical, and scientiﬁc data

if you look hard enough.

The key to ﬁnding these data sets is to track down the relevant papers.

There is an academic literature on just about any topic of interest. Google

Scholar is the most accessible source of research publications. Search by topic,

and perhaps “Open Science” or “data.” Research publications will typically

provide pointers to where its associated data can be found. If not, contacting

the author directly with a request should quickly yield the desired result.

The biggest catch with using published data sets is that someone else has

worked hard to analyze them before you got to them, so these previously mined

sources may have been sucked dry of interesting new results. But bringing fresh

questions to old data generally opens new possibilities.

Often interesting data science projects involve collaborations between re-

searchers from diﬀerent disciplines, such as the social and natural sciences.

These people speak diﬀerent languages than you do, and may seem intimidating

at ﬁrst. But they often welcome collaboration, and once you get past the jargon

it is usually possible to understand their issues on a reasonable level without

specialized study. Be assured that people from other disciplines are generally

not any smarter than you are.

Sweat Equity

Sometimes you will have to work for your data, instead of just taking it from

others. Much historical data still exists only in books or other paper documents,

thus requiring manual entry and curation. A graph or table might contain

information that we need, but it can be hard to get numbers from a graphic

locked in a PDF (portable document format) ﬁle.

I have observed that computationally-oriented people vastly over-estimate

the amount of eﬀort it takes to do manual data entry. At one record per minute,

you can easily enter 1,000 records in only two work days. Instead, computational

people tend to devote massive eﬀorts trying to avoid such grunt work, like

hunting in vain for optical character recognition (OCR) systems that don’t make

3.2. COLLECTING DATA 67

a mess of the ﬁle, or spending more time cleaning up a noisy scan than it would

take to just type it in again fresh.

A middle ground here comes in paying someone else to do the dirty work for

you. Crowdsourcing platforms like Amazon Turk and CrowdFlower enable you

to pay for armies of people to help you extract data, or even collect it in the

ﬁrst place. Tasks requiring human annotation like labeling images or answering

surveys are particularly good use of remote workers. Crowdsourcing will be

discussed in greater detail in Section 3.5.

Many amazing open data resources have been built up by teams of contrib-

utors, like Wikipedia, Freebase, and IMDb. But there is an important concept

to remember: people generally work better when you pay them.

3.2.2 Scraping

Webpages often contain valuable text and numerical data, which we would like

to get our hands on. For example, in our project to build a gambling system

for the sport of jai-alai, we needed to feed our system the results of yesterday’s

matches and the schedule of what games were going on today. Our solution

was to scrape the websites of jai-alai betting establishments, which posted this

information for their fans.

There are two distinct steps to make this happen, spidering and scraping:

•Spidering is the process of downloading the right set of pages for analysis.

•Scraping is the ﬁne art of stripping this content from each page to prepare

it for computational analysis.

The ﬁrst thing to realize is that webpages are generally written in simple-to-

understand formatting languages like HTML and/or JavaScript. Your browser

knows these languages, and interprets the text of the webpage as a program

to specify what to display. By calling a function that emulates/pretends to

be a web browser, your program can download any webpage and interpret the

contents for analysis.

Traditionally, scraping programs were site-speciﬁc scripts hacked up to look

for particular HTML patterns ﬂanking the content of interest. This exploited the

fact that large numbers of pages on speciﬁc websites are generated by programs

themselves, and hence highly predictable in their format. But such scripts tend

to be ugly and brittle, breaking whenever the target website tinkers with the

internal structure of its pages.

Today, libraries in languages like Python (see BeautifulSoup) make it easier

to write robust spiders and scrapers. Indeed, someone else probably has already

written a spider/scraper for every popular website and made it available on

SourceForge or Github, so search before you code.

Certain spidering missions may be trivial, for example, hitting a single URL

(uniform resource locator) at regular time intervals. Such patterns occur in mon-

itoring, say, the sales rank of this book from its Amazon page. Somewhat more

sophisticated approaches to spidering are based on the name regularity of the

68 CHAPTER 3. DATA MUNGING

underlying URLs. If all the pages on a site are speciﬁed by the date or product

ID number, for example http://www.amazon.com/gp/product/1107041376/,

iterating through the entire range of interesting values becomes just a matter

of counting.

The most advanced form of spidering is web crawling, where you systemat-

ically traverse all outgoing links from a given root page, continuing recursively

until you have visited every page on the target website. This is what Google does

in indexing the web. You can do it too, with enough patience and easy-to-ﬁnd

web crawling libraries in Python.

Please understand that politeness limits how rapidly you should spider/crawl

a given website. It is considered bad form to hit a site more than once a second,

and indeed best practices dictate that providers block access to the people who

are hammering them.

Every major website contains a terms of service document that restricts what

you can legally do with any associated data. Generally speaking, most sites will

leave you alone provided you don’t hammer them, and do not redistribute any

data you scrape. Understand that this is an observation, not a legal opinion.

Indeed, read about the Aaron Schwartz case, where a well-known Internet ﬁg-

ure was brought up on serious criminal charges for violating terms of services

in spidering/scraping journal articles, and literally hounded to death. If you

are attempting a web-scraping project professionally, be sure that management

understands the terms of service before you get too creative with someone else’s

property.

3.2.3 Logging

If you own a potential data source, treat it like you own it. Internal access to

a web service, communications device, or laboratory instrument grants you the

right and responsibility to log all activity for downstream analysis.

Amazing things can be done with ambient data collection from weblogs

and sensing devices, soon destined to explode with the coming “Internet of

Things.” The accelerometers in cell phones can be used to measure the strength

of earthquakes, with the correlation of events within a region suﬃcient to ﬁlter

out people driving on bumpy roads or leaving their phones in a clothes dryer.

Monitoring the GPS data of a ﬂeet of taxi cabs tracks traﬃc congestion on

city streets. Computational analysis of image and video streams opens the

door to countless applications. Another cool idea is to use cameras as weather

instruments, by looking at the color of the sky in the background of the millions

of photographs uploaded to photo sites daily.

The primary reason to instrument your system to collect data is because

you can. You might not know exactly what to do with it now, but any well-

constructed data set is likely to become of value once it hits a certain critical

mass of size.

Current storage costs make clear just how low a barrier it is to instrument a

system. My local Costco is currently selling three terabyte disk drive for under

$100, which is Big O of nothing. If each transaction record takes 1 kilobyte (one

3.3. CLEANING DATA 69

thousand characters), this device in principle has room for 3 billion records,

roughly one for every two people on earth.

The important considerations in designing any logging system are:

•Build it to endure with limited maintenance. Set it and forget it, by

provisioning it with enough storage for unlimited expansion, and a backup.

•Store all ﬁelds of possible value, without going crazy.

•Use a human-readable format or transactions database, so you can un-

derstand exactly what is in there when the time comes, months or years

later, to sit down and analyze your data.

3.3 Cleaning Data

“Garbage in, garbage out” is the fundamental principle of data analysis. The

road from raw data to a clean, analyzable data set can be a long one.

Many potential issues can arise in cleaning data for analysis. In this section,

we discuss identifying processing artifacts and integrating diverse data sets. Our

focus here is the processing before we do our real analysis, to make sure that

the garbage never gets in in the ﬁrst place.

Take-Home Lesson: Savvy painting restorers only do things to the original that

are reversible. They never do harm. Similarly, data cleaning is always done

on a copy of the original data, ideally by a pipeline that makes changes in a

systematic and repeatable way.

3.3.1 Errors vs. Artifacts

Under ancient Jewish law, if a suspect on trial was unanimously found guilty

by all judges, then this suspect would be acquitted. The judges had noticed

that unanimous agreement often indicates the presence of a systemic error in

the judicial process. They reasoned that when something seems too good to be

true, a mistake has likely been made somewhere.

If we view data items as measurements about some aspect of the world,

data errors represent information that is fundamentally lost in acquisition. The

Gaussian noise blurring the resolution of our sensors represents error, precision

which has been permanently lost. The two hours of missing logs because the

server crashed represents data error: it is information which cannot be recon-

structed again.

By contrast, artifacts are generally systematic problems arising from pro-

cessing done to the raw information it was constructed from. The good news is

that processing artifacts can be corrected, so long as the original raw data set

remains available. The bad news is that these artifacts must be detected before

they can be corrected.

70 CHAPTER 3. DATA MUNGING

Figure 3.2: What artifacts can you ﬁnd in this time series, counting the number

of author’s names ﬁrst appearing in the scientiﬁc literature each year?

The key to detecting processing artifacts is the “sniﬀ test,” examining the

product closely enough to get a whiﬀ of something bad. Something bad is usually

something unexpected or surprising, because people are naturally optimists.

Surprising observations are what data scientists live for. Indeed, such insights

are the primary reason we do what we do. But in my experience, most surprises

turn out to be artifacts, so we must look at them skeptically.

Figure 3.2 presents computational results from a project where we investi-

gated the process of scientiﬁc publication. It shows a time series of the 100,000

most proliﬁc authors, binned according to the year of their ﬁrst paper appearing

in Pubmed, an essentially complete bibliography of the biomedical literature.

Study this ﬁgure closely, and see if you can discover any artifacts worth

commenting on. I see at least two of them. Extra credit will be awarded if you

can ﬁgure out what caused the problem.

The key to ﬁnding artifacts is to look for anomalies in the data, that con-

tradict what you expect to see. What should the distribution in the number of

virgin authors look like, and how should it change over time? First, construct

a prior distribution of what you expect to see, so that you can then properly

evaluate potential anomalies against it.

My intuition says that the distribution of new top scientists should be pretty

ﬂat, because new stars are born with every successive class of graduate students.

I would also guess that there may be a gradual drift upward as population

expands, and more people enter the scientiﬁc community. But that’s not what

I see in Figure 3.2. So try to enumerate what the anomalies/potential artifacts

are. . .

I see two big bumps when I look at Figure 3.2: a left bump starting around

3.3. CLEANING DATA 71

Figure 3.3: The cleaned data removes these artifacts, and the resulting distri-

bution looks correct.

1965, and a peak which explodes in 2002. On reﬂection, the leftmost bump

makes sense. This left peak occurs the year when Pubmed ﬁrst started to

systematically collect bibliographic records. Although there is some very in-

complete data from 1960–1964, most older scientists who had been publishing

papers for several years would “emerge” only with the start of systematic records

in 1965. So this explains the left peak, which then settles down by 1970 to what

looks like the ﬂat distribution we expected.

But what about that giant 2002 peak? And the decline in new authors to

almost zero in the years which precede it? A similar decline is also visible to

the right of the big peak. Were all the world’s major scientists destined to be

born in 2002?

A careful inspection of the records in the big peak revealed the source of

the anomaly: ﬁrst names. In the early days of Pubmed, authors were identiﬁed

by their initials and last names. But late in 2001, SS Skiena became Steven S.

Skiena, so it looked like a new author emerging from the heavens.

But why the declines to nothingness to the left and right of this peak? Recall

that we limited this study to the 100,000 most proliﬁc scientists. A scientiﬁc

rock star emerging in 1998 would be unlikely to appear in this ranking because

their name was doomed to change a few years later, not leaving enough time

to accumulate a full career of papers. Similar things happen at the very right

of the distribution: newly created scientists in 2010 would never be able to

achieve a full career’s work in only a couple of years. Both phenomena are

neatly explained by this ﬁrst name basis.

Cleaning this data to unify name references took us a few iterations to get

right. Even after eliminating the 2002 peak, we still saw a substantial dip in

prominent scientists starting their careers in the mid 1990s. This was because

many people who had a great half career pre-ﬁrst names and a second great

half career post-ﬁrst names did not rise to the threshold of a great full career

72 CHAPTER 3. DATA MUNGING

in either single period. Thus we had to match all the names in the full before

identifying who were the top 100,000 scientists.

Figure 3.3 shows our ﬁnal distribution of authors, which matches the pla-

tonic ideal of what we expected the distribution to be. Don’t be too quick to

rationalize away how your data looks coming out of the computer. My collabo-

rators were at one point ready to write oﬀ the 2002 bump as due to increases in

research funding or the creation of new scientiﬁc journals. Always be suspicious

of whether your data is clean enough to trust.

3.3.2 Data Compatibility

We say that a comparison of two items is “apples to apples” when it is fair com-

parison, that the items involved are similar enough that they can be meaning-

fully stood up against each other. In contrast, “apples to oranges” comparisons

are ultimately meaningless. For example:

•It makes no sense to compare weights of 123.5 against 78.9, when one is

in pounds and the other is in kilograms.

•It makes no sense to directly compare the movie gross of Gone with the

Wind against that of Avatar, because 1939 dollars are 15.43 times more

valuable than 2009 dollars.

•It makes no sense to compare the price of gold at noon today in New York

and London, because the time zones are ﬁve hours oﬀ, and the prices

aﬀected by intervening events.

•It makes no sense to compare the stock price of Microsoft on February 17,

2003 to that of February 18, 2003, because the intervening 2-for-1 stock

split cut the price in half, but reﬂects no change in real value.

These types of data comparability issues arise whenever data sets are merged.

Here I hope to show you how insidious such comparability issues can be, to

sensitize you as to why you need to be aware of them. Further, for certain

important classes of conversions I point to ways to deal with them.

Take-Home Lesson: Review the meaning of each of the ﬁelds in any data set

you work with. If you do not understand what’s in there down to the units of

measurement, there is no sensible way you can use it.

Unit Conversions

Quantifying observations in physical systems requires standard units of measure-

ment. Unfortunately there exist many functionally equivalent but incompatible

systems of measurement. My 12-year old daughter and I both weigh about 70,

but one of us is in pounds and the other in kilograms.

3.3. CLEANING DATA 73

Disastrous things like rocket explosions happen when measurements are en-

tered into computer systems using the wrong units of measurement. In par-

ticular, NASA lost the $125 million Mars Climate Orbiter space mission on

September 23, 1999 due to a metric-to-English conversion issue.

Such problems are best addressed by selecting a single system of measure-

ments and sticking to it. The metric system oﬀers several advantages over the

traditional English system. In particular, individual measurements are naturally

expressed as single decimal quantities (like 3.28 meters) instead of incomparable

pairs of quantities (5 feet, 8 inches). This same issue arises in measuring angles

(radians vs. degrees/seconds) and weight (kilograms vs. pounds/oz).

Sticking to the metric system does not by itself solve all comparability is-

sues, since there is nothing to prevent you from mixing heights in meters and

centimeters. But it is a good start.

How can you defend yourself against incompatible units when merging data

sets? Vigilance has to be your main weapon. Make sure that you know the

intended units for each numerical column in your data set, and verify compat-

ibility when merging. Any column which does not have an associated unit or

object type should immediately be suspect.

When merging records from diverse sources, it is an excellent practice to

create a new “origin” or “source” ﬁeld to identify where each record came from.

This provides at least the hope that unit conversion mistakes can be corrected

later, by systematically operating on the records from the problematic source.

A partially-automated procedure to detect such problems can be devised

from statistical signiﬁcance testing, to be discussed in Section 5.3. Suppose we

were to plot the frequencies of human heights in a merged data set of English

(feet) and metric (meter) measurements. We would see one peak in the distri-

bution around 1.8 and a second around 5.5. The existence of multiple peaks in a

distribution should make us suspicious. The p-value resulting from signiﬁcance

testing on the two input populations provides a rigorous measurement of the

degree to which our suspicions are validated.

Numerical Representation Conversions

Numerical features are the easiest to incorporate into mathematical models.

Indeed, certain machine learning algorithms such as linear regression and sup-

port vector machines work only with numerically-coded data. But even turning

numbers into numbers can be a subtle problem. Numerical ﬁelds might be rep-

resented in diﬀerent ways: as integers (123), as decimals (123.5), or even as

fractions (123 1/2). Numbers can even be represented as text, requiring the

conversion from “ten million” to 10000000 for numerical processing.

Numerical representation issues can take credit for destroying another rocket

ship. An Ariane 5 rocket launched at a cost of $500 million on June 4, 1996

exploded forty seconds after lift-oﬀ, with the cause ultimately ascribed to an

unsuccessful conversion of a 64-bit ﬂoating point number to a 16-bit integer.

The distinction between integers and ﬂoating point (real) numbers is impor-

tant to maintain. Integers are counting numbers: quantities which are really

74 CHAPTER 3. DATA MUNGING

discrete should be represented as integers. Physically measured quantities are

never precisely quantiﬁed, because we live in a continuous world. Thus all mea-

surements should be reported as real numbers. Integer approximations of real

numbers are sometimes used in a misbegotten attempt to save space. Don’t do

this: the quantiﬁcation eﬀects of rounding or truncation introduces artifacts.

In one particularly clumsy data set we encountered, baby weights were rep-

resented as two integer ﬁelds (pounds and the remaining ounces). Much better

would have been to combine them into a single decimal quantity.

Name Uniﬁcation

Integrating records from two distinct data sets requires them to share a common

key ﬁeld. Names are frequently used as key ﬁelds, but they are often reported

inconsistently. Is Jos´e the same fellow as Jose? Such diacritic marks are banned

from the oﬃcial birth records of several U.S. states, in an aggressive attempt to

force them to be consistent.

As another case in point, databases show my publications as authored by the

Cartesian product of my ﬁrst (Steve,Steven, or S.), middle (Sol,S., or blank),

and last (Skiena) names, allowing for nine diﬀerent variations. And things get

worse if we include misspellings. I can ﬁnd myself on Google with a ﬁrst name

of Stephen and last names of Skienna and Skeina.

Unifying records by key is a very ugly problem, which doesn’t have a magic

bullet. This is exactly why ID numbers were invented, so use them as keys if

you possibly can.

The best general technique is uniﬁcation: doing simple text transformations

to reduce each name to a single canonical version. Converting all strings to

lower case increases the number of (usually correct) collisions. Eliminating

middle names or at least reducing them to an abbreviation creates even more

name matches/collisions, as does mapping ﬁrst names to canonical versions (like

turning all Steves into Stevens).

Any such transformation runs the risk of creating Frankenstein-people, single

records assembled from multiple bodies. Applications diﬀer in whether the

greater danger lies in merging too aggressively or too timidly. Figure out where

your task sits on this spectrum and act accordingly.

An important concern in merging data sets is character code uniﬁcation.

Characters in text strings are assigned numerical representations, with the map-

ping between symbols and number governed by the character code standard.

Unfortunately, there are several diﬀerent character code standards in common

usage, meaning that what you scrape from a webpage might not be in the same

character code as assumed by the system which will process it.

Historically, the good old 7-bit ASCII code standard was expanded to the

8-bit IS0 8859-1 Latin alphabet code, which adds characters and punctuation

marks from several European languages. UTF-8 is an encoding of all Unicode

characters using variable numbers of 8-bit blocks, which is backwards compatible

with ASCII. It is the dominant encoding for web-pages, although other systems

remain in use.

3.3. CLEANING DATA 75

Correctly unifying character codes after merging is pretty much impossible.

You must have the discipline to pick a single code as a standard, and check the

encoding of each input ﬁle on preprocessing, converting it to the target before

further work.

Time/Date Uniﬁcation

Data/time stamps are used to infer the relative order of events, and group events

by relative simultaneity. Integrating event data from multiple sources requires

careful cleaning to ensure meaningful results.

First let us consider issues in measuring time. The clocks from two computers

never exactly agree, so precisely aligning logs from diﬀerent systems requires a

mix of work and guesswork. There are also time zone issues when dealing with

data from diﬀerent regions, as well as diversities in local rules governing changes

in daylight saving time.

The right answer here is to align all time measurements to Coordinated Uni-

versal Time (UTC), a modern standard subsuming the traditional Greenwich

Mean Time (GMT). A related standard is UNIX time, which reports an event’s

precise time in terms of the number of elapsed seconds since 00:00:00 UTC on

Thursday, January 1, 1970.

The Gregorian calendar is common throughout the technology world, al-

though many other calendar systems are in use in diﬀerent countries. Subtle

algorithms must be used to convert between calendar systems, as described in

[RD01]. A bigger problem for date alignment concerns the proper interpretation

of time zones and the international date line.

Time series uniﬁcation is often complicated by the nature of the business

calendar. Financial markets are closed on weekends and holidays, making for

questions of interpretation when you are correlating, say, stock prices to local

temperature. What is the right moment over the weekend to measure tempera-

ture, so as to be consistent with other days of the week? Languages like Python

contain extensive libraries to deal with ﬁnancial time series data to get issues

like this correct. Similar issues arise with monthly data, because months (and

even years) have diﬀerent lengths.

Financial Uniﬁcation

Money makes the world go round, which is why so many data science projects

revolve around ﬁnancial time series. But money can be dirty, so this data

requires cleaning.

One issue here is currency conversion, representing international prices using

a standardized ﬁnancial unit. Currency exchange rates can vary by a few percent

within a given day, so certain applications require time-sensitive conversions.

Conversion rates are not truly standardized. Diﬀerent markets will each have

diﬀerent rates and spreads, the gap between buying and selling prices that cover

the cost of conversion.

76 CHAPTER 3. DATA MUNGING

The other important correction is for inﬂation. The time value of money

implies that a dollar today is (generally) more valuable than a dollar a year

from now, with interest rates providing the right way to discount future dollars.

Inﬂation rates are estimated by tracking price changes over baskets of items,

and provide a way to standardize the purchasing power of a dollar over time.

Using unadjusted prices in a model over non-trivial periods of time is just

begging for trouble. A group of my students once got very excited by the

strong correlation observed between stock prices and oil prices over a thirty-

year period, and so tried to use stock prices in a commodity prediction model.

But both goods were priced in dollars, without any adjustment as they inﬂated.

The time series of prices of essentially any pair of items will correlate strongly

over time when you do not correct for inﬂation.

In fact, the most meaningful way to represent price changes over time is

probably not diﬀerences but returns, which normalize the diﬀerence by the initial

price:

ri=pi+1 −pi

pi

This is more analogous to a percentage change, with the advantage here that

taking the logarithm of this ratio becomes symmetric to gains and losses.

Financial time series contain many other subtleties which require cleaning.

Many stocks give scheduled dividends to the shareholder on a particular date

every year. Say, for example, that Microsoft will pay a $2.50 dividend on Jan-

uary 16. If you own a share of Microsoft at the start of business that day, you

receive this check, so the value of the share then immediately drops by $2.50

the moment after the dividend is issued. This price decline reﬂects no real loss

to the shareholder, but properly cleaned data needs to factor the dividend into

the price of the stock. It is easy to imagine a model trained on uncorrected

price data learning to sell stocks just prior to its issuing dividends, and feeling

unjustly proud of itself for doing so.

3.3.3 Dealing with Missing Values

Not all data sets are complete. An important aspect of data cleaning is iden-

tifying ﬁelds for which data isn’t there, and then properly compensating for

them:

•What is the year of death of a living person?

•What should you do with a survey question left blank, or ﬁlled with an

obviously outlandish value?

•What is the relative frequency of events too rare to see in a limited-size

sample?

Numerical data sets expect a value for every element in a matrix. Setting

missing values to zero is tempting, but generally wrong, because there is always

some ambiguity as to whether these values should be interpreted as data or not.

3.3. CLEANING DATA 77

Is someone’s salary zero because he is unemployed, or did he just not answer

the question?

The danger with using nonsense values as not-data symbols is that they

can get misinterpreted as data when it comes time to build models. A linear

regression model trained to predict salaries from age, education, and gender will

have trouble with people who refused to answer the question.

Using a value like −1 as a no-data symbol has exactly the same deﬁciencies

as zero. Indeed, be like the mathematician who is afraid of negative numbers:

stop at nothing to avoid them.

Take-Home Lesson: Separately maintain both the raw data and its cleaned

version. The raw data is the ground truth, and must be preserved intact for

future analysis. The cleaned data may be improved using imputation to ﬁll in

missing values. But keep raw data distinct from cleaned, so we can investigate

diﬀerent approaches to guessing.

So how should we deal with missing values? The simplest approach is to

drop all records containing missing values. This works just ﬁne when it leaves

enough training data, provided the missing values are absent for non-systematic

reasons. If the people refusing to state their salary were generally those above

the mean, dropping these records will lead to biased results.

But typically we want to make use of records with missing ﬁelds. It can be

better to estimate or impute missing values, instead of leaving them blank. We

need general methods for ﬁlling in missing values. Candidates include:

•Heuristic-based imputation: Given suﬃcient knowledge of the underlying

domain, we should be able to make a reasonable guess for the value of

certain ﬁelds. If I need to ﬁll in a value for the year you will die, guessing

birth year+80 will prove about right on average, and a lot faster than

waiting for the ﬁnal answer.

•Mean value imputation: Using the mean value of a variable as a proxy

for missing values is generally sensible. First, adding more values with

the mean leaves the mean unchanged, so we do not bias our statistics by

such imputation. Second, ﬁelds with mean values add a vanilla ﬂavor to

most models, so they have a muted impact on any forecast made using

the data.

But the mean might not be appropriate if there is a systematic reason

for missing data. Suppose we used the mean death-year in Wikipedia to

impute the missing value for all living people. This would prove disastrous,

with many people recorded as dying before they were actually born.

•Random value imputation: Another approach is to select a random value

from the column to replace the missing value. This would seem to set us up

for potentially lousy guesses, but that is actually the point. Repeatedly

selecting random values permits statistical evaluation of the impact of

imputation. If we run the model ten times with ten diﬀerent imputed

78 CHAPTER 3. DATA MUNGING

values and get widely varying results, then we probably shouldn’t have

much conﬁdence in the model. This accuracy check is particularly valuable

when there is a substantial fraction of values missing from the data set.

•Imputation by nearest neighbor: What if we identify the complete record

which matches most closely on all ﬁelds present, and use this nearest

neighbor to infer the values of what is missing? Such predictions should

be more accurate than the mean, when there are systematic reasons to

explain variance among records.

This approach requires a distance function to identify the most similar

records. Nearest neighbor methods are an important technique in data

science, and will be presented in greater detail in Section 10.2.

•Imputation by interpolation: More generally, we can use a method like

linear regression (see Section 9.1) to predict the values of the target col-

umn, given the other ﬁelds in the record. Such models can be trained over

full records and then applied to those with missing values.

Using linear regression to predict missing values works best when there is

only one ﬁeld missing per record. The potential danger here is creating

signiﬁcant outliers through lousy predictions. Regression models can easily

turn an incomplete record into an outlier, by ﬁlling the missing ﬁelds in

with unusually high or low values. This would lead downstream analysis

to focus more attention on the records with missing values, exactly the

opposite of what we want to do.

Such concerns emphasize the importance of outlier detection, the ﬁnal step

in the cleaning process that will be considered here.

3.3.4 Outlier Detection

Mistakes in data collection can easily produce outliers that can interfere with

proper analysis. An interesting example concerns the largest dinosaur vertebra

ever discovered. Measured at 1500 millimeters, it implies an individual that was

188 feet long. This is amazing, particularly because the second largest specimen

ever discovered comes in at only 122 feet.

The most likely explanation here (see [Gol16]) is that this giant fossil never

actually existed: it has been missing from the American Museum of Natural

History for over a hundred years. Perhaps the original measurement was taken

on a conventionally-sized bone and the center two digits accidentally transposed,

reducing the vertebra down to 1050 millimeters.

Outlier elements are often created by data entry mistakes, as apparently was

the case here. They can also result from errors in scraping, say an irregularity

in formatting causing a footnote number to be interpreted as a numerical value.

Just because something is written down doesn’t make it correct. As with the

dinosaur example, a single outlier element can lead to major misinterpretations.

3.4. WAR STORY: BEATING THE MARKET 79

General sanity checking requires looking at the largest and smallest values

in each variable/column to see whether they are too far out of line. This can

best be done by plotting the frequency histogram and looking at the location of

the extreme elements. Visual inspection can also conﬁrm that the distribution

looks the way it should, typically bell-shaped.

In normally distributed data, the probability that a value is kstandard de-

viations from the mean decreases exponentially with k. This explains why there

are no 10-foot basketball players, and provides a sound threshold to identify

outliers. Power law distributions are less easy to detect outliers in: there really

is a Bill Gates worth over 10,000 times as much as the average individual.

It is too simple to just delete the rows containing outlier ﬁelds and move

on. Outliers often point to more systematic problems that one must deal with.

Consider a data set of historical ﬁgures by lifespan. It is easy to ﬁnger the

biblical Methuselah (at 969 years) as an outlier, and remove him.

But it is better to ﬁgure out whether he is indicative of other ﬁgures that we

should consider removing. Observe that Methuselah had no ﬁrmly established

birth and death dates. Perhaps the published ages of anybody without dates

should be considered suspicious enough to prune. By contrast, the person with

the shortest lifespan in Wikipedia (John I, King of France) lived only ﬁve days.

But his birth (November 15) and death (November 20) dates in 1316 convinces

me that his lifespan was accurate.

3.4 War Story: Beating the Market

Every time we met, my graduate student Wenbin told me we were making

money. But he sounded less and less conﬁdent every time I asked.

Our Lydia sentiment analysis system took in massive text feeds of news and

social media, reducing them to daily time series of frequency and sentiment

for the millions of diﬀerent people, places, and organizations mentioned within.

When somebody wins a sports championship, many articles get written describ-

ing how great an athlete they are. But when this player then gets busted on drug

charges, the tone of the articles about them immediately changes. By keeping

count of the relative frequency of association with positive words (“victorious”)

vs. negative words (“arrested”) in the text stream, we can construct sentiment

signals for any news-worthy entity.

Wenbin studied how sentiment signals could be used to predict future events

like the gross for a given movie, in response to the quality of published reviews

or buzz. But he particularly wanted to use this data to play the stock market.

Stocks move up and down according to news. A missed earnings report is bad

news for a company, so the price goes down. Food and Drug Administration

(FDA) approval of a new drug is great news for the company which owns it, so

the price goes up. If Wenbin could use our sentiment signal to predict future

stock prices, well, let’s just say I wouldn’t have to pay him as a research assistant

anymore.

So he simulated a strategy of buying the stocks that showed the highest

80 CHAPTER 3. DATA MUNGING

sentiment in that day’s news, and then shorting those with the lowest sentiment.

He got great results. “See,” he said. “We are making money.”

The numbers looked great, but I had one quibble. Using today’s news re-

sults to predict current price movements wasn’t really fair, because the event

described in the article may have already moved the price before we had any

chance to read about it. Stock prices should react very quickly to important

news.

So Wenbin simulated the strategy of buying stocks based on sentiment from

the previous day’s news, to create a gap between the observed news and price

changes. The return rate went down substantially, but was still positive. “See,”

he said. “We are still making money.”

But I remained a little uncomfortable with this. Many economists believe

that the ﬁnancial markets are eﬃcient, meaning that all public news is instantly

reﬂected in changing prices. Prices certainly changed in response to news, but

you would not be able to get in fast enough to exploit the information. We had

to remain skeptical enough to make sure there were no data/timing problems

that could explain our results.

So I pressed Wenbin about exactly how he had performed his simulation. His

strategy bought and sold at the closing price every day. But that left sixteen

hours until the next day’s open, plenty of time for the world to react to events

that happened while I slept. He switched his simulated purchase to the opening

price. Again, the return rate went down substantially, but was still positive.

“See,” he said. “We are still making some money.”

But might there still be other artifacts in how we timed our data, giving

us essentially tomorrow’s newspaper today? In good faith, we chased down all

other possibilities we could think of, such as whether the published article dates

reﬂected when they appeared instead of when they were written. After doing

our best to be skeptical, his strategies still seemed to show positive returns from

news sentiment.

Our paper on this analysis [ZS10] has been well received, and Wenbin has

gone on to be a successful quant, using sentiment among other signals to trade

in the ﬁnancial markets. But I remain slightly queasy about this result. Clean-

ing our data to precisely time-stamp each news article was very diﬃcult to do

correctly. Our system was originally designed to produce daily time series in a

batch mode, so it is hard to be sure that we did everything right in the millions

of articles downloaded over several years to now perform ﬁner-scale analysis.

The take-home lesson is that cleanliness is important when there is money

on the line. Further, it is better to design a clean environment at the beginning

of analysis instead of furiously washing up at the end.

3.5 Crowdsourcing

No single person has all the answers. Not even me. Much of what passes for

wisdom is how we aggregate expertise, assembling opinions from the knowledge

and experience of others.