Big Data And Social Science (Statistics In The Behavioral Sciences Series) Ian Foster, Rayid Ghani, Ron S. Jarmin

(Statistics%20in%20the%20social%20and%20behavioral%20sciences%20series)%20Ian%20Foster%2C%20Rayid%20Ghani%2C%20Ron%20S.%20Jarmin

User Manual:

Open the PDF directly: View PDF .
Page Count: 377 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Cover
Half Title
Title
Copyright
Contents
Preface
Editors
Contributors
1: Introduction
I: Capture and Curation
II: Modeling and Analysis
III: Inference and Ethics
Bibliography
Index

BIG DATA AND

SOCIAL SCIENCE

A Practical Guide to Methods and Tools

Statistics in the Social and Behavioral Sciences Series

Aims and scope

Large and complex datasets are becoming prevalent in the social and behavioral

sciences and statistical methods are crucial for the analysis and interpretation of such

data. This series aims to capture new developments in statistical methodology with

particular relevance to applications in the social and behavioral sciences. It seeks to

promote appropriate use of statistical, econometric and psychometric methods in

these applied sciences by publishing a broad range of reference works, textbooks and

handbooks.

The scope of the series is wide, including applications of statistical methodology in

sociology, psychology, economics, education, marketing research, political science,

criminology, public policy, demography, survey methodology and ofcial statistics. The

titles included in the series are designed to appeal to applied statisticians, as well as

students, researchers and practitioners from the above disciplines. The inclusion of real

examples and case studies is therefore essential.

Jeff Gill

Washington University, USA

Wim J. van der Linden

Pacic Metrics, USA

Steven Heeringa

University of Michigan, USA

J. Scott Long

Indiana University, USA

Series Editors

Chapman & Hall/CRC

Tom Snijders

Oxford University, UK

University of Groningen, NL

Published Titles

Analyzing Spatial Models of Choice and Judgment with R

David A. Armstrong II, Ryan Bakker, Royce Carroll, Christopher Hare, Keith T. Poole, and Howard Rosenthal

Analysis of Multivariate Social Science Data, Second Edition

David J. Bartholomew, Fiona Steele, Irini Moustaki, and Jane I. Galbraith

Latent Markov Models for Longitudinal Data

Francesco Bartolucci, Alessio Farcomeni, and Fulvia Pennoni

Statistical Test Theory for the Behavioral Sciences

Dato N. M. de Gruijter and Leo J. Th. van der Kamp

Multivariable Modeling and Multivariate Analysis for the Behavioral Sciences

Brian S. Everitt

Multilevel Modeling Using R

W. Holmes Finch, Jocelyn E. Bolin, and Ken Kelley

Big Data and Social Science: A Practical Guide to Methods and Tools

Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, and Julia Lane

Ordered Regression Models: Parallel, Partial, and Non-Parallel Alternatives

Andrew S. Fullerton and Jun Xu

Bayesian Methods: A Social and Behavioral Sciences Approach, Third Edition

Jeff Gill

Multiple Correspondence Analysis and Related Methods

Michael Greenacre and Jorg Blasius

Applied Survey Data Analysis

Steven G. Heeringa, Brady T. West, and Patricia A. Berglund

Informative Hypotheses: Theory and Practice for Behavioral and Social Scientists

Herbert Hoijtink

Generalized Structured Component Analysis: A Component-Based Approach to Structural Equation Modeling

Heungsun Hwang and Yoshio Takane

Bayesian Psychometric Modeling

Roy Levy and Robert J. Mislevy

Statistical Studies of Income, Poverty and Inequality in Europe: Computing and Graphics in R Using EU-SILC

Nicholas T. Longford

Foundations of Factor Analysis, Second Edition

Stanley A. Mulaik

Linear Causal Modeling with Structural Equations

Stanley A. Mulaik

Age–Period–Cohort Models: Approaches and Analyses with Aggregate Data

Robert M. O’Brien

Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data

Analysis

Leslie Rutkowski, Matthias von Davier, and David Rutkowski

Generalized Linear Models for Categorical and Continuous Limited Dependent Variables

Michael Smithson and Edgar C. Merkle

Incomplete Categorical Data Design: Non-Randomized Response Techniques for Sensitive Questions in

Surveys

Guo-Liang Tian and Man-Lai Tang

Handbook of Item Response Theory, Volume 1: Models

Wim J. van der Linden

Handbook of Item Response Theory, Volume 2: Statistical Tools

Wim J. van der Linden

Handbook of Item Response Theory, Volume 3: Applications

Wim J. van der Linden

Computerized Multistage Testing: Theory and Applications

Duanli Yan, Alina A. von Davier, and Charles Lewis

Statistics in the Social and Behavioral Sciences Series

Chapman & Hall/CRC

BIG DATA AND

SOCIAL SCIENCE

A Practical Guide to Methods and Tools

Edited by

Ian Foster

University of Chicago

Argonne National Laboratory

Rayid Ghani

University of Chicago

Ron S. Jarmin

U.S. Census Bureau

Frauke Kreuter

University of Maryland

University of Manheim

Institute for Employment Research

Julia Lane

New York University

American Institutes for Research

CRC Press

Taylor & Francis Group

6000 Broken Sound Parkway NW, Suite 300

Boca Raton, FL 33487-2742

CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper

Version Date: 20160414

International Standard Book Number-13: 978-1-4987-5140-7 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but

the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to

trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained.

If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical,

or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without

written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright

Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a

variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to

infringe.

Library of Congress Cataloging‑in‑Publication Data

Names: Foster, Ian, 1959- editor.

Title: Big data and social science : a practical guide to methods and tools /

edited by Ian Foster, University of Chicago, Illinois, USA, Rayid Ghani,

University of Chicago, Illinois, USA, Ron S. Jarmin, U.S. Census Bureau,

USA, Frauke Kreuter, University of Maryland, USA, Julia Lane, New York

University, USA.

Description: Boca Raton, FL : CRC Press, [2017] | Series: Chapman & Hall/CRC

statistics in the social and behavioral sciences series | Includes

bibliographical references and index.

Identifiers: LCCN 2016010317 | ISBN 9781498751407 (alk. paper)

Subjects: LCSH: Social sciences--Data processing. | Social

sciences--Statistical methods. | Data mining. | Big data.

Classification: LCC H61.3 .B55 2017 | DDC 300.285/6312--dc23

LC record available at https://lccn.loc.gov/2016010317

Visit the Taylor & Francis Web site at

http://www.taylorandfrancis.com

and the CRC Press Web site at

http://www.crcpress.com

Contents

Preface xiii

Editors xv

Contributors xix

1 Introduction 1

1.1 Whythisbook?................................... 1

1.2 Deﬁning big data and its value . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Social science, inference, and big data . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Social science, data quality, and big data . . . . . . . . . . . . . . . . . . . . 7

1.5 Newtoolsfornewdata............................... 9

1.6 Thebook’s“usecase” ............................... 10

1.7 The structure of the book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.7.1 Part I: Capture and curation . . . . . . . . . . . . . . . . . . . . . . . 13

1.7.2 Part II: Modeling and analysis . . . . . . . . . . . . . . . . . . . . . . . 15

1.7.3 Part III: Inference and ethics . . . . . . . . . . . . . . . . . . . . . . . 16

1.8 Resources...................................... 17

I Capture and Curation 21

2 Working with Web Data and APIs 23

Cameron Neylon

2.1 Introduction .................................... 23

2.2 Scraping information from the web . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.1 Obtaining data from the HHMI website . . . . . . . . . . . . . . . . . . 24

2.2.2 Limitsofscraping ............................. 30

2.3 New data in the research enterprise . . . . . . . . . . . . . . . . . . . . . . . 31

2.4 Afunctionalview.................................. 37

2.4.1 Relevant APIs and resources . . . . . . . . . . . . . . . . . . . . . . . 38

2.4.2 RESTful APIs, returned data, and Python wrappers . . . . . . . . . . . 38

2.5 Programming against an API . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

vii

viii Contents

2.6 Using the ORCID API via a wrapper . . . . . . . . . . . . . . . . . . . . . . . . 42

2.7 Quality, scope, and management . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.8 Integrating data from multiple sources . . . . . . . . . . . . . . . . . . . . . . 46

2.8.1 TheLagottoAPI .............................. 46

2.8.2 Working with a corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.9 Working with the graph of relationships . . . . . . . . . . . . . . . . . . . . . 58

2.9.1 Citation links between articles . . . . . . . . . . . . . . . . . . . . . . 58

2.9.2 Categories, sources, and connections . . . . . . . . . . . . . . . . . . . 60

2.9.3 Data availability and completeness . . . . . . . . . . . . . . . . . . . . 61

2.9.4 The value of sparse dynamic data . . . . . . . . . . . . . . . . . . . . . 62

2.10 Bringing it together: Tracking pathways to impact . . . . . . . . . . . . . . . 65

2.10.1 Network analysis approaches . . . . . . . . . . . . . . . . . . . . . . . 66

2.10.2 Future prospects and new data sources . . . . . . . . . . . . . . . . . 66

2.11 Summary...................................... 67

2.12 Resources...................................... 69

2.13 Acknowledgements and copyright . . . . . . . . . . . . . . . . . . . . . . . . . 70

3 Record Linkage 71

Joshua Tokle and Stefan Bender

3.1 Motivation ..................................... 71

3.2 Introduction to record linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.3 Preprocessing data for record linkage . . . . . . . . . . . . . . . . . . . . . . . 76

3.4 Indexingandblocking............................... 78

3.5 Matching ...................................... 80

3.5.1 Rule-based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 82

3.5.2 Probabilistic record linkage . . . . . . . . . . . . . . . . . . . . . . . . 83

3.5.3 Machine learning approaches to linking . . . . . . . . . . . . . . . . . 85

3.5.4 Disambiguating networks . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.6 Classiﬁcation.................................... 88

3.6.1 Thresholds ................................. 89

3.6.2 One-to-onelinks.............................. 90

3.7 Record linkage and data protection . . . . . . . . . . . . . . . . . . . . . . . . 91

3.8 Summary...................................... 92

3.9 Resources...................................... 92

4 Databases 93

Ian Foster and Pascal Heus

4.1 Introduction .................................... 93

4.2 DBMS:Whenandwhy............................... 94

4.3 RelationalDBMSs ................................. 100

4.3.1 Structured Query Language (SQL) . . . . . . . . . . . . . . . . . . . . 102

4.3.2 Manipulating and querying data . . . . . . . . . . . . . . . . . . . . . 102

4.3.3 Schema design and deﬁnition . . . . . . . . . . . . . . . . . . . . . . . 105

Contents ix

4.3.4 Loadingdata ................................ 107

4.3.5 Transactions and crash recovery . . . . . . . . . . . . . . . . . . . . . 108

4.3.6 Database optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.3.7 Caveats and challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.4 Linking DBMSs and other tools . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.5 NoSQLdatabases ................................. 116

4.5.1 Challenges of scale: The CAP theorem . . . . . . . . . . . . . . . . . . 116

4.5.2 NoSQL and key–value stores . . . . . . . . . . . . . . . . . . . . . . . 117

4.5.3 Other NoSQL databases . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.6 Spatialdatabases ................................. 120

4.7 Whichdatabasetouse? .............................. 122

4.7.1 RelationalDBMSs ............................. 122

4.7.2 NoSQLDBMSs............................... 123

4.8 Summary...................................... 123

4.9 Resources...................................... 124

5 Programming with Big Data 125

Huy Vo and Claudio Silva

5.1 Introduction .................................... 125

5.2 The MapReduce programming model . . . . . . . . . . . . . . . . . . . . . . . 127

5.3 Apache Hadoop MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5.3.1 The Hadoop Distributed File System . . . . . . . . . . . . . . . . . . . 130

5.3.2 Hadoop: Bringing compute to the data . . . . . . . . . . . . . . . . . . 131

5.3.3 Hardware provisioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

5.3.4 Programming language support . . . . . . . . . . . . . . . . . . . . . . 136

5.3.5 Faulttolerance............................... 137

5.3.6 Limitations of Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

5.4 ApacheSpark.................................... 138

5.5 Summary...................................... 141

5.6 Resources...................................... 143

II Modeling and Analysis 145

6 Machine Learning 147

Rayid Ghani and Malte Schierholz

6.1 Introduction .................................... 147

6.2 What is machine learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.3 The machine learning process . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

6.4 Problem formulation: Mapping a problem to machine learning methods . . . . 151

6.5 Methods....................................... 153

6.5.1 Unsupervised learning methods . . . . . . . . . . . . . . . . . . . . . . 153

6.5.2 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

x Contents

6.6 Evaluation ..................................... 173

6.6.1 Methodology ................................ 173

6.6.2 Metrics ................................... 176

6.7 Practicaltips .................................... 180

6.7.1 Features .................................. 180

6.7.2 Machine learning pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 181

6.7.3 Multiclass problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

6.7.4 Skewed or imbalanced classiﬁcation problems . . . . . . . . . . . . . . 182

6.8 How can social scientists beneﬁt from machine learning? . . . . . . . . . . . . 183

6.9 Advancedtopics .................................. 185

6.10 Summary...................................... 185

6.11 Resources...................................... 186

7 Text Analysis 187

Evgeny Klochikhin and Jordan Boyd-Graber

7.1 Understanding what people write . . . . . . . . . . . . . . . . . . . . . . . . . 187

7.2 Howtoanalyzetext ................................ 189

7.2.1 Processing text data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

7.2.2 How much is a word worth? . . . . . . . . . . . . . . . . . . . . . . . . 192

7.3 Approaches and applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

7.3.1 Topicmodeling............................... 193

7.3.1.1 Inferring topics from raw text . . . . . . . . . . . . . . . . . 194

7.3.1.2 Applications of topic models . . . . . . . . . . . . . . . . . . 197

7.3.2 Information retrieval and clustering . . . . . . . . . . . . . . . . . . . 198

7.3.3 Otherapproaches ............................. 205

7.4 Evaluation ..................................... 208

7.5 Textanalysistools ................................. 210

7.6 Summary...................................... 212

7.7 Resources...................................... 213

8 Networks: The Basics 215

Jason Owen-Smith

8.1 Introduction .................................... 215

8.2 Networkdata.................................... 218

8.2.1 Forms of network data . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

8.2.2 Inducing one-mode networks from two-mode data . . . . . . . . . . . 220

8.3 Networkmeasures................................. 224

8.3.1 Reachability ................................ 224

8.3.2 Whole-network measures . . . . . . . . . . . . . . . . . . . . . . . . . 225

8.4 Comparing collaboration networks . . . . . . . . . . . . . . . . . . . . . . . . 234

8.5 Summary...................................... 238

8.6 Resources...................................... 239

Contents xi

III Inference and Ethics 241

9 Information Visualization 243

M. Adil Yalçın and Catherine Plaisant

9.1 Introduction .................................... 243

9.2 Developing eﬀective visualizations . . . . . . . . . . . . . . . . . . . . . . . . 244

9.3 A data-by-tasks taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

9.3.1 Multivariatedata.............................. 249

9.3.2 Spatialdata................................. 251

9.3.3 Temporaldata ............................... 252

9.3.4 Hierarchicaldata.............................. 255

9.3.5 Networkdata................................ 257

9.3.6 Textdata .................................. 259

9.4 Challenges ..................................... 259

9.4.1 Scalability ................................. 260

9.4.2 Evaluation ................................. 261

9.4.3 Visualimpairment............................. 261

9.4.4 Visualliteracy ............................... 262

9.5 Summary...................................... 262

9.6 Resources...................................... 263

10 Errors and Inference 265

Paul P. Biemer

10.1 Introduction .................................... 265

10.2 The total error paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

10.2.1 The traditional model . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

10.2.2 Extending the framework to big data . . . . . . . . . . . . . . . . . . . 273

10.3 Illustrations of errors in big data . . . . . . . . . . . . . . . . . . . . . . . . . 275

10.4 Errors in big data analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

10.4.1 Errors resulting from volume, velocity, and variety, assuming perfect

veracity................................... 277

10.4.2 Errors resulting from lack of veracity . . . . . . . . . . . . . . . . . . . 279

10.4.2.1 Variable and correlated error . . . . . . . . . . . . . . . . . . 280

10.4.2.2 Models for categorical data . . . . . . . . . . . . . . . . . . . 282

10.4.2.3 Misclassiﬁcation and rare classes . . . . . . . . . . . . . . . 283

10.4.2.4 Correlation analysis . . . . . . . . . . . . . . . . . . . . . . . 284

10.4.2.5 Regression analysis . . . . . . . . . . . . . . . . . . . . . . . 288

10.5 Some methods for mitigating, detecting, and compensating for errors . . . . . 290

10.6 Summary...................................... 295

10.7 Resources...................................... 296

xii Contents

11 Privacy and Conﬁdentiality 299

Stefan Bender, Ron Jarmin, Frauke Kreuter, and Julia Lane

11.1 Introduction .................................... 299

11.2 Why is access important? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

11.3 Providingaccess .................................. 305

11.4 Thenewchallenges ................................ 306

11.5 Legal and ethical framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

11.6 Summary...................................... 310

11.7 Resources...................................... 311

12 Workbooks 313

Jonathan Scott Morgan, Christina Jones, and Ahmad Emad

12.1 Introduction .................................... 313

12.2 Environment .................................... 314

12.2.1 Running workbooks locally . . . . . . . . . . . . . . . . . . . . . . . . 314

12.2.2 Central workbook server . . . . . . . . . . . . . . . . . . . . . . . . . . 315

12.3 Workbookdetails.................................. 315

12.3.1 Social Media and APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

12.3.2Databasebasics .............................. 316

12.3.3DataLinkage................................ 316

12.3.4MachineLearning ............................. 317

12.3.5TextAnalysis................................ 317

12.3.6Networks .................................. 318

12.3.7Visualization ................................ 318

12.4 Resources...................................... 319

Bibliography 321

Index 349

Preface

The class on which this book is based was created in response to a

very real challenge: how to introduce new ideas and methodologies

about economic and social measurement into a workplace focused

on producing high-quality statistics. We are deeply grateful for the

inspiration and support of Census Bureau Director John Thompson

and Deputy Director Nancy Potok in designing and implementing

the class content and structure.

As with any book, there are many people to be thanked. We

are grateful to Christina Jones, Ahmad Emad, Josh Tokle from

the American Institutes for Research, and Jonathan Morgan from

Michigan State University, who, together with Alan Marco and Julie

Caruso from the US Patent and Trademark Oﬃce, Theresa Leslie

from the Census Bureau, Brigitte Raumann from the University of

Chicago, and Lisa Jaso from Summit Consulting, actually made the

class happen.

We are also grateful to the students of three “Big Data for Fed-

eral Statistics” classes in which we piloted this material, and to the

instructors and speakers beyond those who contributed as authors

to this edited volume—Dan Black, Nick Collier, Ophir Frieder, Lee

Giles, Bob Goerge, Laure Haak, Madian Khabsa, Jonathan Ozik,

Ben Shneiderman, and Abe Usher. The book would not exist with-

out them.

We thank Trent Buskirk, Davon Clarke, Chase Coleman, Ste-

phanie Eckman, Matt Gee, Laurel Haak, Jen Helsby, Madian Khabsa,

Ulrich Kohler, Charlotte Oslund, Rod Little, Arnaud Sahuguet, Tim

Savage, Severin Thaler, and Joe Walsh for their helpful comments

on drafts of this material.

We also owe a great debt to the copyeditor, Richard Leigh; the

project editor, Charlotte Byrnes; and the publisher, Rob Calver, for

their hard work and dedication.

xiii

Editors

Ian Foster is a Professor of Computer Science at the University of

Chicago and a Senior Scientist and Distinguished Fellow at Argonne

National Laboratory.

Ian has a long record of research contributions in high-perfor-

mance computing, distributed systems, and data-driven discovery.

He has also led US and international projects that have produced

widely used software systems and scientiﬁc computing infrastruc-

tures. He has published hundreds of scientiﬁc papers and six books

on these and other topics. Ian is an elected fellow of the Amer-

ican Association for the Advancement of Science, the Association

for Computing Machinery, and the British Computer Society. His

awards include the British Computer Society’s Lovelace Medal and

the IEEE Tsutomu Kanai award.

Rayid Ghani is the Director of the Center for Data Science and

Public Policy and a Senior Fellow at the Harris School of Public

Policy and the Computation Institute at the University of Chicago.

Rayid is a reformed computer scientist and wannabe social scientist,

but mostly just wants to increase the use of data-driven approaches

in solving large public policy and social challenges. He is also pas-

sionate about teaching practical data science and started the Eric

and Wendy Schmidt Data Science for Social Good Fellowship at the

University of Chicago that trains computer scientists, statisticians,

and social scientists from around the world to work on data science

problems with social impact.

Before joining the University of Chicago, Rayid was the Chief

Scientist of the Obama 2012 Election Campaign, where he focused

on data, analytics, and technology to target and inﬂuence voters,

donors, and volunteers. Previously, he was a Research Scientist

and led the Machine Learning group at Accenture Labs. Rayid did

his graduate work in machine learning at Carnegie Mellon Univer-

sity and is actively involved in organizing data science related con-

xvi Editors

ferences and workshops. In his ample free time, Rayid works with

non-proﬁts and government agencies to help them with their data,

analytics, and digital eﬀorts and strategy.

Ron S. Jarmin is the Assistant Director for Research and Method-

ology at the US Census Bureau. He formerly was the Bureau’s Chief

Economist and Chief of the Center for Economic Studies and a Re-

search Economist. He holds a PhD in economics from the University

of Oregon and has published papers in the areas of industrial or-

ganization, business dynamics, entrepreneurship, technology and

ﬁrm performance, urban economics, data access, and statistical

disclosure avoidance. He oversees a broad research program in

statistics, survey methodology, and economics to improve economic

and social measurement within the federal statistical system.

Frauke Kreuter is a Professor in the Joint Program in Survey Meth-

odology at the University of Maryland, Professor of Methods and

Statistics at the University of Mannheim, and head of the statistical

methods group at the German Institute for Employment Research

in Nuremberg. Previously she held positions in the Department of

Statistics at the University of California Los Angeles (UCLA), and

the Department of Statistics at the Ludwig-Maximillian’s University

of Munich. Frauke serves on several advisory boards for National

Statistical Institutes around the world and within the Federal Sta-

tistical System in the United States. She recently served as the

co-chair of the Big Data Task Force of the American Association

for Public Opinion Research. She is a Gertrude Cox Award win-

ner, recognizing statisticians in early- to mid-career who have made

signiﬁcant breakthroughs in statistical practice, and an elected fel-

low of the American Statistical Association. Her textbooks on Data

Analysis Using Stata and Practical Tools for Designing and Weighting

Survey Samples are used at universities worldwide, including Har-

vard University, Johns Hopkins University, Massachusetts Insti-

tute of Technology, Princeton University, and the University College

London. Her Massive Open Online Course in Questionnaire De-

sign attracted over 70,000 learners within the ﬁrst year. Recently

Frauke launched the international long-distance professional edu-

cation program sponsored by the German Federal Ministry of Edu-

cation and Research in Survey and Data Science.

Editors xvii

Julia Lane is a Professor at the New York University Wagner Grad-

uate School of Public Service and at the NYU Center for Urban Sci-

ence and Progress, and she is a NYU Provostial Fellow for Innovation

Analytics.

Julia has led many initiatives, including co-founding the UMET-

RICS and STAR METRICS programs at the National Science Foun-

dation. She conceptualized and established a data enclave at

NORC/University of Chicago. She also co-founded the creation and

permanent establishment of the Longitudinal Employer-Household

Dynamics Program at the US Census Bureau and the Linked Em-

ployer Employee Database at Statistics New Zealand. Julia has

published over 70 articles in leading journals, including Nature and

Science, and authored or edited ten books. She is an elected fellow

of the American Association for the Advancement of Science and a

fellow of the American Statistical Association.

Contributors

Stefan Bender

Deutsche Bundesbank

Frankfurt, Germany

Paul P. Biemer

RTI International

Raleigh, NC, USA

University of North Carolina

Chapel Hill, NC, USA

Jordan Boyd-Graber

University of Colorado

Boulder, CO, USA

Ahmad Emad

American Institutes for Research

Washington, DC, USA

Pascal Heus

Metadata Technology North America

Knoxville, TN, USA

Christina Jones

American Institutes for Research

Washington, DC, USA

Evgeny Klochikhin

American Institutes for Research

Washington, DC, USA

Jonathan Scott Morgan

Michigan State University

East Lansing, MI, USA

Cameron Neylon

Curtin University

Perth, Australia

Jason Owen-Smith

University of Michigan

Ann Arbor, MI, USA

Catherine Plaisant

University of Maryland

College Park, MD, USA

Malte Schierholz

University of Mannheim

Mannheim, Germany

Claudio Silva

New York University

New York, NY, USA

Joshua Tokle

Amazon

Seattle, WA, USA

Huy Vo

City University of New York

New York, NY, USA

M. Adil Yalçın

University of Maryland

College Park, MD, USA

xix

Introduction

Chapter 1

This section provides a brief overview of the goals and structure of

the book.

1.1 Why this book?

The world has changed for empirical social scientists. The new types

of “big data” have generated an entire new research ﬁeld—that of

data science. That world is dominated by computer scientists who

have generated new ways of creating and collecting data, developed

new analytical and statistical techniques, and provided new ways of

visualizing and presenting information. These new sources of data

and techniques have the potential to transform the way applied

social science is done.

Research has certainly changed. Researchers draw on data that

are “found” rather than “made” by federal agencies; those publish-

ing in leading academic journals are much less likely today to draw

on preprocessed survey data (Figure 1.1).

The way in which data are used has also changed for both gov-

ernment agencies and businesses. Chief data oﬃcers are becoming

as common in federal and state governments as chief economists

were decades ago, and in cities like New York and Chicago, mayoral

oﬃces of data analytics have the ability to provide rapid answers

to important policy questions [233]. But since federal, state, and

local agencies lack the capacity to do such analysis themselves [8],

they must make these data available either to consultants or to the

research community. Businesses are also learning that making ef-

fective use of their data assets can have an impact on their bottom

line [56].

And the jobs have changed. The new job title of “data scien-

tist” is highlighted in job advertisements on CareerBuilder.com and

Burning-glass.com—in the same category as statisticians, economists,

and other quantitative social scientists if starting salaries are useful

indicators.

2 1. Introduction

100

Micro-data Base Articles using Survey Data (%)

1980 1990

Note: “Pre-existing survey” data sets refer to micro surveys such as the CPS or

SIPP and do not include surveys designed by researchers for their study.

Sample excludes studies whose primary data source is from developing countries.

AER QJE

ECMA

JPE

2000

Year

2010

Figure 1.1. Use of pre-existing survey data in publications in leading journals,

1980–2010 [74]

The goal of this book is to provide social scientists with an un-

derstanding of the key elements of this new science, its value, and

the opportunities for doing better work. The goal is also to identify

the many ways in which the analytical toolkits possessed by social

scientists can be brought to bear to enhance the generalizability of

the work done by computer scientists.

We take a pragmatic approach, drawing on our experience of

working with data. Most social scientists set out to solve a real-

world social or economic problem: they frame the problem, identify

the data, do the analysis, and then draw inferences. At all points,

of course, the social scientist needs to consider the ethical ramiﬁ-

cations of their work, particularly respecting privacy and conﬁden-

tiality. The book follows the same structure. We chose a particular

problem—the link between research investments and innovation—

because that is a major social science policy issue, and one in which

social scientists have been addressing using big data techniques.

While the example is speciﬁc and intended to show how abstract

concepts apply in practice, the approach is completely generaliz-

able. The web scraping, linkage, classiﬁcation, and text analysis

methods on display here are canonical in nature. The inference

1.2. Deﬁning big data and its value 3

and privacy and conﬁdentiality issues are no diﬀerent than in any

other study involving human subjects, and the communication of

results through visualization is similarly generalizable.

1.2 Deﬁning big data and its value

There are almost as many deﬁnitions of big data as there are new

types of data. One approach is to deﬁne big data as anything too big ◮This topic is discussed in

more detail in Chapter 5.

to ﬁt onto your computer. Another approach is to deﬁne it as data

with high volume, high velocity, and great variety. We choose the

description adopted by the American Association of Public Opinion

Research: “The term ‘Big Data’ is an imprecise description of a

rich and complicated set of characteristics, practices, techniques,

ethical issues, and outcomes all associated with data” [188].

The value of the new types of data for social science is quite

substantial. Personal data has been hailed as the “new oil” of the

twenty-ﬁrst century, and the beneﬁts to policy, society, and public

opinion research are undeniable [139]. Policymakers have found

that detailed data on human beings can be used to reduce crime,

improve health delivery, and manage cities better [205]. The scope

is broad indeed: one of this book’s editors has used such data to

not only help win political campaigns but also show its potential

for public policy. Society can gain as well—recent work shows data-

driven businesses were 5% more productive and 6% more proﬁtable

than their competitors [56]. In short, the vision is that social sci-

ence researchers can potentially, by using data with high velocity,

variety, and volume, increase the scope of their data collection ef-

forts while at the same time reducing costs and respondent burden,

increasing timeliness, and increasing precision [265].

Example: New data enable new analyses

Spotshotter data, which have fairly detailed information for each gunﬁre incident,

such as the precise timestamp and the nearest address, as well as the type of

shot, can be used to improve crime data [63]; Twitter data can be used to improve

predictions around job loss, job gain, and job postings [17]; and eBay postings can

be used to estimate demand elasticities [104].

But most interestingly, the new data can change the way we

think about measuring and making inferences about behavior. For

4 1. Introduction

example, it enables the capture of information on the subject’s en-

tire environment—thus, for example, the eﬀect of fast food caloric

labeling in health interventions [105]; the productivity of a cashier

if he is within eyesight of a highly productive cashier but not oth-

erwise [252]. So it oﬀers the potential to understand the eﬀects of

complex environmental inputs on human behavior. In addition, big

data, by its very nature, enables us to study the tails of a distribu-

tion in a way that is not possible with small data. Much of interest

in human behavior is driven by the tails of the distribution—health

care costs by small numbers of ill people [356], economic activity

and employment by a small number of ﬁrms [93,109]—and is impos-

sible to study with the small sample sizes available to researchers.

Instead we are still faced with the same challenges and respon-

sibilities as we were before in the survey and small data collection

environment. Indeed, social scientists have a great deal to oﬀer to a

(data) world that is currently looking to computer scientists to pro-

vide answers. Two major areas to which social scientists can con-

tribute, based on decades of experience and work with end users,

are inference and attention to data quality.

1.3 Social science, inference, and big data

The goal of empirical social science is to make inferences about a

population from available data. That requirement exists regardless

of the data source—and is a guiding principle for this book. For

probability-based survey data, methodology has been developed to

overcome problems in the data generating process. A guiding prin-

ciple for survey methodologists is the total survey error framework,

and statistical methods for weighting, calibration, and other forms

of adjustment are commonly used to mitigate errors in the survey

process. Likewise for “broken” experimental data, techniques like

propensity score adjustment and principal stratiﬁcation are widely

used to ﬁx ﬂaws in the data generating process. Two books provide

frameworks for survey quality [35, 143].

◮This topic is discussed in

more detail in Chapter 10.

Across the social sciences, including economics, public policy,

sociology, management, (parts of) psychology and the like, we can

identify three categories of analysis with three diﬀerent inferential

goals: description, causation, and prediction.

Description The job of many social scientists is to provide descrip-

tive statements about the population of interest. These could be

univariate, bivariate, or even multivariate statements. Chapter 6

1.3. Social science, inference, and big data 5

on machine learning will cover methods that go beyond simple de-

scriptive statistics, known as unsupervised learning methods.

Descriptive statistics are usually created based on census data

or sample surveys to generate some summary statistics like a mean,

median, or a graphical distribution to describe the population of in-

terest. In the case of a census, the work ends right there. With

sample surveys the point estimates come with measures of uncer-

tainties (standard errors). The estimation of standard errors has

been worked out for most descriptive statistics and most common

survey designs, even complex ones that include multiple layers of

sampling and disproportional selection probabilities [154, 385].

Example: Descriptive statistics

The US Bureau of Labor Statistics surveys about 60,000 households a month and

from that survey is able to describe national employment and unemployment levels.

For example, in November 2015, total nonfarm payroll employment increased by

211,000 in November, and the unemployment rate was unchanged at 5.0%. Job

gains occurred in construction, professional and technical services, and health

care. Mining and information lost jobs [57].

Proper inference, even for purely descriptive purposes, from a

sample to the population rests usually on knowing that everyone

from the target population had the chance to be included in the

survey, and knowing the selection probability for each element in

the population. The latter does not necessarily need to be known

prior to sampling, but eventually a probability is assigned for each

case. Getting the selection probabilities right is particularly impor-

tant when reporting totals [243]. Unfortunately in practice, samples

that start out as probability samples can suﬀer from a high rate of

nonresponse. Because the survey designer cannot completely con-

trol which units respond, the set of units that ultimately respond

cannot be considered to be a probability sample [257]. Nevertheless,

starting with a probability sample provides some degree of comfort

that a sample will have limited coverage errors (nonzero probability

of being in the sample), and there are methods for dealing with a

variety of missing data problems [240].

Causation In many cases, social scientists wish to test hypotheses,

often originating in theory, about relationships between phenomena

of interest. Ideally such tests stem from data that allow causal infer-

6 1. Introduction

ence: typically randomized experiments or strong nonexperimental

study designs. When examining the eﬀect of Xon Y, knowing how

cases were selected into the sample or data set is much less impor-

tant in the estimation of causal eﬀects than for descriptive studies,

for example, population means. What is important is that all ele-

ments of the inferential population have a chance of being selected

for the treatment [179]. In the debate about probability and non-

probability surveys, this distinction is often overlooked. Medical

researchers have operated with unknown study selection mecha-

nisms for years: for example, randomized trials that enroll only

selected samples.

Example: New data and causal inference

One of the major risks with using big data without thinking about the data source

is the misallocation of resources. Overreliance on, say, Twitter data in targeting re-

sources after hurricanes can lead to the misallocation of resources towards young,

Internet-savvy people with cell phones, and away from elderly or impoverished

neighborhoods [340]. Of course, all data collection approaches have had similar

risks. Bad survey methodology led the Literary Digest to incorrectly call the 1936

election [353]. Inadequate understanding of coverage, incentive and quality issues,

together with the lack of a comparison group, has hampered the use of adminis-

trative records—famously in the case of using administrative records on crime to

make inference about the role of death penalty policy in crime reduction [95].

Of course, in practice it is diﬃcult to ensure that results are

generalizable, and there is always a concern that the treatment

eﬀect on the treated is diﬀerent than the treatment eﬀect in the

full population of interest [365]. Having unknown study selection

probabilities makes it even more diﬃcult to estimate population

causal eﬀects, but substantial progress is being made [99,261]. As

long as we are able to model the selection process, there is no reason

not to do causal inference from so-called nonprobability data.

Prediction Forecasting or prediction tasks are a little less common

among applied social science researchers as a whole, but are cer-

tainly an important element for users of oﬃcial statistics—in partic-

ular, in the context of social and economic indicators—as generally

for decision-makers in government and business. Here, similar to

the causal inference setting, it is of utmost importance that we do

know the process that generated the data, and we can rule out any

unknown or unobserved systematic selection mechanism.

1.4. Social science, data quality, and big data 7

Example: Learning from the ﬂu

“Five years ago [in 2009], a team of researchers from Google announced a remark-

able achievement in one of the world’s top scientiﬁc journals, Nature. Without

needing the results of a single medical check-up, they were nevertheless able to

track the spread of inﬂuenza across the US. What’s more, they could do it more

quickly than the Centers for Disease Control and Prevention (CDC). Google’s track-

ing had only a day’s delay, compared with the week or more it took for the CDC

to assemble a picture based on reports from doctors’ surgeries. Google was faster

because it was tracking the outbreak by ﬁnding a correlation between what people

searched for online and whether they had ﬂu symptoms. . . .

“Four years after the original Nature paper was published, Nature News had

sad tidings to convey: the latest ﬂu outbreak had claimed an unexpected victim:

Google Flu Trends. After reliably providing a swift and accurate account of ﬂu

outbreaks for several winters, the theory-free, data-rich model had lost its nose for

where ﬂu was going. Google’s model pointed to a severe outbreak but when the

slow-and-steady data from the CDC arrived, they showed that Google’s estimates

of the spread of ﬂu-like illnesses were overstated by almost a factor of two.

“The problem was that Google did not know—could not begin to know—what

linked the search terms with the spread of ﬂu. Google’s engineers weren’t trying to

ﬁgure out what caused what. They were merely ﬁnding statistical patterns in the

data. They cared about correlation rather than causation” [155].

1.4 Social science, data quality, and big data

Most data in the real world are noisy, inconsistent, and suﬀers from

missing values, regardless of its source. Even if data collection

is cheap, the costs of creating high-quality data from the source—

cleaning, curating, standardizing, and integrating—are substantial. ◮This topic is discussed in

more detail in Chapter 3.

Data quality can be characterized in multiple ways [76]:

•Accuracy: How accurate are the attribute values in the data?

•Completeness: Is the data complete?

•Consistency: How consistent are the values in and between

the database(s)?

•Timeliness: How timely is the data?

•Accessibility: Are all variables available for analysis?

8 1. Introduction

Social scientists have decades of experience in transforming

messy, noisy, and unstructured data into a well-deﬁned, clearly

structured, and quality-tested data set. Preprocessing is a complex

and time-consuming process because it is “hands-on”—it requires

judgment and cannot be eﬀectively automated. A typical workﬂow

comprises multiple steps from data deﬁnition to parsing and ends

with ﬁltering. It is diﬃcult to overstate the value of preprocessing

for any data analysis, but this is particularly true in big data. Data

need to be parsed, standardized, deduplicated, and normalized.

Parsing is a fundamental step taken regardless of the data source,

and refers to the decomposition of a complex variable into compo-

nents. For example, a freeform address ﬁeld like “1234 E 56th St”

might be broken down into a street number “1234” and a street

name “E 56th St.” The street name could be broken down further

to extract the cardinal direction “E” and the designation “St.” An-

other example would be a combined full name ﬁeld that takes the

form of a comma-separated last name, ﬁrst name, and middle initial

as in “Miller, David A.” Splitting these identiﬁers into components

permits the creation of more reﬁned variables that can be used in

the matching step.

In the simplest case, the distinct parts of a character ﬁeld are

delimited. In the name ﬁeld example, it would be easy to create the

separate ﬁelds “Miller” and “David A” by splitting the original ﬁeld

at the comma. In more complex cases, special code will have to

be written to parse the ﬁeld. Typical steps in a parsing procedure

include:

1. Splitting ﬁelds into tokens (words) on the basis of delimiters,

2. Standardizing tokens by lookup tables and substitution by a

standard form,

3. Categorizing tokens,

4. Identifying a pattern of anchors, tokens, and delimiters,

5. Calling subroutines according to the identiﬁed pattern, therein

mapping of tokens to the predeﬁned components.

Standardization refers to the process of simplifying data by re-

placing variant representations of the same underlying observation

by a default value in order to improve the accuracy of ﬁeld com-

parisons. For example, “First Street” and “1st St” are two ways of

writing the same street name, but a simple string comparison of

these values will return a poor result. By standardizing ﬁelds—and

1.5. New tools for new data 9

using the same standardization rules across ﬁles!—the number of

true matches that are wrongly classiﬁed as nonmatches (i.e., the

number of false nonmatches) can be reduced.

Some common examples of standardization are:

•Standardization of diﬀerent spellings of frequently occurring

words: for example, replacing common abbreviations in street

names (Ave, St, etc.) or titles (Ms, Dr, etc.) with a common

form. These kinds of rules are highly country- and language-

speciﬁc.

•General standardization, including converting character ﬁelds

to all uppercase and removing punctuation and digits.

Deduplication consists of removing redundant records from a

single list, that is, multiple records from the same list that refer to

the same underlying entity. After deduplication, each record in the

ﬁrst list will have at most one true match in the second list and vice

versa. This simpliﬁes the record linkage process and is necessary if

the goal of record linkage is to ﬁnd the best set of one-to-one links

(as opposed to a list of all possible links). One can deduplicate a list

by applying record linkage techniques described in this chapter to

link a ﬁle to itself.

Normalization is the process of ensuring that the ﬁelds that are

being compared across ﬁles are as similar as possible in the sense

that they could have been generated by the same process. At min-

imum, the same standardization rules should be applied to both

ﬁles. For additional examples, consider a salary ﬁeld in a survey.

There are number diﬀerent ways that salary could be recorded: it

might be truncated as a privacy-preserving measure or rounded to

the nearest thousand, and missing values could be imputed with

the mean or with zero. During normalization we take note of exactly

how ﬁelds are recorded.

1.5 New tools for new data

The new data sources that we have discussed frequently require

working at scales for which the social scientist’s familiar tools are

not designed. Fortunately, the wider research and data analytics

community has developed a wide variety of often more scalable and

ﬂexible tools—tools that we will introduce within this book.

Relational database management systems (DBMSs) are used ◮This topic is discussed in

more detail in Chapter 4.

throughout business as well as the sciences to organize, process,

10 1. Introduction

and search large collections of structured data. NoSQL DBMSs are

used for data that is extremely large and/or unstructured, such as

collections of web pages, social media data (e.g., Twitter messages),

and clinical notes. Extensions to these systems and also special-

ized single-purpose DBMSs provide support for data types that are

not easily handled in statistical packages such as geospatial data,

networks, and graphs.

Open source programming systems such as Python (used ex-

tensively throughout this book) and R provide high-quality imple-

mentations of numerous data analysis and visualization methods,

from regression to statistics, text analysis, network analysis, and

much more. Finally, parallel computing systems such as Hadoop

and Spark can be used to harness parallel computer clusters for

extremely large data sets and computationally intensive analyses.

These various components may not always work together as

smoothly as do integrated packages such as SAS, SPSS, and Stata,

but they allow researchers to take on problems of great scale and

complexity. Furthermore, they are developing at a tremendous rate

as the result of work by thousands of people worldwide. For these

reasons, the modern social scientist needs to be familiar with their

characteristics and capabilities.

1.6 The book’s “use case”

This book is about the uses of big data in social science. Our focus

is on working through the use of data as a social scientist normally

approaches research. That involves thinking through how to use

such data to address a question from beginning to end, and thereby

learning about the associated tools—rather than simply engaging in

coding exercises and then thinking about how to apply them to a

potpourri of social science examples.

There are many examples of the use of big data in social science

research, but relatively few that feature all the diﬀerent aspects

that are covered in this book. As a result, the chapters in the book

draw heavily on a use case based on one of the ﬁrst large-scale big

data social science data infrastructures. This infrastructure, based

on UMETRICS*data housed at the University of Michigan’s Insti-

⋆UMETRICS: Universi-

ties Measuring the Impact

of Research on Innovation

and Science [228]

tute for Research on Innovation and Science (IRIS) and enhanced

◮iris.isr.umich.edu with data from the US Census Bureau, provides a new quantitative

analysis and understanding of science policy based on large-scale

computational analysis of new types of data.

1.6. The book’s “use case” 11

The infrastructure was developed in response to a call from the

President’s Science Advisor (Jack Marburger) for a science of science

policy [250]. He wanted a scientiﬁc response to the questions that

he was asked about the impact of investments in science.

Example: The Science of Science Policy

Marburger wrote [250]: “How much should a nation spend on science? What

kind of science? How much from private versus public sectors? Does demand for

funding by potential science performers imply a shortage of funding or a surfeit

of performers? These and related science policy questions tend to be asked and

answered today in a highly visible advocacy context that makes assumptions that

are deserving of closer scrutiny. A new ‘science of science policy’ is emerging, and

it may oﬀer more compelling guidance for policy decisions and for more credible

advocacy. . . .

“Relating R&D to innovation in any but a general way is a tall order, but not

a hopeless one. We need econometric models that encompass enough variables

in a suﬃcient number of countries to produce reasonable simulations of the eﬀect

of speciﬁc policy choices. This need won’t be satisﬁed by a few grants or work-

shops, but demands the attention of a specialist scholarly community. As more

economists and social scientists turn to these issues, the eﬀectiveness of science

policy will grow, and of science advocacy too.”

Responding to this policy imperative is a tall order, because it in-

volves using all the social science and computer science tools avail-

able to researchers. The new digital technologies can be used to

capture the links between the inputs into research, the way in which

those inputs are organized, and the subsequent outputs [396,415].

The social science questions that are addressable with this data in-

frastructure include the eﬀect of research training on the placement

and earnings of doctoral recipients, how university trained scien-

tists and engineers aﬀect the productivity of the ﬁrms they work for,

and the return on investments in research. Figure 1.2 provides an

abstract representation of the empirical approach that is needed:

data about grants, the people who are funded on grants, and the

subsequent scientiﬁc and economic activities.

First, data must be captured on what is funded, and since the

data are in text format, computational linguistics tools must be

applied (Chapter 7). Second, data must be captured on who is

funded, and how they interact in teams, so network tools and ana-

lysis must be used (Chapter 8). Third, information about the type of

results must be gleaned from the web and other sources (Chapter 2).

12 1. Introduction

Co-Author Collaborate

Train

Pays for &

Is Awarded to

People

Funding

Institutions Products

Figure 1.2. A visualization of the complex links between what and who is funded, and the results; tracing the direct

link between funding and results is misleading and wrong

Finally, the disparate complex data sets need to be stored in data-

bases (Chapter 4), integrated (Chapter 3), analyzed (Chapter 6), and

used to make inferences (Chapter 10).

The use case serves as the thread that ties many of the ideas

together. Rather than asking the reader to learn how to code “hello

world,” we build on data that have been put together to answer a

real-world question, and provide explicit examples based on that

data. We then provide examples that show how the approach gen-

eralizes.

For example, the text analysis chapter (Chapter 7) shows how

to use natural language processing to describe what research is

being done, using proposal and award text to identify the research

topics in a portfolio [110, 368]. But then it also shows how the

approach can be used to address a problem that is not just limited

to science policy—the conversion of massive amounts of knowledge

that is stored in text to usable information.

1.7. The structure of the book 13

Similarly, the network analysis chapter (Chapter 8) gives speciﬁc

examples using the UMETRICS data and shows how such data can

be used to create new units of analysis—the networks of researchers

who do science, and the networks of vendors who supply research

inputs. It also shows how networks can be used to study a wide

variety of other social science questions.

In another example, we use APIs*provided by publishers to de- ⋆Application Programming

Interfaces

scribe the results generated by research funding in terms of pub-

lications and other measures of scientiﬁc impact, but also provide

code that can be repurposed for many similar APIs.

And, of course, since all these new types of data are provided

in a variety of diﬀerent formats, some of which are quite large (or

voluminous), and with a variety of diﬀerent timestamps (or velocity),

we discuss how to store the data in diﬀerent types of data formats.

1.7 The structure of the book

We organize the book in three parts, based around the way social

scientists approach doing research. The ﬁrst set of chapters ad-

dresses the new ways to capture, curate, and store data. The sec-

ond set of chapters describes what tools are available to process and

classify data. The last set deals with analysis and the appropriate

handling of data on individuals and organizations.

1.7.1 Part I: Capture and curation

The four chapters in Part I (see Figure 1.3) tell you how to capture

and manage data.

Chapter 2 describes how to extract information from social me-

dia about the transmission of knowledge. The particular applica-

tion will be to develop links to authors’ articles on Twitter using

PLOS articles and to pull information about authors and articles

from web sources by using an API. You will learn how to retrieve

link data from bookmarking services, citations from Crossref, links

from Facebook, and information from news coverage. In keep-

ing with the social science grounding that is a core feature of the

book, the chapter discusses what data can be captured from online

sources, what is potentially reliable, and how to manage data quality

issues.

Big data diﬀers from survey data in that we must typically com-

bine data from multiple sources to get a complete picture of the

activities of interest. Although computer scientists may sometimes

14 1. Introduction

API and Web

Scraping

Record

Linkage

Storing Data

Processing

Large

Data Sets

Chapter 2: Dierent ways of collecting data

Chapter 3: Combining dierent data sets

Chapter 4: Ingest, query and export data

Chapter 5: Output (creating innovation measures)

Figure 1.3. The four chapters of Part I focus on data capture and curation

simply “mash” data sets together, social scientists are rightfully

concerned about issues of missing links, duplicative links, and

erroneous links. Chapter 3 provides an overview of traditional

rule-based and probabilistic approaches to data linkage, as well

as the important contributions of machine learning to the linkage

problem.

Once data have been collected and linked into diﬀerent ﬁles, it

is necessary to store and organize it. Social scientists are used to

working with one analytical ﬁle, often in statistical software tools

such as SAS or Stata. Chapter 4, which may be the most impor-

tant chapter in the book, describes diﬀerent approaches to stor-

ing data in ways that permit rapid and reliable exploration and

analysis.

Big data is sometimes deﬁned as data that are too big to ﬁt onto

the analyst’s computer. Chapter 5 provides an overview of clever

programming techniques that facilitate the use of data (often using

parallel computing). While the focus is on one of the most widely

used big data programming paradigms and its most popular imple-

mentation, Apache Hadoop, the goal of the chapter is to provide a

conceptual framework to the key challenges that the approach is

designed to address.

1.7. The structure of the book 15

Machine

Learning

Text

Analysis

Networks

Chapter 6: Classifying data in new ways

Chapter 7: Creating new data from text

Chapter 8: Creating new measures of

social and economic activity

Figure 1.4. The four chapters in Part II focus on data modeling and analysis

1.7.2 Part II: Modeling and analysis

The three chapters in Part II (see Figure 1.4) introduce three of the

most important tools that can be used by social scientists to do new

and exciting research: machine learning, text analysis, and social

network analysis.

Chapter 6 introduces machine learning methods. It shows the

power of machine learning in a variety of diﬀerent contexts, par-

ticularly focusing on clustering and classiﬁcation. You will get an

overview of basic approaches and how those approaches are applied.

The chapter builds from a conceptual framework and then shows

you how the diﬀerent concepts are translated into code. There is

a particular focus on random forests and support vector machine

(SVM) approaches.

Chapter 7 describes how social scientists can make use of one of

the most exciting advances in big data—text analysis. Vast amounts

of data that are stored in documents can now be analyzed and

searched so that diﬀerent types of information can be retrieved.

Documents (and the underlying activities of the entities that gener-

ated the documents) can be categorized into topics or ﬁelds as well

as summarized. In addition, machine translation can be used to

compare documents in diﬀerent languages.

16 1. Introduction

Social scientists are typically interested in describing the activi-

ties of individuals and organizations (such as households and ﬁrms)

in a variety of economic and social contexts. The frames within

which data are collected have typically been generated from tax or

other programmatic sources. The new types of data permit new

units of analysis—particularly network analysis—largely enabled by

advances in mathematical graph theory. Thus, Chapter 8 describes

how social scientists can use network theory to generate measurable

representations of patterns of relationships connecting entities. As

the author points out, the value of the new framework is not only in

constructing diﬀerent right-hand-side variables but also in study-

ing an entirely new unit of analysis that lies somewhere between

the largely atomistic actors that occupy the markets of neo-classical

theory and the tightly managed hierarchies that are the traditional

object of inquiry of sociologists and organizational theorists.

1.7.3 Part III: Inference and ethics

The four chapters in Part III (see Figure 1.5) cover three advanced

topics relating to data inference and ethics—information visualiza-

tion, errors and inference, and privacy and conﬁdentiality—and in-

troduce the workbooks that provide access to the practical exercises

associated with the text.

Visualization

Inference

Privacy and

Condentiality

Chapter 9: Making sense of the data

Chapter 10: Drawing statistically valid conclusions

Chapter 11: Handling data appropriately

Chapter 12: Applying new models and tools

Workbooks

Figure 1.5. The four chapters in Part III focus on inference and ethics

1.8. Resources 17

Chapter 9 introduces information visualization methods and de-

scribes how you can use those methods to explore data and com-

municate results so that data can be turned into interpretable, ac-

tionable information. There are many ways of presenting statis-

tical information that convey content in a rigorous manner. The

goal of this chapter is to explore diﬀerent approaches and exam-

ine the information content and analytical validity of the diﬀerent

approaches. It provides an overview of eﬀective visualizations.

Chapter 10 deals with inference and the errors associated with

big data. Social scientists know only too well the cost associated

with bad data—we highlighted the classic Literary Digest example

in the introduction to this chapter, as well as the more recent Google

Flu Trends. Although the consequences are well understood, the

new types of data are so large and complex that their properties

often cannot be studied in traditional ways. In addition, the data

generating function is such that the data are often selective, in-

complete, and erroneous. Without proper data hygiene, errors can

quickly compound. This chapter provides a systematic way to think

about the error framework in a big data setting.

Chapter 11 addresses the issue that sits at the core of any study

of human beings—privacy and conﬁdentiality. In a new ﬁeld, like the

one covered in this book, it is critical that many researchers have

access to the data so that work can be replicated and built on—that

there be a scientiﬁc basis to data science. Yet the rules that social

scientists have traditionally used for survey data, namely anonymity

and informed consent, no longer apply when the data are collected

in the wild. This concluding chapter identiﬁes the issues that must

be addressed for responsible and ethical research to take place.

Finally, Chapter 12 provides an overview of the practical work

that accompanies each chapter—the workbooks that are designed,

using Jupyter notebooks, to enable students and interested prac- ◮See jupyter.org.

titioners to apply the new techniques and approaches in selected

chapters. We hope you have a lot of fun with them.

1.8 Resources

For more information on the science of science policy, see Husbands

et al.’s book for a full discussion of many issues [175] and the online

resources at the eponymous website [352].

This book is above all a practical introduction to the methods and

tools that the social scientist can use to make sense of big data,

and thus programming resources are also important. We make

18 1. Introduction

extensive use of the Python programming language and the MySQL

database management system in both the book and its supporting

workbooks. We recommend that any social scientist who aspires

to work with large data sets become proﬁcient in the use of these

two systems, and also one more, GitHub. All three, fortunately, are

quite accessible and are supported by excellent online resources.

Time spent mastering them will be repaid many times over in more

productive research.

For Python, Alex Bell’s Python for Economists (available online

◮Read this! http://bit.ly/

1VgytVV [31]) provides a wonderful 30-page introduction to the use of Python

in the social sciences, complete with XKCD cartoons. Economists

Tom Sargent and John Stachurski provide a very useful set of lec-

tures and examples at http://quant-econ.net/. For more detail, we

recommend Charles Severance’s Python for Informatics: Exploring

Information [338], which not only covers basic Python but also pro-

vides material relevant to web data (the subject of Chapter 2) and

MySQL (the subject of Chapter 4). This book is also freely available

online and is supported by excellent online lectures and exercises.

For MySQL, Chapter 4 provides introductory material and point-

ers to additional resources, so we will not say more here.

We also recommend that you master GitHub. A version control

system is a tool for keeping track of changes that have been made

to a document over time. GitHub is a hosting service for projects

that use the Git version control system. As Strasser explains [363],

Git/GitHub makes it straightforward for researchers to create digi-

tal lab notebooks that record the data ﬁles, programs, papers, and

other resources associated with a project, with automatic tracking

of the changes that are made to those resources over time. GitHub

also makes it easy for collaborators to work together on a project,

whether a program or a paper: changes made by each contribu-

tor are recorded and can easily be reconciled. For example, we

used GitHub to create this book, with authors and editors check-

ing in changes and comments at diﬀerent times and from many time

zones. We also use GitHub to provide access to the supporting work-

books. Ram [314] provides a nice description of how Git/GitHub can

be used to promote reproducibility and transparency in research.

One more resource that is outside the scope of this book but that

you may well want to master is the cloud [21,236]. It used to be that

when your data and computations became too large to analyze on

your laptop, you were out of luck unless your employer (or a friend)

had a larger computer. With the emergence of cloud storage and

computing services from the likes of Amazon Web Services, Google,

and Microsoft, powerful computers are available to anyone with a

1.8. Resources 19

credit card. We and many others have had positive experiences

using such systems for the analysis of urban [64], environmental

[107], and genomic [32] data analysis and modeling, for example.

Such systems may well represent the future of research computing.

Part I

Capture and Curation

Working with Web Data and APIs

Chapter 2

Cameron Neylon

This chapter will show you how to extract information from social

media about the transmission of knowledge. The particular appli-

cation will be to develop links to authors’ articles on Twitter using

PLOS articles and to pull information using an API. You will get link

data from bookmarking services, citations from Crossref, links from

Facebook, and information from news coverage. The examples that

will be used are from Twitter. In keeping with the social science

grounding that is a core feature of the book, it will discuss what can

be captured, what is potentially reliable, and how to manage data

quality issues.

2.1 Introduction

A tremendous lure of the Internet is the availability of vast amounts

of data on businesses, people, and their activity on social media.

But how can we capture the information and make use of it as we

might make use of more traditional data sources? In this chapter,

we begin by describing how web data can be collected, using the

use case of UMETRICS and research output as a readily available

example, and then discuss how to think about the scope, coverage,

and integration issues associated with its collection.

Often a big data exploration starts with information on people or

on a group of people. The web can be a rich source of additional in-

formation. It can also act as pointers to new sources of information,

allowing a pivot from one perspective to another, from one kind of

query to another. Often this is exploratory. You have an existing

core set of data and are looking to augment it. But equally this

exploration can open up whole new avenues. Sometimes the data

are completely unstructured, existing as web pages spread across a

24 2. Working with Web Data and APIs

site, and sometimes they are provided in a machine-readable form.

The challenge is in having a suﬃciently diverse toolkit to bring all

of this information together.

Using the example of data on researchers and research outputs,

we will explore obtaining information directly from web pages (web

scraping) as well as explore the uses of APIs—web services that allow

an interaction with, and retrieval of, structured data. You will see

how the crucial pieces of integration often lie in making connections

between disparate data sets and how in turn making those connec-

tions requires careful quality control. The emphasis throughout

this chapter is on the importance of focusing on the purpose for

which the data will be used as a guide for data collection. While

much of this is speciﬁc to data about research and researchers, the

ideas are generalizable to wider issues of data and public policy.

2.2 Scraping information from the web

With the range of information available on the web, our ﬁrst ques-

tion is how to access it. The simplest approach is often to manually

go directly to the web and look for data ﬁles or other information.

For instance, on the NSF website [268] it is possible to obtain data

dumps of all grant information. Sometimes data are available only

on web pages or we only want a subset of this information. In this

case web scraping is often a viable approach.

Web scraping involves using a program to download and process

web pages directly. This can be highly eﬀective, particularly where

tables of information are made available online. It is also useful in

cases where it is desirable to make a series of very similar queries.

In each case we need to look at the website, identify how to get the

information we want, and then process it. Many websites deliber-

ately make this diﬃcult to prevent easy access to their underlying

data.

2.2.1 Obtaining data from the HHMI website

Let us suppose we are interested in obtaining information on those

investigators that are funded by the Howard Hughes Medical Insti-

tute (HHMI). HHMI has a website that includes a search function

for funded researchers, including the ability to ﬁlter by ﬁeld, state,

and role. But there does not appear to be a downloadable data set

of this information. However, we can automate the process with

code to create a data set that you might compare with other data.

2.2. Scraping information from the web 25

This process involves ﬁrst understanding how to construct a URL

that will do the search we want. This is most easily done by playing

with search functionality and investigating the URL structures that

are returned. Note that in many cases websites are not helpful

here. However, with HHMI if we do a general search and play with

the structure of the URL, we can see some of the elements of the URL

that we can think of as a query. As we want to see all investigators,

we do not need to limit the search, and so with some ﬁddling we

come up with a URL like the following. (We have broken the one-line

URL into three lines for ease of presentation.)

http://www.hhmi.org/scientists/browse?

kw=&sort_by=field_scientist_last_name&

sort_order=ASC&items_per_page=20&page=0

The requests module, available natively in Jupyter Python note-

books, is a useful set of tools for handling interactions with web-

sites. It lets us construct the request that we just presented in

terms of a base URL and query terms, as follows:

>> BASE_URL = "http://www.hhmi.org/scientists/browse"

>> query = {

"kw" :"",

"sort_by" :"field_scientist_last_name",

"sort_order" :"ASC",

"items_per_page" : 20,

"page" : None

}

With our request constructed we can then make the call to the

web page to get a response.

>> import requests

>> response = requests.get(BASE_URL, params=query)

The ﬁrst thing to do when building a script that hits a web page

is to make sure that your call was successful. This can be checked

by looking at the response code that the web server sent—and, obvi-

ously, by checking the actual HTML that was returned. A 200 code

means success and that everything should be OK. Other codes may

mean that the URL was constructed wrongly or that there was a

server error.

>> response.status_code

200

With the page successfully returned, we now need to process

the text it contains into the data we want. This is not a trivial

exercise. It is possible to search through and ﬁnd things, but there

26 2. Working with Web Data and APIs

Figure 2.1. Source HTML from the portion of an HHMI results page containing information on HHMI investigators;

note that the webscraping results in badly formatted html which is difﬁcult to read.

are a range of tools that can help with processing HTML and XML

data. Among these one of the most popular is a module called

BeautifulSoup*[319], which provides a number of useful functions

⋆Python features many

useful libraries; Beautiful-

Soup is particularly helpful

for webscraping.

for this kind of processing. The module documentation provides

more details.

We need to check the details of the page source to ﬁnd where the

information we are looking for is kept (see, for example, Figure 2.1).

Here, all the details on HHMI investigators can be found in a <div>

element with the class attribute view-content. This structure is not

something that can be determined in advance. It requires knowl-

edge of the structure of the page itself. Nested inside this <div>

element are another series of divs, each of which corresponds to

one investigator. These have the class attribute view-rows. Again,

there is nothing obvious about ﬁnding these, it requires a close ex-

amination of the page HTML itself for any speciﬁc case you happen

to be looking at.

We ﬁrst process the page using the BeautifulSoup module (into

the variable soup) and then ﬁnd the div element that holds the

information on investigators (investigator_list). As this element

is unique on the page (I checked using my web browser), we can

use the ﬁnd method. We then process that div (using find_all) to

create an iterator object that contains each of the page segments

detailing a single investigator (investigators).

>> from bs4 import BeautifulSoup

>> soup = BeautifulSoup(response.text, "html5lib")

>> investigator_list = soup.find(’div’, class_ = "view-content")

>> investigators = investigator_list.find_all("div", class_ = "

views-row")

As we speciﬁed in our query parameters that we wanted 20 res-

ults per page, we should check whether our list of page sections has

the right length.

>> len(investigators)

2.2. Scraping information from the web 27

# Given a request response object, parse for HHMI investigators

def scrape(page_response):

# Obtain response HTML and the correct <div> from the page

soup = BeautifulSoup(response.text, "html5lib")

inv_list = soup.find('div', class_ = "view-content")

# Create a list of all the investigators on the page

investigators = inv_list.find_all("div", class_ = "views-row")

data = [] # Make the data object to store scraping results

# Scrape needed elements from investigator list

for investigator in investigators:

inv = {} # Create a dictionary to store results

# Name and role are in same HTML element; this code

# separates them into two data elements

name_role_tag = investigator.find("div",

class_ = "views-field-field-scientist-classification")

strings = name_role_tag.stripped_strings

for string,a in zip(strings, ["name","role"]):

inv[a] = string

# Extract other elements from text of specific divs or from

# class attributes of tags in the page (e.g., URLs)

research_tag = investigator.find("div",

class_ = "views-field-field-scientist-research-abs-nod")

inv["research"] = research_tag.text.lstrip()

inv["research_url"] = "http://hhmi.org"

+ research_tag.find("a").get("href")

institution_tag = investigator.find("div",

class_ = "views-field-field-scientist-academic-institu")

inv["institute"] = institution_tag.text.lstrip()

town_state_tag = investigator.find("div",

class_ = "views-field-field-scientist-institutionstate"

)

inv["town"], inv["state"] = town_state_tag.text.split(",")

inv["town"] = inv.get("town").lstrip()

inv["state"] = inv.get("state").lstrip()

thumbnail_tag = investigator.find("div",

class_ = "views-field-field-scientist-image-thumbnail")

inv["thumbnail_url"] = thumbnail_tag.find("img")["src"]

inv["url"] = "http://hhmi.org"

+ thumbnail_tag.find("a").get("href")

# Add the new data to the list

data.append(inv)

return data

Listing 2.1. Python code to parse for HHMI investigators

28 2. Working with Web Data and APIs

Finally, we need to process each of these segments to obtain the

data we are looking for. This is the actual “scraping” of the page

to get the information we want. Again, this involves looking closely

at the HTML itself, identifying where the information is held, what

tags can be used to ﬁnd it, and often doing some postprocessing to

clean it up (removing spaces, splitting diﬀerent elements up).

Listing 2.1 provides a function to handle all of this. The function

accepts the response object from the requests module as its input,

processes the page text to soup, and then ﬁnds the investigator_list

as above and processes it into an actual list of the investigators. For

each investigator it then processes the HTML to ﬁnd and clean up

the information required, converting it to a dictionary and adding it

to our growing list of data.

Let us check what the ﬁrst two elements of our data set now look

like. You can see two dictionaries, one relating to Laurence Abbott,

who is a senior fellow at the HHMI Janelia Farm Campus, and one

for Susan Ackerman, an HHMI investigator based at the Jackson

Laboratory in Bar Harbor, Maine. Note that we have also obtained

URLs that give more details on the researcher and their research

program (research_url and url keys in the dictionary) that could

provide a useful input to textual analysis or topic modeling (see

Chapter 7).

>> data = scrape(response)

>> data[0:2]

[{’institute’: u’Janelia Research Campus ’,

’name’: u’Laurence Abbott, PhD’,

’research’: u’Computational and Mathematical Modeling of Neurons

and Neural... ’,

’research_url’: u’http://hhmi.org/research/computational-and-

mathematical-modeling-neurons-and-neural-networks’,

’role’: u’Janelia Senior Fellow’,

’state’: u’VA ’,

’thumbnail_url’: u’http://www.hhmi.org/sites/default/files/Our

%20Scientists/Janelia/Abbott-112x112.jpg’,

’town’: u’Ashburn’,

’url’: u’http://hhmi.org/scientists/laurence-f-abbott’},

{’institute’: u’The Jackson Laboratory ’,

’name’: u’Susan Ackerman, PhD’,

’research’: u’Identification of the Molecular Mechanisms

Underlying... ’,

’research_url’: u’http://hhmi.org/research/identification-

molecular-mechanisms-underlying-neurodegeneration’,

’role’: u’Investigator’,

’state’: u’ME ’,

’thumbnail_url’:

u’http://www.hhmi.org/sites/default/files/Our%20Scientists/

Investigators/Ackerman-112x112.jpg’,

2.2. Scraping information from the web 29

’town’: u’Bar Harbor’,

’url’: u’http://hhmi.org/scientists/susan-l-ackerman’}]

So now we know we can process a page from a website to generate

usefully structured data. However, this was only the ﬁrst page of

results. We need to do this for each page of results if we want

to capture all the HHMI investigators. We could just look at the

number of pages that our search returned manually, but to make

this more general we can actually scrape the page to ﬁnd that piece

of information and use that to calculate how many pages we need

to work through.

The number of results is found in a div with the class “view-

headers” as a piece of free text (“Showing 1–20 of 493 results”). We

need to grab the text, split it up (I do so based on spaces), ﬁnd the

right number (the one that is before the word “results”) and convert

that to an integer. Then we can divide by the number of items we

requested per page (20 in our case) to ﬁnd how many pages we need

to work through. A quick mental calculation conﬁrms that if page

0 had results 1–20, page 24 would give results 481–493.

>> # Check total number of investigators returned

>> view_header = soup.find("div", class_ = "view-header")

>> words = view_header.text.split(" ")

>> count_index = words.index("results.") - 1

>> count = int(words[count_index])

>> # Calculate number of pages, given count & items_per_page

>> num_pages = count/query.get("items_per_page")

>> num_pages

Then it is a simple matter of putting the function we constructed

earlier into a loop to work through the correct number of pages. As

we start to hit the website repeatedly, we need to consider whether

we are being polite. Most websites have a ﬁle in the root directory

called robots.txt that contains guidance on using programs to inter-

act with the website. In the case of http://hhmi.org the ﬁle states

ﬁrst that we are allowed (or, more properly, not forbidden) to query

http://www.hhmi.org/scientists/ programmatically. Thus, you can

pull down all of the more detailed biographical or research informa-

tion, if you so desire. The ﬁle also states that there is a requested

“Crawl-delay” of 10. This means that if you are making repeated

queries (as we will be in getting the 24 pages), you should wait for

10 seconds between each query. This request is easily accommo-

dated by adding a timed delay between each page request.

30 2. Working with Web Data and APIs

>> for page_num in range(num_pages):

>> # We already have page zero and we need to go to 24:

>> # range(24) is [0,1,...,23]

>> query["items_per_page"] = page_num + 1

>> page = requests.get(BASE_URL, params=query)

>> # We use extend to add list for each page to existing list

>> data.extend(scrape(page))

>> print "Retrieved and scraped page number:", query.get("

items_per_page")

>> time.sleep(10) # robots.txt at hhmi.org specifies a crawl delay

of 10 seconds

Retrieved and scraped page number: 1

Retrieved and scraped page number: 2

...

Retrieved and scraped page number: 24

Finally we can check that we have the right number of results

after our scraping. This should correspond to the 493 records that

the website reports.

>> len(data)

493

2.2.2 Limits of scraping

While scraping websites is often necessary, is can be a fragile and

messy way of working. It is problematic for a number of reasons: for

example, many websites are designed in ways that make scraping

diﬃcult or impossible, and other sites explicitly prohibit this kind

of scripted analysis. (Both reasons apply in the case of the NSF and

Grants.gov websites, which is why we use the HHMI website in our

example.)

In many cases a better choice is to process a data dump from

an organization. For example, the NSF and Wellcome Trust both

provide data sets for each year that include structured data on all

their awarded grants. In practice, integrating data is a continual

challenge of ﬁguring out what is the easiest way to proceed, what is

allowed, and what is practical and useful. The selection of data will

often be driven by pragmatic rather than theoretical concerns.

Increasingly, however, good practice is emerging in which orga-

nizations provide APIs to enable scripted and programmatic access

to the data they hold. These tools are much easier and generally

more eﬀective to work with. They are the focus of much of the rest

of this chapter.

2.3. New data in the research enterprise 31

2.3 New data in the research enterprise

The new forms of data we are discussing in this chapter are largely

available because so many human activities—in this case, discus-

sion, reading, and bookmarking—are happening online. All sorts of

data are generated as a side eﬀect of these activities. Some of that

data is public (social media conversations), some private (IP ad-

dresses requesting speciﬁc pages), and some intrinsic to the service

(the identity of a user who bookmarks an article). What exactly are

these new forms of data? There are broadly two new directions that

data availability is moving in. The ﬁrst is information on new forms

of research output, data sets, software, and in some cases physical

resources. There is an interest across the research community in

expanding the set of research outputs that are made available and,

to drive this, signiﬁcant eﬀorts are being made to ensure that these

nontraditional outputs are seen as legitimate outputs. In particular

there has been a substantial policy emphasis on data sharing and,

coupled with this, eﬀorts to standardize practice around data cita-

tion. This is applying a well-established measure (citation) to a new

form of research output.

The second new direction, which is more developed, takes the

alternate route, providing new forms of information on existing types

of output, speciﬁcally research articles. The move online of research

activities, including discovery, reading, writing, and bookmarking,

means that many of these activities leave a digital trace. Often

these traces are public or semi-public and can be collected and

tracked. This certainly raises privacy issues that have not been

comprehensively addressed but also provides a rich source of data

on who is doing what with research articles.

There are a wide range of potential data sources, so it is useful

to categorize them. Figure 2.2 shows one possible categorization,

in which data sources are grouped based on the level of engage-

ment and the stage of use. It starts from the left with “views,”

measures of online views and article downloads, followed by “saves”

where readers actively collect articles into a library of their own,

through online discussion forums such as blogs, social media and

new commentary, formal scholarly recommendations, and, ﬁnally,

formal citations.

These categories are a useful way to understand the classes of

information available and to start digging into the sources they can

be obtained from. For each category we will look at the kind of

usage that the indicator is a proxy for, which users are captured by

32 2. Working with Web Data and APIs

Research Article

Viewed Saved

PLOS HTML CiteULike

Mendeley

NatureBlogs F1000 Prime Crossref

PMC

Web of Science

Scopus

ScienceSeeker

Twitter

Facebook

Increasing Engagement

Wikipedia

PLOS Comments

ResearchBlogging

PLOS PDF

PLOS XML

PMC HTML

PMC PDF

Discussed CitedRecommended

Figure 2.2. Classes of online activity related to research journal articles. Reproduced from Lin and Fenner [237],

under a Creative Commons Attribution v 3.0 license

the indicator, the limitations that the indicator has as a measure,

and the sources of data. We start with the familiar case of formal

literature citations to provide context.

Example: Citations

Most quantitative analyses of research have focused on citations from research ar-

ticles to other research articles. Many familiar measures—such as Impact Factors,

Scimago Journal Rank, or Eigenfactor—are actually measures of journal rather

than article performance. However, information on citations at the article level is

increasingly the basis for much bibliometric analysis.

•Kind of usage

◦Citing a scholarly work is a signal from a researcher that a speciﬁc

work has relevance to, or has inﬂuenced, the work they are describing.

◦It implies signiﬁcant engagement and is a measure that carries some

weight.

•Users

◦Researchers, which means usage by a speciﬁc group for a fairly small

range of purposes.

◦With high-quality data, there are some geographical, career, and dis-

ciplinary demographic details.

•Limitations

◦The citations are slow to accumulate, as they must pass through a

peer-review process.

2.3. New data in the research enterprise 33

◦It is seldom clear from raw data why a paper is being cited.

◦It provides a limited view of usage, as it only reﬂects reuse in research,

not application in the community.

•Sources

◦Public sources of citation data include PubMed Central and Europe

PubMed Central, which mine publicly available full text to ﬁnd cita-

tions.

◦Proprietary sources of citation data include Thomson Reuters’ Web of

Knowledge and Elsevier’s Scopus.

◦Some publishers make citation data collected by Crossref available.

Example: Page views and downloads

A major new source of data online is the number of times articles are viewed. Page

views and downloads can be deﬁned in diﬀerent ways and can be reached via a

range of paths. Page views are an immediate measure of usage. Viewing a paper

may involve less engagement than citation or bookmarking, but it can capture

interactions with a much wider range of users.

The possibility of drawing demographic information from downloads has sig-

niﬁcant potential for the future in providing detailed information on who is reading

an article, which may be valuable for determining, for example, whether research

is reaching a target audience.

•Kind of usage

◦It counts the number of people who have clicked on an article page or

downloaded an article.

•Users

◦Page views and downloads report on use by those who have access

to articles. For publicly accessible articles this could be anyone; for

subscription articles it is likely to be researchers.

•Limitations

◦Page views are calculated in diﬀerent ways and are not directly com-

parable across publishers. Standards are being developed but are not

yet widely applied.

◦Counts of page views cannot easily distinguish between short-term

visitors and those who engage more deeply with an article.

◦There are complications if an article appears in multiple places, for

example at the journal website and a repository.

34 2. Working with Web Data and APIs

•Sources

◦Some publishers and many data repositories make page view data

available in some form. Publishers with public data include PLOS,

Nature Publishing Group, Ubiquity Press, Co-Action Press, and Fron-

tiers.

◦Data repositories, including Figshare and Dryad, provide page view

and download information.

◦PubMed Central makes page views of articles hosted on that site avail-

able to depositing publishers. PLOS and a few other publishers make

this available.

Example: Analyzing bookmarks

Tools for collecting and curating personal collections of literature, or web content,

are now available online. They make it easy to make copies and build up indexes of

articles. Bookmarking services can choose to provide information on the number

of people who have bookmarked a paper.

Two important services targeted at researchers are Mendeley and CiteULike.

Mendeley has the larger user base and provides richer statistics. Data include

the number of users that who bookmarked a paper, groups that have collected a

paper, and in some cases demographics of users, which can include discipline,

career stage, and geography.

Bookmarks accumulate rapidly after publication and provide evidence of schol-

arly interest. They correlate quite well with the eventual number of citations. There

are also public bookmarking services that provide a view onto wider interest in re-

search articles.

•Kind of usage

◦Bookmarking is a purposeful act. It may reveal more interest than a

page view, but less than a citation.

◦Its uses are diﬀerent from those captured by citations.

◦The bookmarks may include a variety of documents, such as papers for

background reading, introductory material, position or policy papers,

or statements of community positions.

•Users

◦Academic-focused services provide information on use by researchers.

◦Each service has a diﬀerent user proﬁle in, for instance, sciences or

social sciences.

2.3. New data in the research enterprise 35

◦All services have a geographical bias towards North America and Eu-

rope.

◦There is some demographic information, for instance, on countries

where users are bookmarking the most.

•Limitations

◦There is bias in coverage of services; for instance, Mendeley has good

coverage of biomedical literature.

◦It can only report on activities of signed-up users.

◦It is not usually possible to determine why a bookmark has been cre-

ated.

•Sources

◦Mendeley and CiteULike both have public APIs that provide data that

are freely available for reuse.

◦Most consumer bookmarking services provide some form of API, but

this often has restrictions or limitations.

Example: Discussions on social media

Social media are one of the most valuable new services producing information

about research usage. A growing number of researchers, policymakers, and tech-

nologists are on these services discussing research.

There are three major features of social media as a tool. First, among a large

set of conversations, it is possible to discover a discussion about a speciﬁc paper.

Second, Twitter makes it possible to identify groups discussing research and to

learn whether they were potential targets of the research. Third, it is possible to

reconstruct discussions to understand what paths research takes to users.

In the future it will be possible to identify target audiences and to ask whether

they are being reached and how modiﬁed distribution might maximize that reach.

This could be a powerful tool, particularly for research with social relevance.

Twitter provides the most useful data because discussions and the identity of

those involved are public. Connections between users and the things they say

are often available, making it possible to identify communities discussing work.

However, the 140-character limit on Twitter messages (“tweets”) does not support

extended critiques. Facebook has much less publicly available information—but

being more private, it can be a site for frank discussion of research.

•Kind of usage

◦Those discussing research are showing interest potentially greater than

page views.

36 2. Working with Web Data and APIs

◦Often users are simply passing on a link or recommending an article.

◦It is possible to navigate to tweets and determine the level and nature

of interest.

◦Conversations range from highly technical to trivial, so numbers should

be treated with caution.

◦Highly tweeted or Facebooked papers also tend to have signiﬁcant

bookmarking and citation.

◦Professional discussions can be swamped when a piece of research

captures public interest.

•Users

◦The user bases and data sources for Twitter and Facebook are global

and public.

◦There are strong geographical biases.

◦A rising proportion of researchers use Twitter and Facebook for profes-

sional activities.

◦Many journalists, policymakers, public servants, civil society groups,

and others use social media.

•Limitations

◦Frequent lack of explicit links to papers is a serious limitation.

◦Use of links is biased towards researchers and against groups not

directly engaged in research.

◦There are demographic issues and reinforcement eﬀects—retweeting

leads to more retweeting in preference to other research—so analysis

of numbers of tweets or likes is not always useful.

Example: Recommendations

A somewhat separate form of usage is direct expert recommendations. The best-

known case of this is the F1000 service on which experts oﬀer recommendations

with reviews of speciﬁc research articles. Other services such as collections or

personal recommendation services may be relevant here as well.

•Kind of usage

◦Recommendations from speciﬁc experts show that particular outputs

are worth looking at in detail, are important, or have some other value.

◦Presumably, recommendations are a result of in-depth reading and a

high level of engagement.

2.4. A functional view 37

•Users

◦Recommendations are from a selected population of experts depending

on the service in question.

◦In some cases this might be an algorithmic recommendation service.

•Limitations

◦Recommendations are limited to the interests of the selected popula-

tion of experts.

◦The recommendation system may be biased in terms of the interests of

recommenders (e.g., towards—or away from—new theories vs. develop-

ing methodology) as well as their disciplines.

◦Recommendations are slow to build up.

2.4 A functional view

The descriptive view of data types and sources is a good place to

start, but it is subject to change. Sources of data come and go,

and even the classes of data types may expand and contract in the

medium to long term. We also need a more functional perspective

to help us understand how these sources of data relate to activities

in the broader research enterprise.

Consider Figure 1.2 in Chapter 1. The research enterprise has

been framed as being made up of people who are generating out-

puts. The data that we consider in this chapter relate to connections

between outputs, such as citations between research articles and

tweets referring to articles. These connections are themselves cre-

ated by people, as shown in Figure 2.3. The people in turn may be

classed as belonging to certain categories or communities. What is

interesting, and expands on the simpliﬁed picture of Figure 1.2, is

that many of these people are not professional researchers. Indeed,

in some cases they may not be people at all but automated systems

of some kind. This means we need to expand the set of actors we are

considering. As described above, we are also expanding the range

of outputs (or objects) that we are considering as well.

In the simple model of Figure 2.3, there are three categories of

things (nodes on the graph): objects, people, and the communities

they belong to. Then there are the relationships between these el-

ements (connections between nodes). Any given data source may

provide information on diﬀerent parts of this graph, and the in-

formation available is rarely complete or comprehensive. Data from

38 2. Working with Web Data and APIs

Refers to

Created byCreated by

Figure 2.3. A simpliﬁed model of online interactions between research outputs

and the objects that refer to them

diﬀerent sources can also be diﬃcult to integrate. As with any

◮See Section 3.2.

data integration, combining sources relies on being able to conﬁ-

dently identify those nodes that are common between data sources.

Therefore identifying unique objects and people is critical to making

progress.

These data are not necessarily public but many services choose

to make some data available. An important characteristic of these

data sources is that they are completely in the gift of the service

provider. Data availability, its presentation, and upstream analysis

can change without notice. Data are sometimes provided as a dump

but is also frequently provided through an API.

An API is simply a tool that allows a program to interface with a

service. APIs can take many diﬀerent forms and be of varying quality

and usefulness. In this section we will focus on one common type

of API and examples of important publicly available APIs relevant

to research communications. We will also cover combining APIs

and the beneﬁts and challenges of bringing multiple data sources

together.

2.4.1 Relevant APIs and resources

There is a wide range of other sources of information that can be

used in combination with the APIs featured above to develop an

overview of research outputs and of where and how they are being

used. There are also other tools that can allow deeper analysis of the

outputs themselves. Table 2.1 gives a partial list of key data sources

and APIs that are relevant to the analysis of research outputs.

2.4.2 RESTful APIs, returned data, and Python wrappers

The APIs we will focus on here are all examples of RESTful services.

REST stands for Representational State Transfer [121, 402], but for

2.4. A functional view 39

Table 2.1. Popular sources of data relevant to the analysis of research outputs

Source Description API Free

Bibliographic Data

PubMed An online index that combines bibliographic data from Medline and

PubMed Central. PubMed Central and Europe PubMed Central also

provide information.

Y Y

Web of Science The bibliographic database provided by Thomson Reuters. The ISI

Citation Index is also available.

Y N

Scopus The bibliographic database provided by Elsevier. It also provides

citation information.

Y N

Crossref Provides a range of bibliographic metadata and information obtained

from members registering DOIs.

Y Y

Google Scholar Provides a search index for scholarly objects and aggregates citation

information.

N Y

Microsoft Academic Search Provides a search index for scholarly objects and aggregates citation

information. Not as complete as Google Scholar, but has an API.

Y Y

Social Media

Altmetric.com A provider of aggregated data on social media and mainstream media

attention of research outputs. Most comprehensive source of infor-

mation across diﬀerent social media and mainstream media conver-

sations.

Y N

Twitter Provides an API that allows a user to search for recent tweets and

obtain some information on speciﬁc accounts.

Y Y

Facebook The Facebook API gives information on the number of pages, likes,

and posts associated with speciﬁc web pages.

Y Y

Author Proﬁles

ORCID Unique identiﬁers for research authors. Proﬁles include information

on publication lists, grants, and aﬃliations.

Y Y

LinkedIn CV-based proﬁles, projects, and publications. Y *

Funder Information

Gateway to Research A database of funding decisions and related outputs from Research

Councils UK.

Y Y

NIH Reporter Online search for information on National Institutes of Health grants.

Does not provide an API but a downloadable data set is available.

N Y

NSF Award Search Online search for information on NSF grants. Does not provide an

API but downloadable data sets by year are available.

N Y

* The data are restricted: sometimes fee based, other times not.

our purposes it is most easily understood as a means of transfer-

ring data using web protocols. Other forms of API require addi-

tional tools or systems to work with, but RESTful APIs work directly

over the web. This has the advantage that a human user can also

with relative ease play with the API to understand how it works.

Indeed, some websites work simply by formatting the results of

API calls.

40 2. Working with Web Data and APIs

As an example let us look at the Crossref API. This provides a

range of information associated with Digital Object Identiﬁers (DOIs)

registered with Crossref. DOIs uniquely identify an object, and

Crossref DOIs refer to research objects, primarily (but not entirely)

research articles. If you use a web browser to navigate to http://api.

crossref.org/works/10.1093/nar/gni170, you should receive back

a webpage that looks something like the following. (We have laid it

out nicely to make it more readable.)

{"status" :"ok",

"message-type" :"work",

"message-version" :"1.0.0",

"message" :

{"subtitle":[],

"subject" :["Genetics"],

"issued" :{"date-parts" :[[2005,10,24]] },

"score" :1.0,

"prefix" :"http://id.crossref.org/prefix/10.1093",

"author" :["affiliation" :[],

"family" :"Whiteford",

"given" :"N."}],

"container-title" :["Nucleic Acids Research"],

"reference-count" :0,

"page" :"e171-e171",

"deposited" :{"date-parts" :[[2013,8,8]],

"timestamp" :1375920000000},

"issue" :"19",

"title" :

["An analysis of the feasibility of short read sequencing"]

"type" :"journal-article",

"DOI" :"10.1093/nar/gni170",

"ISSN" :["0305-1048","1362-4962"],

"URL" :"http://dx.doi.org/10.1093/nar/gni170",

"source" :"Crossref",

"publisher" :"Oxford University Press (OUP)",

"indexed" :{"date-parts" :[[2015,6,8]],

"timestamp" :1433777291246},

"volume" :"33",

"member" :"http://id.crossref.org/member/286"

}

This is a package of JavaScript Object Notation (JSON)*data

⋆JSON is an open stan-

dard way of storing and ex-

changing data.

returned in response to a query. The query is contained entirely

in the URL, which can be broken up into pieces: the root URL

(http://api.crossref.org) and a data “query,” in this case made up of

a “ﬁeld” (works) and an identiﬁer (the DOI 10.1093/nar/gni170). The

Crossref API provides information about the article identiﬁed with

this speciﬁc DOI.

2.5. Programming against an API 41

2.5 Programming against an API

Programming against an API involves constructing HTTP requests

and parsing the data that are returned. Here we use the Cross-

ref API to illustrate how this is done. Crossref is the provider of

DOIs used by many publishers to uniquely identify scholarly works.

Crossref is not the only organization to provide DOIs. The scholarly

communication space DataCite is another important provider. The

documentation is available at the Crossref website [394].

Once again the requests Python library provides a series of con-

venience functions that make it easier to make HTTP calls and to

process returned JSON. Our ﬁrst step is to import the module and

set a base URL variable.

>> import requests

>> BASE_URL = "http://api.crossref.org/"

A simple example is to obtain metadata for an article associated

with a speciﬁc DOI. This is a straightforward call to the Crossref

API, similar to what we saw earlier.

>> doi = "10.1093/nar/gni170"

>> query = "works/"

>> url = BASE_URL + query + doi

>> response = requests.get(url)

>> url

http://api.crossref.org/works/10.1093/nar/gni170

>> response.status_code

200

The response object that the requests library has created has a

range of useful information, including the URL called and the re-

sponse code from the web server (in this case 200, which means

everything is OK). We need the JSON body from the response ob-

ject (which is currently text from the perspective of our script) con-

verted to a Python dictionary. The requests module provides a

convenient function for performing this conversion, as the following

code shows. (All strings in the output are in Unicode, hence the u’

notation.)

>> response_dict = response.json()

>> response_dict

{u’message’ :

{u’DOI’ :u’10.1093/nar/gni170’,

u’ISSN’ :[u’0305-1048’,u’1362-4962’],

u’URL’ :u’http://dx.doi.org/10.1093/nar/gni170’,

u’author’ :[ {u’affiliation’ :[],

u’family’ :u’Whiteford’,

42 2. Working with Web Data and APIs

u’given’ :u’N.’} ],

u’container-title’ :[u’Nucleic Acids Research’ ],

u’deposited’ :{u’date-parts’ :[[2013,8,8]],

u’timestamp’ :1375920000000 },

u’indexed’ :{u’date-parts’ :[[2015,6,8]],

u’timestamp’ :1433777291246 },

u’issue’ :u’19’,

u’issued’ :{u’date-parts’ :[[2005,10,24]] },

u’member’ :u’http://id.crossref.org/member/286’,

u’page’ :u’e171-e171’,

u’prefix’ :u’http://id.crossref.org/prefix/10.1093’,

u’publisher’ :u’Oxford University Press (OUP)’,

u’reference-count’ :0,

u’score’ :1.0,

u’source’ :u’Crossref’,

u’subject’ :[u’Genetics’],

u’subtitle’ :[],

u’title’ :[u’An analysis of the feasibility of short read

sequencing’],

u’type’ :u’journal-article’,

u’volume’ :u’33’

u’message-type’ :u’work’,

u’message-version’ :u’1.0.0’,

u’status’ :u’ok’

}

This data object can now be processed in whatever way the user

wishes, using standard manipulation techniques.

The Crossref API can, of course, do much more than simply look

up article metadata. It is also valuable as a search resource and

for cross-referencing information by journal, funder, publisher, and

other criteria. More details can be found at the Crossref website.

2.6 Using the ORCID API via a wrapper

ORCID, which stands for “Open Research and Contributor Identi-

ﬁer” (see orcid.org; see also [145]), is a service that provides unique

identiﬁers for researchers. Researchers can claim an ORCID proﬁle

and populate it with references to their research works, funding and

aﬃliations. ORCID provides an API for interacting with this infor-

mation. For many APIs there is a convenient Python wrapper that

can be used. The ORCID–Python wrapper works with the ORCID

v1.2 API to make various API calls straightforward. This wrapper

only works with the public ORCID API and can therefore only access

publicly available data.

2.6. Using the ORCID API via a wrapper 43

Using the API and wrapper together provides a convenient means

of getting this information. For instance, given an ORCID, it is

straightforward to get proﬁle information. Here we get a list of pub-

lications associated with my ORCID and look at the the ﬁrst item

on the list.

>> import orcid

>> cn = orcid.get("0000-0002-0068-716X")

>> cn

>> cn.publications[0]

The wrapper has created Python objects that make it easier to

work with and manipulate the data. It is common to take the return

from an API and create objects that behave as would be expected

in Python. For instance, the publications object is a list popu-

lated with publications (which are also Python-like objects). Each

publication in the list has its own attributes, which can then be

examined individually. In this case the external IDs attribute is a

list of further objects that include a DOI for the article and the ISSN

of the journal the article was published in.

>> len(cn.publications)

>> cn.publications[12].external_ids

[<ExternalID DOI:10.1371/journal.pbio.1001677>, <ExternalID ISSN

:1545-7885>]

As a simple example of data processing, we can iterate over the

list of publications to identify those for which a DOI has been pro-

vided. In this case we can see that of the 70 publications listed in

this ORCID proﬁle (at the time of testing), 66 have DOIs.

>> exids = []

>> for pub in cn.publications:

if pub.external_ids:

exids = exids + pub.external_ids

>> DOIs = [exid.id for exid in exids if exid.type == "DOI"]

>> len(DOIs)

Wrappers generally make operating with an API simpler and

cleaner by abstracting away the details of making HTTP requests.

Achieving the same by directly interacting with the ORCID API

would require constructing the appropriate URLs and parsing the

returned data into a usable form. Where a wrapper is available it

is generally much easier to use. However, wrappers may not be

actively developed and may lag the development of the API. Where

44 2. Working with Web Data and APIs

possible, use a wrapper that is directly supported or recommended

by the API provider.

2.7 Quality, scope, and management

The examples in the previous section are just a small dip into the

surface of the data available, but we already can see a number of

issues that are starting to surface. A great deal of care needs to

be taken when using these data, and a researcher will need to ap-

ply subject matter knowledge as well as broader data management

expertise. Some of the core issues are as follows:

◮See Chapter 10.

Integration In the examples given above with Crossref and ORCID,

we used a known identiﬁer (a DOI or an ORCID). Integrating data

from Crossref to supplement the information from an ORCID proﬁle

is possible, but it depends on the linking of identiﬁers. Note that

for the proﬁle data we obtained, only 66 or the 70 items had DOIs.

Data integration across multiple data sources that reference DOIs

is straightforward for those objects that have DOIs, and messy or

impossible for those that do not. In general, integration is possible,

but it depends on a means of cross-referencing between data sets.

Unique identiﬁers that are common to both are extremely powerful

but only exist in certain cases (see also Chapter 3).

Coverage Without a population frame, it is diﬃcult to know

whether the information that can be captured is comprehensive.

For example, “the research literature” is at best a vague concept. A

variety of indexes, some openly available (PubMed, Crossref), some

proprietary (Scopus, Web of Knowledge, many others), cover diﬀer-

ent partially overlapping segments of this corpus of work. Each in-

dex has diﬀering criteria for inclusion and diﬀering commitments to

completeness. Sampling of “the literature” is therefore impossible,

and the choice of index used for any study can make a substantial

diﬀerence to the conclusions.

Completeness Alongside the question of coverage (how broad is

a data source?), with web data and opt-in services we also need to

probe the completeness of a data set. In the example above, 66 of 70

objects have a DOI registered. This does not mean that those four

other objects do not have a DOI, just that there are none included in

the ORCID record. Similarly, ORCID proﬁles only exist for a subset

of researchers at this stage. Completeness feeds into integration

2.7. Quality, scope, and management 45

challenges. While many researchers have a Twitter proﬁle and many

have an ORCID proﬁle, only a small subset of ORCID proﬁles provide

a link to a Twitter proﬁle. See below for a worked example.

Scope In survey data sets, the scope is deﬁned by the question

being asked. This is not the case with much of these new data.

For example, the challenges listed above for research articles, tra-

ditionally considered the bedrock of research outputs, at least in

the natural sciences, are much greater for other forms of research

outputs. Increasingly, the data generated from research projects,

software, materials, and tools, as well as reports and presentations,

are being shared by researchers in a variety of settings. Some of

these are formal mechanisms for publication, such as large disci-

plinary databases, books, and software repositories, and some are

highly informal. Any study of (a subset of) these outputs has as its

ﬁrst challenge the question of how to limit the corpus to be studied.

Source and validity The challenges described above relate to the

identiﬁcation and counting of outputs. As we start to address ques-

tions of how these outputs are being used, the issues are com-

pounded. To illustrate some of the diﬃculties that can arise, we

examine the number of citations that have been reported for a sin-

gle sample article on a biochemical methodology [68]. This article

has been available for eight years and has accumulated a reason-

able number of citations for such an article over that time.

However, the exact number of citations identiﬁed varies radi-

cally, depending on the data source. Scopus ﬁnds 40, while Web

of Science ﬁnds only 38. A Google Scholar search performed on

the same date identiﬁed 59. These diﬀerences relate to the size of

the corpus from which inward citations are being counted. Web of

Science has the smallest database, with Scopus being larger and

Google Scholar substantially larger again. Thus the size of the in-

dex not only aﬀects output counting, it can also have a substantial

eﬀect on any analysis that uses that corpus. Alongside the size

of the corpus, the means of analysis can also have an eﬀect. For

the same article, PubMed Central reports 10 citations but Europe

PubMed Central reports 18, despite using a similar corpus. The

distinction lies in diﬀerences in the methodology used to mine the

corpus for citations.

Identifying the underlying latent variable These issues multiply as

we move into newer forms of data. These sparse and incomplete

sources of data require diﬀerent treatment than more traditional

46 2. Working with Web Data and APIs

structured and comprehensive forms of data. They are more useful

as a way of identifying activities than of quantifying or comparing

them. Nevertheless, they can provide new insight into the pro-

cesses of knowledge dissemination and community building that

are occurring online.

2.8 Integrating data from multiple sources

We often must work across multiple data sources to gather the in-

formation needed to answer a research question. A common pattern

is to search in one location to create a list of identiﬁers and then use

those identiﬁers to query another API. In the ORCID example above,

we created a list of DOIs from a single ORCID proﬁle. We could use

those DOIs to obtain further information from the Crossref API and

other sources. This models a common path for analysis of research

outputs: identifying a corpus and then seeking information on its

performance.

In this example, we will build on the ORCID and Crossref ex-

amples to collect a set of work identiﬁers from an ORCID proﬁle

and use a range of APIs to identify additional metadata as well as

information on the performance of those articles. In addition to the

ORCID API, we will use the PLOS Lagotto API. Lagotto is the soft-

ware that was built to support the Article Level Metrics program at

PLOS, the open access publisher, and its API provides information

on various metrics of PLOS articles. A range of other publishers

and service providers, including Crossref, also provide an instance

of this API, meaning the same tools can be used to collect informa-

tion on articles from a range of sources.

2.8.1 The Lagotto API

The module pyalm is a wrapper for the Lagotto API, which is served

from a range of hosts. We will work with two instances in particular:

one run by PLOS, and the Crossref DOI Event Tracker (DET, recently

renamed Crossref Event Data) pilot service. We ﬁrst need to provide

the details of the URLs for these instances to our wrapper. Then

we can obtain some information for a single DOI to see what the

returned data look like.

>> import pyalm

>> pyalm.config.APIS = {’plos’ : {’url’ :

>> ’http://alm.plos.org/api/v5/articles’},

>> ’det’ : {’url’ :

2.8. Integrating data from multiple sources 47

>> ’http://det.labs.crossref.org/api/v5/articles’}

>> }

>> det_alm_test = pyalm.get_alm(’10.1371/journal.pbio.1001677’,

>> info=’detail’, instance=’det’)

det_alm_test

{’articles’ : [<ArticleALM Expert Failure: Re-evaluating Research

Assessment, DOI 10.1371/journal.pbio.1001677>],

’meta’ : {u’error’ : None, u’page’ : 1,

u’total’ : 1, u’total_pages’ : 1}

}

The library returns a Python dictionary containing two elements.

The articles key contains the actual data and the meta key includes

general information on the results of the interaction with the API.

In this case the library has returned one page of results containing

one object (because we only asked about one DOI). If we want to

collect a lot of data, this information helps in the process of paging

through results. It is common for APIs to impose some limit on

the number of results returned, so as to ensure performance. By

default the Lagotto API has a limit of 50 results.

The articles key holds a list of ArticleALM objects as its value.

Each ArticleALM object has a set of internal attributes that contain

information on each of the metrics that the Lagotto instance col-

lects. These are derived from various data providers and are called

sources. Each can be accessed by name from a dictionary called

“sources.” The iterkeys() function provides an iterator that lets us

loop over the set of keys in a dictionary. Within the source object

there is a range of information that we will dig into.

>> article = det_alm_test.get(’articles’)[0]

>> article.title

u’Expert Failure: Re-evaluating Research Assessment’

>> for source in article.sources.iterkeys():

>> print source, article.sources[source].metrics.total

reddit 0

datacite 0

pmceuropedata 0

wikipedia 1

pmceurope 0

citeulike 0

pubmed 0

facebook 0

wordpress 0

pmc 0

mendeley 0

crossref 0

The DET service only has a record of citations to this article

from Wikipedia. As we will see below, the PLOS service returns

more results. This is because some of the sources are not yet being

queried by DET.

48 2. Working with Web Data and APIs

Because this is a PLOS paper we can also query the PLOS Lagotto

instance for the same article.

>> plos_alm_test = pyalm.get_alm(’10.1371/journal.pbio.1001677’,

info=’detail’, instance=’plos’)

>> article_plos = plos_alm_test.get(’articles’)[0]

>> article_plos.title

u’Expert Failure: Re-evaluating Research Assessment’

>> for source in article_plos.sources.iterkeys():

>> print source, article_plos.sources[source].metrics.total

datacite 0

twitter 130

pmc 610

articlecoveragecurated 0

pmceurope 1

pmceuropedata 0

researchblogging 0

scienceseeker 0

copernicus 0

f1000 0

wikipedia 1

citeulike 0

wordpress 2

openedition 0

reddit 0

nature 0

relativemetric 125479

figshare 0

facebook 1

mendeley 14

crossref 3

plos_comments 2

articlecoverage 0

counter 12551

scopus 2

pubmed 1

orcid 3

The PLOS instance provides a greater range of information but

also seems to be giving larger numbers than the DET instance in

many cases. For those sources that are provided by both API in-

stances, we can compare the results returned.

>> for source in article.sources.iterkeys():

>> print source, article.sources[source].metrics.total,

>> article_plos.sources[source].metrics.total

reddit 0 0

datacite 0 0

pmceuropedata 0 0

wikipedia 1 1

pmceurope 0 1

citeulike 0 0

pubmed 0 1

2.8. Integrating data from multiple sources 49

facebook 0 1

wordpress 0 2

pmc 0 610

mendeley 0 14

crossref 0 3

The PLOS Lagotto instance is collecting more information and

has a wider range of information sources. Comparing the results

from the PLOS and DET instances illustrates the issues of coverage

and completeness discussed previously. The data may be sparse for

a variety of reasons, and it is important to have a clear idea of the

strengths and weaknesses of a particular data source or aggregator.

In this case the DET instance is returning information for some

sources for which it is does not yet have data.

We can dig deeper into the events themselves that the met-

rics.total count aggregates. The API wrapper collects these into

an event object within the source object. These contain the JSON

returned from the API in most cases. For instance, the Crossref

source is a list of JSON objects containing information on an article

that cites our article of interest. The ﬁrst citation event in the list is

a citation from the Journal of the Association for Information Science

and Technology by Du et al.

>> article_plos.sources[’crossref’].events[0]

{u’event’ :

{u’article_title’ : u’The effects of research level and article

type on the differences between citation metrics and F1000

recommendations’,

u’contributors’ :

{u’contributor’ :

[ { u’contributor_role’ : u’author’,

u’first_author’ : u’true’,

u’given_name’ : u’Jian’,

u’sequence’ : u’first’,

u’surname’ : u’Du’ },

{ u’contributor_role’ : u’author’,

u’first_author’ : u’false’,

u’given_name’ : u’Xiaoli’,

u’sequence’ : u’additional’,

u’surname’ : u’Tang’},

{ u’contributor_role’ : u’author’,

u’first_author’ : u’false’,

u’given_name’ : u’Yishan’,

u’sequence’ : u’additional’,

u’surname’ : u’Wu’} ]

u’doi’ : u’10.1002/asi.23548’,

u’first_page’ : u’n/a’,

u’fl_count’ : u’0’,

u’issn’ : u’23301635’,

50 2. Working with Web Data and APIs

u’journal_abbreviation’ : u’J Assn Inf Sci Tec’,

u’journal_title’ : u’Journal of the Association for

Information Science and Technology’,

u’publication_type’ : u’full_text’,

u’year’ : u’2015’

u’event_csl’ : {

u’author’ :

[ { u’family’ : u’Du’, u’given’ : u’Jian’},

{u’family’ : u’Tang’, u’given’ : u’Xiaoli’},

{u’family’ : u’Wu’, u’given’ : u’Yishan’} ],

u’container-title’ : u’Journal of the Association for

Information Science and Technology’,

u’issued’ : {u’date-parts’ : [[2015]]},

u’title’ : u’The Effects Of Research Level And Article Type

On The Differences Between Citation Metrics And F1000

Recommendations’,

u’type’ : u’article-journal’,

u’url’ : u’http://doi.org/10.1002/asi.23548’

u’event_url’ : u’http://doi.org/10.1002/asi.23548’

}

Another source in the PLOS data is Twitter. In the case of the

Twitter events (individual tweets), this provides the text of the tweet,

user IDs, user names, URL of the tweet, and the date. We can see

from the length of the events list that there are at least 130 tweets

that link to this article.

>> len(article_plos.sources[’twitter’].events)

130

Again, noting the issues of coverage, scope, and completeness,

it is important to consider the limitations of these data. This is a

lower bound as it represents search results returned by searching

the Twitter API for the DOI or URL of the article. Other tweets that

discuss the article may not include a link, and the Twitter search

API also has limitations that can lead to incomplete results. The

number must therefore be seen as both incomplete and a lower

bound.

We can look more closely at data on the ﬁrst tweet on the list.

Bear in mind that the order of the list is not necessarily special.

This is not the ﬁrst tweet about this article chronologically.

>> article_plos.sources[’twitter’].events[0]

{ u’event’ : {u’created_at’: u’2013-10-08T21:12:28Z’,

u’id’ : u’387686960585641984’,

u’text’ : u’We have identified the Higgs boson; it is surely not

beyond our reach to make research assessment useful http://t

.co/Odcm8dVRSU#PLOSBiology’,

2.8. Integrating data from multiple sources 51

u’user’ : u’catmacOA’,

u’user_name’ : u’Catriona MacCallum’,

u’user_profile_image’ :

u’http://a0.twimg.com/profile_images/1779875975/

CM_photo_reduced_normal.jpg’},

u’event_time’ : u’2013-10-08T21:12:28Z’,

u’event_url’ : u’http://twitter.com/catmacOA/status

/387686960585641984’

}

We could use the Twitter API to understand more about this

person. For instance, we could look at their Twitter followers and

whom they follow, or analyze the text of their tweets for topic mod-

eling. Much work on social media interactions is done with this

kind of data, using forms of network and text analysis described

elsewhere in this book.

◮See Chapters 7 and 8.

A diﬀerent approach is to integrate these data with informa-

tion from another source. We might be interested, for instance,

in whether the author of this tweet is a researcher, or whether they

have authored research papers. One thing we could do is search

the ORCID API to see if there are any ORCID proﬁles that link to

this Twitter handle.

>> twitter_search = orcid.search("catmacOA")

>> for result in twitter_search:

>> print unicode(result)

>> print result.researcher_urls}

[<Website twitter [http://twitter.com/catmacOA]>]

So the person with this Twitter handle seems to have an ORCID

proﬁle. That means we can also use ORCID to gather more infor-

mation on their outputs. Perhaps they have authored work which

is relevant to our article?

>> cm = orcid.get("0000-0001-9623-2225")

>> for pub in cm.publications[0:5]:

>> print pub.title

The future is open: opportunities for publishers and institutions

Open Science and Reporting Animal Studies: Who’s Accountable?

Expert Failure: Re-evaluating Research Assessment

Why ONE Is More Than 5

Reporting Animal Studies: Good Science and a Duty of Care

From this analysis we can show that this tweet is actually from

one of my co-authors of the article.

To make this process easier we write the convenience function

shown in Listing 2.2 to go from a Twitter user handle to try and ﬁnd

an ORCID for that person.

52 2. Working with Web Data and APIs

# Take a twitter handle or user name and return an ORCID

2def twitter2orcid(twitter_handle,

resp = 'orcid', search_depth = 10):

4search = orcid.search(twitter_handle)

s = [r for rin search]

6orc = None

i = 0

8while i < search_depth and orc == None and i < len(s):

arr = [('twitter.com' in website.url)

10 for website in s[i].researcher_urls]

if True in arr:

12 index = arr.index(True)

url = s[i].researcher_urls[index].url

14 if url.lower().endswith(twitter_handle.lower()):

orc = s[i].orcid

16 return orc

i+=1

18 return None

Listing 2.2. Python code to ﬁnd ORCID for Twitter handle

Let us do a quick test of the function.

>> twitter2orcid(’catmacOA’)

u’0000-0001-9623-2225’

2.8.2 Working with a corpus

In this case we will continue as previously to collect a set of works

from a single ORCID proﬁle. This collection could just as easily be

a date range, or subject search at a range of other APIs. The target

is to obtain a set of identiﬁers (in this case DOIs) that can be used

to precisely query other data sources. This is a general pattern

that reﬂects the issues of scope and source discussed above. The

choice of how to construct a corpus to analyze will strongly aﬀect

the results and the conclusions that can be drawn.

>> # As previously, collect DOIs available from an ORCID profile

>> cn = orcid.get("0000-0002-0068-716X")

>> exids = []

>> for pub in cn.publications:

>> if pub.external_ids:

>> exids = exids + pub.external_ids

>> DOIs = [exid.id for exid in exids if exid.type == "DOI"]

>> len(DOIs)

2.8. Integrating data from multiple sources 53

We have recovered 66 DOIs from the ORCID proﬁle. Note that we

have not obtained an identiﬁer for every work, as not all have DOIs.

This result illustrates an important point about data integration. In

practice it is generally not worth the eﬀort of attempting to integrate

data on objects unless they have a unique identiﬁer or key that

can be used in multiple data sources, hence the focus on DOIs and

ORCIDs in these examples. Even in our search of the ORCID API

for proﬁles that are associated with a Twitter account, we used the

Twitter handle as a unique ID to search on.

While it is possible to work with author names or the titles of

works directly, disambiguating such names and titles is substan-

tially more diﬃcult than working with unique identiﬁers. Other

chapters (in particular, Chapter 3) deal with issues of data cleaning

and disambiguation. Much work has been done on this basis, but

increasingly you will see that the ﬁrst step in any analysis is simply

to discard objects without a unique ID that can be used across data

sources.

We can obtain data for these from the DET API. As is common

with many APIs, there is a limit to how many queries can be simul-

taneously run, in this case 50, so we divide our query into batches.

>> batches = [DOIs[0:50], DOIs[51:-1]]

>> det_alms = []

>> for batch in batches:

>> alms_response = pyalm.get_alm(batch, info="detail",

instance="det")

>> det_alms.extend(alms_response.get(’articles’))

>> len(det_alms)

The DET API only provides information on a subset of Cross-

ref DOIs. The process that Crossref has followed to populate its

database has focused on more recently published articles, so only

24 responses are received in this case for the 66 DOIs we queried on.

A good exercise would be to look at which of the DOIs are found and

which are not. Let us see how much interesting data is available in

the subset of DOIs for which we have data.

>> for r in [d for d in det_alms if d.sources[’wikipedia’].metrics

.total != 0]:

>> print r.title

>> print ’ ’, r.sources[’pmceurope’].metrics.total, ’

pmceurope citations’

>> print ’ ’, r.sources[’wikipedia’].metrics.total, ’

wikipedia citations’

Architecting the Future of Research Communication: Building the

Models and Analytics for an Open Access Future

54 2. Working with Web Data and APIs

1 pmceurope citations

1 wikipedia citations

Expert Failure: Re-evaluating Research Assessment

0 pmceurope citations

1 wikipedia citations

LabTrove: A Lightweight, Web Based, Laboratory "Blog" as a Route

towards a Marked Up Record of Work in a Bioscience Research

Laboratory

0 pmceurope citations

1 wikipedia citations

The lipidome and proteome of oil bodies from Helianthus annuus (

common sunflower)

2 pmceurope citations

1 wikipedia citations

As discussed above, this shows that the DET instance, while it

provides information on a greater number of DOIs, has less com-

plete data on each DOI at this stage. Only four of the 24 responses

have Wikipedia references. You can change the code to look at the

full set of 24, which shows only sparse data. The PLOS Lagotto

instance provides more data but only on PLOS articles. However, it

does provide data on all PLOS articles, going back earlier than the

set returned by the DET instance. We can collect the set of articles

from the proﬁle published by PLOS.

>> plos_dois = []

>> for doi in DOIs:

>> # Quick and dirty, should check Crossref API for publisher

>> if doi.startswith(’10.1371’):

>> plos_dois.append(doi)

>> len(plos_dois)

>> plos_alms = pyalm.get_alm(plos_dois, info=’detail’, instance=’

plos’).get(’articles’)

>> for article in plos_alms:

>> print article.title

>> print ’ ’, article.sources[’crossref’].metrics.total, ’

Crossref citations’

>> print ’ ’, article.sources[’twitter’].metrics.total, ’

tweets’

Architecting the Future of Research Communication: Building the

Models and Analytics for an Open Access Future

2 Crossref citations

48 tweets

Expert Failure: Re-evaluating Research Assessment

3 Crossref citations

130 tweets

LabTrove: A Lightweight, Web Based, Laboratory "Blog" as a Route

towards a Marked Up Record of Work in a Bioscience Research

Laboratory

6 Crossref citations

2.8. Integrating data from multiple sources 55

1 tweets

More Than Just Access: Delivering on a Network-Enabled Literature

4 Crossref citations

95 tweets

Article-Level Metrics and the Evolution of Scientific Impact

24 Crossref citations

5 tweets

Optimal Probe Length Varies for Targets with High Sequence

Variation: Implications for Probe

Library Design for Resequencing Highly Variable Genes

2 Crossref citations

1 tweets

Covalent Attachment of Proteins to Solid Supports and Surfaces via

Sortase-Mediated Ligation

40 Crossref citations

0 tweets

From the previous examples we know that we can obtain in-

formation on citing articles and tweets associated with these 66

articles. From that initial corpus we now have a collection of up to

86 related articles (cited and citing), a few hundred tweets that refer

to (some of) those articles, and perhaps 500 people if we include

authors of both articles and tweets. Note how for each of these

links our query is limited, so we have a subset of all the related

objects and agents. At this stage we probably have duplicate articles

(one article might cite multiple in our set of seven) and duplicate

people (authors in common between articles and authors who are

also tweeting).

These data could be used for network analysis, to build up a new

corpus of articles (by following the citation links), or to analyze the

links between authors and those tweeting about the articles. We do

not pursue an in-depth analysis here, but will gather the relevant

objects, deduplicate them as far as possible, and count how many

we have in preparation for future analysis.

>> # Collect all citing DOIs & author names from citing articles

>> citing_dois = []

>> citing_authors = []

>> for article in plos_alms:

>> for cite in article.sources[’crossref’].events:

>> citing_dois.append(cite[’event’][’doi’])

>> # Use ’extend’ because the element is a list

>> citing_authors.extend(cite[’event_csl’][’author’])

>> print ’\nBefore de-deduplication:’

>> print ’ ’, len(citing_dois), ’DOIs’

>> print ’ ’, len(citing_authors), ’citing authors’

>> # Easiest way to deduplicate is to convert to a Python set

>> citing_dois = set(citing_dois)

>> citing_authors = set([author[’given’] + author[’family’] for

56 2. Working with Web Data and APIs

author in citing_authors])

>> print ’\nAfter de-deduplication:’

>> print ’ ’, len(citing_dois), ’DOIs’

>> print ’ ’, len(citing_authors), ’citing authors’

Before de-deduplication:

81 DOIs

346 citing authors

After de-deduplication:

78 DOIs

278 citing authors

>> # Collect all tweets, usernames; check for ORCIDs

>> tweet_urls = set()

>> twitter_handles = set()

>> for article in plos_alms:

>> for tweet in article.sources[’twitter’].events:

>> tweet_urls.add(tweet[’event_url’])

>> twitter_handles.add(tweet[’event’][’user’])

>> # No need to explicitly deduplicate as we created sets directly

>> print len(tweet_urls), ’tweets’

>> print len(twitter_handles), ’Twitter users’

280 tweets

210 Twitter users

It could be interesting to look at which Twitter users interact

most with the articles associated with this ORCID proﬁle. To do

that we would need to create not a set but a list, and then count the

number of duplicates in the list. The code could be easily modiﬁed

to do this. Another useful exercise would be to search ORCID for

proﬁles corresponding to citing authors. The best way to do this

would be to obtain ORCIDs associated with each of the citing ar-

ticles. However, because ORCID data are sparse and incomplete,

there are two limitations here. First, the author may not have an

ORCID. Second, the article may not be explicitly linked to another

article. Try searching ORCID for the DOIs associated with each of

the citing articles.

In this case we will look to see how many of the Twitter handles

discussing these articles are associated with an ORCID proﬁle we

can discover. This in turn could lead to more proﬁles and more

cycles of analysis to build up a network of researchers interacting

through citation and on Twitter. Note that we have inserted a delay

between calls. This is because we are making a larger number of API

calls (one for each Twitter handle). It is considered polite to keep

the pace at which calls are made to an API to a reasonable level.

The ORCID API does not post suggested limits at the moment, but

delaying for a second between calls is reasonable.

2.8. Integrating data from multiple sources 57

>> tweet_orcids = []

>> for handle in twitter_handles:

>> orc = twitter2orcid(handle)

>> if orc:

>> tweet_orcids.append(orc)

>> time.sleep(1) # wait one second between each call to the

ORCID API

>> print len(tweet_orcids)

In this case we have identiﬁed 12 ORCID proﬁles that we can

link positively to tweets about this set of articles. This is a substan-

tial underestimate of the likely number of ORCIDs associated with

these tweets. However, relatively few ORCIDs have Twitter accounts

registered as part of the proﬁle. To gain a broader picture a search

and matching strategy would need to be applied. Nevertheless, for

these 12 we can look more closely into the proﬁles.

The ﬁrst step is to obtain the actual proﬁle information for each

of the 12 ORCIDs that we have found. Note that at the moment

what we have is the ORCIDs themselves, not the retrieved proﬁles.

>> orcs = []

>> for id in tweet_orcids:

>> orcs.append(orcid.get(id))

With the proﬁles retrieved we can then take a look at who they

are, and check that we do in fact have sensible Twitter handles

associated with them. We could use this to build up the network of

related authors and Twitter users for further analysis.

>> for orc in orcs:

>> i = [(’twitter.com’ in website.url) for website in orc.

researcher_urls].index(True)

>> twitter_url = orc.researcher_urls[i].url

>> print orc.given_name, orc.family_name, orc.orcid,

twitter_url

Catriona MacCallum 0000-0001-9623-2225 http://twitter.com/catmacOA

John Dupuis 0000-0002-6066-690X https://twitter.com/dupuisj

Johannes Velterop 0000-0002-4836-6568 https://twitter.com/

Villavelius

Stuart Lawson 0000-0002-1972-8953 https://twitter.com/Lawsonstu

Nelson Piedra 0000-0003-1067-8707 http://www.twitter.com/nopiedra

Iryna Kuchma 0000-0002-2064-3439 https://twitter.com/irynakuchma

Frank Huysmans 0000-0002-3468-9032 https://twitter.com/fhuysmans

Salvatore Salvi VICIDOMINI 0000-0001-5086-7401 https://twitter.com

/SalViVicidomini

William Gunn 0000-0002-3555-2054 http://twitter.com/mrgunn

Stephen Curry 0000-0002-0552-8870 https://twitter.com/

Stephen_Curry

Cameron Neylon 0000-0002-0068-716X http://twitter.com/

58 2. Working with Web Data and APIs

cameronneylon

Graham Steel 0000-0003-4681-8011 https://twitter.com/McDawg

2.9 Working with the graph of relationships

In the above examples we started with the proﬁle of an individual,

used this to create a corpus of works, which in turn led us to other

citing works (and their authors) and commentary about those works

on Twitter (and the people who wrote those comments). Along the

way we built up a graph of relationships between objects and people.

◮See Chapter 8.

In this section we will look at this model of the data and how it

reveals limitations and strengths of these forms of data and what

can be done with them.

2.9.1 Citation links between articles

A citation in a research article (or a policy document or working

paper) deﬁnes a relationship between that citing article and the

cited article. The exact form of the relationship is generally poorly

deﬁned, at least at the level of large-scale data sets. A citation

might be referring to previous work, indicating the source of data,

or supporting (or refuting) an idea. While eﬀorts have been made to

codify citation types, they have thus far gained little traction.

In our example we used a particular data source (Crossref) for

information about citations. As previously discussed, this will give

diﬀerent results than other sources (such as Thomson Reuters, Sco-

pus, or Google Scholar) because other sources look at citations from

a diﬀerent set of articles and collect them in a diﬀerent way. The

completeness of the data will always be limited. We could use the

data to clearly connect the citing articles and their authors because

author information is generally available in bibliographic metadata.

However, we would have run into problems if we had only had

names. ORCIDs can provide a way to uniquely identify authors

and ensure that our graph of relationships is clean.

Acitation is a reference from an object of one type to an object

of the same type. We also sought to link social media activity with

speciﬁc articles. Rather than a link between objects that are the

same (articles) we started to connect diﬀerent kinds of objects to-

gether. We are also expanding the scope of the communities (i.e.,

people) that might be involved. While we focused on the question of

which Twitter handles were connected with researchers, we could

2.9. Working with the graph of relationships 59

Funder

Institution

Researchers

Outputs

Research group

Authored by

Articles

Tweeted by

Patient group

Other Agents/Stakeholders

Links to

Tweets

Figure 2.4. A functional view of proxies and relationships

just as easily have focused on trying to discover which comments

came from people who are not researchers.

We used the Lagotto API at PLOS to obtain this information.

The PLOS API in turn depends on the Twitter Search API. A tweet

that refers explicitly to a research article, perhaps via a Crossref

DOI, can be discovered, and a range of services do these kinds of

checks. These services generally rely either on Twitter Search or,

more generally, on a search of “the ﬁrehose,” a dump of all Twitter

data that are available for purchase. The distinction is important

because Twitter Search does not provide either a complete or a con-

sistent set of results. In addition, there will be many references to

research articles that do not contain a unique identiﬁer, or even a

link. These are more challenging to discover. As with citations, the

completeness of any data set will always be limited.

However, the set of all tweets is a more deﬁned set of objects than

the set of all “articles.” Twitter is a speciﬁc social media service with

a deﬁned scope. “Articles” is a broad class of objects served by a

very wide range of services. Twitter is clearly a subset of all discus-

sions and is highly unlikely to be representative of “all discussions.”

Equally the set of all objects with a Crossref DOI, while deﬁned, is

unlikely to be representative of all articles.

Expanding on Figure 2.3, we show in Figure 2.4 agents and ac-

tors (people) and outputs. We place both agents and outputs into

categories that may be more or less well deﬁned. In practice our

analysis is limited to those objects that we discover by using some

“selector” (circles in this diagram), which may or may not have a

60 2. Working with Web Data and APIs

close correspondence with the “real” categories (shown with graded

shapes). Our aim is to identify, aggregate, and in some cases count

the relationships between and within categories of objects; for in-

stance, citations are relationships between formal research outputs.

A tweet may have a relationship (“links to”) with a speciﬁc formally

published research output. Both tweets and formal research out-

puts relate to speciﬁc agents (“authors”) of the content.

2.9.2 Categories, sources, and connections

We can see in this example a distinction between categories of ob-

jects of interest (articles, discussions, people) and sources of infor-

mation on subsets of those categories (Crossref, Twitter, ORCID).

Any analysis will depend on one or more data sources, and in turn

be limited by the coverage of those data sources. The selectors

used to generate data sets from these sources will have their own

limitations.

Similar to a query on a structured data set, the selector itself may

introduce bias. The crucial diﬀerence between ﬁltering on a com-

prehensive (or at least representative) data set and the data sources

we are discussing here is that these data sources are by their very

nature incomplete. Survey data may include biases introduced in

the way that the survey itself is structured or the sampling is de-

signed, but the intent is to be comprehensive. Many of these new

forms of data make no attempt to be comprehensive or avowedly

avoid such an attempt.

Understanding this incompleteness is crucial to understanding

the forms of inference that can be made from these data. Sampling

is only possible within a given source or corpus, and this limits the

conclusions that can be drawn to the scope of that corpus. It is

frequently possible to advance a plausible argument to claim that

such ﬁndings are more broadly applicable, but it is crucial to avoid

assuming that this is the case. In particular, it is important to

be clear about what data sources a ﬁnding applies to and where

the boundary between the strongly evidenced ﬁnding and a claim

about its generalization lies. Much of the literature on scholarly

communications and research impact is poor on this point.

If this is an issue in identifying the objects of interest, it is even

more serious when seeking to identify the relationships between

them, which are, after all generally the thing of interest. In some

cases there are reasonably good sources of data between objects of

the same class (at least those available from the same data sources)

such as citations between journal articles or links between tweets.

2.9. Working with the graph of relationships 61

However, as illustrated in this chapter, detecting relationships be-

tween tweets and articles is much more challenging.

These issues can arise both due to the completeness of the

data source itself (e.g., ORCID currently covers only a subset of

researchers; therefore, the set of author–article relationships is lim-

ited) or due to the challenges of identiﬁcation (e.g., in the Twitter

case above) or due to technical limitations at source (the diﬀerence

between the Twitter search API and the ﬁrehose). In addition, be-

cause the source data and the data services are both highly dynamic

and new, there is often a mismatch. Many services tracking Twitter

data only started collecting data relatively recently. There is a range

of primary and secondary data sources working to create more com-

plete data sets. However, once again it is important to treat all of

these data as sparse and limited as well as highly dynamic and

changeable.

2.9.3 Data availability and completeness

With these caveats in hand and the categorization discussed above,

we can develop a mapping of what data sources exist, what objects

those data sources inform us about, the completeness of those data

sources, and how well the relationships between the diﬀerent data

sources are tracked. Broadly speaking, data sources concern them-

selves with either agents (mostly people) or objects (articles, books,

tweets, posts), while additionally providing additional data about

the relationships of the agents or objects that they describe with

other objects or agents.

The ﬁve broad types of data described above are often treated

as ways of categorizing the data source. They are more properly

thought of as relationships between objects, or between objects and

agents. Thus, for example, citations are relationships between ar-

ticles; the tweets that we are considering are actually relationships

between speciﬁc Twitter posts and articles; and “views” are an event

associating a reader (agent) with an article. The last case illustrates

that often we do not have detailed information on the relationship

but merely a count of them. Relationships between agents (such as

co-authorship or group membership) can also be important.

With this framing in hand, we can examine which types of re-

lationships we can obtain data on. We need to consider both the

quality of data available and the completeness of the data availabil-

ity. These metrics are necessarily subjective and any analysis will

be a personal view of a particular snapshot in time. Nevertheless,

some major trends are available.

62 2. Working with Web Data and APIs

We have growing and improving data on the relationships be-

tween a wide range of objects and agents and traditional scholarly

outputs. Although it is sparse and incomplete in many places,

nontraditional information on traditional outputs is becoming more

available and increasingly rich. By contrast, references from tradi-

tional outputs to nontraditional outputs are weaker and data that

allow us to understand the relationships between nontraditional

outputs is very sparse.

In the context of the current volume, a major weakness is our

inability to triangulate around people and communities. While it

may be possible to collect a set of co-authors from a bibliographic

data source and to identify a community of potential research users

on Twitter or Facebook, it is extremely challenging to connect these

diﬀerent sets. If a community is discussing an article or book on

social media, it is almost impossible to ascertain whether the au-

thors (or, more generically, interested parties such as authors of

cited works or funders) are engaged in that conversation.

2.9.4 The value of sparse dynamic data

Two clear messages arise from our analysis. These new forms of

data are incomplete or sparse, both in quality and in coverage, and

they change. A data source that is poor today may be much im-

proved tomorrow. A query performed one minute may give diﬀerent

results the next. This can be both a strength and a weakness: data

are up to the minute, giving a view of relationships as they form

(and break), but it makes ensuring consistency within analyses and

across analyses challenging. Compared to traditional surveys, these

data sources cannot be relied on as either representative samples

or to be stable.

A useful question to ask, therefore, is what kind of statements

these data can support. Questions like this will be necessarily diﬀer-

ent from the questions that can be posed with high-quality survey

data. More often they provide an existence proof that something

has happened—but they cannot, conversely, show that it has not.

They enable some forms of comparison and determination of the

characteristics of activity in some cases.

Provide evidence that . . . Because much of the data that we have

is sparse, the absence of an indicator cannot reliably be taken to

mean an absence of activity. For example, a lack of Mendeley book-

marks may not mean that a paper is not being saved by researchers,

2.9. Working with the graph of relationships 63

just that those who do save the article are not using Mendeley to

do it. Similarly, a lack of tweets about an article does not mean the

article is not being discussed. But we can use the data that do exist

to show that some activity is occurring. Here are some examples:

•Provide evidence that relevant communities are aware of a spe-

ciﬁc paper. I identiﬁed the fact that a paper by Jewkes et

al. [191] was mentioned by crisis centers, sexual health orga-

nizations, and discrimination support groups in South Africa

when I was looking for University of Cape Town papers that

had South African Twitter activity using Altmetric.com.

•Provide evidence that a relatively under-cited paper is having

a research impact. There is a certain kind of research article,

often a method description or a position paper, that is inﬂu-

ential without being (apparently) heavily cited. For instance,

the PLoS One article by Shen et al. [341] has a respectable

14,000 views and 116 Mendeley bookmarks, but a relatively

(for the number of views) small number of WoS citations (19)

compared to, say, another article, by Leahy et al. [231] and

also in PLoS One, that is similar in age and number of views

but has many more citations.

•Provide evidence of public interest in some topic. Many art-

icles at the top of lists ordered by views or social media men-

tions are of ephemeral (or prurient) interest—the usual trilogy

of sex, drugs, and rock and roll. However, if we dig a little

deeper, a wide range of articles surface, often not highly cited

but clearly of wider interest. For example, an article on Y-

chromosome distribution in Afghanistan [146] has high page

views and Facebook activity among papers with a Harvard af-

ﬁliation but is not about sex, drugs, nor rock and roll. Unfor-

tunately, because this is Facebook data we cannot see who is

talking about it, which limits our ability to say which groups

are talking about it, which could be quite interesting.

Compare . . . Comparisons using social media or download stat-

istics need real care. As noted above, the data are sparse so it is

important that comparisons are fair. Also, comparisons need to be

on the basis of something that the data can actually tell you: for ex-

ample, “which article is discussed more by this online community,”

not “which article is discussed more.”

•Compare the extent to which these articles are discussed by

this online patient group, or possibly speciﬁc online communi-

64 2. Working with Web Data and APIs

ties in general. Here the online communities might be a proxy

for a broader community, or there might be a speciﬁc interest

in knowing whether the dissemination strategy reaches this

community. It is clear that in the longer term social media will

be a substantial pathway for research to reach a wide range

of audiences, and understanding which communities are dis-

cussing what research will help us to optimize the communi-

cation.

•Compare the readership of these articles in these countries.

One thing that most data sources are weak on at the moment

is demographics, but in principle the data are there. Are these

articles that deal with diseases of speciﬁc areas actually being

viewed by readers in those areas? If not, why not? Do they

have Internet access, could lay summaries improve dissemi-

nation, are they going to secondary online sources instead?

•Compare the communities discussing these articles online. Is

most conversation driven by science communicators or by re-

searchers? Are policymakers, or those who inﬂuence them,

involved? What about practitioner communities? These com-

parisons require care, and simple counting rarely provides

useful information. But understanding which people within

which networks are driving conversations can give insight into

who is aware of the work and whether it is reaching target

audiences.

What ﬂavor is it? Priem et al. [310] provide a thoughtful analysis

of the PLOS Article Level Metrics data set. They used principal

component analysis to deﬁne diﬀerent “ﬂavors of impact” based on

the way diﬀerent combinations of signals seemed to point to diﬀerent

kinds of interest. Many of the above use cases are variants on this

theme—what kind of article is this? Is it a policy piece, of public

interest? Is it of interest to a niche research community or does it

have wider public implications? Is it being used in education or in

health practice? And to what extent are these diﬀerent kinds of use

independent from each other?

It is important to realize that these kinds of data are proxies of

things that we do not truly understand. They are signals of the ﬂow

of information down paths that we have not mapped. To me this

is the most exciting possibility and one we are only just starting to

explore. What can these signals tell us about the underlying path-

2.10. Bringing it together: Tracking pathways to impact 65

ways down which information ﬂows? How do diﬀerent combinations

of signals tell us about who is using that information now, and how

they might be applying it in the future? Correlation analysis cannot

answer these questions, but more sophisticated approaches might.

And with that information in hand we could truly design schol-

arly communication systems to maximize their reach, value, and

eﬃciency.

2.10 Bringing it together: Tracking pathways

to impact

Collecting data on research outputs and their performance clearly

has signiﬁcant promise. However, there are a series of substantial

challenges in how best to use these data. First, as we have seen,

it is sparse and patchy. Absence of evidence cannot be taken as

evidence of absence. But, perhaps more importantly, it is unclear

in many cases what these various proxies actually mean. Of course

this is also true of more familiar indicators like citations.

Finally, there is a challenge in how to eﬀectively analyze these

data. The sparse nature of the data is a substantial problem in it-

self, but in addition there are a number of signiﬁcantly confounding

eﬀects. The biggest of these is time. The process of moving research

outputs and their use online is still proceeding, and the uptake and

penetration of online services and social media by researchers and

other relevant communities has increased rapidly over the past few

years and will continue to do so for some time.

These changes are occurring on a timescale of months, or even

weeks, so any analysis must take into account how those changes

may contribute to any observed signal. Much attention has focused

on how diﬀerent quantitative proxies correlate with each other. In

essence this has continued the mistake that has already been made

with citations. Focusing on proxies themselves implicitly makes the

assumption that it is the proxy that matters, rather than the un-

derlying process that is actually of interest. Citations are irrelevant;

what matters is the inﬂuence that a piece of research has had. Cita-

tions are merely a proxy for a particular slice of inﬂuence, a (limited)

indicator of the underlying process in which a research output is

used by other researchers.

Of course, these are common challenges for many “big data”

situations. The challenge lies in using large, but disparate and

messy, data sets to provide insight while avoiding the false positives

66 2. Working with Web Data and APIs

that will arise from any attempt to mine data blindly for correlations.

Using the appropriate models and tools and careful validation of

ﬁndings against other sources of data are the way forward.

2.10.1 Network analysis approaches

One approach is to use these data to dissect and analyze the (visible)

network of relationships between agents and objects. This approach

can be useful in deﬁning how networks of collaborators change over

time, who is in contact with whom, and how outputs are related to

each other. This kind of analysis has been productive with citation

graphs (see Eigenfactor for an example) as well as with small-scale

analysis of grant programs (see, for instance, the Lattes analysis of

the network grant program).

Network analysis techniques and visualization are covered in

Chapter 8 (on networks) and clustering and categorization in Chap-

ter 6 (on machine learning). Networks may be built up from any

combination of outputs, actors/agents, and their relationships to

each other. Analyses that may be particularly useful are those

searching for highly connected (proxy for inﬂuential) actors or out-

puts, clustering to deﬁne categories that emerge from the data itself

(as opposed to external categorization) and comparisons between

networks, both between those built from speciﬁc nodes (people,

outputs) and between networks that are built from data relating

to diﬀerent time frames.

Care is needed with such analyses to make sure that compar-

isons are valid. In particular, when doing analyses of diﬀerent time

frames, it is important to compare any change in the network char-

acteristics that are due to general changes over time as opposed

to speciﬁc changes. As noted above, this is particularly important

with networks based on social media data, as any networks are

likely to have increased in size and diversity over the past few years

as more users interested in research have joined. It is important

to distinguish in these cases between changes relating to a spe-

ciﬁc intervention or treatment and those that are environmental.

As with any retrospective analysis, a good counterfactual sample is

required.

2.10.2 Future prospects and new data sources

As the broader process of research moves online we are likely to

have more and more information on what is being created, by whom,

2.11. Summary 67

and when. As access to these objects increases, both through pro-

vision of open access to published work and through increased data

sharing, it will become more and more feasible to mine the objects

themselves to enrich the metadata. And ﬁnally, as the use of unique

identiﬁers increases for both outputs and people, we will be able to

cross-reference across data sources much more strongly.

Much of the data currently being collected is of poor quality or

is inconsistently processed. Major eﬀorts are underway to develop

standards and protocols for initial processing, particularly for page

view and usage data. Alongside eﬀorts such as the Crossref DOI

Event Tracker Service to provide central clearing houses for data,

both consistency and completeness will continue to rise, making

new and more comprehensive forms of analysis feasible.

Perhaps the most interesting prospect is new data that arise as

more of the outputs and processes of research move online. As the

availability of data outputs, software products, and even potentially

the raw record of lab notebooks increases, we will have opportuni-

ties to query how (and how much) diﬀerent reagents, techniques,

tools, and instruments are being used. As the process of policy de-

velopment and government becomes more transparent and better

connected, it will be possible to watch in real time as research has

its impact on the public sphere. And as health data moves online

there will be opportunities to see how both chemical and behavioral

interventions aﬀect health outcomes in real time.

In the end all of this data will also be grist to the mill for further

research. For the ﬁrst time we will have the opportunity to treat the

research enterprise as a system that is subject to optimization and

engineering. Once again the challenges of what it is we are seeking

to optimize for are questions that the data itself cannot answer, but

in turn the data can better help us to have the debate about what

matters.

2.11 Summary

The term research impact is diﬃcult and politicized, and it is used

diﬀerently in diﬀerent areas. At its root it can be described as the

change that a particular part of the research enterprise (e.g., re-

search project, researcher, funding decision, or institute) makes in

the world. In this sense, it maps well to standard approaches in

the social sciences that seek to identify how an intervention has led

to change.

68 2. Working with Web Data and APIs

The link between “impact” and the distribution of limited re-

search resources makes its deﬁnition highly political. In fact, there

are many forms of impact, and the diﬀerent pathways to diﬀerent

kinds of change, further research, economic growth, improvement

in health outcomes, greater engagement of citizenry, or environ-

mental change may have little in common beyond our interest in

how they can be traced to research outputs. Most public policy

on research investment has avoided the diﬃcult question of which

impacts are most important.

In part this is due to the historical challenges of providing ev-

idence for these impacts. We have only had good data on for-

mal research outputs, primarily journal articles, and measures

have focused on naïve metrics such as productivity or citations, or

on qualitative peer review. Broader impacts have largely been evi-

denced through case studies, an expensive and nonscalable

approach.

The move of research processes online is providing much richer

and more diverse information on how research outputs are used

and disseminated. We have the prospect of collecting much more

information around the performance and usage of traditional re-

search outputs as well as greater data on the growing diversity of

nontraditional research outputs that are now being shared.

It is possible to gain quantitative information on the numbers of

people looking at research, diﬀerent groups talking about research

(in diﬀerent places), those citing research in diﬀerent places, and

recommendations and opinions on the value of work. These data

are sparse and incomplete and its use needs to acknowledge these

limitations, but it is nonetheless possible to gain new and valuable

insights from analysis.

Much of this data is available from web services in the form of ap-

plication programming interfaces. Well-designed APIs make it easy

to search for, gather, and integrate data from multiple sources. A

key aspect of successfully integrating data is the eﬀective use and

application of unique identiﬁers across data sets that allow straight-

forward cross-referencing. Key among the identiﬁers currently be-

ing used are ORCIDs to uniquely identify researchers and DOIs,

from both Crossref and increasingly DataCite, to identify research

outputs. With good cross-referencing it is possible to obtain rich

data sets that can be used as inputs to many of the techniques

described elsewhere in the book.

The analysis of this new data is a nascent ﬁeld and the quality of

work done so far has been limited. In my view there is a substantial

opportunity to use these rich and diverse data sets to treat the

2.12. Resources 69

underlying question of how research outputs ﬂow from the academy

to their sites of use. What are the underlying processes that lead

to various impacts? This means treating these data sets as time

domain signals that can be used to map and identify the underlying

processes. This approach is appealing because it oﬀers the promise

of probing the actual process of knowledge diﬀusion while making

fewer assumptions about what we think is happening.

2.12 Resources

We talked a great deal here about how to access publications and

other resources via their DOIs. Paskin [297] provides a nice sum-

mary of the problems that DOIs solve and how they work.

ORCIDs are another key piece of this puzzle, as we have seen

throughout this chapter. You might ﬁnd some of the early articles

describing the need for unique author IDs useful, such as Bourne

et al. [46], as well as more recent descriptions [145]. More recent

initiatives on expanding the scope of identiﬁers to materials and

software have also been developed [24].

More general discussions of the challenges and opportunities

of using metrics in research assessment may be found in recent

reports such as the HEFCE Expert Group Report [405], and I have

covered some of the broader issues elsewhere [274].

There are many good introductions to web scraping using Beau-

tifulSoup and other libraries as well as API usage in general. Given

the pace at which APIs and Python libraries change, the best and

most up to date source of information is likely to be a web search.

In other settings, you may be concerned with assigning DOIs to

data that you generate yourself, so that you and others can easily

and reliably refer to and access that data in their own work. Here we

face an embarrassment of riches, with many systems available that

each meet diﬀerent needs. Big data research communities such

as climate science [404], high-energy physics [304], and astron-

omy [367] operate their own specialized infrastructures that you

are unlikely to require. For small data sets, Figshare [122] and

DataCite [89] are often used. The Globus publication service [71]

permits an institution or community to build their own publication

system.

70 2. Working with Web Data and APIs

2.13 Acknowledgements and copyright

Section 2.3 is adapted in part from Neylon et al. [275], copyright In-

ternational Development Research Center, Canada, used here un-

der a Creative Commons Attribution v 4.0 License.

Section 2.9.4 is adapted in part from Neylon [273], copyright

PLOS, used here under a Creative Commons Attribution v 4.0

License.

Record Linkage

Chapter 3

Joshua Tokle and Stefan Bender

Big data diﬀers from survey data in that it is typically necessary

to combine data from multiple sources to get a complete picture

of the activities of interest. Although computer scientists tend to

simply “mash” data sets together, social scientists are rightfully

concerned about issues of missing links, duplicative links, and

erroneous links. This chapter provides an overview of traditional

rule-based and probabilistic approaches, as well as the important

contribution of machine learning to record linkage.

3.1 Motivation

Big data oﬀers social scientists great opportunities to bring together

many diﬀerent types of data, from many diﬀerent sources. Merg-

ing diﬀerent data sets provides new ways of creating population

frames that are generated from the digital traces of human activity

rather than, say, tax records. These opportunities, however, cre-

ate diﬀerent kinds of challenges from those posed by survey data.

Combining information from diﬀerent sources about an individual,

business, or geographic entity means that the social scientist must

determine whether or not two entities on two diﬀerent ﬁles are the

same. This determination is not easy. In the UMETRICS data, if

data are to be used to measure the impact of research grants, is

David A. Miller from Stanford, CA, the same as David Andrew Miller

from Fairhaven, NJ, in a list of inventors? Is Google the same as

Alphabet if the productivity and growth of R&D-intensive ﬁrms is

to be studied? Or, more generally, is individual A the same person

as the one who appears on a list of terrorists that has been com-

piled? Does the product that a customer is searching for match the

products that business B has for sale?

72 3. Record Linkage

The consequences of poor record linkage decisions can be sub-

stantial. In the business arena, Christen reports that as much as

12% of business revenues are lost due to bad linkages [76]. In the

security arena, failure to match travelers to a “known terrorist” list

may result in those individuals entering the country, while over-

zealous matching could lead to numbers of innocent citizens being

detained. In ﬁnance, incorrectly detecting a legitimate purchase as

a fraudulent one annoys the customer, but failing to identify a thief

will lead to credit card losses. Less dramatically, in the scientiﬁc

arena when studying patenting behavior, if it is decided that two in-

ventors are the same person, when in fact they are not, then records

will be incorrectly grouped together and one researcher’s productiv-

ity will be overstated. Conversely, if the records for one inventor are

believed to correspond to multiple individuals, then that inventor’s

productivity will be understated.

This chapter discusses current approaches to joining multiple

data sets together—commonly called record linkage. Other names

associated with record linkage are entity disambiguation, entity res-

olution, co-reference resolution, statistical matching, and data fu-

sion, meaning that records which are linked or co-referent can be

thought of as corresponding to the same underlying entity. The

number of names is reﬂective of a vast literature in social science,

statistics, computer science, and information sciences. We draw

heavily here on work by Winkler, Scheuren, and Christen, in par-

ticular [76, 77, 165]. To ground ideas, we use examples from a re-

cent paper examining the eﬀects of diﬀerent algorithms on studies

of patent productivity [387].

3.2 Introduction to record linkage

There are many reasons to link data sets. Linking to existing data

sources to solve a measurement need instead of implementing a new

survey results in cost savings (and almost certainly time savings as

well) and reduced burden on potential survey respondents. For

some research questions (e.g., a survey of the reasons for death of a

longitudinal cohort of individuals) a new survey may not be possible.

In the case of administrative data or other automatically generated

data, the sample size is much greater than would be possible from

a survey.

Record linkage can be used to compensate for data quality is-

sues. If a large number of observations for a particular ﬁeld are

missing, it may be possible to link to another data source to ﬁll

3.2. Introduction to record linkage 73

in the missing values. For example, survey respondents might not

want to share a sensitive datum like income. If the researcher has

access to an oﬃcial administrative list with income data, then those

values can be used to supplement the survey [5].

Record linkage is often used to create new longitudinal data sets

by linking the same entities over time [190]. More generally, linking

separate data sources makes it possible to create a combined data

set that is richer in coverage and measurement than any of the

individual data sources [4].

Example: The Administrative Data Research Network

The UK’s Administrative Data Research Network*(ADRN) is a major investment by ⋆“Administrative data” typ-

ically refers to data gener-

ated by the administration

of a government program,

as distinct from deliberate

survey collection.

the United Kingdom to “improve our knowledge and understanding of the society

we live in . . . [and] provide a sound base for policymakers to decide how to tackle a

range of complex social, economic and environmental issues” by linking adminis-

trative data from a variety of sources, such as health agencies, court records, and

tax records in a conﬁdential environment for approved researchers. The linkages

are done by trusted third-party providers. [103]

Linking is straightforward if each entity has a corresponding

unique identiﬁer that appears in the data sets to be linked. For

example, two lists of US employees may both contain Social Secu-

rity numbers. When a unique identiﬁer exists in the data or can be

created, no special techniques are necessary to join the data sets.

If there is no unique identiﬁer available, then the task of identi-

fying unique entities is challenging. One instead relies on ﬁelds that

only partially identify the entity, like names, addresses, or dates of

birth. The problem is further complicated by poor data quality and

duplicate records, issues well attested in the record linkage litera-

ture [77] and sure to become more important in the context of big

data. Data quality issues include input errors (typos, misspellings,

truncation, extraneous letters, abbreviations, and missing values)

as well as diﬀerences in the way variables are coded between the

two data sets (age versus date of birth, for example). In addition

to record linkage algorithms, we will discuss diﬀerent data prepro-

cessing steps that are necessary ﬁrst steps for the best results in

record linkage.

To ﬁnd all possible links between two data sets it would be neces-

sary to compare each record of the ﬁrst data set with each record of

the second data set. The computational complexity of this approach

74 3. Record Linkage

grows quadratically with the size of the data—an important consid-

eration, especially in the big data context. To compensate for this

complexity, the standard second step in record linkage, after pre-

processing, is indexing or blocking, which creates subsets of similar

records and reduces the total number of comparisons.

The outcome of the matching step is a set of predicted links—

record pairs that are likely to correspond to the same entity. After

these are produced, the ﬁnal stage of the record linkage process is

to evaluate the result and estimate the resulting error rates. Unlike

other areas of application for predictive algorithms, ground truth

or gold standard data sets are rarely available. The only way to

create a reliable truth data set sometimes is through an expensive

clerical review process that may not be viable for a given application.

Instead, error rates must be estimated.

An input data set may contribute to the linked data in a variety of

ways, such as increasing coverage, expanding understanding of the

measurement or mismeasurement of underlying latent variables, or

adding new variables to the combined data set. It is therefore im-

portant to develop a well-speciﬁed reason for linking the data sets,

and to specify a loss function to proxy the cost of false negative

matches versus false positive matches that can be used to guide

match decisions. It is also important to understand the coverage of

the diﬀerent data sets being linked because diﬀerences in coverage

may result in bias in the linked data. For example, consider the

problem of linking Twitter data to a sample-based survey—elderly

adults and very young children are unlikely to use Twitter and so

the set of records in the linked data set will have a youth bias, even

if the original sample was representative of the population. It is also

essential to engage in critical thinking about what latent variables

are being captured by the measures in the diﬀerent data sets—an

“occupational classiﬁcation” in a survey data set may be very dif-

ferent from a “job title” in an administrative record or a “current

position” in LinkedIn data.

◮This topic is discussed in

more detail in Chapter 10.

Example: Employment and earnings outcomes of

doctoral recipients

A recent paper in Science matched UMETRICS data on doctoral recipients to Cen-

sus data on earnings and employment outcomes. The authors note that some

20% of the doctoral recipients are not matched for several reasons: (i) the recip-

ient does not have a job in the US, either for family reasons or because he/she

goes back to his/her home country; (ii) he/she starts up a business rather than

3.2. Introduction to record linkage 75

choosing employment; or (iii) it is not possible to uniquely match him/her to a

Census Bureau record. They correctly note that there may be biases introduced in

case (iii), because Asian names are more likely duplicated and harder to uniquely

match [415]. Improving the linkage algorithm would increase the estimate of the

eﬀects of investments in research and the result would be more accurate.

Comparing the kinds of heterogeneous records associated with

big data is a new challenge for social scientists, who have tradition-

ally used a technique ﬁrst developed in the 1960s to apply comput-

ers to the problem of medical record linkage. There is a reason why

this approach has survived: it has been highly successful in linking

survey data to administrative data, and eﬃcient implementations

of this algorithm can be applied at the big data scale. However,

the approach is most eﬀective when the two ﬁles being linked have

a number of ﬁelds in common. In the new landscape of big data,

there is a greater need to link ﬁles that have few ﬁelds in common

but whose noncommon ﬁelds provide additional predictive power to

determine which records should be linked. In some cases, when

suﬃcient training data can be produced, more modern machine

learning techniques may be applied.

The canonical record linkage workﬂow process is shown in Fig-

ure 3.1 for two data ﬁles, A and B. The goal is to identify all pairs of

records in the two data sets that correspond to the same underlying

individual. One approach is to compare all data units from ﬁle A

Data le A

Comparison

Classication

Links Non links

Data le B

Figure 3.1. The preprocessing pipeline

76 3. Record Linkage

with all units in ﬁle B and classify all of the comparison outcomes

to decide whether or not the records match. In a perfect statistical

world the comparison would end with a clear determination of links

and nonlinks.

Alas, a perfect world does not exist, and there is likely to be

noise in the variables that are common to both data sets and that

will be the main identiﬁers for the record linkage. Although the

original ﬁles A and B are the starting point, the identiﬁers must be

preprocessed before they can be compared. Determining identiﬁers

for the linkage and deciding on the associated cleaning steps are

extremely important, as they result in a necessary reduction of the

possible search space.

In the next section we begin our overview of the record linkage

process with a discussion of the main steps in data preprocess-

ing. This is followed by a section on approaches to record linkage

that includes rule-based, probabilistic, and machine learning algo-

rithms. Next we cover classiﬁcation and evaluation of links, and we

conclude with a discussion of data privacy in record linkage.

3.3 Preprocessing data for record linkage

As noted in the introductory chapter, all data work involves prepro-

cessing, and data that need to be linked is no exception. Preprocess-

ing refers to a workﬂow that transforms messy, noisy, and unstruc-

tured data into a well-deﬁned, clearly structured, and quality-tested

data set. Elsewhere in this book, we discuss general strategies for

◮This topic (quality of

data, preprocessing issues)

is discussed in more detail

in Section 1.4.

data preprocessing. In this section, we focus speciﬁcally on pre-

processing steps relating to the choice of input ﬁelds for the record

linkage algorithm. Preprocessing for any kind of a new data set is

a complex and time-consuming process because it is “hands-on”: it

requires judgment and cannot be eﬀectively automated. It may be

tempting to minimize this demanding work under the assumption

that the record linkage algorithm will account for issues in the data,

but it is diﬃcult to overstate the value of preprocessing for record

linkage quality. As Winkler notes: “In situations of reasonably

high-quality data, preprocessing can yield a greater improvement

in matching eﬃciency than string comparators and ‘optimized’ pa-

rameters. In some situations, 90% of the improvement in matching

eﬃciency may be due to preprocessing” [406].

The ﬁrst step in record linkage is to develop link keys, which are

the record ﬁelds that will be used to predict link status. These can

include common identiﬁers like ﬁrst and last name. Survey and ad-

3.3. Preprocessing data for record linkage 77

ministrative data sets may include a number of clearly identifying

variables like address, birth date, and sex. Other data sets, like

transaction records or social media data, often will not include ad-

dress or birth date but may still include other identifying ﬁelds like

occupation, a list of interests, or connections on a social network.

Consider this chapter’s illustrative example of the US Patent and

Trademark Oﬃce (USPTO) data [387]:

USPTO maintains an online database of all patents is-

sued in the United States. In addition to identifying in-

formation about the patent, the database contains each

patent’s list of inventors and assignees, the companies,

organizations, individuals, or government agencies to

which the patent is assigned. . . . However, inventors and

assignees in the USPTO database are not given unique

identiﬁcation numbers, making it diﬃcult to track inven-

tors and assignees across their patents or link their in-

formation to other data sources.

There are some basic precepts that are useful when considering

identifying ﬁelds. The more diﬀerent values a ﬁeld can take, the less

likely it is that two randomly chosen individuals in the population

will agree on those values. Therefore, ﬁelds that exhibit a wider

range of values are more powerful as link keys: names are much

better link keys than sex or year of birth.

Example: Link keys in practice

“A Harvard professor has re-identiﬁed the names of more than 40 percent of a

sample of anonymous participants in a high-proﬁle DNA study, highlighting the

dangers that ever greater amounts of personal data available in the Internet era

could unravel personal secrets. . . . Of the 1,130 volunteers Sweeney and her

team reviewed, about 579 provided zip code, date of birth and gender, the three

key pieces of information she needs to identify anonymous people combined with

information from voter rolls or other public records. Of these, Sweeney succeeded

in naming 241, or 42 percent of the total. The Personal Genome Project conﬁrmed

that 97 percent of the names matched those in its database if nicknames and ﬁrst

name variations were included” [369].

Complex link keys like addresses can be broken down into com-

ponents so that the components can be compared independently of

one another. This way, errors due to data quality can be further

78 3. Record Linkage

isolated. For example, assigning a single comparison value to the

complex ﬁelds “1600 Pennsylvania” and “160 Pennsylvania Ave” is

less informative than assigning separate comparison values to the

street number and street name portions of those ﬁelds. A record

linkage algorithm that uses the decomposed ﬁeld can make more

nuanced distinctions by assigning diﬀerent weights to errors in each

component.

Sometimes a data set can include diﬀerent variants of a ﬁeld,

like legal ﬁrst name and nickname. In these cases match rates

can be improved by including all variants of the ﬁeld in the record

comparison. For example, if only the ﬁrst list includes both vari-

ants, and the second list has a single “ﬁrst name” ﬁeld that could

be either a legal ﬁrst name or a nickname, then match rates can be

improved by comparing both variants and then keeping the better

of the two comparison outcomes. It is important to remember, how-

ever, that some record linkage algorithms expect ﬁeld comparisons

to be somewhat independent. In our example, using the outcome

from both comparisons as separate inputs into the probabilistic

model we describe below may result in a higher rate of false nega-

tives. If a record has the same value in the legal name and nickname

ﬁelds, and if that value happens to agree with the ﬁrst name ﬁeld

in the second ﬁle, then the agreement is being double-counted. By

the same token, if a person in the ﬁrst list has a nickname that

diﬀers signiﬁcantly from their legal ﬁrst name, then a comparison

of that record to the corresponding record will unfairly penalize the

outcome because at least one of those name comparisons will show

a low level of agreement.

Preprocessing serves two purposes in record linkage. First, to

correct for issues in data quality that we described above. Second,

to account for the diﬀerent ways that the input ﬁles were generated,

which may result in the same underlying data being recorded on

diﬀerent scales or according to diﬀerent conventions.

Once preprocessing is ﬁnished, it is possible to start linking the

records in the diﬀerent data sets. In the next section we describe a

technique to improve the eﬃciency of the matching step.

3.4 Indexing and blocking

There is a practical challenge to consider when comparing the records

in two ﬁles. If both ﬁles are roughly the same size, say 100 records

in the ﬁrst and 100 records in the second ﬁle, then there are 10,000

possible comparisons, because the number of pairs is the product

3.4. Indexing and blocking 79

of the number of records in each ﬁle. More generally, if the number

of records in each ﬁle is approximately n, then the total number of

possible record comparisons is approximately n2. Assuming that

there are no duplicate records in the input ﬁles, the proportion of

record comparisons that correspond to a link is only 1/n. If we

naively proceed with all n2possible comparisons, the linkage algo-

rithm will spend the bulk of its time comparing records that are not

matches. Thus it is possible to speed up record linkage signiﬁcantly

by skipping comparisons between record pairs that are not likely to

be linked.

Indexing refers to techniques that determine which of the poss-

ible comparisons will be made in a record linkage application. The

most used technique for indexing is blocking. In this approach you

construct a “blocking key” for each record by concatenating ﬁelds

or parts of ﬁelds. Two records with identical blocking keys are

said to be in the same block, and only records in the same block

are compared. This technique is eﬀective because performing an

exact comparison of two blocking keys is a relatively quick operation

compared to a full record comparison, which may involve multiple

applications of a fuzzy string comparator.

Example: Blocking in practice

Given two lists of individuals, one might construct the blocking key by concate-

nating the ﬁrst letter of the last name and the postal code and then “blocking”

on ﬁrst character of last name and postal code. This reduces the total number of

comparisons by only comparing those individuals in the two ﬁles who live in the

same locality and whose last names begin with the same letter.

There are important considerations when choosing the blocking

key. First, the choice of blocking key creates a potential bias in

the linked data because true matches that do not share the same

blocking key will not be found. In the example, the blocking strategy

could fail to match records for individuals whose last name changed

or who moved. Second, because blocking keys are compared ex-

actly, there is an implicit assumption that the included ﬁelds will

not have typos or other data entry errors. In practice, however, the

blocking ﬁelds will exhibit typos. If those typos are not uniformly

distributed over the population, then there is again the possibility

of bias in the linked data set. One simple strategy for dealing with

◮This topic is discussed in

more detail in Chapter 10.

imperfect blocking keys is to implement multiple rounds of block-

80 3. Record Linkage

ing and matching. After the ﬁrst set of matches is produced, a new

blocking strategy is deployed to search for additional matches in the

remaining record pairs.

Blocking based on exact ﬁeld agreements is common in practice,

but there are other approaches to indexing that attempt to be more

error tolerant. For example, one may use clustering algorithms to

identify sets of similar records. In this approach an index key, which

is analogous to the blocking key above, is generated for both data

sets and then the keys are combined into a single list. A distance

function must be chosen and pairwise distances computed for all

keys. The clustering algorithm is then applied to the combined list,

and only record pairs that are assigned to the same cluster are

compared. This is a theoretically appealing approach but it has the

drawback that the similarity metric has to be computed for all pairs

of records. Even so, computing the similarity measure for a pair of

blocking keys is likely to be cheaper than computing the full record

comparison, so there is still a gain in eﬃciency. Whang et al. [397]

provide a nice review of indexing approaches.

In addition to reducing the computational burden of record link-

age, indexing plays an important secondary role. Once implemented,

the fraction of comparisons made that correspond to true links will

be signiﬁcantly higher. For some record linkage approaches that

use an algorithm to ﬁnd optimal parameters—like the probabilis-

tic approach—having a larger ratio of matches to nonmatches will

produce a better result.

3.5 Matching

The purpose of a record linkage algorithm is to examine pairs of

records and make a prediction as to whether they correspond to the

same underlying entity. (There are some sophisticated algorithms

that examine sets of more than two records at a time [359], but

pairwise comparison remains the standard approach.) At the core

of every record linkage algorithm is a function that compares two

records and outputs a “score” that quantiﬁes the similarity between

those records. Mathematically, the match score is a function of the

output from individual ﬁeld comparisons: agreement in the ﬁrst

name ﬁeld, agreement in the last name ﬁeld, etc. Field comparisons

may be binary—indicating agreement or disagreement—or they may

output a range of values indicating diﬀerent levels of agreement.

There are a variety of methods in the statistical and computer sci-

ence literature that can be used to generate a match score, includ-

3.5. Matching 81

ing nearest-neighbor matching, regression-based matching, and

propensity score matching. The probabilistic approach to record

linkage deﬁnes the match score in terms of a likelihood ratio [118].

Example: Matching in practice

Long strings, such as assignee and inventor names, are susceptible to typograph-

ical errors and name variations. For example, none of Sony Corporation, Sony

Corporatoin and Sony Corp. will match using simple exact matching. Similarly,

David vs. Dave would not match [387].

Comparing ﬁelds whose values are continuous is straightforward:

often one can simply take the absolute diﬀerence as the comparison

value. Comparing character ﬁelds in a rigorous way is more com-

plicated. For this purpose, diﬀerent mathematical deﬁnitions of

the distance between two character ﬁelds have been deﬁned. Edit

distance, for example, is deﬁned as the minimum number of edit

operations—chosen from a set of allowed operations—needed to con-

vert one string to another. When the set of allowed edit operations is

single-character insertions, deletions, and substitutions, the corre-

sponding edit distance is also known as the Levenshtein distance.

When transposition of adjacent characters is allowed in addition

to those operations, the corresponding edit distance is called the

Levenshtein–Damerau distance.

Edit distance is appealing because of its intuitive deﬁnition, but

it is not the most eﬃcient string distance to compute. Another

standard string distance known as Jaro–Winkler distance was de-

veloped with record linkage applications in mind and is faster to

compute. This is an important consideration because in a typical

record linkage application most of the algorithm run time will be

spent performing ﬁeld comparisons. The deﬁnition of Jaro–Winkler

distance is less intuitive than edit distance, but it works as ex-

pected: words with more characters in common will have a higher

Jaro–Winkler value than those with fewer characters in common.

The output value is normalized to fall between 0 and 1. Because

of its history in record linkage applications, there are some stan-

dard variants of Jaro–Winkler distance that may be implemented in

record linkage software. Some variants boost the weight given to

agreement in the ﬁrst few characters of the strings being compared.

Others decrease the score penalty for letter substitutions that arise

from common typos.

82 3. Record Linkage

Once the ﬁeld comparisons are computed, they must be com-

bined to produce a ﬁnal prediction of match status. In the following

sections we describe three types of record linkage algorithms: rule-

based, probabilistic, and machine learning.

3.5.1 Rule-based approaches

A natural starting place is for a data expert to create a set of ad hoc

rules that determine which pairs of records should be linked. In the

classical record linkage setting where the two ﬁles have a number

of identifying ﬁelds in common, this is not the optimal approach.

However, if there are few ﬁelds in common but each ﬁle contains

auxiliary ﬁelds that may inform a linkage decision, then an ad hoc

approach may be appropriate.

Example: Linking in practice

Consider the problem of linking two lists of individuals where both lists contain

ﬁrst name, last name, and year of birth. Here is one possible linkage rule: link all

pairs of records such that

•the Jaro–Winkler comparison of ﬁrst names is greater than 0.9

•the Jaro–Winkler comparison of last names is greater than 0.9

•the ﬁrst three digits of the year of birth are the same.

The result will depend on the rate of data errors in the year of birth ﬁeld and typos

in the name ﬁelds.

By auxiliary ﬁeld we mean data ﬁelds that do not appear on both

data sets, but which may nonetheless provide information about

whether records should be linked. Consider a situation in which the

ﬁrst list includes an occupation ﬁeld and the second list includes

educational history. In that case one might create additional rules

to eliminate matches where the education was deemed to be an

unlikely ﬁt for the occupation.

This method may be attractive if it produces a reasonable-looking

set of links from intuitive rules, but there are several pitfalls. As

the number of rules grows it becomes harder to understand the

ways that the diﬀerent rules interact to produce the ﬁnal set of

links. There is no notion of a threshold that can be increased or

decreased depending on the tolerance for false positive and false

negative errors. The rules themselves are not chosen to satisfy any

kind of optimality, unlike the probabilistic and machine learning

3.5. Matching 83

methods. Instead, they reﬂect the practitioner’s domain knowledge

about the data sets.

3.5.2 Probabilistic record linkage

In this section we describe the probabilistic approach to record

linkage, also known as the Fellegi–Sunter algorithm [118]. This

approach dominates in traditional record linkage applications and

remains an eﬀective and eﬃcient way to solve the record linkage

problem today.

In this section we give a somewhat formal deﬁnition of the statis-

tical model underlying the algorithm. By understanding this model,

one is better equipped to deﬁne link keys and record comparisons

in an optimal way.

Example: Usefulness of probabilistic record linkage

In practice, it is typically the case that a researcher will want to combine two or

more data sets containing records for the same individuals or units that possi-

bly come from diﬀerent sources. Unless the sources all contain the same unique

identiﬁers, linkage will likely require matching on standardized text strings. Even

standardized data are likely to contain small diﬀerences that preclude exact match-

ing as in the matching example above. The Census Bureau’s Longitudinal Busi-

ness Database (LBD) links establishment records from administrative and survey

sources. Exact numeric identiﬁers do most of the heavy lifting, but mergers, ac-

quisitions, and other actions can break these linkages. Probabilistic record linkage

on company names and/or addresses is used to ﬁx these broken linkages that bias

statistics on business dynamics [190].

Let Aand Bbe two lists of individuals whom we wish to link.

The product set A×Bcontains all possible pairs of records where

the ﬁrst element of the pair comes from Aand the second element

of the pair comes from B. A fraction of these pairs will be matches,

meaning that both records in the pair represent the same underlying

individual, but the vast majority of them will be nonmatches. In

other words, A×Bis the disjoint union of the set of matches M

and the set of nonmatches U, a fact that we denote formally by

A×B=M∪U.

Let γbe a vector-valued function on A×Bsuch that, for a∈Aand

b∈B,γ(a, b)represents the outcome of a set of ﬁeld comparisons

between aand b. For example, if both Aand Bcontain data on

84 3. Record Linkage

individuals’ ﬁrst names, last names, and cities of residence, then γ

could be a vector of three binary values representing agreement in

ﬁrst name, last name, and city. In that case γ(a, b)=(1,1,0)would

mean that the records aand bagree on ﬁrst name and last name,

but disagree on city of residence.

For this model, the comparison outcomes in γ(a, b)are not re-

quired to be binary, but they do have to be categorical: each com-

ponent of γ(a, b)should take only ﬁnitely many values. This means

that a continuous comparison outcome—such as output from the

Jaro–Winkler string comparator—has to be converted to an ordinal

value representing levels of agreement. For example, one might cre-

ate a three-level comparison, using one level for exact agreement,

one level for approximate agreement deﬁned as a Jaro–Winkler score

greater than 0.85, and one level for nonagreement corresponding to

a Jaro–Winkler score less than 0.85.

If a variable being used in the comparison has a signiﬁcant num-

ber of missing values, it can help to create a comparison outcome

level to indicate missingness. Consider two data sets that both have

middle initial ﬁelds, and suppose that in one of the data sets the

middle initial is ﬁlled in only about half of the time. When compar-

ing records, the case where both middle initials are ﬁlled in but are

not the same should be treated diﬀerently from the case where one

of the middle initials is blank, because the ﬁrst case provides more

evidence that the records do not correspond to the same person.

We handle this in the model by deﬁning a three-level comparison

for the middle initial, with levels to indicate “equal,” “not equal,”

and “missing.”

Probabilistic record linkage works by weighing the probability of

seeing the result γ(a, b)if (a, b)belongs to the set of matches M

against the probability of seeing the result if (a, b)belongs to the

set of nonmatches U. Conditional on Mor U, the distribution of the

individual comparisons deﬁned by γare assumed to be mutually

independent. The parameters that deﬁne the marginal distributions

of γ|Mare called m-weights, and similarly the marginal distributions

of γ|Uare called u-weights.

In order to apply the Fellegi–Sunter method, it is necessary to

choose values for these parameters, m-weights and u-weights. With

labeled data—a pair of lists for which the match status is known—

it is straightforward to solve for optimal values. Training data are

not usually available, however, and the typical approach is to use

expectation maximization to ﬁnd optimal values.

We have noted that primary motivation for record linkage is to

create a linked data set for analysis that will have a richer set of ﬁelds

3.5. Matching 85

than either of the input data sets alone. A natural application is to

perform a linear regression using a combination of variables from

both ﬁles as predictors. With all record linkage approaches it is a

challenge to understand how errors from the linkage process will

manifest in the regression. Probabilistic record linkage has an ad-

vantage over rule-based and machine learning approaches in that

there are theoretical results concerning coeﬃcient bias and errors

[221, 329]. More recently, Chipperﬁeld and Chambers have devel-

oped an approach based on the bootstrap to account for record link-

age errors when making inferences for cross-tabulated variables [75].

3.5.3 Machine learning approaches to linking

Computer scientists have contributed extensively in parallel litera-

ture focused on linking large data sets [76]. Their focus is on iden-

tifying potential links using approaches that are fast and scalable,

and approaches are developed based on work in network algorithms

and machine learning.

While simple blocking as described in Section 3.4 is standard in

Fellegi–Sunter applications, computer scientists are likely to use the

more sophisticated clustering approach to indexing. Indexing may

also use network information to include, for example, records for in-

dividuals that have a similar place in a social graph. When linking

lists of researchers, one might specify that comparisons should be

made between records that share the same address, have patents

in the same patent class, or have overlapping sets of coinventors.

These approaches are known as semantic blocking, and the com-

putational requirements are similar to standard blocking [76].

In recent years machine learning approaches have been applied

◮This topic is discussed in

more detail in Chapter 6.

to record linkage following their success in other areas of prediction

and classiﬁcation. Computer scientists couch the analytical prob-

lem as one of entity resolution, even though the conceptual problem

is identical. As Wick et al. [400] note:

Entity resolution, the task of automatically determining

which mentions refer to the same real-world entity, is a

crucial aspect of knowledge base construction and man-

agement. However, performing entity resolution at large

scales is challenging because (1) the inference algorithms

must cope with unavoidable system scalability issues and

(2) the search space grows exponentially in the number

of mentions. Current conventional wisdom declares that

performing coreference at these scales requires decom-

86 3. Record Linkage

posing the problem by ﬁrst solving the simpler task of

entity-linking (matching a set of mentions to a known set

of KB entities), and then performing entity discovery as a

post-processing step (to identify new entities not present

in the KB). However, we argue that this traditional ap-

proach is harmful to both entity-linking and overall coref-

erence accuracy. Therefore, we embrace the challenge of

jointly modeling entity-linking and entity discovery as a

single entity resolution problem.

Figure 3.2 provides a useful comparison between classical record

linkage and learning-based approaches. In machine learning there

is a predictive model and an algorithm for “learning” the optimal

set of parameters to use in the predictive algorithm. The learning

algorithm relies on a training data set. In record linkage, this would

be a curated data set with true and false matches labeled as such.

See [387] for an example and a discussion of how a training data

set was created for the problem of disambiguating inventors in the

USPTO database. Once optimal parameters are computed from the

training data, the predictive model can be applied to unlabeled data

to ﬁnd new links. The quality of the training data set is critical; the

model is only as good as the data it is trained on.

An example of a machine learning model that is popular for

record linkage is the random forest model [50]. This is a classi-

◮See Chapter 6.

ﬁcation model that ﬁts a large number of classiﬁcation trees to a

labeled training data set. Each individual tree is trained on a boot-

strap sample of all labeled cases using a random subset of predictor

variables. After creating the classiﬁcation trees, new cases are la-

beled by giving each tree a vote and keeping the label that receives

the most votes. This highly randomized approach corrects for a

problem with simple classiﬁcation trees, which is that they may

overﬁt to training data.

As shown in Figure 3.2, a major diﬀerence between probabilistic

and machine learning approaches is the need for labeled training

data to implement the latter approach. Usually training data are

created through a painstaking process of clerical review. After an

initial round of record linkage, a sample of record pairs that are

not clearly matches or nonmatches is given to a research assistant

who makes the ﬁnal determination. In some cases it is possible to

create training data by automated means. For example, when there

is a subset of the complete data that contains strongly identifying

ﬁelds. Suppose that both of the candidate lists contain name and

date of birth ﬁelds and that in the ﬁrst list the date of birth data are

3.5. Matching 87

Match

Decision

Similarity

Computation

reshold

• Similarity Function

• Attribute Selection

Blocking

Source Target Source Target

Blocking

Model

Application

Training

Data

Selection

Training

Data

• No. of Examples

• Selection Scheme

• reshold

Model

Generation

• Learning Algorithm

• Matcher Selection

Figure 3.2. Probabilistic (left) vs. machine learning (right) approaches to linking. Source: Köpcke et al. [213]

complete, but in the second list only about 10% of records contain

date of birth. For reasonably sized lists, name and date of birth

together will be a nearly unique identiﬁer. It is then possible to

perform probabilistic record linkage on the subset of records with

date of birth and be conﬁdent that the error rates would be small.

If the subset of records with date of birth is representative of the

complete data set, then the output from the probabilistic record

linkage can be used as “truth” data.

Given a quality training data set, machine learning approaches

may have advantages over probabilistic record linkage. Consider

the random forest model. Random forests are more robust to corre-

lated predictor variables, because only a random subset of predic-

tors is included in any individual classiﬁcation tree. The conditional

independence assumption, to which we alluded in our discussion

of the probabilistic model, can be dropped. An estimate of the gen-

eralization error can be computed in the form of “out-of-bag error.”

A measure of variable importance is computed that gives an idea of

how powerful a particular ﬁeld comparison is in terms of correctly

predicting link status. Finally, unlike the Fellegi–Sunter model,

predictor variables can be continuous.

The combination of being robust to correlated variables and pro-

viding a variable importance measure makes random forests a use-

88 3. Record Linkage

ful diagnostic tool for record linkage models. It is possible to reﬁne

the record linkage model iteratively, by ﬁrst including many pre-

dictor variables, including variants of the same comparison, and

then using the variable importance measure to narrow down the

predictors to a parsimonious set.

There are many published studies on the eﬀectiveness of random

forests and other machine learning algorithms for record linkage.

Christen and Ahmed et al. provide some pointers [77, 108].

3.5.4 Disambiguating networks

The problem of disambiguating entities in a network is closely re-

lated to record linkage: in both cases the goal is to consolidate mul-

tiple records corresponding to the same entity. Rather than ﬁnding

the same entity in two data sets, however, the goal in network dis-

ambiguation is to consolidate duplicate records in a network data

set. By network we mean that the data set contains not only typical

record ﬁelds like names and addresses but also information about

how entities relate to one another: entities may be coauthors, coin-

ventors, or simply friends in a social network.

The record linkage techniques that we have described in this

chapter can be applied to disambiguate a network. To do so, one

must convert the network to a form that can be used as input into

a record linkage algorithm. For example, when disambiguating a

social network one might deﬁne a ﬁeld comparison whose output

gives the fraction of friends in common between two records. Ven-

tura et al. demonstrated the relative eﬀectiveness of the probabilis-

tic method and machine learning approaches to disambiguating a

database of inventors in the USPTO database [387]. Another ap-

proach is to apply clustering algorithms from the computer science

literature to identify groups of records that are likely to refer to the

same entity. Huang et al. [172] have developed a successful method

based on an eﬃcient computation of distance between individuals

in the network. These distances are then fed into the DBSCAN

clustering algorithm to identify unique entities.

3.6 Classiﬁcation

Once the match score for a pair of records has been computed us-

ing the probabilistic or random forest method, a decision has to be

made whether the pair should be linked. This requires classifying

the pair as either a “true” or a “false” match. In most cases, a third

classiﬁcation is required—sending for manual review and classiﬁca-

tion.

3.6. Classiﬁcation 89

3.6.1 Thresholds

In the probabilistic and random forest approaches, both of which

output a “match score” value, a classiﬁcation is made by establish-

ing a threshold Tsuch that all records with a match score greater

than Tare declared to be links. Because of the way these algorithms

are deﬁned, the match scores are not meaningful by themselves and

the threshold used for one linkage application may not be appro-

priate for another application. Instead, the classiﬁcation threshold

must be established by reviewing the model output.

Typically one creates an output ﬁle that includes pairs of records

that were compared along with the match score. The ﬁle is sorted

by match score and the reviewer begins to scan the ﬁle from the

highest match scores to the lowest. For the highest match scores the

record pairs will agree on all ﬁelds and there is usually no question

about the records being linked. However, as the scores decrease the

reviewer will see more record pairs whose match status is unclear

(or that are clearly nonmatches) mixed in with the clear matches.

There are a number of ways to proceed, depending on the resources

available and the goal of the project.

Rather than set a single threshold, the reviewer may set two

thresholds T1> T2. Record pairs with a match score greater than

T1are marked as matches and removed from further consideration.

The set of record pairs with a match score between T1and T2are be-

lieved to contain signiﬁcant numbers of matches and nonmatches.

These are sent to clerical review, meaning that research assistants

will make a ﬁnal determination of match status. The ﬁnal set of

links will include clear matches with a score greater than T1as well

as the record pairs that pass clerical review. If the resources are

available for this approach and the initial threshold T1is set suf-

ﬁciently high, then the resulting data set will contain a minimal

number of false positive links. The collection of record pairs with

match scores between T1and T2is sometimes referred to as the

clerical review region.

The clerical review region generally contains many more pairs

than the set of clear matches, and it can be expensive and time-

consuming to review each pair. Therefore, a second approach is to

establish tentative threshold Tand send only a sample of record

pairs with scores in a neighborhood of Tto clerical review. This

results in data on the relative numbers of true matches and true

nonmatches at diﬀerent score levels, as well as the characteristics

of record pairs that appear at a given level. Based on the review

and the relative tolerance for false positive errors and false negative

90 3. Record Linkage

errors, a ﬁnal threshold T′is set such that pairs with a score greater

than T′are considered to be matches.

After viewing the results of the clerical review, it may be deter-

mined that the parameters to the record linkage algorithm could be

improved to create a clearer delineation between matches and non-

matches. For example, a research assistant may determine that

many potential false positives appear near the tentative threshold

because the current set of record linkage parameters is giving too

much weight to agreement in ﬁrst name. In this case the reviewer

may decide to update the record linkage model to produce an im-

proved set of match scores. The update may consist in an ad hoc

adjustment of parameters, or the result of the clerical review may

be used as training data and the parameter-ﬁtting algorithm may

be run again. An iterative approach like this is common when ﬁrst

linking two data sets because the clerical review process can im-

prove one’s understanding of the data sets involved.

Setting the threshold value higher will reduce the number of false

positives (record pairs for which a link is incorrectly predicted) while

increasing the number of false negatives (record pairs that should

be linked but for which a link is not predicted). The proper tradeoﬀ

between false positive and false negative error rates will depend

on the particular application and the associated loss function, but

there are some general concerns to keep in mind. Both types of

errors create bias, which can impact the generalizability of analyses

conducted on the linked data set. Consider a simple regression

on the linked data that includes ﬁelds from both data sets. If the

threshold is too high, then the linked data will be biased toward

records with no data entry errors or missing values, and whose

ﬁelds did not change over time. This set of records may not be

representative of the population as a whole. If a low threshold is

used, then the set of linked records will contain more pairs that

are not true links and the variables measured in those records are

independent of each other. Including these records in a regression

amounts to adding statistical noise to the data.

3.6.2 One-to-one links

In the probabilistic and machine learning approaches to record link-

age that we have described, each record pair is compared and a link

is predicted independently of all other record pairs. Because of the

independence of comparisons, one record in the ﬁrst ﬁle may be

predicted to link to multiple records in the second ﬁle. Under the

assumption that each input ﬁle has been deduplicated, at most one

3.7. Record linkage and data protection 91

of these predictions can correspond to a true link. For many ap-

plications it is preferable to extract a set of “best” links with the

property that each record in one ﬁle links to at most one record

in the second ﬁle. A set of links with this property is said to be

one-to-one.

One possible deﬁnition of “best” is a set of one-to-one links such

that the sum of the match scores of all included links is maximal.

This is an example of what is known as the assignment problem in

combinatorial optimization. In the linear case above, where we care

about the sum of match scores, the problem can be solved exactly

using the Hungarian algorithm [216].

◮This topic is discussed in

more detail in Chapter 6.

3.7 Record linkage and data protection

In many social science applications data sets there is no need for

data to include identifying ﬁelds like names and addresses. These

ﬁelds may be left out intentionally out of concern for privacy, or

◮See Chapter 11.

they may simply be irrelevant to the research question. For record

linkage, however, names and addresses are among the best possible

identiﬁers. We describe two approaches to the problem of balancing

needs for both eﬀective record linkage and privacy.

The ﬁrst approach is to establish a trusted third party or safe

center. The concept of trusted third parties (TTPs) is well known

in cryptography. In the case of record linkage, a third party takes

a place between the data owners and the data users, and it is this

third party that actually performs the linkage work. Both the data

owners and data users trust the third party in the sense that it

assumes responsibility for data protection (data owners) and data

competence (data users) at the same time. No party other than the

TTP learns about the private data of the other parties. After record

linkage only the linked records are revealed, with no identiﬁers at-

tached. The TTP ensures that the released linked data set cannot

be relinked to any of the source data sets. Possible third parties

are safe centers, which are operated by lawyers, or oﬃcial trusted

institutions like the US Census Bureau. Some countries like the

UK and Germany are establishing new institutions speciﬁcally to

act as TTPs for record linkage work.

The second approach is known as privacy-preserving record link-

age. The goal of this approach is to ﬁnd the same individual in sep-

arate data ﬁles without revealing the identity of the individual [80].

In privacy-preserving record linkage, cryptographic procedures are

used to encrypt or hash identiﬁers before they are shared for record

linkage. Many of these procedures require exact matching of the

92 3. Record Linkage

identiﬁers, however, and do not tolerate any errors in the original

identiﬁers. This leads to information loss because it is not possible

to account for typos or other small variations in hashed ﬁelds. To

account for this, Schnell has developed a method to calculate string

similarity of encrypted ﬁelds using bloom ﬁlters [330,332].

In many countries these approaches are combined. For exam-

ple, when the UK established the ADRN, the latter established the

concept of trusted third parties. That third party is provided with

data in which identifying ﬁelds have been hashed. This solves the

challenge of trust between the diﬀerent parties. Some authors ar-

gue that transparency of data use and informed consent will help

to build trust. In the context of big data this is more challenging..

◮This topic is discussed in

more detail in Chapter 11.

3.8 Summary

Accurate record linkage is critical to creating high-quality data sets

for analysis. However, outside of a few small centers for record

linkage research, linking data sets historically relied on artisan ap-

proaches, particularly for parsing and cleaning data sets. As the

creation and use of big data increases, so does the need for system-

atic record linkage. The history of record linkage is long by computer

science standards, but new data challenges encourage the develop-

ment of new approaches like machine learning methods, clustering

algorithms, and privacy-preserving record linkage.

Record linkage stands on the boundary between statistics, in-

formation technology, and privacy. We are conﬁdent that there will

continue to be exciting developments in this ﬁeld in the years to

come.

3.9 Resources

Out of many excellent resources on the subject, we note the follow-

ing:

•We strongly recommend Christen’s book [76].

•There is a wealth of information available on the ADRN website

[103].

•Winkler has a series of high-quality survey articles [407].

•The German Record Linkage Center is a resource for research,

software, and ongoing conference activities [331].

Databases

Chapter 4

Ian Foster and Pascal Heus

Once the data have been collected and linked into diﬀerent ﬁles, it

is necessary to store and organize them. Social scientists are used

to working with one analytical ﬁle, often in SAS, Stata, SPSS, or

R. This chapter, which may be the most important chapter in the

book, describes diﬀerent approaches to storing data in ways that

permit rapid and reliable exploration and analysis.

4.1 Introduction

We turn now to the question of how to store, organize, and manage

the data used in data-intensive social science. As the data with

which you work grow in volume and diversity, eﬀective data man-

agement becomes increasingly important if you are to avoid issues

of scale and complexity from overwhelming your research processes.

In particular, when you deal with data that get frequently updated,

with changes made by diﬀerent people, you will frequently want to

use database management systems (DBMSs) instead of maintaining

data in single ﬁles or within siloed statistical packages such as SAS,

SPSS, Stata, and R. Indeed, we go so far as to say: if you take away

just one thing from this book, it should be this: Use a database!

As we explain in this chapter, DBMSs provide an environment

that greatly simpliﬁes data management and manipulation. They

require a little bit of eﬀort to set up, but are worth it. They permit

large amounts of data to be organized in multiple ways that allow

for eﬃcient and rapid exploration via powerful declarative query

languages; durable and reliable storage, via transactional features

that maintain data consistency; scaling to large data sizes; and in-

tuitive analysis, both within the DBMS itself and via bridges to other

data analysis packages and tools when specialized analyses are re-

quired. DBMSs have become a critical component of a great variety

94 4. Databases

of applications, from handling transactions in ﬁnancial systems to

delivering data as a service to power websites, dashboards, and ap-

plications. If you are using a production-level enterprise system,

chances are there is a database in the back end. They are multi-

purpose and well suited for organizing social science data and for

supporting analytics for data exploration.

DBMSs make many easy things trivial, and many hard things

easy. They are easy to use but can appear daunting to those un-

familiar with their concepts and workings. A basic understanding

of databases and of when and how to use DBMSs is an important

element of the social data scientist’s knowledge base. We therefore

provide in this chapter an introduction to databases and how to

use them. We describe diﬀerent types of databases and their var-

ious features, and how diﬀerent types can be applied in diﬀerent

contexts. We describe basic features like how to get started, set

up a database schema, ingest data, query data within a database,

and get results out. We also discuss how to link from databases

to other tools, such as Python, R, and Stata (if you really have

to). Chapter 5 describes how to apply parallel computing methods

when needed.

4.2 DBMS: When and why

Consider the following three data sets:

1. 10,000 records describing research grants, each specifying the

principal investigator, institution, research area, proposal ti-

tle, award date, and funding amount in comma-separated-

value (CSV) format.

2. 10 million records in a variety of formats from funding agen-

cies, web APIs, and institutional sources describing people,

grants, funding agencies, and patents.

3. 10 billion Twitter messages and associated metadata—around

10 terabytes (1013 bytes) in total, and increasing at a terabyte

a month.

Which tools should you use to manage and analyze these data sets?

The answer depends on the speciﬁcs of the data, the analyses that

you want to perform, and the life cycle within which data and ana-

lyses are embedded. Table 4.1 summarizes relevant factors, which

we now discuss.

4.2. DBMS: When and why 95

Table 4.1. When to use different data management and analysis technologies

Text ﬁles, spreadsheets, and scripting language

•Your data are small

•Your analysis is simple

•You do not expect to repeat analyses over time

Statistical packages

•Your data are modest in size

•Your analysis maps well to your chosen statistical package

Relational database

•Your data are structured

•Your data are large

•You will be analyzing changed versions of your data over time

•You want to share your data and analyses with others

NoSQL database

•Your data are unstructured

•Your data are extremely large

In the case of data set 1 (10,000 records describing research

grants), it may be feasible to leave the data in their original ﬁle,

use spreadsheets, pivot tables, or write programs in scripting lan-

guages*such as Python or R to ask questions of those ﬁles. For

⋆A scripting language is

a programming language

used to automate tasks that

could otherwise be per-

formed one by one be the

user.

example, someone familiar with such languages can quickly create

a script to extract from data set 1 all grants awarded to one inves-

tigator, compute average grant size, and count grants made each

year in diﬀerent areas.

However, this approach also has disadvantages. Scripts do not

provide inherent control over the ﬁle structure. This means that if

you obtain new data in a diﬀerent format, your scripts need to be

updated. You cannot just run them over the newly acquired ﬁle.

Scripts can also easily become unreasonably slow as data volumes

grow. A Python or R script will not take long to search a list of

1,000 grants to ﬁnd those that pertain to a particular institution.

But what if you have information about 1 million grants, and for

each grant you want to search a list of 100,000 investigators, and

for each investigator, you want to search a list of 10 million papers

to see whether that investigator is listed as an author of each paper?

You now have 1,000,000×100,000×10,000,000 =1018 comparisons

to perform. Your simple script may now run for hours or even days.

You can speed up the search process by constructing indices, so

that, for example, when given a grant, you can ﬁnd the associated

investigators in constant time rather than in time proportional to

the number of investigators. However, the construction of such

indices is itself a time-consuming and error-prone process.

96 4. Databases

For these reasons, the use of scripting languages alone for data

analysis is rarely to be recommended. This is not to say that

all analysis computations can be performed in database systems.

A programming language will also often be needed. But many

data access and manipulation computations are best handled in

a database.

Researchers in the social sciences frequently use statistical pack-

ages*such as R, SAS, SPSS, and Stata for data analysis. Because

⋆A statistical package is

a specialized compute pro-

gram for analysis in statis-

tics and economics.

these systems integrate some crude data management, statistical

analysis, and graphics capabilities in a single package, a researcher

can often carry out a data analysis project of modest size within the

same environment. However, each of these systems has limitations

that hinder its use for modern social science research, especially as

data grow in size and complexity.

Take Stata, for example. Stata always loads the entire data set

into the computer’s working memory, and thus you would have

no problems loading data set 1. However, depending on your com-

puter’s memory, it could have problems dealing with with data set 2

and certainly would not be able to handle data set 3. In addition,

you would need to perform this data loading step each time you

start working on the project, and your analyses would be limited

to what Stata can do. SAS can deal with larger data sets, but is

renowned for being hard to learn and use. Of course there are

workarounds in statistical packages. For example, in Stata you can

deal with larger ﬁle sizes by choosing to only load the variables or

cases that you need for the analysis [211]. Likewise, you can deal

with more complex data by creating a system of ﬁles that each can

be linked as needed for a particular analysis through a common

identiﬁer variable.

◮For example, the Panel

Study of Income Dynamics

[181] has a series of ﬁles

that are related and can be

combined through common

identiﬁer variables [182].

Those solutions essentially mimic core functions of a DBMS, and

you would be well advised to set up such system, especially if you

ﬁnd yourself in a situation where the data set is constantly updated

through diﬀerent users, if groups of users have diﬀerent rights to

use your data or should only have access to subsets of the data,

and if the analysis takes place on a server that sends results to

a client (browser). Statistics packages also have diﬃculty working

with more than one data source at a time—something that DBMSs

are designed to do well.

These considerations bring us to the topic of this chapter, namely

database management systems. A DBMS*handles all of the is-

⋆DBMS is a system

that interacts with users,

other applications, and the

database itself to capture

and analyze data.

sues listed above, and more. As we will see below when we look

at concrete examples, a DBMS allows the programmer to deﬁne a

logical design that ﬁts the structure of their data. The DBMS then

4.2. DBMS: When and why 97

implements a data model (more on this below) that allows these data

to be stored, queried, and updated eﬃciently and reliably on disk,

thus providing independence from underlying physical storage. It

supports eﬃcient access to data through query languages and auto-

matic optimization of those queries to permit fast analysis. Impor-

tantly, it also support concurrent access by multiple users, which is

not an option for ﬁle-based data storage. It supports transactions,

meaning that any update to a database is performed in its entirety

or not at all, even in the face of computer failures or multiple con-

current updates. And it reduces the time spent both by analysts,

by making it easy to express complex analytical queries concisely,

and on data administration, by providing simple and uniform data

administration interfaces.

Adatabase is a structured collection of data about entities and

their relationships. It models real-world objects—both entities (e.g.,

grants, investigators, universities) and relationships (e.g., “Steven

Weinberg” works at “University of Texas at Austin”)—and captures

structure in ways that allow these entities and relationships to be

queried for analysis. A database management system is a software

suite designed to safely store and eﬃciently manage databases, and

to assist with the maintenance and discovery of the relationships

that database represents. In general, a DBMS encompasses three

key components, as shown in Table 4.2: its data model (which

deﬁnes how data are represented: see Box 4.1), its query language

(which deﬁnes how the user interacts with the data), and support

for transactions and crash recovery (to ensure reliable execution

despite system failures).*⋆Some key DBMS fea-

tures are often lacking in

standard statistical pack-

ages: a standard query lan-

guage (with commands that

allow analyses or data ma-

nipulation on a subgroup

of cases deﬁned during

the analysis, for example

“group by . . . ,” “order by

. . . ”), keys (for speed im-

provement), and an explicit

model of a relational data

structure.

Box 4.1: Data model

Adata model speciﬁes the data elements associated with a

problem domain, the properties of those data elements, and

how those data elements relate to one another. In developing

a data model, we commonly ﬁrst identity the entities that are

to be modeled and then deﬁne their properties and relation-

ships. For example, when working on the science of science

policy (see Figure 1.2), the entities include people, products,

institutions, and funding, each of which has various properties

(e.g., for a person, their name, address, employer); relationships

include “is employed by” and “is funded by.” This conceptual

data model can then be translated into relational tables or some

other database representation, as we describe next.

98 4. Databases

Table 4.2. Key components of a DBMS

Data model Query language Transactions, crash recovery

User-facing For example: relational,

semi-structured

For example: SQL (for

relational), XPath (for

semi-structured)

Transactions

Internal Mapping data to storage

systems; creating and

maintaining indices

Query optimization and

evaluation; consistency

Locking, concurrency control,

recovery

Literally hundreds of diﬀerent open source, commercial, and

cloud-hosted versions DBMSs are available. However, you only

need to understand a relatively small number of concepts and ma-

jor database types to make sense of this diversity. Table 4.3 deﬁnes

the major classes of DBMSs that we will consider in this book. We

consider only a few of these in any detail.

Relational DBMSs are the most widely used and mature sys-

tems, and will be the optimal solution for many social science data

analysis purposes. We describe relational DBMSs in detail below,

but in brief, they allow for the eﬃcient storage, organization, and

analysis of large quantities of tabular data: data organized as ta-

◮Sometimes, as discussed

in Chapter 3, the links are

one to one and sometimes

one to many.

bles, in which rows represent entities (e.g., research grants) and

columns represent attributes of those entities (e.g., principal inves-

tigator, institution, funding level). The associated Structured Query

Language (SQL) can then be used to perform a wide range of anal-

yses, which are executed with high eﬃciency due to sophisticated

indexing and query planning techniques.

While relational DBMSs have dominated the database world for

decades, other database technologies have become popular for var-

ious classes of applications in recent years. As we will see, these

alternative NoSQL DBMSs have typically been motivated by a de-

sire to scale the quantities of data and/or number of users that can

be supported and/or to deal with unstructured data that are not

easily represented in tabular form. For example, a key–value store

can organize large numbers of records, each of which associates an

arbitrary key with an arbitrary value. These stores, and in partic-

ular variants called document stores that permit text search on the

stored values, are widely used to organize and process the billions

of records that can be obtained from web crawlers. We review be-

low some of these alternatives and the factors that may motivate

their use.

4.2. DBMS: When and why 99

Table 4.3. Types of databases: relational (ﬁrst row) and various types of NoSQL (other rows)

Type Examples Advantages Disadvantages Uses

Relational

database

MySQL,

PostgreSQL,

Oracle, SQL

Server, Teradata

Consistency (ACID) Fixed schema;

typically harder to

scale

Transactional systems:

order processing, retail,

hospitals, etc.

Key–value

store

Dynamo, Redis Dynamic schema; easy

scaling; high

throughput

Not immediately

consistent; no

higher-level queries

Web applications

Column

store

Cassandra,

HBase

Same as key–value;

distributed; better

compression at column

level

Not immediately

consistent; using

all columns is

ineﬃcient

Large-scale analysis

Document

store

CouchDB,

MongoDB

Index entire document

(JSON)

Not immediately

consistent; no

higher-level queries

Web applications

Graph

database

Neo4j,

InﬁniteGraph

Graph queries are fast Diﬃcult to do

non-graph analysis

Recommendation

systems, networks,

routing

Relational and NoSQL databases (and indeed other solutions,

such as statistical packages) can also be used together. Consider,

for example, Figure 4.1, which depicts data ﬂows commonly en-

countered in large research projects. Diverse data are being col-

lected from diﬀerent sources: JSON documents from web APIs, web

pages from web scraping, tabular data from various administrative

databases, Twitter data, and newspaper articles. There may be hun-

dreds or even thousands of data sets in total, some of which may be

extremely large. We initially have no idea of what schema*to use for

⋆A schema deﬁnes the

structure of a database in

a formal language deﬁned

by the DBMS. See Sec-

tion 4.3.3.

the diﬀerent data sets, and indeed it may not be feasible to deﬁne

a uniﬁed set of schema, so diverse are the data and so rapidly are

new data sets being acquired. Furthermore, the way we organize

the data may vary according to our intended purpose. Are we in-

terested in geographic, temporal, or thematic relationships among

diﬀerent entities? Each type of analysis may require a diﬀerent

organization.

For these reasons, a common storage solution is to ﬁrst load

all data into a large NoSQL database. This approach makes all

data available via a common (albeit limited) query interface. Re-

searchers can then extract from this database the speciﬁc elements

that are of interest for their work, loading those elements into a re-

100 4. Databases

PubMed

abstracts

Researcher

web pages

Patent

abstracts

Twitter

messages

Big NoSQL database

Extracted

research area

Domain-specic

relational database

Extracted

collaboration

network

Graph database

Figure 4.1. A research project may use a NoSQL database to accumulate large

amounts of data from many different sources, and then extract selected subsets to

a relational or other database for more structured processing

lational DBMS, another specialized DBMS (e.g., a graph database),

or a statistical package for more detailed analysis. As part of the

process of loading data from the NoSQL database into a relational

database, the researcher will necessarily deﬁne schemas, relation-

ships between entities, and so forth. Analysis results can be stored

in a relational database or back into the NoSQL store.

4.3 Relational DBMSs

We now provide a more detailed description of relational DBMSs.

Relational DBMSs implement the relational data model, in which

data are represented as sets of records organized in tables. This

model is particularly well suited for the structured, regular data

with which we frequently deal in the social sciences; we discuss in

Section 4.5 alternative data models, such as those used in NoSQL

databases.

We use the data shown in Figure 4.2 to introduce key con-

cepts. These two CSV format ﬁles describe grants made by the

US National Science Foundation (NSF). One ﬁle contains informa-

tion about grants, the other information about investigators. How

should you proceed to manipulate and analyze these data?

The main concept underlying the relational data model is a ta-

ble (also referred to as a relation): a set of rows (also referred to as

tuples, records, or observations), each with the same columns (also

referred to as ﬁelds, attributes or variables). A database consists of

multiple tables. For example, we show in Figure 4.3 how the data

4.3. Relational DBMSs 101

The ﬁle grants.csv

# Identifier,Person,Funding,Program

1316033,Steven Weinberg,666000,Elem. Particle Physics/Theory

1336199,Howard Weinberg,323194,ENVIRONMENTAL ENGINEERING

1500194,Irving Weinberg,200000,Accelerating Innovation Rsrch

1211853,Irving Weinberg,261437,GALACTIC ASTRONOMY PROGRAM

The ﬁle investigators.csv

# Name,Institution,Email

Steven Weinberg,University of Texas at Austin,weinberg@utexas.edu

Howard Weinberg,University of North Carolina Chapel Hill,

Irving Weinberg,University of Maryland College Park,irving@ucmc.edu

Figure 4.2. CSV ﬁles representing grants and investigators. Each line in the ﬁrst table speciﬁes a grant number,

investigator name, total funding amount, and NSF program name; each line in the second gives an investigator

name, institution name, and investigator email address

contained in the two CSV ﬁles of Figure 4.2 may be represented as

two tables. The Grants table contains one tuple for each row in

grants.csv, with columns GrantID,Person,Funding, and Program.

The Investigators table contains one tuple for each row in inves-

tigators.csv, with columns ID,Name,Institution, and Email. The

CSV ﬁles and tables contain essentially the same information, al-

beit with important diﬀerences (the addition of an ID ﬁeld in the

Investigators table, the substitution of an ID column for the Person

column in the Grants table) that we will explain below.

The use of the relational data model provides for physical inde-

pendence: a given table can be stored in many diﬀerent ways. SQL

queries are written in terms of the logical representation of tables

(i.e., their schema deﬁnition). Consequently, even if the physical

organization of the data changes (e.g., a diﬀerent layout is used

to store the data on disk, or a new index is created to speed up

access for some queries), the queries need not change. Another

advantage of the relational data model is that, since a table is a

set, in a mathematical sense, simple and intuitive set operations

(e.g., union, intersection) can be used to manipulate the data, as

we discuss below. We can easily, for example, determine the inter-

section of two relations (e.g., grants that are awarded to a speciﬁc

institution), as we describe in the following. The database further

ensures that the data comply with the model (e.g., data types, key

uniqueness, entity relationships), essentially providing core quality

assurance.

102 4. Databases

Number Person Funding Program

1316033 1 660,000 Elem. Particle Physics/Theory

1336199 2 323,194 ENVIRONMENTAL ENGINEERING

1500194 3 200,000 Accelerating Innovation Rsrch

1211853 3 261,437 GALACTIC ASTRONOMY PROGRAM

ID Name Institution Email

1 Steven Weinberg University of Texas at Austin weinberg@utexas.edu

2 Howard Weinberg University of North Carolina Chapel Hill

3 Irving Weinberg University of Maryland College Park irving@ucmc.edu

Figure 4.3. Relational tables Grants and Investigators corresponding to the grants.csv and investigators.csv

data in Figure 4.2, respectively. The only differences are the representation in a tabular form, the introduction of a

unique numerical investigator identiﬁer (ID) in the Investigators table, and the substitution of that identiﬁer for

the investigator name in the Grants table

4.3.1 Structured Query Language (SQL)

We use query languages to manipulate data in a database (e.g.,

to add, update, or delete data elements) and to retrieve (raw and

aggregated) data from a database (e.g., data elements that certain

properties). Relational DBMSs support SQL, a simple, powerful

query language with a strong formal foundation based on logic, a

foundation that allows relational DBMSs to perform a wide variety of

sophisticated optimizations. SQL is used for three main purposes:

•Data deﬁnition: e.g., creation of new tables,

•Data manipulation: queries and updates,

•Control: creation of assertions to protect data integrity.

We introduce each of these features in the following, although not

in that order, and certainly not completely. Our goal here is to give

enough information to provide the reader with insights into how

relational databases work and what they do well; an in-depth SQL

tutorial is beyond the scope of this book but is something we highly

recommend readers seek elsewhere.

4.3.2 Manipulating and querying data

SQL and other query languages used in DBMSs support the concise,

declarative speciﬁcation of complex queries. Because we are eager

to show you something immediately useful, we cover these features

ﬁrst, before talking about how to deﬁne data models.

4.3. Relational DBMSs 103

Example: Identifying grants of more than $200,000

Here is an SQL query to identify all grants with total funding of at most $200,000:

select *from Grants

where Funding <= 200,000;

(Here and elsewhere in this chapter, we show SQL key words in blue.)

Notice SQL’s declarative nature: this query can be read almost as the English

language statement, “select all rows from the Grants table for which the Funding

column has value less than or equal 200,000.” This query is evaluated as follows:

1. The input table speciﬁed by the from clause, Grants, is selected.

2. The condition in the where clause, Funding <= 200,000, is checked

against all rows in the input table to identify those rows that match.

3. The select clause speciﬁes which columns to keep from the matching rows,

that is, which columns make the schema of the output table. (The “*” indi-

cates that all columns should be kept.)

The answer, given the data in Figure 4.3, is the following single-row table. (The

fact that an SQL query returns a table is important when it comes to creating more

complex queries: the result of a query can be stored into the database as a new

table, or passed to another query as input.)

Number Person Funding Program

1500194 3 200,000 Accelerating Innovation Rsrch

DBMSs automatically optimize declarative queries such as the

example that we just presented, translating them into a set of low-

level data manipulations (an imperative query plan) that can be eval-

uated eﬃciently. This feature allows users to write queries without

having to worry too much about performance issues—the database

does the worrying for you. For example, a DBMS need not consider

every row in the Grants table in order to identify those with funding

less than $200,000, a strategy that would be slow if the Grants ta-

ble were large: it can instead use an index to retrieve the relevant

records much more quickly. We discuss indices in more detail in

Section 4.3.6.

The querying component of SQL supports a wide variety of ma-

nipulations on tables, whether referred to explicitly by a table name

(as in the example just shown) or constructed by another query.

We just saw how to use the select operator to both pick certain

rows (what is termed selection) and certain columns (what is called

projection) from a table.

104 4. Databases

Example: Finding grants awarded to an investigator

We want to ﬁnd all grants awarded to the investigator with name “Irving Weinberg.”

The information required to answer this question is distributed over two tables,

Grants and Investigators, and so we join*the two tables to combine tuples

⋆In statistical packages,

the term merge or append

is often used when data

sets are combined.

from both:

select Number, Name, Funding, Program

from Grants, Investigators

where Grants.Person = Investigators.ID

and Name = "Irving Weinberg";

This query combines tuples from the Grants and Investigators tables

for which the Person and ID ﬁelds match. It is evaluated in a similar fashion

to the query presented above, except for the from clause: when multiple tables

are listed, as here, the conditions in the where clause are checked for all diﬀerent

combinations of tuples from the tables deﬁned in the from clause (i.e., the cartesian

product of these tables)—in this case, a total of 3 ×4=12 combinations. We

thus determine that Irving Weinberg has two grants. The query further selects

the Number,Name,Funding, and Program ﬁelds from the result, giving the

following:

Number Name Funding Program

1500194 Irving Weinberg 200,000 Accelerating Innovation Rsrch

1211853 Irving Weinberg 261,437 GALACTIC ASTRONOMY PROGRAM

This ability to join two tables in a query is one example of how SQL permits

concise speciﬁcations of complex computations. This joining of tables via a carte-

sian product operation is formally called a cross join. Other types of join are also

supported. We describe one such, the inner join, in Section 4.6.

SQL aggregate functions allow for the computation of aggregate

statistics over tables. For example, we can use the following query

to determine the total number of grants and their total and average

funding levels:

select count(*)as 'Number',sum(Funding) as 'Total',

avg(Funding) as 'Average'

from Grants;

This yields the following:

Number Total Average

4 1444631 361158

The group by operator can be used in conjunction with the ag-

gregate functions to group the result set by one or more columns.

4.3. Relational DBMSs 105

For example, we can use the following query to create a table with

three columns: investigator name, the number of grants associated

with the investigator, and the aggregate funding:

select Name, count(*)as 'Number',

avg(Funding) as 'Average funding'

from Grants, Investigators

where Grants.Person = Investigators.ID

group by Name;

We obtain the following:

Name Number Average Funding

Steven Weinberg 1 666000

Howard Weinberg 1 323194

Irving Weinberg 2 230719

4.3.3 Schema design and deﬁnition

We have seen that a relational database comprises a set of tables.

The task of specifying the structure of the data to be stored in a

database is called logical design. This task may be performed by

a database administrator, in the case of a database to be shared

by many people, or directly by users, if they are creating databases

themselves. More speciﬁcally, the logical design process involves

deﬁning a schema. A schema comprises a set of tables (including,

for each table, its columns and their types), their relationships, and

integrity constraints.

The ﬁrst step in the logical design process is to identify the en-

tities that need to be modeled. In our example, we identiﬁed two

important classes of entity: “grants” and “investigators.” We thus

deﬁne a table for each; each row in these two tables will correspond

to a unique grant or investigator, respectively. (In a more complete

and realistic design, we would likely also identify other entities, such

as institutions and research products.) During this step, we will of-

ten ﬁnd ourselves breaking information up into multiple tables, so

as to avoid duplicating information.

For example, imagine that we were provided grant information

in the form of one CSV ﬁle rather than two, with each line pro-

viding a grant number, investigator, funding, program, institution,

and email. In this ﬁle, the name, institution, and email address

for Irving Weinberg would then appear twice, as he has two grants,

which can lead to errors when updating values and make it diﬃcult

to represent certain information. (For example, if we want to add an

investigator who does not yet have a grant, we will need to create

106 4. Databases

a tuple (row) with empty slots for all columns (variables) associated

with grants.) Thus we would want to break up the single big table

into the two tables that we deﬁned here. This breaking up of infor-

mation across diﬀerent tables to avoid repetition of information is

referred to as normalization.*

⋆Normalization involves

organizing columns and ta-

bles of a relational database

to minimize data redun-

dancy.

The second step in the design process is to deﬁne the columns

that are to be associated with each entity. For each table, we deﬁne

a set of columns. For example, given the data in Figure 4.2, those

columns will likely include, for a grant, an award identiﬁer, title,

investigator, and award amount; for an investigator, a name, uni-

versity, and email address. In general, we will want to ensure that

each row in our table has a key: a set of columns that uniquely iden-

tiﬁes that row. In our example tables, grants are uniquely identiﬁed

◮Normalization can be

done in statistical packages

as well. For example, as

noted above, PSID splits

its data into different ﬁles

linked through ID variables.

The difference here is that

the DBMS makes creat-

ing, navigating, and query-

ing the resulting data partic-

ularly easy.

by Number and investigators by ID.

The third step in the design process is to capture relationships

between entities. In our example, we are concerned with just one

relationship, namely that between grants and investigators: each

grant has an investigator. We represent this relationship between

tables by introducing a Person column in the Grants table, as shown

in Figure 4.3. Note that we do not simply duplicate the investigator

names in the two tables, as was the case in the two CSV ﬁles shown

in Figure 4.2: these names might not be unique, and the duplication

of data across tables can lead to later inconsistencies if a name is

updated in one table but not the other.

The ﬁnal step in the design process is to represent integrity con-

straints (or rules) that must hold for the data. In our example, we

may want to specify that each grant must be awarded to an investi-

gator; that each value of the grant identiﬁer column must be unique

(i.e., there cannot be two grants with the same number); and total

funding can never be negative. Such restrictions can be achieved by

specifying appropriate constraints at the time of schema creation,

as we show in Listing 4.1, which contains the code used to create

the two tables that make up our schema.

Listing 4.1 contains four SQL statements. The ﬁrst two state-

ments, lines 1 and 2, simply set up our new database. The create

table statement in lines 4–10 creates our ﬁrst table. It speciﬁes

the table name (Investigators) and, for each of the four columns,

the column name and its type.*Relational DBMSs oﬀer a rich set

⋆These storage types

will be familiar to many of

you from statistical software

packages.

of types to choose from when designing a schema: for example,

int or integer (synonyms); real or float (synonyms); char(n),

a ﬁxed-length string of ncharacters; and varchar(n), a variable-

length string of up to ncharacters. Types are important for several

reasons. First, they allow for more eﬃcient encoding of data. For

4.3. Relational DBMSs 107

create database grantdata;

2use grantdata;

4create table Investigators (

ID int auto_increment,

6Name varchar(100) not null,

Institution varchar(256) not null,

8Email varchar(100),

primary key(ID)

10 );

12 create table Grants (

Number int not null,

14 Person int not null,

Funding float unsigned not null,

16 Program varchar(100),

primary key(Number)

18 );

Listing 4.1. Code to create the grantdata database and its Investigators

and Grants tables

example, the Funding ﬁeld in the grants.csv ﬁle of Figure 4.2 could

be represented as a string in the Grants table, char(15), say, to

allow for large grants. By representing it as a ﬂoating point number

instead (line 15 in Listing 4.1), we reduce the space requirement

per grant to just four bytes. Second, types allow for integrity checks

on data as they are added to the database: for example, that same

type declaration for Funding ensures that only valid numbers will be

entered into the database. Third, types allow for type-speciﬁc op-

erations on data, such as arithmetic operations on numbers (e.g.,

min, max, sum).

Other SQL features allow for the speciﬁcation of additional con-

straints on the values that can be placed in the corresponding

column. For example, the not null constraints for Name and

Institution (lines 6, 7) indicate that each investigator must have

a name and an institution, respectively. (The lack of such a con-

straint on the Email column shows that an investigator need not

have an email address.)

4.3.4 Loading data

So far we have created a database and two tables. To complete

our simple SQL program, we show in Listing 4.2 the two state-

ments that load the data of Figure 4.2 into our two tables. (Here

and elsewhere in this chapter, we use the MySQL DBMS. The SQL

108 4. Databases

load data local infile "investigators.csv"

2into table Investigators

fields terminated by ","

4ignore 1lines

(Name, Institution, Email);

load data local infile "grants.csv" into table Grants

8fields terminated by ","

ignore 1lines

10 (Number, @var, Funding, Program)

set Person = (select ID from Investigators

12 where Investigators.Name=@var);

Listing 4.2. Code to load data into the Investigators and Grants tables

syntax used by diﬀerent DBMSs diﬀers in various, mostly minor

ways.) Each statement speciﬁes the name of the ﬁle from which

data is to be read and the table into which it is to be loaded.

The fields terminated by "," statement tells SQL that values are

separated by columns, and ignore 1 lines tells SQL to skip the

header. The list of column names is used to specify how values

from the ﬁle are to be assigned to columns in the table.

For the Investigators table, the three values in each row of

the investigators.csv ﬁle are assigned to the Name,Institution, and

Email columns of the corresponding database row. Importantly, the

auto_increment declaration on the ID column (line 5 in Listing 4.1)

causes values for this column to be assigned automatically by the

DBMS, as rows are created, starting at 1. This feature allows us to

assign a unique integer identiﬁer to each investigator as its data are

loaded.

For the Grants table, the load data call (lines 7–12) is somewhat

more complex. Rather than loading the investigator name (the sec-

ond column of each line in our data ﬁle, represented here by the

variable @var) directly into the database, we use an SQL query (the

select statement in lines 11–12) to retrieve from the Investigators

table the ID corresponding to that name. By thus replacing the

investigator name with the unique investigator identiﬁer, we avoid

replicating the name across the two tables.

4.3.5 Transactions and crash recovery

A DBMS protects the data that it stores from computer crashes: if

your computer stops running suddenly (e.g., your operating system

crashes or you unplug the power), the contents of your database are

4.3. Relational DBMSs 109

not corrupted. It does so by supporting transactions. A transaction

is an atomic sequence of database actions. In general, every SQL

statement is executed as a transaction. You can also specify sets

of statements to be combined into a single transaction, but we do

not cover that capability here. The DBMS ensures that each trans-

action is executed completely even in the case of failure or error: if

the transaction succeeds, the results of all operations are recorded

permanently (“persisted”) in the database, and if it fails, all opera-

tions are “rolled back” and no changes are committed. For example,

suppose we ran the following SQL statement to convert the funding

amounts in the Grants table from dollars to euros, by scaling each

number by 0.9. The update statement speciﬁes the table to be up-

dated and the operation to be performed, which in this case is to

update the Funding column of each row. The DBMS will ensure that

either no rows are altered or all are altered.

update Grants set Grants.Funding = Grants.Funding*0.9;

Transactions are also key to supporting multi-user access. The

concurrency control mechanisms in a DBMS allow multiple users to

operate on a database concurrently, as if they were the only users

of the system: transactions from multiple users can be interleaved

to ensure fast response times, while the DBMS ensures that the

database remains consistent. While entire books could be (and have

been) written on concurrency in databases, the key point is that

read operations can proceed concurrently, while update operations

are typically serialized.

4.3.6 Database optimizations

A relational DBMS applies query planning and optimization meth-

ods with the goal of evaluating queries as eﬃciently as possible.

For example, if a query asks for rows that ﬁt two conditions, one

cheap to evaluate and one expensive, a relational DBMS may ﬁlter

ﬁrst on the basis of the ﬁrst condition, and then apply the second

conditions only to the rows identiﬁed by that ﬁrst ﬁlter. These sorts

of optimization are what distinguish SQL from other programming

languages, as they allow the user to write queries declaratively and

rely on the DBMS to come up with an eﬃcient execution strategy.

Nevertheless, the user can help the DBMS to improve perform-

ance. The single most powerful performance improvement tool is

the index, an internal data structure that the DBMS maintains to

speed up queries. While various types of indices can be created,

with diﬀerent characteristics, the basic idea is simple. Consider the

110 4. Databases

column ID in our Investigators table. Assume that there are N

rows in the table. In the absence of an index, a query that refers

to a column value (e.g., where ID=3) would require a linear scan of

the table, taking on average N/2 comparisons and in the worst case

Ncomparisons. A binary tree index allows the desired value to be

found with just log2Ncomparisons.

Example: Using indices to improve database performance

Consider the following query:

select ID, Name, sum(Funding) as TotalFunding

from Grants, Investigators

where Investigators.ID=Grants.Person

group by ID;

This query joins our two tables to link investigators with the grants that they

hold, groups grants by investigator (using group by), and ﬁnally sums the funding

associated with the grants held by each investigator. The result is the following:

ID Name TotalFunding

1 Steven Weinberg 666000

2 Howard Weinberg 323194

3 Irving Weinberg 461437

In the absence of indices, the DBMS must compare each row in Investi-

gators with each row in Grants, checking for each pair whether Investiga-

tors.ID = Grants.Person holds. As the two tables in our sample database

have only three and four rows, respectively, the total number of comparisons is

only 3 ×4=12. But if we had, say, 1 million investigators and 1 million grants,

then the DBMS would have to perform 1 trillion comparisons, which would take

a long time. (More importantly in many cases, it would have to perform a large

number of disk I/O operations if the tables did not ﬁt in memory.) An index on

the ID column of the Investigators table reduces the number of operations

dramatically, as the DBMS can then take each of the 1 million rows in the Grants

table and, for each row, identify the matching row(s) in Investigators via an

index lookup rather than a linear scan.

In our example table, the ID column has been speciﬁed to be a primary key,

and thus an index is created for it automatically. If it were not, we could easily

create the desired index as follows:

alter table Investigators add index(ID);

It can be diﬃcult for the user to determine when an index is required. A good

rule of thumb is to create an index for any column that is queried often, that

is, appears on the right-hand side of a where statement. However, the presence

of indices makes updates more expensive, as every change to a column value

requires that the index be rebuilt to reﬂect the change. Thus, if your data are

4.3. Relational DBMSs 111

highly dynamic, you should carefully select which indices to create. (For bulk load

operations, a common practice is to drop indices prior to the data import, and

re-create them once the load is completed.) Also, indices take disk space, so you

need to consider the tradeoﬀ between query eﬃciency and resources.

The explain command can be useful for determining when indices are re-

quired. For example, we show in the following some of the output produced when

we apply explain to our query. (For this example, we have expanded the two ta-

bles to 1,000 rows each, as our original tables are too small for MySQL to consider

the use of indices.) The output provides useful information such as the key(s) that

could be used, if indices exist (Person in the Grants table, and the primary key,

ID, for the Investigators table); the key(s) that are actually used (the primary

key, ID, in the Investigators table); the column(s) that are compared to the

index (Investigators.ID is compared with Grants.Person); and the number

of rows that must be considered (each of the 1,000 rows in Grants is compared

with one row in Investigators, for a total of 1,000 comparisons).

mysql> explain select ID, Name, sum(Funding) as TotalFunding

from Grants, Investigators

where Investigators.ID=Grants.Person group by ID;

+---------------+---------------+---------+---------------+------+

+---------------+---------------+---------+---------------+------+

+---------------+---------------+---------+---------------+------+

Contrast this output with the output obtained for equivalent tables in which

ID is not a primary key. In this case, no keys are used and thus 1,000 ×1,000 =

1,000,000 comparisons and the associated disk reads must be performed.

+---------------+---------------+------+------+------+

+---------------+---------------+------+------+------+

+---------------+---------------+------+------+------+

A second way in which the user can contribute to performance

improvement is by using appropriate table deﬁnitions and data

types. Most DBMSs store data on disk. Data must be read from

disk into memory before it can be manipulated. Memory accesses

are fast, but loading data into memory is expensive: accesses to

main memory can be a million times faster than accesses to disk.

Therefore, to ensure queries are eﬃcient, it is important to minimize

the number of disk accesses. A relational DBMS automatically op-

timizes queries: based on how the data are stored, it transforms

a SQL query into a query plan that can be executed eﬃciently,

112 4. Databases

and chooses an execution strategy that minimizes disk accesses.

But users can contribute to making queries eﬃcient. As discussed

above, the choice of types made when deﬁning schemas can make

a big diﬀerence. As a rule of thumb, only use as much space as

needed for your data: the smaller your records, the more records

can be transferred to main memory using a single disk access. The

design of relational tables is also important. If you put all columns

in a single table (do not normalize), more data will come into memory

than is required.

4.3.7 Caveats and challenges

It is important to keep the following caveats and challenges in mind

when using SQL technology with social science data.

Data cleaning Data created outside an SQL database, such as data

in ﬁles, are not always subject to strict constraints: data types

may not be correct or consistent (e.g., numeric data stored as text)

and consistency or integrity may not be enforced (e.g., absence of

primary keys, missing foreign keys). Indeed, as the reader probably

knows well from experience, data are rarely perfect. As a result,

the data may fail to comply with strict SQL schema requirements

and fail to load, in which case either data must be cleaned before

or during loading, or the SQL schema must be relaxed.

Missing values Care must be taken when loading data in which

some values may be missing or blank. SQL engines represent and

refer to a missing or blank value as the built-in constant null.

Counterintuitively, when loading data from text ﬁles (e.g., CSV),

many SQL engines require that missing values be represented ex-

plicitly by the term null; if a data value is simply omitted, it may

fail to load or be incorrectly represented, for example as zero or the

empty string (" ") instead of null. Thus, for example, the second

row in the investigators.csv ﬁle of Figure 4.2:

Howard Weinberg,University of North Carolina Chapel Hill,

may need to be rewritten as:

Howard Weinberg,University of North Carolina Chapel Hill,null

4.4. Linking DBMSs and other tools 113

Metadata for categorical variables SQL engines are metadata poor:

they do not allow extra information to be stored about a variable

(ﬁeld) beyond its base name and type (int,char, etc., as introduced

in Section 4.3.3). They cannot, for example, record directly the fact

that the column class can only take one of three values, animal,

vegetable, or mineral, or what these values mean. Common prac-

tice is thus to store information about possible values in another

table (commonly referred to as a dimension table) that can be used

as a lookup and constraint, as in the following:

Table class_values

Value Description

animal Is alive

vegetable Grows

mineral Isn’t alive and doesn’t grow

A related concept is that a column or list of columns may be

declared primary key or unique. Either says that no two tuples of

the table may agree in all the column(s) on the list. There can be

only one primary key for a table, but several unique columns. No

column of a primary key can ever be null in any tuple. But columns

declared unique may have nulls, and there may be several tuples

with null.

4.4 Linking DBMSs and other tools

Query languages such as SQL are not general-purpose program-

ming languages; they support easy, eﬃcient access to large data

sets, but are not intended to be used for complex calculations.

When complex computations are required, one can embed query

language statements into a programming language or statistical

package. For example, we might want to calculate the interquartile

range of funding for all grants. While this calculation can be ac-

complished in SQL, the resulting SQL code will be complicated.

Languages like Python make such statistical calculations straight-

forward, so it is natural to write a Python (or R, SAS, Stata, etc.)

program that connects to the DBMS that contains our data, fetches

the required data from the DBMS, and then calculates the inter-

quartile range of those data. The program can then, if desired,

store the result of this calculation back into the database.

Many relational DBMSs also have built-in analytical functions

or often now embed the R engine, providing signiﬁcant in-database

114 4. Databases

from mysql.connector import MySQLConnection, Error

2from python_mysql_dbconfig import read_db_config

4def retrieve_and_analyze_data():

try:

6# Open connection to the MySQL database

dbconfig = read_db_config()

8conn = MySQLConnection(**dbconfig)

cursor = conn.cursor()

10 # Transmit the SQL query to the database

cursor.execute('select Funding from Grants;')

12 # Fetch all rows of the query response

rows = [row for row in cur.fetchall()]

14 calculate_inter_quartile_range(rows)

except Error as e:

16 print(e)

finally:

18 cursor.close()

conn.close()

if __name__ == '__main__':

22 retrieve_and_analyze_data()

Listing 4.3. Embedding SQL in Python

statistical and analytical capabilities and alleviating the need for

external processing.

Example: Embedding database queries in Python

The Python script in Listing 4.3 shows how this embedding of database queries in

Python is done. This script establishes a connection to the database (lines 7–9),

transmits the desired SQL query to the database (line 11), retrieves the query

results into a Python array (line 13), and calls a Python procedure (not given) to

perform the desired computation (line 14). A similar program could be used to load

the results of a Python (or R, SAS, Stata, etc.) computation into a database.

Example: Loading other structured data

We saw in Listing 4.2 how to load data from CSV ﬁles into SQL tables. Data in other

formats, such as the commonly used JSON, can also be loaded into a relational

DBMS. Consider, for example, the following JSON format data, a simpliﬁed version

of data shown in Chapter 2.

4.4. Linking DBMSs and other tools 115

3institute :Janelia Campus,

4name :Laurence Abbott,

5role :Senior Fellow,

6state :VA,

7town :Ashburn

8},

10 institute :Jackson Lab,

11 name :Susan Ackerman,

12 role :Investigator,

13 state :ME,

14 town :Bar Harbor

15 }

16 ]

While some relational DBMSs provide built-in support for JSON objects, we

assume here that we want to convert these data into normal SQL tables. Using

one of the many utilities for converting JSON into CSV, we can construct the

following CSV ﬁle, which we can load into an SQL table using the method shown

earlier.

institute,name,role,state,town

Janelia Campus,Laurence Abbott,Senior Fellow,VA,Ashburn

Jackson Lab,Susan Ackerman,Investigator,ME,Bar Harbor

But into what table? The two records each combine information about a person

with information about an institute. Following the schema design rules given in

Section 4.3.3, we should normalize the data by reorganizing them into two tables,

one describing people and one describing institutes. Similar problems arise when

JSON documents contain nested structures. For example, consider the following

alternative JSON representation of the data above. Here, the need for normalization

is yet more apparent.

3name :Laurence Abbott,

4role :Senior Fellow,

5employer :{institute :Janelia Campus,

6state :VA,

7town :Ashburn}

8},

10 name :Susan Ackerman,

11 role :Investigator,

12 employer:{institute :Jackson Lab,

13 state :ME,

14 town :Bar Harbor}

15 }

16 ]

116 4. Databases

Thus, the loading of JSON data into a relational database usually requires both

work on schema design (Section 4.3.3) and data preparation.

4.5 NoSQL databases

While relational DBMSs have dominated the database world for sev-

eral decades, other database technologies exist and indeed have

become popular for various classes of applications in recent years.

As we will see, these alternative technologies have typically been

motivated by a desire to scale the quantities of data and/or num-

ber of users that can be supported, and/or to support specialized

data types (e.g., unstructured data, graphs). Here we review some

of these alternatives and the factors that may motivate their use.

4.5.1 Challenges of scale: The CAP theorem

For many years, the big relational database vendors (Oracle, IBM,

Sybase, and to a lesser extent Microsoft) have been the mainstay of

how data were stored. During the Internet boom, startups looking

for low-cost alternatives to commercial relational DBMSs turned to

MySQL and PostgreSQL. However, these systems proved inadequate

for big sites as they could not cope well with large traﬃc spikes, for

example when many customers all suddenly wanted to order the

same item. That is, they did not scale.

An obvious solution to scaling databases is to partition and/or

replicate data across multiple computers, for example by distribut-

ing diﬀerent tables, or diﬀerent rows from the same table, over multi-

ple computers. However, partitioning and replication also introduce

challenges, as we now explain. Let us ﬁrst deﬁne some terms. In a

system that comprises multiple computers:

•Consistency indicates that all computers see the same data at

the same time.

•Availability indicates that every request receives a response

about whether it succeeded or failed.

•Partition tolerance indicates that the system continues to op-

erate even if a network failure prevents computers from com-

municating.

4.5. NoSQL databases 117

An important result in distributed systems (the so-called “CAP

theorem” [51]) observes that it is not possible to create a distributed

system with all three properties. This situation creates a challenge

with large transactional data sets. Partitioning is needed in or-

der to achieve high performance, but as the number of comput-

ers grows, so too does the likelihood of network disruption among

pair(s) of computers. As strict consistency cannot be achieved at

the same time as availability and partition tolerance, the DBMS de-

signer must choose between high consistency and high availability

for a particular system.

The right combination of availability and consistency will de-

pend on the needs of the service. For example, in an e-commerce

setting, it makes sense to choose high availability for a checkout

process, in order to ensure that requests to add items to a shop-

ping cart (a revenue-producing process) can be honored. Errors

can be hidden from the customer and sorted out later. However, for

order submission—when a customer submits an order—it makes

sense to favor consistency because several services (credit card pro-

cessing, shipping and handling, reporting) need to access the data

simultaneously. However, in almost all cases, availability is chosen

over consistency.

4.5.2 NoSQL and key–value stores

Relational DBMSs were traditionally motivated by the need for trans-

action processing and analysis, which led them to put a premium

on consistency and availability. This led the designers of these

systems to provide a set of properties summarized by the acronym

ACID [137,347]:

•Atomic: All work in a transaction completes (i.e., is committed

to stable storage) or none of it completes.

•Consistent: A transaction transforms the database from one

consistent state to another consistent state.

•Isolated: The results of any changes made during a transaction

are not visible until the transaction has committed.

•Durable: The results of a committed transaction survive fail-

ures.

The need to support extremely large quantities of data and num-

bers of concurrent clients has led to the development of a range of

alternative database technologies that relax consistency and thus

118 4. Databases

these ACID properties in order to increase scalability and/or avail-

ability. These systems are commonly referred to as NoSQL (for “not

SQL”—or, more recently, “not only SQL,” to communicate that they

may support SQL-like query languages) because they usually do

not require a ﬁxed table schema nor support joins and other SQL

features. Such systems are sometimes referred to as BASE [127]:

Basically Available (the system seems to work all the time), Soft

state (it does not have to be consistent all the time), and Eventu-

ally consistent (it becomes consistent at some later time). The data

systems used in essentially all large Internet companies (Google,

Yahoo!, Facebook, Amazon, eBay) are BASE.

Dozens of diﬀerent NoSQL DBMSs exist, with widely varying

characteristics as summarized in Table 4.3. The simplest are key–

value stores such as Redis, Amazon Dynamo, Apache Cassandra,

and Project Voldemort. We can think of a key–value store as a rela-

tional database with a single table that has just two columns, key

and value, and that supports just two operations: store (or update)

a key–value pair, and retrieve the value for a given key.

Example: Representing investigator data in a NoSQL

database

We might represent the contents of the investigators.csv ﬁle of Figure 4.2 (in a

NoSQL database) as follows.

Key Value

Investigator_StevenWeinberg_Institution University of Texas at Austin

Investigator_StevenWeinberg_Email weinberg@utexas.edu

Investigator_HowardWeinberg_Institution University of North Carolina Chapel Hill

Investigator_IrvingWeinberg_Institution University of Maryland College Park

Investigator_IrvingWeinberg_Email irving@ucmc.edu

A client can then read and write the value associated with a given key by using

operations such as the following:

•Get(key) returns the value associated with key.

•Put(key,value) associates the supplied value with key.

•Delete(key) removes the entry for key from the data store.

Key–value stores are thus particularly easy to use. Furthermore, because there

is no schema, there are no constraints on what values can be associated with a

key. This lack of constraints can be useful if we want to store arbitrary data. For

example, it is trivial to add the following records to a key–value store; adding this

information to a relational table would require schema modiﬁcations.

Key Value

Investigator_StevenWeinberg_FavoriteColor Blue

Investigator_StevenWeinberg_Awards Nobel

4.5. NoSQL databases 119

Another advantage is that if a given key would have no value (e.g., Investiga-

tor_HowardWeinberg_Email), we need not create a record. Thus, a key–value store

can achieve a more compact representation of sparse data, which would have many

empty ﬁelds if expressed in relational form.

A third advantage of the key–value approach is that a key–value store is easily

partitioned and thus can scale to extremely large sizes. A key–value DBMS can

partition the space of keys (e.g., via a hash on the key) across diﬀerent computers

for scalability. It can also replicate key–value pairs across multiple computers

for availability. Adding, updating, or querying a key–value pair requires simply

sending an appropriate message to the computer(s) that hold that pair.

The key–value approach also has disadvantages. As we can see from the exam-

ple, users must be careful in their choice of keys if they are to avoid name collisions.

The lack of schema and constraints can also make it hard to detect erroneous keys

and values. Key–value stores typically do not support join operations (e.g., “which

investigators have the Nobel and live in Texas?”). Many key–value stores also relax

consistency constraints and do not provide transactional semantics.

4.5.3 Other NoSQL databases

The simple structure of key–value stores allows for extremely fast

and scalable implementations. However, as we have seen, many

interesting data cannot be easily modeled as key–value pairs. Such

concerns have motivated the development of a variety of other NoSQL

systems that oﬀer, for example, richer data models: document-

based (CouchDB and MongoDB), graph-based (Neo4J), and column-

based (Cassandra, HBase) databases.

In document-based databases, the value associated with a key

can be a structured document: for example, a JSON document,

permitting the following representation of our investigators.csv ﬁle

plus the additional information that we just introduced.

Key Value

Investigator_StevenWeinberg { institution : University of Texas at Austin,

email : weinberg@utexas.edu,

favcolor : Blue,

award : Nobel }

Investigator_HowardWeinberg { institution : University of North Carolina

Chapel Hill }

Investigator_IrvingWeinberg { institution : University of Maryland College

Park, email : irving@ucmc.edu }

Associated query languages may permit queries within the docu-

ment, such as regular expression searches, and retrieval of selected

120 4. Databases

ﬁelds, providing a form of a relational DBMS’s selection and projec-

tion capabilities (Section 4.3.2). For example, MongoDB allows us

to ask for documents in a collection called investigators that have

“University of Texas at Austin” as their institution and the Nobel as

an award.

db.investigators.find(

{ institution: ’University of Texas at Austin’,

award: ’Nobel’ }

)

A column-oriented DBMS stores data tables by columns rather

than by rows, as is common practice in relational DBMSs. This

approach has advantages in settings where aggregates must fre-

quently be computed over many similar data items: for example, in

clinical data analysis. Google Cloud BigTable and Amazon RedShift

are two cloud-hosted column-oriented NoSQL databases. HBase

and Cassandra are two open source systems with similar charac-

teristics. (Confusingly, the term column oriented is also often used

to refer to SQL database engines that store data in columns instead

of rows: for example, Google BigQuery, HP Vertica, Terradata, and

the open source MonetDB. Such systems are not to be confused

with column-based NoSQL databases.)

Graph databases store information about graph structures in

terms of nodes, edges that connect nodes, and attributes of nodes

and edges. Proponents argue that they permit particularly straight-

forward navigation of such graphs, as when answering queries such

as “ﬁnd all the friends of the friends of my friends”—a task that

would require multiple joins in a relational database.

4.6 Spatial databases

Social science research commonly involves spatial data. Socio-

economic data may be associated with census tracts, data about

the distribution of research funding and associated jobs with cities

and states, and crime reports with speciﬁc geographic locations.

Furthermore, the quantity and diversity of such spatially resolved

data are growing rapidly, as are the scale and sophistication of the

systems that provide access to these data. For example, just one

urban data store, Plenario, contains many hundreds of data sets

about the city of Chicago [64].

Researchers who work with spatial data need methods for repre-

senting those data and then for performing various queries against

them. Does crime correlate with weather? Does federal spending

4.6. Spatial databases 121

on research spur innovation within the locales where research oc-

curs? These and many other questions require the ability to quickly

determine such things as which points exist within which regions,

the areas of regions, and the distance between two points. Spatial

databases address these and many other related requirements.

Example: Spatial extensions to relational databases

Spatial extensions have been developed for many relational databases: for exam-

ple, Oracle Spatial, DB2 Spatial, and SQL Server Spatial. We use the PostGIS

extensions to the PostgreSQL relational database here. These extensions im-

plement support for spatial data types such as point,line, and polygon, and

operations such as st_within (returns true if one object is contained within

another), st_dwithin (returns true if two objects are within a speciﬁed dis-

tance of each other), and st_distance (returns the distance between two objects).

Thus, for example, given two tables with rows for schools and hospitals in Illinois

(illinois_schools and illinois_hospitals, respectively; in each case,

the column the_geom is a polygon for the object in question) and a third table

with a single row representing the city of Chicago (chicago_citylimits), we

can easily ﬁnd the names of all schools within the Chicago city limits:

select illinois_schools.name

from illinois_schools, chicago_citylimits

where st_within(illinois_schools.the_geom,

chicago_citylimits.the_geom);

We join the two tables illinois_schools and chicago_citylimits,

with the st_within constraint constraining the selected rows to those represent-

ing schools within the city limits. Here we use the inner join introduced in Sec-

tion 4.3.2. This query could also be written as:

select illinois_schools.name

from illinois_schools left join chicago_citylimits

on st_within(illinois_schools.the_geom,

chicago_citylimits.the_geom);

We can also determine the names of all schools that do not have a hospital

within 3,000 meters:

select s.name as 'School Name'

from illinois_schools as s

left join illinois_hospitals as h

on st_dwithin(s.the_geom, h.the_geom, 3000)

where h.gid is null;

Here, we use an alternative form of the join operator, the left join—or, more

precisely, the left excluding join. The expression

table1 left join table2 on constraint

122 4. Databases

Table 4.4. Three types of join illustrated: the inner join, as used in Section 4.3.2,

the left join, and left excluding join

Inner join Left join Left excluding join

A B

select columns select columns select columns

from Table_A A from Table_A A from Table_A A

inner join Table_B B left join Table_B B left join Table_B B

on A.Key = B.Key on A.Key = B.Key on A.Key = B.Key

where B.Key is null

returns all rows from the left table (table1) with the matching rows in the right

table (table2), with the result being null in the right side when there is no match.

This selection is illustrated in the middle column of Table 4.4. The addition of

the where h.gid is null then selects only those rows in the left table with

no right-hand match, as illustrated in the right-hand column of Table 4.4. Note

also the use of the as operator to rename the columns illinois_schools and

illinois_hospitals. In this case, we rename them simply to make our query

more compact.

4.7 Which database to use?

The question of which DBMS to use for a social sciences data man-

agement and analysis project depends on many factors. We in-

troduced some relevant rules in Table 4.1. We expand on those

considerations here.

4.7.1 Relational DBMSs

If your data are structured, then a relational DBMS is almost cer-

tainly the right technology to use. Many open source, commercial,

and cloud-hosted relational DBMSs exist. Among the open source

DBMSs, MySQL and PostgreSQL (often simply Postgres) are partic-

ularly widely used. MySQL is the most popular. It is particularly

easy to install and use, but does not support all features of the SQL

standard. PostgreSQL is fully standard compliant and supports

useful features such as full text search and the PostGIS extensions

4.8. Summary 123

mentioned in the previous section, but can be more complex to work

with.

Popular commercial relational DBMSs include IBM DB2, Mi-

crosoft SQL Server, and Oracle RDMS. These systems are heav-

ily used in commercial settings. There are free community edi-

tions, and some large science projects use enterprise features via

academic licensing: for example, the Sloan Digital Sky Survey uses

Microsoft SQL Server [367] and the CERN high-energy physics lab

uses Oracle [132].

We also see increasing use being made of cloud-hosted relational

DBMSs such as Amazon Relational Database Service (RDS; this

supports MySQL, PostgreSQL, and various commercial DBMSs),

Microsoft Azure, and Google Cloud SQL. These systems obviate the

need to install local software, administer a DBMS, or acquire hard-

ware to run and scale your database. Particularly if your database

is bigger than can ﬁt on your workstation, a cloud-hosted solution

can be a good choice.

4.7.2 NoSQL DBMSs

Few social science problems have the scale that might motivate the

use of a NoSQL DBMS. Furthermore, while deﬁning and enforc-

ing a schema can involve some eﬀort, the beneﬁts of so doing are

considerable. Thus, the use of a relational DBMS is usually to be

recommended.

Nevertheless, as noted in Section 4.2, there are occasions when a

NoSQL DBMS can be a highly eﬀective, such as when working with

large quantities of unstructured data. For example, researchers

analyzing large collections of Twitter messages frequently store the

messages in a NoSQL document-oriented database such as Mon-

goDB. NoSQL databases are also often used to organize large num-

bers of records from many diﬀerent sources, as illustrated in Fig-

ure 4.1.

4.8 Summary

A key message of this book is that you should, whenever possible,

use a database. Database management systems are one of the great

achievements of information technology, permitting large amounts

of data to be stored and organized so as to allow rapid and reliable

exploration and analysis. They have become a central component

124 4. Databases

of a great variety of applications, from handling transactions in ﬁ-

nancial systems to serving data published in websites. They are

particularly well suited for organizing social science data and for

supporting analytics for data exploration.

DBMSs provide an environment that greatly simpliﬁes data man-

agement and manipulation. They make many easy things trivial,

and many hard things easy. They automate many other error-

prone, manual tasks associated with query optimization. While

they can by daunting to those unfamiliar with their concepts and

workings, they are in fact easy to use. A basic understanding of

databases and of when and how to use DBMSs is an important

element of the social data scientist’s knowledge base.

4.9 Resources

The enormous popularity of DBMSs means that there are many good

books to be found. Classic textbooks such as those by Silberschatz

et al. [347] and Ramakrishnan and Gherke [316] provide a great deal

of technical detail. The DB Engines website collects information on

DBMSs [351]. There are also many also useful online tutorials,

and of course StackExchange and other online forums often have

answers to your technical questions.

Turning to speciﬁc technologies, the SQL Cookbook [260] pro-

vides a wonderful introduction to SQL. We also recommend the

SQL Cheatsheet [22] and a useful visual depiction of diﬀerent SQL

join operators [259]. Two good books on the PostGIS geospatial ex-

tensions to the PostgreSQL database are the PostGIS Cookbook [85]

and PostGIS in Action [285]. The online documentation is also excel-

lent [306]. The monograph NoSQL Databases [364] provides much

useful technical detail.

We did not consider in this chapter the native extensible Markup

Language (XML) and Resource Description Framework (RDF) triple

stores, as these are not typically used for data management. How-

ever, they do play a fundamental role in metadata and knowledge

management. See, for example, Sesame [53, 336]

If you are interested in the state of database and data manage-

ment research, the recent Beckman Report [2] provides a useful

perspective.

Programming with Big Data

Chapter 5

Huy Vo and Claudio Silva

Big data is sometimes deﬁned as data that are too big to ﬁt onto

the analyst’s computer. This chapter provides an overview of clever

programming techniques that facilitate the use of data (often using

parallel computing). While the focus is on one of the most widely

used big data programming paradigms and its most popular imple-

mentation, Apache Hadoop, the goal of the chapter is to provide a

conceptual framework to the key challenges that the approach is

designed to address.

5.1 Introduction

There are many deﬁnitions of big data, but perhaps the most popu-

lar one is from Laney’s report [229] in 2001 that deﬁnes big data as

data with the three Vs: volume (large data sets), velocity (real-time

streaming data), and variety (various data forms). Other authors

have since proposed additional Vs for various purposes: veracity,

value,variability, and visualization. It is common nowadays to see

big data being associated with ﬁve or even seven Vs—the big data

deﬁnition itself is also getting “big” and getting diﬃcult to keep track.

A simpler but also broader deﬁnition of big data is data sets that

are “so large or complex that traditional data processing applica-

tions are inadequate,” as described by Wikipedia. We note that in

this case, the deﬁnition adapts to the task at hand, and it is also

tool dependent.

For example, we could consider a problem to involve big data if

a spreadsheet gets so large that Excel can no longer load the entire

data set into memory for analysis—or if there are so many features

◮See Chapter 6, in partic-

ular Section 6.5.2.

in a data set that a machine learning classiﬁer would take an unrea-

sonable amount of time (say days) to ﬁnish instead of a few seconds

125

126 5. Programming with Big Data

or minutes. In such cases, the analyst has to develop customized

solutions to interrogate the data, often by taking advantage of paral-

lel computing. In particular, one may have to get a larger computer,

with more memory and/or stronger processors, to cope with expen-

sive computations; or get a cluster of machines to speed up the

computing time by distributing the workload among them. While

the former solution would not scale well due to the limited amount

of processors a single computer can have, the latter solution needs

to deal with nontraditional programming infrastructures. Big data

technologies, in a nutshell, are the technologies that make these

infrastructures more usable by users without a computer science

background.

Parallel computing and big data are hardly new ideas for deal-

ing with computational challenges. Scientists have routinely been

working on data sets much larger than a single machine can han-

dle for several decades, especially at the DOE National Laborato-

ries [87,337] where high-performance computing has been a major

technology trend. This is also demonstrated by the history of re-

search in distributed computing and data management going back

to the 1980s [401].

There are major technological diﬀerences between this type of

big data technology and what is covered in this chapter. True par-

allel computing involves designing clever parallel algorithms, often

from scratch, taking into account machine-dependent constraints

to achieve maximum performance from the particular architecture.

They are often implemented using message passing libraries, such

as implementations of the Message Passing Interface (MPI) stan-

dard [142] (available since the mid-1990s). Often the expectation is

that the code developed will be used for substantial computations.

In that scenario it makes sense to optimize a code for maximum

performance knowing the same code will be used repeatedly. Con-

trast that with big data for exploration, where we might need to

keep changing the analysis code very often. In this case, what we

are trying to optimize is the analyst’s time, not computing time.

With storage and networking getting signiﬁcantly cheaper and

faster, big data sets could easily become available to data enthu-

siasts with just a few mouse clicks, e.g., the Amazon Web Service

Public Data Sets [11]. These enthusiasts may be policymakers, gov-

ernment employees, or managers who would like to draw insights

and (business) value from big data. Thus, it is crucial for big data

to be made available to nonexpert users in such a way that they

can process the data without the need for a supercomputing ex-

pert. One such approach is to build big data frameworks within

5.2. The MapReduce programming model 127

which commands can be implemented just as they would in a small

data framework. Also, such a framework should be as simple as

possible, even if not as eﬃcient as custom-designed parallel solu-

tions. Users should expect that if their code works within these

frameworks for small data, it will also work for big data.

In order to achieve scalability for big data, many frameworks only

implement a small subset of data operations and fully automate the

parallelism of these operations. Users are expected to develop their

codes using only the subset if they expect their code to scale to

large data sets. MapReduce, one of the most widely used big data

programming paradigms, is no exception to this rule. As its name

suggests, the framework only supports two operations: map and

reduce. The next sections will provide an overview of MapReduce

and its most popular implementation, Apache Hadoop.

5.2 The MapReduce programming model

The MapReduce framework was proposed by Jeﬀrey Dean and San-

jay Ghemawat at Google in 2004 [91]. Its origins date back to

conceptually similar approaches ﬁrst described in the early 1980s.

MapReduce was indeed inspired by the map and reduce functions

of functional programming, though its reduce function is more of a

group-by-key function, producing a list of values, instead of the tra-

ditional reduce, which outputs only a single value. MapReduce is a

record-oriented model, where each record is treated as a key–value

pair; thus, both map and reduce functions operate on key–value pair

data.

A typical MapReduce job is composed of three phases—map,

shuﬄe, and reduce—taking a list of key–value pairs [(k$_1$,v$_2$),

(k$_2$,v$_2$), ..., (k$_n$,v$_n$)] as input. In the map phase,

each input key–value pair is run through the map function, and zero

or more new key–value pairs are output. In the shuﬄe phase, the

framework sorts the outputs of the map phase, grouping pairs by

keys before sending each of them to the reduce function. In the

reduce phase, each grouping of values are processed by the reduce

function, and the result is a list of new values that are collected for

the job output.

In brief, a MapReduce job just takes a list of key–value pairs as

input and produces a list of values as output. Users only need to

implement interfaces of the map and reduce functions (the shuﬄe

phase is not very customizable) and can leave it up to the system

that implements MapReduce to handle all data communications

128 5. Programming with Big Data

and parallel computing. We can summarize the MapReduce logic

as follows:

map: (ki, vi)→

→

→[f(k′

i1, v′

i1),f(k′

i2, v′

i2),..]

for a user-deﬁned function f.

reduce: (k′′

i, [v′′

i1,v′′

i2,..]) →

→

→[v′′′

1,v′′′

2,..]

where {k′

i} ≡ {k′′

j}and v′′′ =g(v′′)for a user-deﬁned function g.

Example: Counting NSF awards

To gain a better understanding of these MapReduce operators, imagine that we

have a list of NSF principal investigators, along with their email information and

award IDs as below. Our task is to count the number of awards for each institu-

tion. For example, given the four records below, we will discover that the Berkeley

Geochronology Center has two awards, while New York University and the Univer-

sity of Utah each have one.

AwardId,FirstName,LastName,EmailAddress

0958723,Roland,Mundil,rmundil@bgc.org

0958915,Randall,Irmis,irmis@umnh.utah.edu

1301647,Zaher,Hani,zh8@nyu.edu

1316375,David,Shuster,dshuster@bgc.org

We observe that institutions can be distinguished by their email address do-

main name. Thus, we adopt of a strategy of ﬁrst grouping all award IDs by domain

names, and then counting the number of distinct award within each group. In

order to do this, we ﬁrst set the map function to scan input lines and extract insti-

tution information and award IDs. Then, in the reduce function, we simply count

unique IDs on the data, since everything is already grouped by institution. Python

pseudo-code is provided in Listing 5.1.

In the map phase, the input will be transformed into tuples of institutions and

award ids:

"0958723,Roland,Mundil,rmundil@bgc.org" →

→

→("bgc.org", 0958723)

"0958915,Randall,Irmis,irmis@umnh.utah.edu" →

→

→("utah.edu", 0958915)

"1301647,Zaher,Hani,zh8@nyu.edu" →

→

→("nyu.edu", 1301647)

"1316375,David,Shuster,dshuster@bgc.org" →

→

→("bgc.org", 1316375)

Then the tuples will be grouped by institutions and be counted by the reduce

function.

("bgc.org", [0958723,1316375]) →

→

→("bgc.org", 2)

("utah.edu", [0958915]) →

→

→("utah.edu", 1)

("nyu.edu", [1301647]) →

→

→("nyu.edu", 1)

5.3. Apache Hadoop MapReduce 129

# Input : a list of text lines

2# Output : a list of domain name and award ids

def MAP(lines):

4for line in lines:

fields = line.strip('\n').split(',')

6awardId = fields[0]

domainName = fields[3].split('@')[-1].split('.')[-2:]

8yield (domainName, awardId)

10 # Input : a list of domain name and award ids

# Output : a list of domain name and award count

12 def REDUCE(pairs):

for (domainName, awardIds) in pairs:

14 count = len(set(awardIds))

yield (domainName, count)

Listing 5.1. Python pseudo-code for the map and reduce functions to count the

number of awards per institution

As we have seen so far, the MapReduce programming model is

quite simple and straightforward, yet it supports a simple paral-

lelization model. In fact, it has been said to be too simple and criti-

cized as “a major step backwards” [94] for large-scale, data-intensive

applications. It is hard to argue that MapReduce is oﬀering some-

thing truly innovative when MPI has been oﬀering similar scatter

and reduce operations since 1995, and Python has had high-order

functions (map,reduce,filter, and lambda) since its 2.2 release in

1994. However, the biggest strength of MapReduce is its simplic-

ity. Its simple programming model has brought many nonexpert

users to big data analysis. Its simple architecture has also inspired

many developers to develop advanced capabilities, such as support

for distributed computing, data partitioning, and streaming pro-

cessing. (A downside of this diversity of interest is that available

features and capabilities can vary considerably, depending on the

speciﬁc implementation of MapReduce that is being used.)

We next describe two speciﬁc implementations of the MapReduce

model: Hadoop and Spark.

5.3 Apache Hadoop MapReduce

The names MapReduce and Apache Hadoop (or Hadoop)*are often ⋆The term Hadoop refers

to the creator’s son’s toy

elephant.

used interchangeably, but they are conceptually diﬀerent. MapRe-

duce is simply a programming paradigm with a layer of abstraction

that allows a set of data processing pipelines to be expressed with-

out too much tailoring for how it will be executed exactly. The

130 5. Programming with Big Data

MapReduce model tells us which class of data structure (key–value

pairs) and data transformations (map and reduce) it is supporting;

however, it does not speciﬁcally state how the framework should

be implemented; for example, it does not specify how data should

be stored or how the computation be executed (in particular, par-

allelized). Hadoop [399], on the other hand, is a speciﬁc imple-

mentation of MapReduce with exact speciﬁcations of how data and

computation are handled inside the system.

Hadoop was originally designed for batch data processing at

scale, with the target of being able to run in environments with

thousands of machines. Supporting such a large computing envi-

ronment puts several constraints on the system; for instance, with

so many machines, the system had to assume computing nodes

would fail. Hadoop is an enhanced MapReduce implementation

with the support for fault tolerance, distributed storage, and data

parallelism through two added key design features: (1) a distributed

ﬁle system called the Hadoop Distributed File System (HDFS); and

(2) a data distribution strategy that allows computation to be moved

to the data during execution.

5.3.1 The Hadoop Distributed File System

The Hadoop Distributed File System [345] is arguably the most im-

portant component that Hadoop added to the MapReduce frame-

work. In a nutshell, it is a distributed ﬁle system that stripes data

across all the nodes of a Hadoop cluster. HDFS splits large data

ﬁles into smaller blocks which are managed by diﬀerent nodes in

the cluster. Each block is also replicated across several nodes as an

attempt to ensure that a full copy of the data is still available even

in the case of computing node failures. The block size as well as

the number of replications per block are fully customized by users

when they create ﬁles on HDFS. By default, the block size is set

to 64 MB with a replication factor of 3, meaning that the system

may encounter at least two concurrent node failures without losing

any data. HDFS also actively monitors failures and re-replicates

blocks on failed nodes to make sure that the number of replications

for each block always stays at the user-deﬁned settings. Thus, if a

node fails, and only two copies of some data exist, the system will

quickly copy those data to a working node, thus raising the num-

ber of copies to three again. This dynamic replication the primary

mechanism for fault tolerance in Hadoop.

Note that data blocks are replicated and distributed across sev-

eral machines. This could create a problem for users, because

5.3. Apache Hadoop MapReduce 131

if they had to manage the data manually, they might, for exam-

ple, have to access more than one machine to fetch a large data

ﬁle. Fortunately, Hadoop provides infrastructure for managing this

complexity, including command line programs as well as an API

that users can employ to interact with HDFS as if it were a local ﬁle

system.

This is one example that reinforces our discussion on big data

technology being all about making things work seamlessly regard-

less of the computational environment; it should be possible for the

user to use the system as though they are using their local worksta-

tion. Hadoop and HDFS are great examples of this approach. For

example, one can run ls and mkdir to list and create a directory on

HDFS, or even use tail to inspect ﬁle contents as one would expect

in a Linux ﬁle system. The following code shows some examples of

interacting with HDFS.

# Creating a temporary folder

hadoop dfs -mkdir /tmp/mytmp

# Upload a CSV file from our local machine to HDFS

hadoop dfs -put myfile.csv /tmp/mytmp

# Listing all files under mytmp folder

hadoop dfs -ls /tmp/mytmp

# Upload another file with five replications and 128MB per block

hadoop -D dfs.replication=5 -D dfs.block.size=128M \

dfs -put mylargefile.csv /tmp/mytmp

# Download a file to our local machine

hadoop dfs -get /tmp/mytmp/myfile.csv .

5.3.2 Hadoop: Bringing compute to the data

The conﬁguration of parallel computing infrastructure is a fairly

complex task. At the risk of oversimplifying, we consider the com-

puting environment as comprising a compute cluster with substan-

tial computing power (e.g., thousands of computing cores), and a

storage cluster with petabytes of disk space, capable of storing and

serving data quickly to the compute cluster. These two clusters

have quite diﬀerent hardware speciﬁcations: the ﬁrst is optimized

for CPU performance and the second for storage occupancy. The two

systems are typically conﬁgured as separate physical hardware.

Running compute jobs on such hardware often goes like this.

When a user requests to run an intensive task on a particular data

132 5. Programming with Big Data

Task 3

Task 1 Task 2

Task 4Task 5

Data Data Task 0

(a)

Task 0 Task 3

Task 1 Task 2

Task 4

Task 5

(b)

Figure 5.1. (a) The traditional parallel computing model where data are brought to

the computing nodes. (b) Hadoop’s parallel computing model: bringing compute

to the data [242]

set, the system will ﬁrst reserve a set of computing nodes. Then

the data are partitioned and copied from the storage server into

these computing nodes before the task is executed. This process is

illustrated in Figure 5.1(a). This computing model will be referred

to as bringing data to computation. In this model, if a data set

is being analyzed in multiple iterations, it is very likely that the

data will be copied multiple times from the storage cluster to the

compute nodes without reusability. This is because the compute

node scheduler normally does not have or keep knowledge of where

data have previously been held. The need to copy data multiple

times tends to make such a computation model ineﬃcient, and I/O

becomes the bottleneck when all tasks constantly pull data from

the storage cluster (the green arrow). This in turn leads to poor

scalability; adding more nodes to the computing cluster would not

increase its performance.

To solve this problem, Hadoop implements a bring compute to the

data strategy that combines both computing and storage at each

node of the cluster. In this setup, each node oﬀers both computing

power and storage capacity. As shown in Figure 5.1(b), when users

submit a task to be run on a data set, the scheduler will ﬁrst look

for nodes that contain the data, and if the nodes are available, it

will schedule the task to run directly on those nodes. If a node is

5.3. Apache Hadoop MapReduce 133

busy with another task, data will still be copied to available nodes,

but the scheduler will maintain records of the copy for subsequent

use of the data. In addition, data copying can be minimized by

increasing the data duplication in the cluster, which also increases

the potential for parallelism, since the scheduler has more choices

to allocate computing without copying. Since both the compute and

data storage are closely coupled for this model, it is best suited for

data-intensive applications.

Given that Hadoop was designed for batch data processing at

scale, this model ﬁts the system nicely, especially with the sup-

port of HDFS. However, in an environment where tasks are more

compute intensive, a traditional high-performance computing envi-

ronment is probably best since it tends to spend more resources on

CPU cores. It should be clear now that the Hadoop model has hard-

ware implications, and computer architects have optimized systems

for data-intensive computing.

Now that we are equipped with the knowledge that Hadoop is

a MapReduce implementation that runs on HDFS and a bring-

compute-to-the-data model, we can go over the design of a Hadoop

MapReduce job. A MapReduce job is still composed of three phases:

map, shuﬄe, and reduce. However, Hadoop divides the map and

reduce phases into smaller tasks.

Each map phase in Hadoop is divided into ﬁve tasks: input for-

mat,record reader,mapper,combiner, and partitioner. An input

format task is in charge of talking to the input data presumably

sitting on HDFS, and splitting it into partitions (e.g., by breaking

lines at line breaks). Then a record reader task is responsible for

translating the split data into the key–value pair records so that

they can be processed by the mapper. By default, Hadoop parses

ﬁles into key–value pairs of line numbers and line contents. How-

ever, both input formats and record readers are fully customizable

and can be programmed to read custom data including binary ﬁles.

It is important to note that input formats and record readers only

provide data partitioning; they do not move data around computing

nodes.

After the records are generated, mappers are spawned—typically

on nodes containing the blocks—to run through these records and

output zero or more new key–value pairs. A mapper in Hadoop

is equivalent to the map function of the MapReduce model that we

discussed earlier. The selection of the key to be output from the

mapper will heavily depend on the data processing pipeline and

could greatly aﬀect the performance of the framework. Mappers are

executed concurrently in Hadoop as long as resources permit.

134 5. Programming with Big Data

A combiner task in Hadoop is similar to a reduce function in the

MapReduce framework, but it only works locally at each node: it

takes output from mappers executed on the same node and pro-

duces aggregated values. Combiners are optional but can be used

to greatly reduce the amount of data exchange in the shuﬄe phase;

thus, users are encouraged to implement this whenever possible.

A common practice is when a reduce function is both commutative

and associative, and has the same input and output format, one

can just use the reduce function as the combiner. Nevertheless,

combiners are not guaranteed to be executed by Hadoop, so this

should only be treated as a hint. Its execution must not aﬀect the

correctness of the program.

A partitioner task is the last process taking place in the map

phase on each mapper node, where it hashes the key of each key–

value pair output from the mappers or the combiners into bins.

By default, the partitioner uses object hash codes and modulus

operations to direct a designated reducer to pull data from a map

node. Though it is possible to customize the partitioner, it is only

advisable to do so when one fully understands the intermediate data

distribution as well as the speciﬁcations of the cluster. In general,

it is better to leave this job to Hadoop.

Each reduce phase in Hadoop is divided into three tasks: re-

ducer,output format, and record writer. The reducer task is equiv-

alent to the reduce function of the MapReduce model. It basically

groups the data produced by the mappers by keys and runs a reduce

function on each list of grouping values. It outputs zero or more

key–value pairs for the output format task, which then translates

them into a writable format for the record writer task to serialize

on HDFS. By default, Hadoop will separate the key and value with

a tab and write separate records on separate lines. However, this

behavior is fully customizable. Similarly, the map phase reducers

are also executed concurrently in Hadoop.

5.3.3 Hardware provisioning

Hadoop requires a distributed cluster of machines to operate ef-

ﬁciently. (It can be set up to run entirely on a single computer,

but this should only be done for technology demonstration pur-

poses.) This is mostly because the MapReduce performance heavily

depends on the total I/O throughput (i.e., disk read and write) of

the entire system. Having a distributed cluster, where each ma-

chine has its own set of hard drives, is one of the most eﬃcient

ways to maximize this throughput.

5.3. Apache Hadoop MapReduce 135

Data on HDFS

Input Partition

Shuing

Sort in Parallel

Output Partition

Data on HDFS

Map

Reduce Reduce

Map Map Map

Figure 5.2. Data transfer and communication of a MapReduce job in Hadoop.

Data blocks are assigned to several maps, which emit key–value pairs that are

shufﬂed and sorted in parallel. The reduce step emits one or more pairs, with

results stored on the HDFS

A typical Hadoop cluster consists of two types of machine: mas-

ters and workers. Master machines are those exclusively reserved

for running services that are critical to the framework operations.

Some examples are the NameNode and the JobTracker services,

which are tasked to manage how data and tasks are distributed

among the machines, respectively. The worker machines are re-

served for data storage and for running actual computation tasks

(i.e., map and reduce). It is normal to have worker machines that

can be included or removed from an operational cluster on demand.

This ability to vary the number of worker nodes makes the overall

system more tolerant of failure. However, master machines are

usually required to be running uninterrupted.

Provisioning and conﬁguring the hardware for Hadoop, like any

other parallel computing, are some of the most important and com-

plex tasks in setting up a cluster, and they often require a lot

of experience and careful consideration. Major big data vendors

provide guidelines and tools to facilitate the process [18, 23, 81].

136 5. Programming with Big Data

#!/usr/bin/env python

import sys

def parseInput():

for line in sys.stdin:

yield line

if __name__=='__main__':

for line in parseInput():

fields = line.strip('\n').split(',')

awardId = fields[0]

domainName = fields[3].split('@')[-1].split('.')[-2:]

print '%s\t%s' % (domainName,awardId)

Listing 5.2. A Hadoop streaming mapper in Python

Nevertheless, most decisions will be based on the types of analysis

to be run on the cluster, for which only you, as the user, can provide

the best input.

5.3.4 Programming language support

Hadoop is written entirely in Java, thus it is best supporting appli-

cations written in Java. However, Hadoop also provides a streaming

API that allows arbitrary code to be run inside the Hadoop MapRe-

duce framework through the use of UNIX pipes. This means that we

can supply a mapper program written in Python or C++ to Hadoop

as long as that program reads from the standard input and writes to

the standard output. The same mechanism also applies for the com-

biner and reducer. For example, we can develop from the Python

pseudo-code in Listing 5.1 to a complete Hadoop streaming mapper

(Listing 5.2) and reducer (Listing 5.3).

#!/usr/bin/env python

import sys

def parseInput():

for line in sys.stdin:

yield line

if __name__=='__main__':

for line in parseInput():

(domainName, awardIds) = line.split('\t')

count = len(set(awardIds))

print '%s\t%s' % (domainName, count)

Listing 5.3. A Hadoop streaming reducer in Python

5.3. Apache Hadoop MapReduce 137

It should be noted that in Hadoop streaming, intermediate key–

value pairs (the data ﬂowing between mappers and reducers) must

be in tab-delimited format, thus we replace the original yield com-

mand with a print formatted with tabs. Though the input format

and record reader are still customizable in Hadoop streaming, they

must be supplied as Java classes. This is one of the biggest limita-

tions of Hadoop for Python developers. They not only have to split

their code into separate mapper and reducer programs, but also

need to learn Java if they want to work with nontextual data.

5.3.5 Fault tolerance

By default, HDFS uses checksums to enforce data integrity on its

ﬁle system use data replication for recovery of potential data losses.

Taking advantage of this, Hadoop also maintains fault tolerance of

MapReduce jobs by storing data at every step of a MapReduce job

to HDFS, including intermediate data from the combiner. Then the

system checks whether a task fails by either looking at its heart-

beats (data activities) or whether it has been taking too long. If a

task is deemed to have failed, Hadoop will kill it and run it again on

a diﬀerent node. The time limit for the heartbeats and task running

duration may also be customized for each job. Though the mecha-

nism is simple, it works well on thousands of machines. It is indeed

highly robust because of the simplicity of the model.

5.3.6 Limitations of Hadoop

Hadoop is a great system, and probably the most widely used MapRe-

duce implementation. Nevertheless, it has important limitations, as

we now describe.

•Performance: Hadoop has proven to be a scalable implemen-

tation that can run on thousands of cores. However, it is

also known for having a relatively high job setup overheads

and suboptimal running time. An empty task in Hadoop (i.e.,

with no mapper or reducer) can take roughly 30 seconds to

complete even on a modern cluster. This overhead makes it

unsuitable for real-time data or interactive jobs. The problem

comes mostly from the fact that Hadoop monitoring processes

only lives within a job, thus it needs to start and stop these

processes each time a job is submitted, which in turns results

in this major overhead. Moreover, the brute force approach of

138 5. Programming with Big Data

maintaining fault tolerance by storing everything on HDFS is

expensive, especially when for large data sets.

•Hadoop streaming support for non-Java applications: As men-

tioned previously, non-Java applications may only be inte-

grated with Hadoop through the Hadoop streaming API. How-

ever, this API is far from optimal. First, input formats and

record readers can only be written in Java, making it impos-

sible to write advanced MapReduce jobs entirely in a diﬀer-

ent language. Second, Hadoop streaming only communicates

with Hadoop through Unix pipes, and there is no support for

data passing within the application using native data struc-

ture (e.g., it is necessary to convert Python tuples into strings

in the mappers and convert them back into tuples again in

reducers).

•Real-time applications: With the current setup, Hadoop only

supports batch data processing jobs. This is by design, so it is

not exactly a limitation of Hadoop. However, given that more

and more applications are dealing with real-time massive data

sets, the community using MapReduce for real-time processing

is constantly growing. Not having support for streaming or

real-time data is clearly a disadvantage of Hadoop over other

implementations.

•Limited data transformation operations: This is more of a lim-

itation of MapReduce than Hadoop per se. MapReduce only

supports two operations, map and reduce, and while these op-

erations are suﬃcient to describe a variety of data processing

pipelines, there are classes of applications that MapReduce is

not suitable for. Beyond that, developers often ﬁnd themselves

rewriting simple data operations such as data set joins, ﬁnd-

ing a min or max, and so on. Sometime, these tasks require

more than one map-and-reduce operation, resulting in multi-

ple MapReduce jobs. This is both cumbersome and ineﬃcient.

There are tools to automate this process for Hadoop; however,

they are only a layer above, and it is not easy to integrate with

existing customized Hadoop applications.

5.4 Apache Spark

In addition to Apache Hadoop, other notable MapReduce implemen-

tations include MongoDB, GreenplumDB, Disco, Riak, and Spark.

5.4. Apache Spark 139

MongoDB, Riak, and Greenplum DB are all database systems and ◮See Chapter 4.

thus their MapReduce implementations focus more on the inter-

operability of MapReduce and the core components such as Mon-

goDB’s aggregation framework, and leave it up to users to customize

the MapReduce functionalities for broader tasks. Some of these

systems, such as Riak, only parallelize the map phase, and run the

reduce phase on the local machine that request the tasks. The main

advantage of the three implementations is the ease with which they

connect to speciﬁc data stores. However, their support for general

data processing pipelines is not as extensive as that of Hadoop.

Disco, similar to Hadoop, is designed to support MapReduce in

a distributed computing environment, but it written in Erlang with

a Python interface. Thus, for Python developers, Disco might be a

better ﬁt. However, it has signiﬁcantly fewer supporting applica-

tions, such as access control and workﬂow integration, as well as a

smaller developing community. This is why the top three big data

platforms, Cloudera, Hortonworks, and MapR, still build primarily

on Hadoop.

Apache Spark is another implementation that aims to support

beyond MapReduce. The framework is centered around the con-

cept of resilient distributed data sets and data transformations that

can operate on these objects. An innovation in Spark is that the

fault tolerance of resilient distributed data sets can be maintained

without ﬂushing data onto disks, thus signiﬁcantly improving the

system performance (with a claim of being 100 times faster than

Hadoop). Instead, the fault-recovery process is done by replaying

a log of data transformations on check-point data. Though this

process could take longer than reading data straight from HDFS,

it does not occur often and is a fair tradeoﬀ between processing

performance and recovery performance.

Beyond map and reduce, Spark also supports various other

transformations [147], including ﬁlter, data join, and aggregation.

Streaming computation can also be done in Spark by asking Spark

to reserve resources on a cluster to constantly stream data to/from

the cluster. However, this streaming method might be resource in-

tensive (still consuming resources when there is no data coming).

Additionally, Spark plays well with the Hadoop ecosystem, particu-

larly with the distributed ﬁle system (HDFS) and resource manager

(YARN), making it possible to be built on top of current Hadoop

applications.

Another advantage of Spark is that it supports Python natively;

thus, developers can run Spark in a fraction of the time required for

Hadoop. Listing 5.4 provides the full code for the previous example

140 5. Programming with Big Data

import sys

from pyspark import SparkContext

def mapper(lines):

for line in lines:

fields = line.strip('\n').split(',')

awardId = fields[0]

domainName = fields[3].split('@')[-1].split('.')[-2:]

yield (domainName, awardId)

def reducer(pairs):

for (domainName, awardIds) in pairs:

count = len(set(awardIds))

yield (domainName, count)

if __name__=='__main__':

hdfsInputPath = sys.argv[1]

hdfsOutputFile = sys.argv[2]

sc = SparkContext(appName="Counting Awards")

output = sc.textFile(hdfsInputPath) \

.mapPartitions(mapper) \

.groupByKey() \

.mapPartitions(reducer)

output.saveAsTextFile(hdfsInputPath)

Listing 5.4. Python code for a Spark program that counts the number of awards

per institution using MapReduce

written entirely in Spark. It should be noted that Spark’s concept

of the reduceByKey operator is not the same as Hadoop’s, as it is

designed to aggregate all elements of a data set into a single ele-

ment. The closest simulation of Hadoop’s MapReduce pattern is a

combination of mapPartitions,groupByKey and mapPartitions, as

shown in the next example.

Example: Analyzing home mortgage disclosure

application data

We use a ﬁnancial services analysis problem to illustrate the use of Apache Spark.

Mortgage origination data provided by the Consumer Protection Financial Bu-

reau provide insightful details of the ﬁnancial health of the real estate market.

The data [84], which are a product of the Home Mortgage Disclosure Act (HMDA),

highlight key attributes that function as strong indicators of health and lending

patterns.

Lending institutions, as deﬁned by section 1813 in Title 12 of the HMDA, decide

on whether to originate or deny mortgage applications based on credit risk. In

order to determine this credit risk, lenders must evaluate certain features relative

5.5. Summary 141

Table 5.1. Home Mortgage Disclosure Act data size

Year Records File Size (Gigabytes)

2007 26,605,696 18

2008 17,391,571 12

2009 19,493,492 13

2010 16,348,558 11

2011 14,873,416 9.4

2012 18,691,552 12

2013 17,016,160 11

Total 130,420,445 86.4

Table 5.2. Home Mortgage Disclosure Act data ﬁelds

Index Attribute Type

0 Year Integer

1 State String

2 County String

3 Census Tract String

4 Loan Amount Float

5 Applicant Income Float

6 Loan Originated Boolean

··· ··· ···

to the applicant, the underlying property, and the location. We want to determine

whether census tract clusters could be created based on mortgage application

data and whether lending institutions’ perception of risk is held constant across

the entire USA.

For the ﬁrst step of this process, we study the debt–income ratio for loans

originating in diﬀerent census tracts. This could be achieved simply by computing

the debt–income ratio for each loan application and aggregating them for each

year by census tract number. A challenge, however, is that the data set provided

by HMDA is quite extensive. In total, HMDA data contain approximately 130

million loan applications between 2007 and 2013. As each record contains 47

attributes, varying in types from continuous variables such as loan amounts and

applicant income to categorical variables such as applicant gender, race, loan type,

and owner occupancy, the entire data set results in about 86 GB of information.

Parsing the data alone could take up to hours on a single machine if using a naïve

approach that scans through the data sequentially. Tables 5.1 and 5.2 highlight

the breakdown in size per year and data ﬁelds of interest.

Observing the transactional nature of the data, where the aggregation process

could be distributed and merged across multiple partitions of the data, we could

complete this task in much less time by using Spark. Using a cluster consisting

of 1,200 cores, the Spark program in Listing 5.5 took under a minute to complete.

The substantial performance gain comes not so much from the large number of

processors available, but mostly from the large I/O bandwidth available on the

cluster thanks to the 200 distributed hard disks and fast network interconnects.

5.5 Summary

Big data means that it is necessary to both store very large collec-

tions of data and perform aggregate computations on those data.

This chapter spelled out an important data storage approach (the

Hadoop Distributed File System) and a way of processing large-scale

142 5. Programming with Big Data

import ast

import sys

from pyspark import SparkContext

def mapper(lines):

for line in lines:

fields = ast.literal_eval('(%s)' % line)

(year, state, county, tract) = fields[:4]

(amount, income, originated) = fields[4:]

key = (year, state, county, tract)

value = (amount, income)

# Only count originated loans

if originated:

yeild (key, value)

def sumDebtIncome(debtIncome1, debtIncome2):

return (debtIncome1[0] + debtIncome2[0], debtIncome1[1] +

debtIncome2[1])

if __name__=='__main__':

hdfsInputPath = sys.argv[1]

hdfsOutputFile = sys.argv[2]

sc = SparkContext(appName="Counting Awards")

sumValues = sc.textFile(hdfsInputPath) \

.mapPartitions(mapper) \

.reduceByKey(sumDebtIncome)

# Actually compute the aggregated debt income

output = sumValues.mapValues(lambda debtIncome: debtIncome[0]/

debtIncome[1])

output.saveAsTextFile(hdfsInputPath)

Listing 5.5. Python code for a Spark program to aggregate the debt–income ratio

for loans originated in different census tracts

data sets (the MapReduce model, as implemented in both Hadoop

and Spark). This in-database processing model enables not only

high-performance analytics but also the provision of more ﬂexi-

bility for analysts to work with data. Instead of going through a

database administrator for every data gathering, data ingestion, or

data transformation task similar to a traditional data warehouse

approach, the analysts rather “own” the data in the big data en-

vironment. This increases the analytic throughput as well as the

time to insight, speeding up the decision-making process and thus

increasing the business impact, which is one of the main drivers for

big data analytics.

5.6. Resources 143

5.6 Resources

There are a wealth of online resources describing both Hadoop and

Spark. See, for example, the tutorials on the Apache Hadoop [19]

and Spark [20] websites. Albanese describes how to use Hadoop for

social science [9], and Lin and Dyer discuss the use of MapReduce

for text analysis [238].

We have not discussed here how to deal with data that are lo-

cated remotely from your computer. If such data are large, then

moving them can be an arduous task. The Globus transfer ser-

vice [72] is commonly used for the transfer and sharing of such

data. It is available on many research research systems and is easy

to install on your own computer.

Part II

Modeling and Analysis

Machine Learning

Chapter 6

Rayid Ghani and Malte Schierholz

This chapter introduces you to the value of machine learning in the

social sciences, particularly focusing on the overall machine learn-

ing process as well as clustering and classiﬁcation methods. You

will get an overview of the machine learning pipeline and methods

and how those methods are applied to solve social science problems.

The goal is to give an intuitive explanation for the methods and to

provide practical tips on how to use them in practice.

6.1 Introduction

You have probably heard of “machine learning” but are not sure

exactly what it is, how it diﬀers from traditional statistics, and what

you can do with it. In this chapter, we will demystify machine

learning, draw connections to what you already know from statis-

tics and data analysis, and go deeper into some of the unique con-

cepts and methods that have been developed in this ﬁeld. Although

the ﬁeld originates from computer science (speciﬁcally, artiﬁcial in-

telligence), it has been inﬂuenced quite heavily by statistics in the

past 15 years. As you will see, many of the concepts you will learn

are not entirely new, but are simply called something else. For

example, you already are familiar with logistic regression (a classi-

ﬁcation method that falls under the supervised learning framework

in machine learning) and cluster analysis (a form of unsupervised

learning). You will also learn about new methods that are more

exclusively used in machine learning, such as random forests and

support vector machines. We will keep formalisms to a minimum

and focus on getting the intuition across, as well as providing practi-

cal tips. Our hope is this chapter will make you comfortable and fa-

miliar with machine learning vocabulary, concepts, and processes,

and allow you to further explore and use these methods and tools

in your own research and practice.

147

148 6. Machine Learning

6.2 What is machine learning?

When humans improve their skills with experience, they are said to

learn. Is it also possible to program computers to do the same?

Arthur Samuel, who coined the term machine learning in 1959

[323], was a pioneer in this area, programming a computer to play

checkers. The computer played against itself and human oppo-

nents, improving its performance with every game. Eventually,

after suﬃcient training (and experience), the computer became a

better player than the human programmer. Today, machine learn-

ing has grown signiﬁcantly beyond learning to play checkers. Ma-

chine learning systems have learned to drive (and park) autonomous

cars, are embedded inside robots, can recommend books, prod-

ucts, and movies we are (sometimes) interested in, identify drugs,

proteins, and genes that should be investigated further to cure dis-

eases, detect cancer and other diseases in medical imaging, help

us understand how the human brain learns language, help identify

which voters are persuadable in elections, detect which students are

likely to need extra support to graduate high school on time, and

help solve many more problems. Over the past 20 years, machine

learning has become an interdisciplinary ﬁeld spanning computer

science, artiﬁcial intelligence, databases, and statistics. At its core,

machine learning seeks to design computer systems that improve

over time with more experience. In one of the earlier books on ma-

chine learning, Tom Mitchell gives a more operational deﬁnition,

stating that: “A computer program is said to learn from experience

Ewith respect to some class of tasks Tand performance measure

P, if its performance at tasks in T, as measured by P, improves with

experience E” [258].

Machine learning grew from the need to build systems that were

adaptive, scalable, and cost-eﬀective to build and maintain. A lot

of tasks now being done using machine learning used to be done

by rule-based systems, where experts would spend considerable

time and eﬀort developing and maintaining the rules. The problem

with those systems was that they were rigid, not adaptive, hard to

scale, and expensive to maintain. Machine learning systems started

becoming popular because they could improve the system along all

of these dimensions. Box 6.1 mentions several examples where

◮See Chapter 3.

machine learning is being used in commercial applications today.

Social scientists are uniquely placed today to take advantage of the

same advances in machine learning by having better methods to

solve several key problems they are tackling. We will give concrete

examples later in this chapter.

6.2. What is machine learning? 149

Box 6.1: Commercial machine learning examples

•Speech recognition: Speech recognition software uses ma-

chine learning algorithms that are built on large amounts

of initial training data. Machine learning allows these sys-

tems to be tuned and adapt to individual variations in

speaking as well as across diﬀerent domains.

•Autonomous cars: The ongoing development of self-

driving cars applies techniques from machine learning.

An onboard computer continuously analyzes the incom-

ing video and sensor streams in order to monitor the sur-

roundings. Incoming data are matched with annotated

images to recognize objects like pedestrians, traﬃc lights,

and potholes. In order to assess the diﬀerent objects, huge

training data sets are required where similar objects al-

ready have been identiﬁed. This allows the autonomous

car to decide on which actions to take next.

•Fraud detection: Many public and private organizations

face the problem of fraud and abuse. Machine learning

systems are widely used to take historical cases of fraud

and ﬂag fraudulent transactions as they take place. These

systems have the beneﬁt of being adaptive, and improving

with more data over time.

•Personalized ads: Many online stores have personalized

recommendations promoting possible products of interest.

Based on individual shopping history and what other sim-

ilar users bought in the past, the website predicts prod-

ucts a user may like and tailors recommendations. Netﬂix

and Amazon are two examples of companies whose recom-

mendation software predicts how a customer would rate

a certain movie or product and then suggests items with

the highest predicted ratings. Of course there are some

caveats here, since they then adjust the recommendations

to maximize proﬁts.

•Face recognition: Surveillance systems, social network-

ing platforms, and imaging software all use face detec-

tion and face recognition to ﬁrst detect faces in images

(or video) and then tag them with individuals for various

tasks. These systems are trained by giving examples of

faces to a machine learning system which then learns to

detect new faces, and tag known individuals.

150 6. Machine Learning

This chapter is not an exhaustive introduction to machine learn-

ing. There are many books that have done an excellent job of

that [124, 159, 258]. Instead, we present a short and understand-

able introduction to machine learning for social scientists, give an

overview of the overall machine learning process, provide an intu-

itive introduction to machine learning methods, give some practical

tips that will be helpful in using these methods, and leave a lot of

the statistical theory to machine learning textbooks. As you read

more about machine learning in the research literature or the me-

dia, you will encounter names of other ﬁelds that are related (and

practically the same for most social science audiences), such as

statistical learning, data mining, and pattern recognition.

6.3 The machine learning process

When solving problems using machine learning methods, it is im-

portant to think of the larger data-driven problem-solving process of

◮See Chapter 10.

which these methods are a small part. A typical machine learning

problem requires researchers and practitioners to take the following

steps:

1. Understand the problem and goal: This sounds obvious but is

often nontrivial. Problems typically start as vague descriptions

of a goal—improving health outcomes, increasing graduation

rates, understanding the eﬀect of a variable Xon an outcome

Y, etc. It is really important to work with people who under-

stand the domain being studied to dig deeper and deﬁne the

problem more concretely. What is the analytical formulation

of the metric that you are trying to optimize?

2. Formulate it as a machine learning problem: Is it a classiﬁca-

tion problem or a regression problem? Is the goal to build a

model that generates a ranked list prioritized by risk, or is it to

detect anomalies as new data come in? Knowing what kinds

of tasks machine learning can solve will allow you to map the

problem you are working on to one or more machine learning

settings and give you access to a suite of methods.

3. Data exploration and preparation: Next, you need to carefully

explore the data you have. What additional data do you need

or have access to? What variable will you use to match records

for integrating diﬀerent data sources? What variables exist in

the data set? Are they continuous or categorical? What about

6.4. Problem formulation: Mapping a problem to machine learning methods 151

missing values? Can you use the variables in their original

form or do you need to alter them in some way?

4. Feature engineering: In machine learning language, what you

might know as independent variables or predictors or factors

or covariates are called “features.” Creating good features is

probably the most important step in the machine learning pro-

cess. This involves doing transformations, creating interaction

terms, or aggregating over data points or over time and space.

5. Method selection: Having formulated the problem and created

your features, you now have a suite of methods to choose from.

It would be great if there were a single method that always

worked best for a speciﬁc type of problem, but that would

make things too easy. Typically, in machine learning, you

take a collection of methods and try them out to empirically

validate which one works the best for your problem. We will

give an overview of leading methods that are being used today

in this chapter.

6. Evaluation: As you build a large number of possible models,

you need a way to select the model that is the best. This

part of the chapter will cover the validation methodology to

ﬁrst validate the models on historical data as well as discuss a

variety of evaluation metrics. The next step is to validate using

a ﬁeld trial or experiment.

7. Deployment: Once you have selected the best model and val-

idated it using historical data as well as a ﬁeld trial, you are

ready to put the model into practice. You still have to keep in

mind that new data will be coming in, and the model might

change over time. We will not cover too much of those aspects

in this chapter, but they are important to keep in mind.

6.4 Problem formulation: Mapping a problem

to machine learning methods

When working on a new problem, one of the ﬁrst things we need to

do is to map it to a class of machine learning methods. In general,

the problems we will tackle, including the examples above, can be

grouped into two major categories:

1. Supervised learning: These are problems where there exists a

target variable (continuous or discrete) that we want to predict

152 6. Machine Learning

or classify data into. Classiﬁcation, prediction, and regression

all fall into this category. More formally, supervised learning

methods predict a value Ygiven input(s) Xby learning (or

estimating or ﬁtting or training) a function F, where F(X)=Y.

Here, Xis the set of variables (known as features in machine

learning, or in other ﬁelds as predictors) provided as input and

Yis the target/dependent variable or a label (as it is known in

machine learning).

The goal of supervised learning methods is to search for that

function Fthat best predicts Y. When the output Yis categor-

ical, this is known as classiﬁcation. When Yis a continuous

value, this is called regression. Sound familiar?

One key distinction in machine learning is that the goal is not

just to ﬁnd the best function Fthat can predict Yfor observed

outcomes (known Ys) but to ﬁnd one that best generalizes to

new, unseen data. This distinction makes methods more fo-

cused on generalization and less on just ﬁtting the data we

have as best as we can. It is important to note that you do

that implicitly when performing regression by not adding more

and more higher-order terms to get better ﬁt statistics. By

getting better ﬁt statistics, we overﬁt to the data and the per-

formance on new (unseen) data often goes down. Methods like

the lasso [376] penalize the model for having too many terms

by performing what is known as regularization.*

⋆In statistical terms, reg-

ularization is an attempt to

avoid overﬁtting the model. 2. Unsupervised learning: These are problems where there does

not exist a target variable that we want to predict but we want

to understand “natural” groupings or patterns in the data.

Clustering is the most common example of this type of analysis

where you are given Xand want to group similar Xs together.

Principal components analysis (PCA) and related methods also

fall into the unsupervised learning category.

In between the two extremes of supervised and unsupervised

learning, there is a spectrum of methods that have diﬀerent levels

of supervision involved (Figure 6.1). Supervision in this case is the

presence of target variables (known in machine learning as labels).

In unsupervised learning, none of the data points have labels. In

supervised learning, all data points have labels. In between, either

the percentage of examples with labels can vary or the types of

labels can vary. We do not cover the weakly supervised and semi-

supervised methods much in this chapter, but this is an active area

of research in machine learning. Zhu [414] provides more details.

6.5. Methods 153

Machine Learning Spectrum

Unsupervised “Weakly” supervised Fully supervised

Clustering

PCA

MDS

Association Rules

…

Classication

Prediction

Regression

Figure 6.1. Spectrum of machine learning methods from unsupervised to super-

vised learning

6.5 Methods

We will start by describing unsupervised learning methods and then

go on to supervised learning methods. We focus here on the intu-

ition behind the methods and the algorithm, as well as practical

tips, rather than on the statistical theory that underlies the meth-

ods. We encourage readers to refer to machine learning books listed

in Section 6.11 for more details. Box 6.2 gives brief deﬁnitions of

several terms we will use in this section.

6.5.1 Unsupervised learning methods

As mentioned earlier, unsupervised learning methods are used when

we do not have a target variable to predict but want to understand

“natural” clusters or patterns in the data. These methods are often

used for initial data exploration, as in the following examples:

1. When faced with a large corpus of text data—for example, email

records, congressional bills, speeches, or open-ended free-text

survey responses—unsupervised learning methods are often

used to understand and get a handle on what the data contain.

2. Given a data set about students and their behavior over time

(academic performance, grades, test scores, attendance, etc.),

one might want to understand typical behaviors as well as tra-

jectories of these behaviors over time. Unsupervised learning

methods (clustering) can be applied to these data to get stu-

dent “segments” with similar behavior.

3. Given a data set about publications or patents in diﬀerent

ﬁelds, we can use unsupervised learning methods (association

154 6. Machine Learning

Box 6.2: Machine learning vocabulary

•Learning: In machine learning, you will notice the term

learning that will be used in the context of “learning” a

model. This is what you probably know as ﬁtting or esti-

mating a function, or training or building a model. These

terms are all synonyms and are used interchangeably in

the machine learning literature.

•Examples: These are data points and instances.

•Features: These are independent variables, attributes,

predictor variables, and explanatory variables.

•Labels: These include the response variable, dependent

variable, and target variable.

•Underﬁtting: This happens when a model is too sim-

ple and does not capture the structure of the data well

enough.

•Overﬁtting: This happens when a model is possibly too

complex and models the noise in the data, which can re-

sult in poor generalization performance. Using in-sample

measures to do model selection can result in that.

•Regularization: This is a general method to avoid overﬁt-

ting by applying additional constraints to the model that

is learned. A common approach is to make sure the model

weights are, on average, small in magnitude. Two common

regularizations are L1regularization (used by the lasso),

which has a penalty term that encourages the sum of the

absolute values of the parameters to be small; and L2reg-

ularization, which encourages the sum of the squares of

the parameters to be small.

rules) to ﬁgure out which disciplines have the most collabo-

ration and which ﬁelds have researchers who tend to publish

across diﬀerent ﬁelds.

Clustering Clustering is the most common unsupervised learning

technique and is used to group data points together that are similar

to each other. The goal of clustering methods is to produce clusters

6.5. Methods 155

with high intra-cluster (within) similarity and low inter-cluster (be-

tween) similarity.

Clustering algorithms typically require a distance (or similarity)

metric*to generate clusters. They take a data set and a distance ⋆Distance metrics are

mathematical formulas to

calculate the distance be-

tween two objects.

For example, Manhattan

distance is the distance a

car would drive from one

place to another place in

a grid-based street system,

whereas Euclidian distance

(in two-dimensional space)

is the “straight-line” dis-

tance between two points.

metric (and sometimes additional parameters), and they generate

clusters based on that distance metric. The most common dis-

tance metric used is Euclidean distance, but other commonly used

metrics are Manhattan, Minkowski, Chebyshev, cosine, Hamming,

Pearson, and Mahalanobis. Often, domain-speciﬁc similarity met-

rics can be designed for use in speciﬁc problems. For example,

when performing the record linkage tasks discussed in Chapter 3,

you can design a similarity metric that compares two ﬁrst names

and assigns them a high similarity (low distance) if they both map

to the same canonical name, so that, for example, Sammy and Sam

map to Samuel.

Most clustering algorithms also require the user to specify the

number of clusters (or some other parameter that indirectly deter-

mines the number of clusters) in advance as a parameter. This

is often diﬃcult to do a priori and typically makes clustering an

iterative and interactive task. Another aspect of clustering that

makes it interactive is often the diﬃculty in automatically evaluat-

ing the quality of the clusters. While various analytical clustering

metrics have been developed, the best clustering is task-dependent

and thus must be evaluated by the user. There may be diﬀerent

clusterings that can be generated with the same data. You can

imagine clustering similar news stories based on the topic content,

based on the writing style or based on sentiment. The right set of

clusters depends on the user and the task they have. Clustering is

therefore typically used for exploring the data, generating clusters,

exploring the clusters, and then rerunning the clustering method

with diﬀerent parameters or modifying the clusters (by splitting or

merging the previous set of clusters). Interpreting a cluster can be

nontrivial: you can look at the centroid of a cluster, look at fre-

quency distributions of diﬀerent features (and compare them to the

prior distribution of each feature), or you can build a decision tree

(a supervised learning method we will cover later in this chapter)

where the target variable is the cluster ID that can describe the

cluster using the features in your data. A good example of a tool

that allows interactive clustering from text data is Ontogen [125].

k-means clustering The most commonly used clustering algorithm

is called k-means, where kdeﬁnes the number of clusters. The

algorithm works as follows:

156 6. Machine Learning

1. Select k(the number of clusters you want to generate).

2. Initialize by selecting kpoints as centroids of the kclusters.

This is typically done by selecting kpoints uniformly at ran-

dom.

3. Assign each point a cluster according to the nearest centroid.

4. Recalculate cluster centroids based on the assignment in (3)

as the mean of all data points belonging to that cluster.

5. Repeat (3) and (4) until convergence.

The algorithm stops when the assignments do not change from

one iteration to the next (Figure 6.2). The ﬁnal set of clusters, how-

ever, depend on the starting points. If they are initialized diﬀerently,

it is possible that diﬀerent clusters are obtained. One common

practical trick is to run k-means several times, each with diﬀerent

(random) starting points. The k-means algorithm is fast, simple,

and easy to use, and is often a good ﬁrst clustering algorithm to try

and see if it ﬁts your needs. When the data are of the form where

the mean of the data points cannot be computed, a related method

called K-medoids can be used [296].

Expectation-maximization (EM) clustering You may be familiar

with the EM algorithm in the context of imputing missing data.

EM is a general approach to maximum likelihood in the presence

of incomplete data. However, it is also used as a clustering method

where the missing data are the clusters a data point belongs to.

Unlike k-means, where each data point gets assigned to only one

cluster, EM does a soft assignment where each data point gets a

probabilistic assignment to various clusters. The EM algorithm iter-

ates until the estimates converge to some (locally) optimal solution.

The EM algorithm is fairly good at dealing with outliers as well

as high-dimensional data, compared to k-means. It also has a few

limitations. First, it does not work well with a large number of

clusters or when a cluster contains few examples. Also, when the

value of kis larger than the number of actual clusters in the data,

EM may not give reasonable results.

Mean shift clustering Mean shift clustering works by ﬁnding dense

regions in the data by deﬁning a window around each data point and

computing the mean of the data points in the window. Then it shifts

the center of the window to the mean and repeats the algorithm till

6.5. Methods 157

Number of people working on the project

Project budget

Number of people working on the project

Project budget

Number of people working on the project

Figure 6.2. Example of k-means clustering with k=3. The upper left panel shows the distribution of the data and

the three starting points m1, m2, m3placed at random. On the upper right we see what happens in the ﬁrst iteration.

The cluster means move to more central positions in their respective clusters. The lower left panel shows the second

iteration. After six iterations the cluster means have converged to their ﬁnal destinations and the result is shown in

the lower right panel

it converges. After each iteration, we can consider that the window

shifts to a denser region of the data set. The algorithm proceeds as

follows:

1. Fix a window around each data point (based on the bandwidth

parameter that deﬁnes the size of the window).

2. Compute the mean of data within the window.

3. Shift the window to the mean and repeat till convergence.

Mean shift needs a bandwidth parameter hto be tuned, which

inﬂuences the convergence rate and the number of clusters. A large

hmight result in merging distinct clusters. A small hmight result

in too many clusters. Mean shift might not work well in higher

158 6. Machine Learning

dimensions since the number of local maxima is pretty high and it

might converge to a local optimum quickly.

One of the most important diﬀerences between mean shift and k-

means is that k-means makes two broad assumptions: the number

of clusters is already known and the clusters are shaped spherically

(or elliptically). Mean shift does not assume anything about the

number of clusters (but the value of hindirectly determines that).

Also, it can handle arbitrarily shaped clusters.

The k-means algorithm is also sensitive to initializations, where-

as mean shift is fairly robust to initializations. Typically, mean shift

is run for each point, or sometimes points are selected uniformly

randomly. Similarly, k-means is sensitive to outliers, while mean

shift is less sensitive. On the other hand, the beneﬁts of mean shift

come at a cost—speed. The k-means procedure is fast, whereas

classic mean shift is computationally slow but can be easily paral-

lelized.

Hierarchical clustering The clustering methods that we have seen

so far, often termed partitioning methods, produce a ﬂat set of clus-

ters with no hierarchy. Sometimes, we want to generate a hierarchy

of clusters, and methods that can do that are of two types:

1. Agglomerative (bottom-up): Start with each point as its own

cluster and iteratively merge the closest clusters. The iter-

ations stop either when the clusters are too far apart to be

merged (based on a predeﬁned distance criterion) or when

there is a suﬃcient number of clusters (based on a predeﬁned

threshold).

2. Divisive (top-down): Start with one cluster and create splits

recursively.

Typically, agglomerative clustering is used more often than divi-

sive clustering. One reason is that it is signiﬁcantly faster, although

both of them are typically slower than direct partition methods such

as k-means and EM. Another disadvantage of these methods is that

they are greedy, that is, a data point that is incorrectly assigned to

the “wrong” cluster in an earlier split or merge cannot be reassigned

again later on.

Spectral clustering Figure 6.3 shows the clusters that k-means

would generate on the data set in the ﬁgure. It is obvious that the

clusters produced are not the clusters you would want, and that is

one drawback of methods such as k-means. Two points that are far

6.5. Methods 159

–5

–10

–15 –10 –5 0 5 10

(a) k-means

15 20 25 30 –15 –10 –5 0 5 10

(b) Spectral Clustering

15 20 25 30

–5

–10

Figure 6.3. The same data set can produce drastically different clusters: (a) k-means; (b) spectral clustering

away from each other will be put in diﬀerent clusters even if there are

other data points that create a “path” between them. Spectral clus-

tering ﬁxes that problem by clustering data that are connected but

not necessarily (what is called) compact or clustered within convex

boundaries. Spectral clustering methods work by representing data

as a graph (or network), where data points are nodes in the graph

and the edges (connections between nodes) represent the similarity

between the two data points.

The algorithm works as follows:

1. Compute a similarity matrix from the data. This involves deter-

mining a pairwise distance function (using one of the distance

functions we described earlier).

2. With this matrix, we can now perform graph partitioning, where

connected graph components are interpreted as clusters. The

graph must be partitioned such that edges connecting diﬀerent

clusters have low weights and edges within the same cluster

have high values.

3. We can now partition these data represented by the similarity

matrix in a variety of ways. One common way is to use the

normalized cuts method. Another way is to compute a graph

Laplacian from the similarity matrix.

4. Compute the eigenvectors and eigenvalues of the Laplacian.

160 6. Machine Learning

5. The keigenvectors are used as proxy data for the original data

set, and they are fed into k-means clustering to produce clus-

ter assignments for each original data point.

Spectral clustering is in general much better than k-means in

clustering performance but much slower to run in practice. For

large-scale problems, k-means is a preferred clustering algorithm

to run because of eﬃciency and speed.

Principal components analysis Principal components analysis is

another unsupervised method used for ﬁnding patterns and struc-

ture in data. In contrast to clustering methods, the output is not a

set of clusters but a set of principal components that are linear com-

binations of the original variables. PCA is typically used when you

have a large number of variables and you want a reduced number

that you can analyze. This approach is often called dimensionality

reduction. It generates linearly uncorrelated dimensions that can be

used to understand the underlying structure of the data. In math-

ematical terms, given a set of data on ndimensions, PCA aims to

ﬁnd a linear subspace of dimension dlower than nsuch that the

data points lie mainly on this linear subspace.

PCA is related to several other methods you may already know

about. Multidimensional scaling, factor analysis, and independent

component analysis diﬀer from PCA in the assumptions they make,

but they are often used for similar purposes of dimensionality re-

duction and discovering the underlying structure in a data set.

Association rules Association rules are a diﬀerent type of analysis

method and originate from the data mining and database commu-

nity, primarily focused on ﬁnding frequent co-occurring associa-

tions among a collection of items. This methods is sometimes re-

ferred to as “market basket analysis,” since that was the original

application area of association rules. The goal is to ﬁnd associations

of items that occur together more often than you would randomly

expect. The classic example (probably a myth) is “men who go to the

store to buy diapers will also tend to buy beer at the same time.”

This type of analysis would be performed by applying association

rules to a set of supermarket purchase data.

Association rules take the form X1, X2, X3⇒Ywith support S

and conﬁdence C, implying that when a transaction contains items

{X1, X2, X3}C% of the time, they also contain item Yand there are at

least S% of transactions where the antecedent is true. This is useful

in cases where we want to ﬁnd patterns that are both frequent and

6.5. Methods 161

statistically signiﬁcant, by specifying thresholds for support Sand

conﬁdence C.

Support and conﬁdence are useful metrics to generate rules but

are often not enough. Another important metric used to gener-

ate rules (or reduce the number of spurious patterns generated) is

lift. Lift is simply estimated by the ratio of the joint probability of

two items, xand y, to the product of their individual probabilities:

P(x, y)/[P(x)P(y)]. If the two items are statistically independent,

then P(x, y)=P(x)P(y), corresponding to a lift of 1. Note that anti-

correlation yields lift values less than 1, which is also an interesting

pattern, corresponding to mutually exclusive items that rarely occur

together.

Association rule algorithms work as follows: Given a set of trans-

actions (rows) and items for that transaction:

1. Find all combinations of items in a set of transactions that oc-

cur with a speciﬁed minimum frequency. These combinations

are called frequent itemsets.

2. Generate association rules that express co-occurrence of items

within frequent itemsets.

For our purposes, association rule methods are an eﬃcient way

to take a basket of features (e.g., areas of publication of a re-

searcher, diﬀerent organizations an individual has worked at in

their career, all the cities or neighborhoods someone may have lived

in) and ﬁnd co-occurrence patterns. This may sound trivial, but as

data sets and number of features get larger, it becomes computa-

tionally expensive and association rule mining algorithms provide a

fast and eﬃcient way of doing it.

6.5.2 Supervised learning

We now turn to the problem of supervised learning, which typically

involves methods for classiﬁcation, prediction, and regression. We

will mostly focus on classiﬁcation methods in this chapter since

many of the regression methods in machine learning are fairly simi-

lar to methods with which you are already familiar. Remember that

classiﬁcation means predicting a discrete (or categorical) variable.

Some of the classiﬁcation methods that we will cover can also be

used for regression, a fact that we will mention when describing

that method.

In general, supervised learning methods take as input pairs of

data points (X, Y )where Xare the predictor variables (features) and

162 6. Machine Learning

Yis the target variable (label). The supervised learning method

then uses these pairs as training data and learns a model F, where

F(X)∼Y. This model Fis then used to predict Ys for new data

points X. As mentioned earlier, the goal is not to build a model that

best ﬁts known data but a model that is useful for future predictions

and minimizes future generalization error. This is the key goal that

diﬀerentiates many of the methods that you know from the methods

that we will describe next. In order to minimize future error, we

want to build models that are not just overﬁtting on past data.

Another goal, often prioritized in the social sciences, that ma-

chine learning methods do not optimize for is getting a structural

form of the model. Machine learning models for classiﬁcation can

take diﬀerent structural forms (ranging from linear models, to sets

of rules, to more complex forms), and it may not always be possible

to write them down in a compact form as an equation. This does

not, however, make them incomprehensible or uninterpretable. An-

other focus of machine learning models for supervised learning is

prediction, and not causal inference. Some of these models can

◮The topic of causal infer-

ence is addressed in more

detail in Chapter 10. be used to help with causal inference, but they are typically opti-

mized for prediction tasks. We believe that there are many social

science and policy problems where better prediction methods can

be extremely beneﬁcial.

In this chapter, we mostly deal with binary classiﬁcation prob-

lems: that is, problems in which the data points are to be classiﬁed

into one of two categories. Several of the methods that we will

cover can also be used for multiclass classiﬁcation (classifying a

data point into one of ncategories) or for multi-label classiﬁcation

(classifying a data point into mof ncategories where m≥1). There

are also approaches to take multiclass problems and turn them into

a set of binary problems that we will mention brieﬂy at the end of

the chapter.

Before we describe supervised learning methods, we want to re-

cap a few principles as well as terms that we have used and will be

using in the rest of the chapter.

Training a model Once we have ﬁnished data exploration, ﬁlled in

missing values, created predictor variables (features), and decided

what our target variable (label) is, we now have pairs of X, Y to start

training (or building) the model.

Using the model to score new data We are building this model

so we can predict Yfor a new set of Xs—using the model means,

6.5. Methods 163

Figure 6.4. Example of k-nearest neighbor with k=1,3,5neighbors. We want to predict the points A and B. The

1-nearest neighbor for both points is red (“Patent not granted”), the 3-nearest neighbor predicts point A (B) to be red

(green) with probability 2/3, and the 5-nearest neighbor predicts again both points to be red with probabilities 4/5

and 3/5, respectively.

getting new data, generating the same features to get the vector X,

and then applying the model to produce Y.

One common technique for supervised learning is logistic re-

gression, a method you will already be familiar with. We will give an

overview of some of the other methods used in machine learning.

It is important to remember that as you use increasingly powerful

classiﬁcation methods, you need more data to train the models.

k-nearest neighbor The method k-nearest neighbor (k-NN) is one

of the simpler classiﬁcation methods in machine learning. It be-

longs to a family of models sometimes known as memory-based

models or instance-based models. An example is classiﬁed by ﬁnd-

ing its knearest neighbors and taking majority vote (or some other

aggregation function). We need two key things: a value for kand a

distance metric with which to ﬁnd the knearest neighbors. Typi-

cally, diﬀerent values of kare used to empirically ﬁnd the best one.

Small values of klead to predictions having high variance but can

capture the local structure of the data. Larger values of kbuild

more global models that are lower in variance but may not capture

local structure in the data as well.

Figure 6.4 provides an example for k=1,3,5 nearest neighbors.

The number of neighbors (k) is a parameter, and the prediction

depends heavily on how it is determined. In this example, point B

is classiﬁed diﬀerently if k=3.

Training for k-NN just means storing the data, making this

method useful in applications where data are coming in extremely

164 6. Machine Learning

quickly and a model needs to be updated frequently. All the work,

however, gets pushed to scoring time, since all the distance calcu-

lations happen when a new data point needs to be classiﬁed. There

are several optimized methods designed to make k-NN more eﬃcient

that are worth looking into if that is a situation that is applicable to

your problem.

In addition to selecting kand an appropriate distance metric, we

also have to be careful about the scaling of the features. When dis-

tances between two data points are large for one feature and small

for a diﬀerent feature, the method will rely almost exclusively on

the ﬁrst feature to ﬁnd the closest points. The smaller distances on

the second feature are nearly irrelevant to calculate the overall dis-

tance. A similar problem occurs when continuous and categorical

predictors are used together. To resolve the scaling issues, various

options for rescaling exist. For example, a common approach is to

center all features at mean 0 and scale them to variance 1.

There are several variations of k-NN. One of these is weighted

nearest neighbors, where diﬀerent features are weighted diﬀerently

or diﬀerent examples are weighted based on the distance from the

example being classiﬁed. The method k-NN also has issues when

the data are sparse and has high dimensionality, which means that

every point is far away from virtually every other point, and hence

pairwise distances tend to be uninformative. This can also happen

when a lot of features are irrelevant and drown out the relevant

features’ signal in the distance calculations.

Notice that the nearest-neighbor method can easily be applied to

regression problems with a real-valued target variable. In fact, the

method is completely oblivious to the type of target variable and can

potentially be used to predict text documents, images, and videos,

based on the aggregation function after the nearest neighbors are

found.

Support vector machines Support vector machines are one of the

most popular and best-performing classiﬁcation methods in ma-

chine learning today. The mathematics behind SVMs has a lot of

prerequisites that are beyond the scope of this book, but we will

give you an intuition of how SVMs work, what they are good for,

and how to use them.

We are all familiar with linear models that separate two classes

by ﬁtting a line in two dimensions (or a hyperplane in higher dimen-

sions) in the middle (see Figure 6.5). An important decision that lin-

ear models have to make is which linear separator we should prefer

when there are several we can build.

6.5. Methods 165

x2x2

x1x1

Maximum

margin

Optimal hyperplane

Figure 6.5. Support vector machines

You can see in Figure 6.5 that multiple lines oﬀer a solution

to the problem. Is any of them better than the others? We can

intuitively deﬁne a criterion to estimate the worth of the lines: A

line is bad if it passes too close to the points because it will be noise

sensitive and it will not generalize correctly. Therefore, our goal

should be to ﬁnd the line passing as far as possible from all points.

The SVM algorithm is based on ﬁnding the hyperplane that max-

imizes the margin of the training data. The training examples that

are closest to the hyperplane are called support vectors since they

are supporting the margin (as the margin is only a function of the

support vectors).

An important concept to learn when working with SVMs is ker-

nels. SVMs are a speciﬁc instance of a class of methods called kernel

methods. So far, we have only talked about SVMs as linear mod-

els. Linear works well in high-dimensional data but sometimes you

need nonlinear models, often in cases of low-dimensional data or in

image or video data. Unfortunately, traditional ways of generating

nonlinear models get computationally expensive since you have to

explicitly generate all the features such as squares, cubes, and all

the interactions. Kernels are a way to keep the eﬃciency of the lin-

ear machinery but still build models that can capture nonlinearity

in the data without creating all the nonlinear features.

You can essentially think of kernels as similarity functions and

use them to create a linear separation of the data by (implicitly) map-

ping the data to a higher-dimensional space. Essentially, we take

an n-dimensional input vector X, map it into a high-dimensional

166 6. Machine Learning

(possibly inﬁnite-dimensional) feature space, and construct an op-

timal separating hyperplane in this space. We refer you to relevant

papers for more detail on SVMs and nonlinear kernels [334, 339].

SVMs are also related to logistic regression, but use a diﬀerent

loss/penalty function [159].

When using SVMs, there are several parameters you have to

optimize, ranging from the regularization parameter C, which de-

termines the tradeoﬀ between minimizing the training error and

minimizing model complexity, to more kernel-speciﬁc parameters.

It is often a good idea to do a grid search to ﬁnd the optimal param-

eters. Another tip when using SVMs is to normalize the features;

one common approach to doing that is to normalize each data point

to be a vector of unit length.

Linear SVMs are eﬀective in high-dimensional spaces, especially

when the space is sparse such as text classiﬁcation where the num-

ber of data points (perhaps tens of thousands) is often much less

than the number of features (a hundred thousand to a million or

more). SVMs are also fairly robust when the number of irrelevant

features is large (unlike the k-NN approaches that we mentioned

earlier) as well as when the class distribution is skewed, that is,

when the class of interest is signiﬁcantly less than 50% of the data.

One disadvantage of SVMs is that they do not directly provide

probability estimates. They assign a score based on the distance

from the margin. The farther a point is from the margin, the higher

the magnitude of the score. This score is good for ranking examples,

but getting accurate probability estimates takes more work and re-

quires more labeled data to be used to perform probability calibra-

tions.

In addition to classiﬁcation, there are also variations of SVMs

that can be used for regression [348] and ranking [70].

Decision trees Decision trees are yet another set of methods that

are helpful for prediction. Typical decision trees learn a set of rules

from training data represented as a tree. An exemplary decision

tree is shown in Figure 6.6. Each level of a tree splits the tree to

create a branch using a feature and a value (or range of values). In

the example tree, the ﬁrst split is made on the feature number of

visits in the past year and the value 4. The second level of the tree

now has two splits: one using average length of visit with value 2

days and the other using the value 10 days.

Various algorithms exist to build decision trees. C4.5, CHAID,

and CART (Classiﬁcation and Regression Trees) are the most popular.

6.5. Methods 167

# of Visits in past year

≥ 4

Average length of visit

≥ 2 days

Average length of visit

≥ 10 days

Risk

High

Risk

Low

Risk

High

Risk

Low

# of Visits in Past Year

Average Length of Stay

123 4 5 6

Risk = High

Risk = Low

Figure 6.6. An exemplary decision tree. The top ﬁgure is the standard repre-

sentation for trees. The bottom ﬁgure offers an alternative view of the same tree.

The feature space is partitioned into numerous rectangles, which is another way

to view a tree, representing its nonlinear character more explicitly

Each needs to determine the next best feature to split on. The goal

is to ﬁnd feature splits that can best reduce class impurity in the

data, that is, a split that will ideally put all (or as many as possible)

positive class examples on one side and all (or as many as possi-

ble) negative examples on the other side. One common measure of

impurity that comes from information theory is entropy, and it is

calculated as

H(X)=−X

p(x) log p(x).

Entropy is maximum (1) when both classes have equal numbers

of examples in a node. It is minimum (0) when all examples are

168 6. Machine Learning

from the same class. At each node in the tree, we can evaluate

all the possible features and select the one that most reduces the

entropy given the tree so far. This expected change in entropy is

known as information gain and is one of the most common criteria

used to create decision trees. Other measures that are used instead

of information gain are Gini and chi-squared.

If we keep constructing the tree in this manner, selecting the

next best feature to split on, the tree ends up fairly deep and tends

to overﬁt the data. To prevent overﬁtting, we can either have a

stopping criterion or prune the tree after it is fully grown. Common

stopping criteria include minimum number of data points to have

before doing another feature split, maximum depth, and maximum

purity. Typical pruning approaches use holdout data (or cross-

validation, which will be discussed later in this chapter) to cut oﬀ

parts of the tree.

Once the tree is built, a new data point is classiﬁed by running it

through the tree and, once it reaches a terminal node, using some

aggregation function to give a prediction (classiﬁcation or regres-

sion). Typical approaches include performing maximum likelihood

(if the leaf node contains 10 examples, 8 positive and 2 negative,

any data point that gets into that node will get an 80% probability

of being positive). Trees used for regression often build the tree as

described above but then ﬁt a linear regression model at each leaf

node.

Decision trees have several advantages. The interpretation of a

tree is straightforward as long as the tree is not too large. Trees

can be turned into a set of rules that experts in a particular domain

can possibly dig deeper into, validate, and modify. Trees also do not

require too much feature engineering. There is no need to create

interaction terms since trees can implicitly do that by splitting on

two features, one after another.

Unfortunately, along with these beneﬁts come a set of disadvan-

tages. Decision trees, in general, do not perform well, compared to

SVMs, random forests, or logistic regression. They are also unsta-

ble: small changes in data can result in very diﬀerent trees. The lack

of stability comes from the fact that small changes in the training

data may lead to diﬀerent splitting points. As a consequence, the

whole tree may take a diﬀerent structure. The suboptimal predic-

tive performance can be seen from the fact that trees partition the

predictor space into a few rectangular regions, each one predicting

only a single value (see the bottom part of Figure 6.6).

6.5. Methods 169

Ensemble methods Combinations of models are generally known

as model ensembles. They are among the most powerful techniques

in machine learning, often outperforming other methods, although

at the cost of increased algorithmic and model complexity.

The intuition behind building ensembles of models is to build

several models, each somewhat diﬀerent. This diversity can come

from various sources such as: training models on subsets of the

data; training models on subsets of the features; or a combination

of these two.

Ensemble methods in machine learning have two things in com-

mon. First, they construct multiple, diverse predictive models from

adapted versions of the training data (most often reweighted or re-

sampled). Second, they combine the predictions of these models in

some way, often by simple averaging or voting (possibly weighted).

Bagging Bagging stands for “bootstrap aggregation”:*we ﬁrst cre- ⋆Bootstrap is a general

statistical procedure that

draws random samples of

the original data with re-

placement.

ate bootstrap samples from the original data and then aggregate the

predictions using models trained on each bootstrap sample. Given

a data set of size N, the method works as follows:

1. Create kbootstrap samples (with replacement), each of size N,

resulting in kdata sets. Only about 63% of the original train-

ing examples will be represented in any given bootstrapped

set.

2. Train a model on each of the kdata sets, resulting in kmodels.

3. For a new data point X, predict the output using each of the k

models.

4. Aggregate the kpredictions (typically using average or voting)

to get the prediction for X.

A nice feature of this method is that any underlying model can

be used, but decision trees are often the most commonly used base

model. One reason for this is that decision tress are typically high

variance and unstable, that is, they can change drastically given

small changes in data, and bagging is eﬀective at reducing the vari-

ance of the overall model. Another advantage of bagging is that

each model can be trained in parallel, making it eﬃcient to scale to

large data sets.

Boosting Boosting is another popular ensemble technique, and it

often results in improving the base classiﬁer being used. In fact,

170 6. Machine Learning

if your only goal is improving accuracy, you will most likely ﬁnd

that boosting will achieve that. The basic idea is to keep training

classiﬁers iteratively, each iteration focusing on examples that the

previous one got wrong. At the end, you have a set of classiﬁers,

each trained on smaller and smaller subsets of the training data.

Given a new data point, all the classiﬁers predict the target, and a

weighted average of those predictions is used to get the ﬁnal pre-

diction, where the weight is proportional to the accuracy of each

classiﬁer. The algorithm works as follows:

1. Assign equal weights to every example.

2. For each iteration:

(a) Train classiﬁer on the weighted examples.

(b) Predict on the training data.

(d) Calculate the new weighting on the examples based on

the errors of the classiﬁer.

(e) Reweight examples.

3. Generate a weighted classiﬁer based on the accuracy of each

classiﬁer.

One constraint on the classiﬁer used within boosting is that it

should be able to handle weighted examples (either directly or by

replicating the examples that need to be overweighted). The most

common classiﬁers used in boosting are decision stumps (single-

level decision trees), but deeper trees can also work well.

Boosting is a common way to boost the performance of a classi-

ﬁcation method but comes with additional complexity, both in the

training time and in interpreting the predictions. A disadvantage of

boosting is that it is diﬃcult to parallelize since the next iteration of

boosting relies on the results of the previous iteration.

A nice property of boosting is its ability to identify outliers: ex-

amples that are either mislabeled in the training data, or are inher-

ently ambiguous and hard to categorize. Because boosting focuses

its weight on the examples that are more diﬃcult to classify, the

examples with the highest weight often turn out to be outliers. On

the other hand, if the number of outliers is large (lots of noise in

the data), these examples can hurt the performance of boosting by

focusing too much on them.

6.5. Methods 171

Random forests Given a data set of size Nand containing Mfea-

tures, the random forest training algorithm works as follows:

1. Create nbootstrap samples from the original data of size N.

Remember, this is similar to the ﬁrst step in bagging. Typically

nranges from 100 to a few thousand but is best determined

empirically.

2. For each bootstrap sample, train a decision tree using mfea-

tures (where mis typically much smaller than M) at each node

of the tree. The mfeatures are selected uniformly at random

from the Mfeatures in the data set, and the decision tree will

select the best split among the mfeatures. The value of mis

held constant during the forest growing.

3. A new test example/data point is classiﬁed by all the trees,

and the ﬁnal classiﬁcation is done by majority vote (or another

appropriate aggregation method).

Random forests are probably the most accurate classiﬁers being

used today in machine learning. They can be easily parallelized,

making them eﬃcient to run on large data sets, and can handle a

large number of features, even with a lot of missing values. Random

forests can get complex, with hundreds or thousands of trees that

are fairly deep, so it is diﬃcult to interpret the learned model. At

the same time, they provide a nice way to estimate feature impor-

tance, giving a sense of what features were important in building

the classiﬁer.

Another nice aspect of random forests is the ability to compute a

proximity matrix that gives the similarity between every pair of data

points. This is calculated by computing the number of times two

examples land in the same terminal node. The more that happens,

the closer the two examples are. We can use this proximity matrix

for clustering, locating outliers, or explaining the predictions for a

speciﬁc example.

Stacking Stacking is a technique that deals with the task of learn-

ing a meta-level classiﬁer to combine the predictions of multiple

base-level classiﬁers. This meta-algorithm is trained to combine

the model predictions to form a ﬁnal set of predictions. This can be

used for both regression and classiﬁcation. The algorithm works as

follows:

1. Split the data set into nequal-sized sets: set1, set2, . . . , setn.

172 6. Machine Learning

2. Train base models on all possible combinations of n−1 sets

and, for each model, use it to predict on setiwhat was left out

of the training set. This would give us a set of predictions on

every data point in the original data set.

3. Now train a second-stage stacker model on the predicted classes

or the predicted probability distribution over the classes from

the ﬁrst-stage (base) model(s).

By using the ﬁrst-stage predictions as features, a stacker model

gets more information on the problem space than if it were trained in

isolation. The technique is similar to cross-validation, an evaluation

methodology that we will cover later in this chapter.

Neural networks and deep learning Neural networks are a set of

multi-layer classiﬁers where the outputs of one layer feed into the

inputs of the next layer. The layers between the input and output

layers are called hidden layers, and the more hidden layers a neu-

ral network has, the more complex functions it can learn. Neural

networks were popular in the 1980s and early 1990s, but then fell

out of fashion because they were slow and expensive to train, even

with only one or two hidden layers. Since 2006, a set of techniques

has been developed that enable learning in deeper neural networks.

These techniques have enabled much deeper (and larger) networks

to be trained—people now routinely train networks with ﬁve to ten

hidden layers. And it turns out that these perform far better on

many problems than shallow neural networks (with just a single

hidden layer). The reason for the better performance is the ability

of deep nets to build up a complex hierarchy of concepts, learning

multiple levels of representation and abstraction that help to make

sense of data such as images, sound, and text.

Usually, with a supervised neural network you try to predict a

target vector, Y, from a matrix of inputs, X. But when you train

a deep neural network, it uses a combination of supervised and

unsupervised learning. In an unsupervised neural network, you

try to predict the matrix Xusing the same matrix Xas the input.

In doing this, the network can learn something intrinsic about the

data without the help of a separate target or label. The learned

information is stored as the weights of the network.

Currently, deep neural networks are trendy and a lot of research

is being done on them. It is, however, important to keep in mind

that they are applicable for a narrow class of problems with which

social scientists would deal and that they often require a lot more

6.6. Evaluation 173

data than are available in most problems. Training deep neural

networks also requires a lot of computational power, but that is

less likely to be an issue for most people. Typical cases where deep

learning has been shown to be eﬀective involve lots of images, video,

and text data. We are still in the early stages of development of this

class of methods, and the next few years will give us a much better

understanding of why they are eﬀective and the problems for which

they are well suited.

6.6 Evaluation

The previous section introduced us to a variety of methods, all with

certain pros and cons, and no single method guaranteed to outper-

forms others for a given problem. This section focuses on evaluation

methods, with three primary goals:

1. Model selection: How do we select a method to use? What

parameters should we select for that method?

2. Performance estimation: How well will our model do once it is

deployed and applied to new data?

3. A deeper understanding of the model can point to inaccuracies

of existing methods and provide a better understanding of the

data and the problem we are tackling.

This section will cover evaluation methodologies as well as met-

rics that are commonly used. We will start by describing common

evaluation methodologies that use existing data and then move on

to ﬁeld trials. The methodologies we describe below apply both to

regression and classiﬁcation problems.

6.6.1 Methodology

In-sample evaluation As social scientists, you already evaluate

methods on how well they perform in-sample (on the set that the

model was trained on). As we mentioned earlier in the chapter, the

goal of machine learning methods is to generalize to new data, and

validating models in-sample does not allow us to do that. We fo-

cus here on evaluation methodologies that allow us to optimize (as

best as we can) for generalization performance. The methods are

illustrated in Figure 6.7.

174 6. Machine Learning

Training Set Test Set

Data (Size = N)

Size = N/5

Out-of-Sample (Holdout Set)

Original Data Set

5-Fold Cross-Validation

Figure 6.7. Validation methodologies: holdout set and cross-validation

Out-of-sample and holdout set The simplest way to focus on gen-

eralization is to pretend to generalize to new (unseen) data. One

way to do that is to take the original data and randomly split them

into two sets: a training set and a test set (sometimes also called

the holdout or validation set). We can decide how much to keep in

each set (typically the splits range from 50–50 to 80–20, depending

on the size of the data set). We then train our models on the train-

ing set and classify the data in the test set, allowing us to get an

estimate of the relative performance of the methods.

One drawback of this approach is that we may be extremely

lucky or unlucky with our random split. One way to get around the

problem that is to repeatedly create multiple training and test sets.

We can then train on TR1and test on TE1, train on TR2and test

on TE2, and so on. The performance measures on each test set can

then give us an estimate of the performance of diﬀerent methods

and how much they vary across diﬀerent random sets.

Cross-validation Cross-validation is a more sophisticated holdout

training and testing procedure that takes away some of the short-

comings of the holdout set approach. Cross-validation begins by

splitting a labeled data set into kpartitions (called folds). Typically,

kis set to 5 or 10. Cross-validation then proceeds by iterating k

times. In each iteration, one of the kfolds is held out as the test

set, while the other k−1 folds are combined and used to train

the model. A nice property of cross-validation is that every exam-

ple is used in one test set for testing the model. Each iteration of

cross-validation gives us a performance estimate that can then be

aggregated (typically averaged) to generate the overall estimate.

6.6. Evaluation 175

Train Test

2010 2011 2012 2013 2014

Figure 6.8. Temporal validation

An extreme case of cross-validation is called leave-one-out cross-

validation, where given a data set of size N, we create Nfolds. That

means iterating over each data point, holding it out as the test set,

and training on the rest of the N−1 examples. This illustrates the

beneﬁt of cross-validation by giving us good generalization estimates

(by training on as much of the data set as possible) and making sure

the model is tested on each data point.

Temporal validation The cross-validation and holdout set ap-

proaches described above assume that the data have no time de-

pendencies and that the distribution is stationary over time. This

assumption is almost always violated in practice and aﬀects perfor-

mance estimates for a model.

In most practical problems, we want to use a validation strategy

that emulates the way in which our models will be used and pro-

vides an accurate performance estimate. We will call this temporal

validation. For a given point in time ti, we train our models only on

information available to us before tito avoid training on data from

the “future.” We then predict and evaluate on data from tito ti+d

and iterate, expanding the training window while keeping the test

window size constant at d. Figure 6.8 shows this validation process

with ti=2010 and d=1 year. The test set window ddepends

on a few factors related to how the model will be deployed to best

emulate reality:

1. How far out in the future do predictions need to be made? For

example, if the set of students who need to be targeted for

176 6. Machine Learning

interventions has to be ﬁnalized at the beginning of the school

year for the entire year, then d=1 year.

2. How often will the model be updated? If the model is being

updated daily, then we can move the window by a day at a

time to reﬂect the deployment scenario.

3. How often will the system get new data? If we are getting new

data frequently, we can make predictions more frequently.

Temporal validation is similar to how time series models are eval-

uated and should be the validation approach used for most practical

problems.

6.6.2 Metrics

The previous subsection focused on validation methodologies as-

suming we have a evaluation metric in mind. This section will go

over commonly used evaluation metrics. You are probably familiar

with using R2, analysis of the residuals, and mean squared error

(MSE) to evaluate the quality of regression models. For regression

problems, the MSE calculates the average squared diﬀerences be-

tween predictions ˆyiand true values yi. When prediction models

have smaller MSE, they are better. However, the MSE itself is hard

to interpret because it measures quadratic diﬀerences. Instead, the

root mean squared error (RMSE) is more intuitive as it as measure

of mean diﬀerences on the original scale of the response variable.

Yet another alternative is the mean absolute error (MAE), which

measures average absolute distances between predictions and true

values.

We will now describe some additional evaluation metrics com-

monly used in machine learning for classiﬁcation. Before we dive

into metrics, it is important to highlight that machine learning mod-

els for classiﬁcation typically do not predict 0/1 values directly.

SVMs, random forests, and logistic regression all produce a score

(which is sometimes a probability) that is then turned into 0 or 1

based on a user-speciﬁc threshold. You might ﬁnd that certain tools

(such as sklearn) use a default value for that threshold (often 0.5),

but it is important to know that it is an arbitrary threshold and

you should select the threshold based on the data, the model, and

the problem you are solving. We will cover that a little later in this

section.

Once we have turned the real-valued predictions into 0/1 clas-

siﬁcation, we can now create a confusion matrix from these pre-

6.6. Evaluation 177

True

Class

Predicted Class

1 0 total

1True

Positives

False

Negatives P′

0False

Positives

True

Negatives N′

total P N

Figure 6.9. Aconfusion matrix created from real-valued predictions

dictions, shown in Figure 6.9. Each data point belongs to either

the positive class or the negative class, and for each data point

the prediction of the classiﬁer is either correct or incorrect. This is

what the four cells of the confusion matrix represent. We can use

the confusion matrix to describe several commonly used evaluation

metrics.

Accuracy is the ratio of correct predictions (both positive and

negative) to all predictions:

Accuracy =TP +TN

TP +TN +FP +FN =TP +TN

P+N=TP +TN

P′+N′,

where TP denotes true positives, TN true negatives, FP false posi-

tives, FN false negatives, and other symbols denote row or column

totals as in Figure 6.9. Accuracy is the most commonly described

evaluation metric for classiﬁcation but is surprisingly the least use-

ful in practical situations (at least by itself). One problem with

accuracy is that it does not give us an idea of lift compared to base-

line. For example, if we have a classiﬁcation problem with 95% of

the data as positive and 5% as negative, a classiﬁer with 85% is

performing worse than a dumb classiﬁer that predicts positive all

the time (and will have 95% accuracy).

Two additional metrics that are often used are precision and

recall, which are deﬁned as follows:

Precision =TP

TP +FP =TP

Recall =TP

TP +FN =TP

P′

178 6. Machine Learning

(see also Box 7.3). Precision measures the accuracy of the classiﬁer

when it predicts an example to be positive. It is the ratio of correctly

predicted positive examples (TP) to all examples predicted as positive

(TP +FP). This measure is also called positive predictive value in

other ﬁelds. Recall measures the ability of the classiﬁer to ﬁnd

positive examples. It is the ratio of all the correctly predicted positive

examples (TP) to all the positive examples in the data (TP +FN). This

is also called sensitivity in other ﬁelds.

You might have encountered another metric called speciﬁcity in

other ﬁelds. This measure is the true negative rate: the proportion

of negatives that are correctly identiﬁed.

Another metric that is used is the F1score, which is the harmonic

mean of precision and recall:

F1=2∗Precision ∗Recall

Precision +Recall

(see also equation (7.1)). This is often used when you want to bal-

ance both precision and recall.

There is often a tradeoﬀ between precision and recall. By se-

lecting diﬀerent classiﬁcation thresholds, we can vary and tune the

precision and recall of a given classiﬁer. A highly conservative clas-

siﬁer that only predicts a 1 when it is absolutely sure (say, a thresh-

old of 0.9999) will most often be correct when it predicts a 1 (high

precision) but will miss most 1s (low recall). At the other extreme, a

classiﬁer that says 1 to every data point (a threshold of 0.0001) will

have perfect recall but low precision. Figure 6.10 show a precision–

recall curve that is often used to represent the performance of a

given classiﬁer.

If we care about optimizing for the entire precision recall space, a

useful metric is the area under the curve (AUC-PR), which is the area

under the precision–recall curve. AUC-PR must not be confused

with AUC-ROC, which is the area under the related receiver operat-

ing characteristic (ROC) curve. The ROC curve is created by plotting

recall versus (1 – speciﬁcity). Both AUCs can be helpful metrics to

compare the performance of diﬀerent methods and the maximum

value the AUC can take is 1. If, however, we care about a speciﬁc

part on the precision–recall curve, we have to look at ﬁner-grained

metrics.

Let us consider an example from public health. Most public

health agencies conduct inspections of various sorts to detect health

hazard violations (lead hazards, for example). The number of possi-

ble places (homes or businesses) to inspect far exceeds the inspec-

tion resources typically available. Let us assume further that they

6.6. Evaluation 179

0.8

0.6

0.4

0.2

00 0.2 0.4 0.6

Precision

Classier A Classier B

Recall

0.8 1

Figure 6.10. Precision–recall curve

1.0

0.9

0.8

0.7

0.6

Precision

Recall

0.5

0.4

0.3

0.2

0.1

0.0 0.2 0.4 0.6

Percent of Population

1.0

0.8

0.6

0.4

0.2

0.0

0.8 1.0

Figure 6.11. Precision or recall at different thresholds

can only inspect 5% of all possible places; they would clearly want

to prioritize the inspection of places that are most likely to contain

the hazard. In this case, the model will score and rank all the pos-

sible inspection places in order of hazard risk. We would then want

to know what percentage of the top 5% (the ones that will get in-

spected) are likely to be hazards, which translates to the precision

in the top 5% of the most conﬁdence predictions—precision at 5%,

as it is commonly called (see Figure 6.11). Precision at top k percent

is a common class of metrics widely used in information retrieval

and search engine literature, where you want to make sure that the

results retrieved at the top of the search results are accurate. More

generally, this metric is often used in problems in which the class

distribution is skewed and only a small percentage of the examples

will be examined manually (inspections, investigations for fraud,

etc.). The literature provides many case studies of such applica-

tions [219,222, 307].

One last metric we want to mention is a class of cost-sensitive

metrics where diﬀerent costs (or beneﬁts) can be associated with

the diﬀerent cells in the confusion matrix. So far, we have implicitly

assumed that every correct prediction and every error, whether for

the positive class or the negative class, has equal costs and beneﬁts.

In many practical problems, that is not the case. For example, we

may want to predict whether a patient in a hospital emergency room

is likely to go into cardiac arrest in the next six hours. The cost of a

false positive in this case is the cost of the intervention (which may

be a few extra minutes of a physician’s time) while the cost of a false

negative could be death. This type of analysis allows us to calculate

180 6. Machine Learning

the expected value of the predictions of a classiﬁer and select the

model that optimizes this cost-sensitive metric.

6.7 Practical tips

Here we highlight some practical tips that will be helpful when work-

ing with machine learning methods.

6.7.1 Features

So far in this chapter, we have focused a lot on methods and pro-

cess, and we have not discussed features in detail. In social science,

they are not called features but instead are known as variables or

predictors. Good features are what makes machine learning sys-

tems eﬀective. Feature generation (or engineering, as it is often

called) is where the bulk of the time is spent in the machine learn-

ing process. As social science researchers or practitioners, you have

spent a lot of time constructing features, using transformations,

dummy variables, and interaction terms. All of that is still required

and critical in the machine learning framework. One diﬀerence you

will need to get comfortable with is that instead of carefully selecting

a few predictors, machine learning systems tend to encourage the

creation of lots of features and then empirically use holdout data to

perform regularization and model selection. It is common to have

models that are trained on thousands of features. Commonly used

approaches to create features include:

•Transformations, such as log, square, and square root.

•Dummy (binary) variables: This is often done by taking cate-

gorical variables (such as city) and creating a binary variable

for each value (one variable for each city in the data). These

are also called indicator variables.

•Discretization: Several methods require features to be discrete

instead of continuous. Several approaches exist to convert

continuous variables into discrete ones, the most common of

which is equal-width binning.

•Aggregation: Aggregate features often constitute the majority

of features for a given problem. These aggregations use diﬀer-

ent aggregation functions (count, min, max, average, standard

deviation, etc.), often over varying windows of time and space.

6.7. Practical tips 181

For example, given urban data, we would want to calculate the

number (and min, max, mean, variance) of crimes within an

m-mile radius of an address in the past tmonths for varying

values of mand t, and then to use all of them as features in a

classiﬁcation problem.

In general, it is a good idea to have the complexity in features

and use a simple model, rather than using more complex models

with simple features. Keeping the model simple makes it faster to

train and easier to understand.

6.7.2 Machine learning pipeline

When working on machine learning projects, it is a good idea to

structure your code as a modular pipeline so you can easily try dif-

ferent approaches and methods without major restructuring. The

Python workbooks supporting this book will give you an example of

a machine learning pipeline. A good pipeline will contain modules

for importing data, doing exploration, feature generation, classiﬁca-

tion, and evaluation. You can then instantiate a speciﬁc workﬂow

by combining these modules.

An important component of the machine learning pipeline is

comparing diﬀerent methods. With all the methods out there and

all the hyperparameters they come with, how do we know which

model to use and which hyperparameters to select? And what hap-

pens when we add new features to the model or when the data have

“temporal drift” and change over time? One simple approach is to

have a nested set of for loops that loop over all the methods you

have access to, then enumerate all the hyperparameters for that

method, create a cross-product, and loop over all of them, compar-

ing them across diﬀerent evaluation metrics and selecting the best

one to use going forward. You can even add diﬀerent feature subsets

and time slices to this for loop, as the example in the supporting

workbooks will show.

6.7.3 Multiclass problems

In the supervised learning section, we framed classiﬁcation prob-

lems as binary classiﬁcation problems with a 0 or 1 output. There

are many problems where we have multiple classes, such as clas-

sifying companies into their industry codes or predicting whether a

student will drop out, transfer, or graduate. Several solutions have

been designed to deal with the multiclass classiﬁcation problem:

182 6. Machine Learning

•Direct multiclass: Use methods that can directly perform mul-

ticlass classiﬁcation. Examples of such methods are K-nearest

neighbor, decision trees, and random forests. There are ex-

tensions of support vector machines that exist for multiclass

classiﬁcation as well [86], but they can often be slow to train.

•Convert to one versus all (OVA): This is a common approach

to solve multiclass classiﬁcation problems using binary clas-

siﬁers. Any problem with nclasses can be turned into nbi-

nary classiﬁcation problems, where each classiﬁer is trained

to distinguish between one versus all the other classes. A new

example can be classiﬁed by combining the predictions from

all the nclassiﬁers and selecting the class with the highest

score. This is a simple and eﬃcient approach, and one that is

commonly used, but it suﬀers from each classiﬁcation problem

possibly having an imbalanced class distribution (due to the

negative class being a collection of multiple classes). Another

limitation of this approach is that it requires the scores of each

classiﬁer to be calibrated so that they are comparable across

all of them.

•Convert to pairwise: In this approach, we can create binary

classiﬁers to distinguish between each pair of classes, result-

ing in n

2binary classiﬁers. This results in a large number of

classiﬁers, but each classiﬁer usually has a balanced classiﬁ-

cation problem. A new example is classiﬁed by taking the pre-

dictions of all the binary classiﬁers and using majority voting.

6.7.4 Skewed or imbalanced classiﬁcation problems

A lot of problems you will deal with will not have uniform (balanced)

distributions for both classes. This is often the case with problems

in fraud detection, network security, and medical diagnosis where

the class of interest is not very common. The same is true in many

social science and public policy problems around behavior predic-

tion, such as predicting which students will not graduate on time,

which children may be at risk of getting lead poisoning, or which

homes are likely to be abandoned in a given city. You will notice

that applying standard machine learning methods may result in all

the predictions being for the most frequent category in such situa-

tions, making it problematic to detect the infrequent classes. There

has been a lot of work in machine learning research on dealing with

such problems [73, 217] that we will not cover in detail here. Com-

6.8. How can social scientists beneﬁt from machine learning? 183

mon approaches to deal with class imbalance include oversampling

from the minority class and undersampling from the majority class.

It is important to keep in mind that the sampling approaches do not

need to result in a 1 :1 ratio. Many supervised learning methods

described in this chapter (such as SVMs) can work well even with

a 10 :1 imbalance. Also, it is critical to make sure that you only

resample the training set; keep the distribution of the test set the

same as that of the original data since you will not know the class

labels of new data in practice and will not be able to resample.

6.8 How can social scientists beneﬁt from

machine learning?

In this chapter, we have introduced you to some new methods (both

unsupervised and supervised), validation methodologies, and eval-

uation metrics. All of these can beneﬁt social scientists as they

tackle problems in research and practice. In this section, we will

give a few concrete examples where what you have learned so far

can be used to improve some social science tasks:

•Use of better prediction methods and methodology: Traditional

statistics and social sciences have not focused much on meth-

ods for prediction. Machine learning researchers have spent

the past 30 years developing and adapting methods focusing

on that task. We believe that there is a lot of value for so-

cial science researchers and practitioners in learning more

about those methods, applying them, and even augmenting

them [210]. Two common tasks that can be improved us-

ing better prediction methods are generating counterfactuals

(essentially a prediction problem) and matching. In addition,

holdout sets and cross-validation can be used as a model se-

lection methodology with any existing regression and classi-

ﬁcation methods, resulting in improved model selection and

error estimates.

•Model misspeciﬁcation: Linear and logistic regressions are

common techniques for data analysis in the social sciences.

One fundamental assumption within both is that they are ad-

ditive over parameters. Machine learning provides tools when

this assumption is too limiting. Hainmueller and Hazlett [148],

for example, reanalyze data that were originally analyzed with

184 6. Machine Learning

logistic regression and come to substantially diﬀerent conclu-

sions. They argue that their analysis, which is more ﬂexible

and based on supervised learning methodology, provides three

additional insights when compared to the original model. First,

predictive performance is similar or better, although they do

not need an extensive search to ﬁnd the ﬁnal model speciﬁ-

cation as it was done in the original analysis. Second, their

model allows them to calculate average marginal eﬀects that

are mostly similar to the original analysis. However, for one

covariate they ﬁnd a substantially diﬀerent result, which is

due to model misspeciﬁcation in the original model. Finally,

the reanalysis also discovers interactions that were missed in

the original publication.

•Better text analysis: Text is everywhere, but unfortunately hu-

mans are slow and expensive in analyzing text data. Thus,

computers are needed to analyze large collections of text. Ma-

chine learning methods can help make this process more ef-

ﬁcient. Feldman and Sanger [117] provide an overview of

diﬀerent automatic methods for text analysis. Grimmer and

Stewart [141] give examples that are more speciﬁc for social

scientists, and Chapter 7 provides more details on this topic.

•Adaptive surveys: Some survey questions have a large num-

ber of possible answer categories. For example, international

job classiﬁcations describe more than 500 occupational cat-

egories, and it is prohibitive to ask all categories during the

survey. Instead, respondents answer an open-ended question

about their job and machine learning algorithms can use the

verbatim answers to suggest small sets of plausible answer

options. The respondents can then select which option is the

best description for their occupation, thus saving the costs for

coding after the interview.

•Estimating heterogeneous treatment eﬀects: A standard ap-

proach to causal inference is the assignment of diﬀerent treat-

ments (e.g., medicines) to the units of interest (e.g., patients).

Researchers then usually calculate the average treatment

eﬀect—the average diﬀerence in outcomes for both groups. It is

also of interest if treatment eﬀects diﬀer for various subgroups

(e.g., is a medicine more eﬀective for younger people?). Tra-

ditional subgroup analysis has been criticized and challenged

by various machine learning techniques [138, 178].

6.9. Advanced topics 185

•Variable selection: Although there are many methods for vari-

able selection, regularized methods such as the lasso are highly

eﬀective and eﬃcient when faced with large amounts of data.

Varian [386] goes into more detail and gives other methods

from machine learning that can be useful for variable selec-

tion. We can also ﬁnd interactions between pairs of variables

(to feed into other models) using random forests, by looking

at variables that co-occur in the same tree, and by calculating

the strength of the interaction as a function of how many trees

they co-occur in, how high they occur in the trees, and how

far apart they are in a given tree.

6.9 Advanced topics

This has been a short but intense introduction to machine learn-

ing, and we have left out several important topics that are useful

and interesting for you to know about and that are being actively

researched in the machine learning community. We mention them

here so you know what they are, but will not describe them in detail.

These include:

•Semi-supervised learning,

•Active learning,

•Reinforcement learning,

•Streaming data,

•Anomaly detection,

•Recommender systems.

6.10 Summary

Machine learning is a active research ﬁeld, and in this chapter we

have given you an overview of how the work developed in this ﬁeld

can be used by social scientists. We covered the overall machine

learning process, methods, evaluation approaches and metrics, and

some practical tips, as well as how all of this can beneﬁt social sci-

entists. The material described in this chapter is a snapshot of

a fast-changing ﬁeld, and as we are seeing increasing collabora-

tions between machine learning researchers and social scientists,

186 6. Machine Learning

the hope and expectation is that the next few years will bring ad-

vances that will allow us to tackle social and policy problems much

more eﬀectively using new types of data and improved methods.

6.11 Resources

Literature for further reading that also explains most topics from

this chapter in greater depth:

•Hastie et al.’s The Elements of Statistical Learning [159] is a

classic and is available online for free.

•James et al.’s An Introduction to Statistical Learning [187], from

the same authors, includes less mathematics and is more ap-

proachable. It is also available online.

•Mitchell’s Machine Learning [258] is a good introduction to

some of the methods and gives a good motivation underlying

them.

•Provost and Fawcett’s Data Science for Business [311] is a good

practical handbook for using machine learning to solve real-

world problems.

•Wu et al.’s “Top 10 Algorithms in Data Mining” [409].

Software:

•Python (with libraries like scikit-learn,pandas, and more).

•R has many relevant packages [168].

•Cloud-based: AzureML, Amazon ML.

•Free: KNIME, Rapidminer, Weka (mostly for research use).

•Commercial: IBM Modeler, SAS Enterprise Miner, Matlab.

Many excellent courses are available online [412], including Hastie

and Tibshirani’s Statistical Learning [158].

Major conferences in this area include the International Confer-

ence on Machine Learning [177], the Annual Conference on Neu-

ral Information Processing Systems (NIPS) [282], and the ACM In-

ternational Conference on Knowledge Discovery and Data Mining

(KDD) [204].

Text Analysis

Chapter 7

Evgeny Klochikhin and Jordan Boyd-Graber

This chapter provides an overview of how social scientists can make

use of one of the most exciting advances in big data—text analysis.

Vast amounts of data that are stored in documents can now be

analyzed and searched so that diﬀerent types of information can be

retrieved. Documents (and the underlying activities of the entities

that generated the documents) can be categorized into topics or

ﬁelds as well as summarized. In addition, machine translation can

be used to compare documents in diﬀerent languages.

7.1 Understanding what people write

You wake up and read the newspaper, a Facebook post, or an aca-

demic article a colleague sent you. You, like other humans, can

digest and understand rich information, but an increasingly cen-

tral challenge for humans is to cope with the deluge of information

we are supposed to read and understand. In our use case of sci-

ence, even Aristotle struggled with categorizing areas of science;

the vast increase in the scope of written research has only made the

challenge greater.

One approach is to use rule-based methods to tag documents

for categorization. Businesses used to employ human beings to

read the news and tag documents on topics of interest for senior

management. The rules on how to assign these topics and tags were

developed and communicated to these human beings beforehand.

Such a manual categorization process is still common in multiple

applications, e.g., systematic literature reviews [52].

However, as anyone who has used a search engine knows, newer

approaches exist to categorize text and help humans cope with over-

load: computer-aided text analysis. Text data can be used to enrich

187

188 7. Text Analysis

“conventional” data sources, such as surveys and administrative

data, since the words spoken or written by individuals often provide

more nuanced and unanticipated insights. Chapter 3 discusses

◮See Chapter 3.

how to link data to create larger, more diverse data sets. The link-

age data sets need not just be numeric, but can also include data

sets consisting of text data.

Example: Using text to categorize scientiﬁc ﬁelds

The National Center for Science and Engineering Statistics, the US statistical

agency charged with collecting statistics on science and engineering, uses a rule-

based system to manually create categories of science; these are then used to

categorize research as “physics” or “economics” [262, 288]. In a rule-based system

there is no ready response to the question “how much do we spend on climate

change, food safety, or biofuels?” because existing rules have not created such

categories. Text analysis techniques can be used to provide such detail without

manual collation. For example, data about research awards from public sources

and about people funded on research grants from UMETRICS can be linked with

data about their subsequent publications and related student dissertations from

ProQuest. Both award and dissertation data are text documents that can be used

to characterize what research has been done, provide information about which

projects are similar within or across institutions, and potentially identify new ﬁelds

of study [368].

Overall, text analysis can help with speciﬁc tasks that deﬁne

application-speciﬁc subﬁelds including the following:

•Searches and information retrieval: Text analysis tools can

help ﬁnd relevant information in large databases. For exam-

ple, we used these techniques in systematic literature reviews

to facilitate the discovery and retrieval of relevant publica-

tions related to early grade reading in Latin America and the

Caribbean.

•Clustering and text categorization: Tools like topic modeling

can provide a big picture of the contents of thousands of doc-

uments in a comprehensible format by discovering only the

most important words and phrases in those documents.

•Text summarization: Similar to clustering, text summariza-

tion can provide value in processing large documents and text

corpora. For example, Wang et al. [393] use topic modeling to

7.2. How to analyze text 189

produce category-sensitive text summaries and annotations

on large-scale document collections.

•Machine translation: Machine translation is an example of a

text analysis method that provides quick insights into docu-

ments written in other languages.

7.2 How to analyze text

Human language is complex and nuanced, which makes analysis

diﬃcult. We often make simplifying assumptions: we assume our

input is perfect text; we ignore humor [149] and deception [280,

◮See Chapter 6 for a dis-

cussion of speech recogni-

tion, which can turn spoken

language into text.

292]; and we assume “standard” English [212].

Recognizing this complexity, the goal of text mining is to reduce

the complexity of text and extract important messages in a com-

prehensible and meaningful way. This objective is usually achieved

through text categorization or automatic classiﬁcation. These tools ◮Classiﬁcation, a machine

learning method, is dis-

cussed in Chapter 6.

can be used in multiple applications to gain salient insights into the

relationships between words and documents. Examples include us-

ing machine learning to analyze the ﬂow and topic segmentation of

political debates and behaviors [276, 279] and to assign automated

tags to documents [379].

Information retrieval has a similar objective of extracting the

most important messages from textual data that would answer a

particular query. The process analyzes the full text or metadata re-

lated to documents and allows only relevant knowledge to be discov-

ered and returned to the query maker. Typical information retrieval

tasks include knowledge discovery [263], word sense disambigu-

ation [269], and sentiment analysis [295].

The choice of appropriate tools to address speciﬁc tasks signiﬁ-

cantly depends on the context and application. For example, doc-

ument classiﬁcation techniques can be used to gain insights into

the general contents of a large corpus of documents [368], or to

discover a particular knowledge area, or to link corpora based on

implicit semantic relationships [54].

In practical terms, some of the questions can be: How much does

the US government invest in climate change research and nano-

technology? Or what are the main topics in the political debate

on guns in the United States? Or how can we build a salient and

dynamic taxonomy of all scientiﬁc research?

We begin with a review of established techniques to begin the

process of analyzing text. Section 7.3 provides an overview of topic

190 7. Text Analysis

modeling, information retrieval and clustering, and other approaches

accompanied by practical examples and applications. Section 7.4

reviews key evaluation techniques used to assess the validity, ro-

bustness and utility of derived results.

7.2.1 Processing text data

The ﬁrst important step in working with text data is cleaning and

processing. Textual data are often messy and unstructured, which

◮Cleaning and processing

are discussed extensively in

Chapter 3.

makes many researchers and practitioners overlook their value. De-

pending on the source, cleaning and processing these data can re-

quire varying amounts of eﬀort but typically involve a set of estab-

lished techniques.

Text corpora A set of multiple similar documents is called a corpus.

For example, the Brown University Standard Corpus of Present-Day

American English, or just the Brown Corpus [128], is a collection

of processed documents from works published in the United States

in 1961. The Brown Corpus was a historical milestone: it was a

machine-readable collection of a million words across 15 balanced

genres with each word tagged with its part of speech (e.g., noun,

verb, preposition). The British National Corpus [383] repeated the

same process for British English at a larger scale. The Penn Tree-

bank [251] provides additional information: in addition to part-of-

speech annotation, it provides syntactic annotation. For example,

what is the object of the sentence “The man bought the hat”? These

standard corpora serve as training data to train the classiﬁers and

machine learning techniques to automatically analyze text [149].

However, not every corpus is eﬀective for every purpose: the

number and scope of documents determine the range of questions

that you can ask and the quality of the answers you will get back:

too few documents result in a lack of coverage, too many of the

wrong kind of documents invite confusing noise.

Tokenization The ﬁrst step in processing text is deciding what

terms and phrases are meaningful. Tokenization separates sen-

tences and terms from each other. The Natural Language Toolkit

(NLTK) [39] provides simple reference implementations of standard

natural language processing algorithms such as tokenization—for

example, sentences are separated from each other using punctua-

tion such as period, question mark, or exclamation mark. However,

this does not cover all cases such as quotes, abbreviations, or infor-

7.2. How to analyze text 191

mal communication on social media. While separating sentences in

a single language is hard enough, some documents “code-switch,”

combining multiple languages in a single document. These com-

plexities are best addressed through data-driven machine learning

frameworks [209].

Stop words Once the tokens are clearly separated, it is possible to

perform further text processing at a more granular, token level. Stop

words are a category of words that have limited semantic meaning

regardless of the document contents. Such words can be preposi-

tions, articles, common nouns, etc. For example, the word “the”

accounts for about 7% of all words in the Brown Corpus, and “to”

and “of” are more than 3% each [247].

Hapax legomena are rarely occurring words that might have

only one instance in the entire corpus. These words—names, mis-

spellings, or rare technical terms—are also unlikely to bear signiﬁ-

cant contextual meaning. Similar to stop words, these tokens are

often disregarded in further modeling either by the design of the

method or by manual removal from the corpus before the actual

analysis.

N-grams However, individual words are sometimes not the correct

unit of analysis. For example, blindly removing stop words can ob-

scure important phrases such as “systems of innovation,” “cease

and desist,” or “commander in chief.” Identifying these N-grams

requires looking for statistical patterns to discover phrases that of-

ten appear together in ﬁxed patterns [102]. These combinations

of phrases are often called collocations, as their overall meaning is

more than the sum of their parts.

Stemming and lemmatization Text normalization is another im-

portant aspect of preprocessing textual data. Given the complexity

of natural language, words can take multiple forms dependent on

the syntactic structure with limited change of their original mean-

ing. For example, the word “system” morphologically has a plural

“systems” or an adjective “systematic.” All these words are seman-

tically similar and—for many tasks—should be treated the same.

For example, if a document has the word “system” occurring three

times, “systems” once, and “systematic” twice, one can assume that

the word “system” with similar meaning and morphological struc-

ture can cover all instances and that variance should be reduced to

“system” with six instances.

192 7. Text Analysis

The process for text normalization is often implemented using

established lemmatization and stemming algorithms. A lemma is

the original dictionary form of a word. For example, “go,” “went,”

and “goes” will all have the lemma “go.” The stem is a central part

of a given word bearing its primary semantic meaning and uniting

a group of similar lexical units. For example, the words “order” and

“ordering” will have the same stem “ord.” Morphy (a lemmatizer

provided by the electronic dictionary WordNet), Lancaster Stemmer,

and Snowball Stemmer are common tools used to derive lemmas

and stems for tokens, and all have implementations in the NLTK

[39].

All text-processing steps are critical to successful analysis. Some

of them bear more importance than others, depending on the spe-

ciﬁc application, research questions, and properties of the corpus.

Having all these tools ready is imperative to producing a clean input

for subsequent modeling and analysis. Some simple rules should be

followed to prevent typical errors. For example, stop words should

not be removed before performing n-gram indexing, and a stemmer

should not be used where data are complex and require accounting

for all possible forms and meanings of words. Reviewing interim

results at every stage of the process can be helpful.

7.2.2 How much is a word worth?

Not all words are worth the same; in an article about electronics,

“capacitor” is more important than “aspect.” Appropriately weight-

ing and calibrating words is important for both human and machine

consumers of text data: humans do not want to see “the” as the

◮Term weighting is an ex-

ample of feature engineer-

ing discussed in Chapter 6. most frequent word of every document in summaries, and classiﬁ-

cation algorithms beneﬁt from knowing which features are actually

important to making a decision.

Weighting words requires balancing how often a word appears

in a local context (such as a document) with how much it appears

overall in the document collection. Term frequency–inverse docu-

ment frequency (TFIDF) [322] is a weighting scheme to explicitly

balance these factors and prioritize the most meaningful words.

The TFIDF model takes into account both the term frequency of a

given token and its document frequency (Box 7.1) so that if a highly

frequent word also appears in almost all documents, its meaning

for the speciﬁc context of the corpus is negligible. Stop words

are a good example when highly frequent words also bear limited

meaning since they appear in virtually all documents of a given

corpus.

7.3. Approaches and applications 193

Box 7.1: TFIDF

For every token tand every document din the corpus D, TFIDF

is calculated as

tfidf (t, d, D)=tf (t, d)×idf (t, D),

where term frequency is either a simple count,

tf (t, d)=f(t, d),

or a more balanced quantity,

tf (t, d)=0.5+0.5×f(t, d)

max{f(t, d) : t∈d},

and inverse document frequency is

idf (t, D)=log N

|{d∈D:t∈d}|.

7.3 Approaches and applications

In this section, we discuss several approaches that allow users to

perform an unsupervised analysis of large text corpora. That is,

approaches that do not require extensive investment of time from

experts or programmers to begin to understand large text corpora.

The ease of using these approaches provides additional opportu-

nities for social scientists and policymakers to gain insights into

policy and research questions through text analysis.

First, we discuss topic modeling, an approach that discovers top-

ics that constitute the high-level themes of a corpus. Topic modeling

is often described as an information discovery process: describing

what concepts are present in a corpus. Second, we discuss infor-

mation retrieval, which ﬁnds the closest documents to a particular

concept a user wants to discover. In contrast to topic modeling

(exposing the primary concepts the corpus, heretofore unknown),

information retrieval ﬁnds documents that express already known

concepts. Other approaches can be used for document classiﬁca-

tion, sentiment analysis, and part-of-speech tagging.

7.3.1 Topic modeling

As topic modeling is a broad subﬁeld of natural language processing

and machine learning, we will restrict our focus to a single exemplar:

194 7. Text Analysis

Computer,

technology,

system,

service, site,

phone, internet,

machine

Topic 1

Sell, sale,

store, product,

business,

advertising,

market,

consumer

Topic 2

Play, ﬁlm,

movie, theater,

production,

star, director,

stage

Topic 3

Figure 7.1. Topics are distributions over words. Here are three example topics

learned by latent Dirichlet allocation from a model with 50 topics discovered from

the New York Times [324]. Topic 1 seems to be about technology, Topic 2 about

business, and Topic 3 about the arts

latent Dirichlet allocation (LDA) [43]. LDA is a fully Bayesian exten-

sion of probabilistic latent semantic indexing [167], itself a proba-

bilistic extension of latent semantic analysis [224]. Blei and Laf-

ferty [42] provide a more detailed discussion of the history of topic

models.

LDA, like all topic models, assumes that there are topics that

form the building blocks of a corpus. Topics are distributions over

words and are often shown as a ranked list of words, with the high-

est probability words at the top of the list (Figure 7.1). However, we

do not know what the topics are a priori; the challenge is to discover

what they are (more on this shortly).

In addition to assuming that there exist some number of topics

that explain a corpus, LDA also assumes that each document in a

corpus can be explained by a small number of topics. For example,

taking the example topics from Figure 7.1, a document titled “Red

Light, Green Light: A Two-Tone LED to Simplify Screens” would

be about Topic 1, which appears to be about technology. How-

ever, a document like “Forget the Bootleg, Just Download the Movie

Legally” would require all three of the topics. The set of topics that

are used by a document is called the document’s allocation (Fig-

ure 7.2). This terminology explains the name latent Dirichlet alloca-

tion: each document has an allocation over latent topics governed

by a Dirichlet distribution.

7.3.1.1 Inferring topics from raw text

Algorithmically, the problem can be viewed as a black box. Given a

corpus and an integer Kas input, provide the topics that best de-

scribe the document collection: a process called posterior inference.

The most common algorithm for solving this problem is a technique

called Gibbs sampling [131].

7.3. Approaches and applications 195

Forget the

Bootleg, Just

Download the

Movie Legally

Multiplex Heralded

As Linchpin To

Growth

e Shape of

Cinema,

Transformed At

the Click of a

Mouse A Peaceful Crew

Puts Muppets

Where Its Mouth Is

Stock Trades: A

Better Deal For

Investors Isn't

Simple

Internet Portals

Begin to Distinguish

among emselves

as Shopping Malls

Red Light, Green

Light: A

2-Tone L.E.D. to

Simplify Screens

TOPIC 3

"ENTERTAINMENT"

TOPIC 1

"TECHNOLOGY"

TOPIC 2

"BUSINESS"

Figure 7.2. Allocations of documents to topics

Gibbs sampling works at the word level to discover the topics that

best describe a document collection. Each word is associated with a

single topic, explaining why that word appeared in a document. For

example, consider the sentence “Hollywood studios are preparing

to let people download and buy electronic copies of movies over the

Internet.” Each word in this sentence is associated with a topic:

“Hollywood” might be associated with an arts topic; “buy” with a

business topic; and “Internet” with a technology topic (Figure 7.3).

This is where we should eventually get. However, we do not know

this to start. So we can initially assign words to topics randomly.

This will result in poor topics, but we can make those topics better.

We improve these topics by taking each word, pretending that we

do not know the topic, and selecting a new topic for the word.

A topic model wants to do two things: it does not want to use

many topics in a document, and it does not want to use many words

in a topic. So the algorithm will keep track of how many times a

document dhas used a topic k,Nd,k , and how many times a topic

khas used a word w,Vk,w. For notational convenience, it will also

be useful to keep track of marginal counts of how many words are

in a document,

Nd,·≡X

Nd,k ,

196 7. Text Analysis

Hollywood studios are preparing to let people

download and buy electronic copies of movies over

the Internet, much as record labels now sell songs for

99 cents through Apple Computer's iTunes music store

and other online services ...

Hollywood

studios

download

movies

buy

electronic

copies

Internet,

cents

Apple

Compute

sell

unes

store

music

songs

other online

servic

record

computer,

technology,

system, service,

site, phone,

internet, machine

play, ﬁlm, movie,

theater, production,

star, director, stage

sell, sale, store,

product, business,

advertising, market,

consumer

Figure 7.3. Each word is associated with a topic. Gibbs sampling inference iter-

atively resamples the topic assignments for each word to discover the most likely

topic assignments that explain the document collection

and how many words are associated with a topic,

Vk,·≡X

Vk,w.

The algorithm removes the counts for a word from Nd,k and Vk,w and

then changes the topic of a word (hopefully to a better topic than

the one it had before). Through many thousands of iterations of this

process, the algorithm can ﬁnd topics that are coherent, useful, and

characterize the data well.

The two goals of topic modeling—balancing document allocations

to topics and topics’ distribution over words—come together in an

equation that multiplies them together. A good topic will be both

common in a document and explain a word’s appearance well.

Example: Gibbs sampling for topic models

The topic assignment zd,n of word nin document dis proportional to

p(zd,n =k)∝





Nd,k +α

Nd,·+Kα

| {z }

how much doc likes the topic













Vk,wd,n +

Vk,·+V

| {z }

how much topic likes the word







7.3. Approaches and applications 197

where αand are smoothing factors that prevent a topic from having zero prob-

ability if a topic does not use a word or a document does not use a topic [390].

Recall that we do not include the token that we are sampling in the counts for N

or V.

For the sake of concreteness, assume that we have three documents with the

following topic assignments:

Document 1: Adog3Bcat2Ccat3Dpig1

Document 2: Ehamburger2Fdog3Ghamburger1

Document 3: Hiron1Iiron3Jpig2Kiron2

If we want to sample token B (the ﬁrst instance of of “cat” in document 1), we

compute the conditional probability for each of the three topics (z=1,2,3):

p(zB=1)=1+1.000

3+3.000 ×0+1.000

3+5.000 =0.333 ×0.125 =0.042,

p(zB=2)=0+1.000

3+3.000 ×0+1.000

3+5.000 =0.167 ×0.125 =0.021, and

p(zB=3)=2+1.000

3+3.000 ×1+1.000

4+5.000 =0.500 ×0.222 =0.111.

To reiterate, we do not include token B in these counts: in computing these con-

ditional probabilities, we consider topic 2 as never appearing in the document and

“cat” as never appearing in topic 2. However, “cat” does appear in topic 3 (token C),

so it has a higher probability than the other topics. After renormalizing, our con-

ditional probabilities are (0.24,0.12,0.64). We then sample the new assignment

of token B to be topic 3 two times out of three. Griﬃths and Steyvers [140] provide

more details on the derivation of this equation.

Example code Listing 7.1 provides a function to compute the conditional proba-

bility of a single word and return the (unnormalized) probability to sample from.

7.3.1.2 Applications of topic models

Topic modeling is most often used for topic exploration, allowing

users to understand the contents of large text corpora. Thus,

topic models have been used, for example, to understand what

the National Institutes of Health funds [368]; to compare and con-

trast what was discussed in the North and South in the Civil War

[270]; and to understand how individuals code in large program-

ming projects [253].

Topic models can also be used as features to more elaborate

algorithms such as machine translation [170], detecting objects in

images [392], or identifying political polarization [298].

198 7. Text Analysis

def class_sample(docs, vocab, d, n, alpha,

beta, theta, phi, num_topics):

# Get the vocabulary ID of the word we are sampling

type = docs[d][n]

# Dictionary to store final result

result = {}

# Consider each topic possibility

for kk in xrange(num_topics):

# theta stores the number of times the document d uses

# each topic kk; alpha is a smoothing parameter

doc_contrib = (theta[d][kk] + alpha) / \

(sum(theta[d].values()) + num_topics *alpha)

# phi stores the number of times topic kk uses

# this word type; beta is a smoothing parameter

topic_contrib = (phi[kk][type] + beta) / \

(sum(phi[kk].values()) + len(vocab) *beta)

result[kk] = doc_contrib *topic_contrib

return result

Listing 7.1. Python code to compute conditional probability of a single word and

return the probability from which to sample

7.3.2 Information retrieval and clustering

Information retrieval is a large subdiscipline that encompasses a

variety of methods and approaches. Its main advantage is using

large-scale empirical data to make analytical inferences and class

assignments. Compared to topic modeling, discussed above, infor-

mation retrieval techniques can use external knowledge repositories

to categorize given corpora as well as discover smaller and emerging

areas within a large database.

A major concept of information retrieval is a search query that is

usually a short phrase presented by a human or machine to retrieve

a relevant answer to a question or discover relevant knowledge.

A good example of a large-scale information retrieval system is a

search engine, such as Google or Yahoo!, that provides the user with

an opportunity to search the entire Internet almost instantaneously.

Such fast searches are achieved by complex techniques that are

linguistic (set-theoretic), algebraic, probabilistic, or feature-based.

Set-theoretic operations and Boolean logic Set-theoretic opera-

tions proceed from the assumption that any query is a set of linked

components all of which need to be present in the returned result for

7.3. Approaches and applications 199

it to be relevant. Boolean logic serves as the basis for such queries;

it uses Boolean operators such as AND,OR, and NOT to combine query

components. For example, the query

induction AND (physics OR logic)

will retrieve all documents in which the word “induction” is used,

whether in a physical or logical sense. The extended Boolean model

and fuzzy retrieval are enhanced approaches to calculating the rel-

evance of retrieved documents based on such queries [232].

Search queries can be also enriched by wildcards and other con-

nectors. For example, the character “*” typically substitutes for

any possible character or characters depending on the settings of

the query engine. (In some instances, search queries can run in

nongreedy mode, in which case, for example, the phrase inform*

might retrieve only text up to the end of the sentence. On the other

hand, a greedy query might retrieve full text following the word or

part of word, denoted as inform*, up to the end of the document,

which would essentially mean the same as inform*$.) The wildcard

“?” expects either one or no character in its place, and the wild-

card “.” expects exactly one character. Search queries enhanced

with such symbols and Boolean operators are referred to as regular

expressions.

Various databases and search engines can interpret Boolean op-

erators and wildcards diﬀerently depending on their settings and

therefore are prone to return rather diﬀerent results. This behavior

should be expected and controlled for while running searches on

diﬀerent data sources.

Example: Discover food safety awards

Food safety is an interdisciplinary research area that spans multiple scientiﬁc

disciplines including biological sciences, agriculture, and food science. To retrieve

food safety-related awards, we have to construct a Boolean-based search string

that would look for terms and phrases in those documents and return only relevant

results.

An example of such a string would typically be subdivided by category or search

group connected to each other by the AND or OR operator:

1. General terms: (food safety OR food securit*OR food

insecurit*)

2. Food pathogens: (food*) AND (acanthamoeba OR actinobacteri*

OR (anaerobic organ*) OR DDT OR ...)

3. Biochemistry and toxicology: (food*) AND (toxicolog*OR

(activated carbon*) OR (acid-hydrol?zed vegetable

protein*) OR aflatoxin*OR ...)

200 7. Text Analysis

4. Food processing and preservation: (food*) AND (process*OR

preserv*OR fortif*OR extrac*OR ...)

5. Food quality and quality control: (food*) AND (qualit*OR (danger

zon*) OR test*OR (risk analys*) OR ...)

6. Food-related diseases: (food*OR foodbo?rn*OR food-rela*)

AND (diseas*OR hygien*OR allerg*OR diarrh?ea*OR

nutrit*OR ...)

Diﬀerent websites and databases use diﬀerent search functions to return most rel-

evant results given a query (e.g., “food safety”). Ideally, a user has access to a full

database and can apply the same Python code based on regular expressions to

all textual data. However, this is not always possible (e.g., when using proprietary

databases, such as Web of Science). In those cases, it is important to follow the con-

ventions of the information retrieval system. For example, one source might need

phrases to be embedded in parentheses (i.e., (x-ray crystallograph.*))

while another database interface would require such phrases to be contained

within quotation marks (i.e., ‘‘x-ray crystallograph.*’’). It is then criti-

cal to explore the search tips and rules on those databases to ensure that the most

complete information is gathered and further analyzed.

Example code Python’s built-in re package provides all the capability needed to

construct and search with complex regular expressions. Listing 7.2 provides an

example.

def fs_regex(nsf_award_abstracts,outfilename):

# Construct simple search string divided by search groups

food = "food.*"

general = "safety|secur.*|insecur.*"

pathogens = "toxicolog.*|acid-hydrolyzed vegetable protein.*|

activated carbon.*"

process = "process.*|preserv.*|fortif.*"#and so on

# Open csv table with all NSF award abstracts in 2000-2014

inpfile = open(nsf_award_abstracts,'rb')

inpdata = csv.reader(infile)

outfile = open(outfilename,'wb')

output = csv.writer(outfile)

for line in inpdata:

award_id = line[0]

title = line[1]

abstract = line[2]

if re.search(food,abstract) and (re.search(general,abstract)

or re.search(pathogens,abstract) or re.search(process,

abstract)):

output.writerow(i+['food safety award'])

Listing 7.2. Python code to identify food safety related NSF awards with regular

expressions

7.3. Approaches and applications 201

Algebraic models Algebraic models turn text into numbers to run

mathematical operations and discover inherent interdependencies

between terms and phrases, also deﬁning the most important and

meaningful among them. The vector space representation is a typ-

ical way of converting words into numbers, wherein every token is

assigned with a sequential ID and a respective weight, be it a simple

term frequency, TFIDF value, or any other assigned number.

Latent Dirichlet allocation, discussed in the preceding section,

is a good example of a probabilistic model, while unsupervised ma-

chine learning techniques, such as random forest, can be used for

◮Random forests are dis-

cussed for record linkages

in Chapter 3 and described

in more detail in Chapter 10.

feature-based modeling and information retrieval.

Similarity measures and approaches Based on algebraic models,

the user can either compare documents between each other or train

a model that can be further inferred on a diﬀerent corpus. Typ-

ical metrics involved in this process include cosine similarity and

Kullback–Leibler divergence [218].

Cosine similarity is a popular measure in document classiﬁca-

tion. Given two documents daand dbpresented as term vectors −→

and −→

tb, the cosine similarity is

SIMC(−→

ta,−→

tb)=−→

ta·−→

|−→

ta| ∗ |−→

tb|

Example: Measuring cosine similarity between documents

NSF awards are not labeled by scientiﬁc ﬁeld—they are labeled by program. This

administrative classiﬁcation is not always useful to assess the eﬀects of certain

funding mechanisms on disciplines and scientiﬁc communities. One approach

is to understand how awards align with each other even if they were funded by

diﬀerent programs. Cosine similarity allows us to do just that.

Example code The Python numpy module is a powerful library of tools for eﬃcient

linear algebra computation. Among other things, it can be used to compute the

cosine similarity of two documents represented by numeric vectors, as described

above. The gensim module that is often used as a Python-based topic modeling

implementation can be used to produce vector space representations of textual

data. Listing 7.3 provides an example of measuring cosine similarity using these

modules.

202 7. Text Analysis

# Define cosine similarity function

def coss(v1,v2):

return np.dot(v1,v2) /

(np.sqrt(np.sum(np.square(v1))) *

np.sqrt(np.sum(np.square(v2))))

def coss_nsf(nsf_climate_change,nsf_earth_science,outfile):

# Open the source and compared to documents

source = csv.reader(file(nsf_climate_change,'rb'))

comparison = csv.reader(file(nsf_earth_science,'rb'))

# Create an output file

output = csv.writer(open(outfile,'wb'))

# Read through the source and store value in static data container

data = {}

for row in source:

award_id = row[0]

abstract = row[1]

data[award_id] = abstract

# Read through the comparison file and compute similarity

for row in comparison:

award_id = row[0]

# Assuming that abstract is cleaned, processed, tokenized, and

# stored as a space-separated string of tokens

abstract = row[1]

abstract_for_dict = abstract.split(" ")

# Construct dictionary of tokens and IDs

dict_abstract = corpora.dictionary.Dictionary(abstract_for_dict)

# Construct vector from dictionary

# of all tokens and IDs in abstract

abstr_vector = dict(dict_abstract.doc2bow(abstract))

# Iterate through all stored abstracts in source corpus

# and assign same token IDs using dictionary

for key,value in data.items():

source_id = key

# Get all tokens from source abstract, assuming it is

# tokenized and space-separated

source_abstr = value.split(" ")

source_vector = dict(dict_abstract.doc2bow(source_abstr))

# Cosine similarity requires having same shape vectors.

# Thus impute zeros for any missing tokens in source

# abstract as compared to the target one

add = { n:0 for nin abstr_vector.keys()

if nnot in source_dict.keys() }

# Update source vector

source_vector.update(add)

source_vector = sorted(source_vector.items())

abstr_vector = sorted(abstr_vector.items())

# Compute cosine similarity

similarity = coss(np.array([item[1] for item in abstr_vector]),

np.array([item[1] for item in source_dict])

output.writerow([source_id,award_id,similarity])

Listing 7.3. Python code to measure cosine similarity between Climate Change

and all other Earth Science NSF awards

7.3. Approaches and applications 203

Kullback–Leibler (KL) divergence is an asymmetric measure that

is often enhanced by averaged calculations to ensure unbiased re-

sults when comparing documents between each other or running a

classiﬁcation task. Given two term vectors −→

taand −→

tb, the KL diver-

gence from vector −→

tato −→

tbis

DKL (−→

ta||−→

tb)=

t=1

wt,a ×log wt,a

wt,b !,

where wt,a and wt,b are term weights in two vectors, respectively.

An averaged KL divergence metric is then deﬁned as

DAvgKL (−→

ta||−→

tb)=

t=1

(π1×D(wt,a ||wt)+π2×D(wt,b||wt)),

where π1=wt,a

wt,a +wt,b , π2=wt,b

wt,a +wt,b , and wt=π1×wt,a +π2×wt,b [171].

A Python-based scikit-learn library provides an implementa-

tion of these measures as well as other machine learning models

and approaches.

Knowledge repositories Information retrieval can be signiﬁcantly

enriched by the use of established knowledge repositories that can

provide enormous amounts of organized empirical data for modeling

and relevance calculations. Established corpora, such as the Brown

Corpus and Lancaster–Oslo–Bergen Corpus, are one type of such

preprocessed repositories.

Wikipedia and WordNet are examples of another type of lexi-

cal and semantic resources that are dynamic in nature and that

can provide a valuable basis for consistent and salient information

retrieval and clustering. These repositories have the innate hierar-

chy, or ontology, of words (and concepts) that are explicitly linked

to each other either by inter-document links (Wikipedia) or by the

inherent structure of the repository (WordNet). In Wikipedia, con-

cepts thus can be considered as titles of individual Wikipedia pages

and the contents of these pages can be considered as their extended

semantic representation.

Information retrieval techniques build on these advantages of

WordNet and Wikipedia. For example, Meĳ et al. [256] mapped

search queries to the DBpedia ontology (derived from Wikipedia top-

ics and their relationships), and found that this mapping enriches

the search queries with additional context and concept relation-

ships. One way of using these ontologies is to retrieve a predeﬁned

204 7. Text Analysis

list of Wikipedia pages that would match a speciﬁc taxonomy. For

example, scientiﬁc disciplines are an established way of tagging

documents—some are in physics, others in chemistry, engineer-

ing, or computer science. If a user retrieves four Wikipedia pages

on “Physics,” “Chemistry,” “Engineering,” and “Computer Science,”

they can be further mapped to a given set of scientiﬁc documents to

label and classify them, such as a corpus of award abstracts from

the US National Science Foundation.

Personalized PageRank is a similarity system that can help with

the task. This system uses WordNet to assess semantic relation-

ships and relevance between a search query (document d) and pos-

sible results (the most similar Wikipedia article or articles). This

system has been applied to text categorization [269] by compar-

ing documents to semantic model vectors of Wikipedia pages con-

structed using WordNet. These vectors account for the term fre-

quency and their relative importance given their place in the Word-

Net hierarchy, so that the overall wiki vector is deﬁned as:

SMVwiki (s)=Pw∈Synonyms(s)

tfwiki (w)

|Synsets(w)|,

where wis a token within wiki,sis a WordNet synset that is associ-

ated with every token win WordNet hierarchy, Synonyms(s)is the

set of words (i.e., synonyms) in the synset s,tfwiki (w)is the term fre-

quency of the word win the Wikipedia article wiki, and Synsets(w)

is the set of synsets for the word w.

The overall probability of a candidate document d(e.g., an NSF

award abstract or a PhD dissertation abstract) matching the target

query, or in our case a Wikipedia article wiki, is

wikiBEST =Xwt∈doc max

s∈Synsets(wt)SMVwiki(s),

where Synsets(wt)is the set of synsets for the word wtin the target

document document (e.g., NSF award abstract) and SMVwiki (s)is

the semantic model vector of a Wikipedia page, as deﬁned above.

Applications Information retrieval can be used in a number of ap-

plications. Knowledge discovery, or information extraction, is per-

haps its primary mission; in contrast, for users, the purpose of

information retrieval applications is to retrieve the most relevant

response to a query.

Document classiﬁcation is another popular task where informa-

tion retrieval methods can be helpful. Such systems, however, typi-

cally require a two-step process: The ﬁrst phase deﬁnes all relevant

7.3. Approaches and applications 205

information needed to answer the query. The second phase clus-

ters the documents according to a set of rules or by allowing the

machine to actively learn the patterns and classes. For example,

one approach is to generate a taxonomy of concepts with associated

Wikipedia pages and then map other documents to these pages

through Personalized PageRank. In this case, disciplines, such as

physics, chemistry, and engineering, can be used as the original la-

bels, and NSF award abstracts can be mapped to these disciplinary

categories through the similarity metrics (i.e., whichever of these

disciplines scores the highest is the most likely to ﬁt the disciplinary

proﬁle of an award abstract).

Another approach is to use the Wikipedia structure as a clus-

tering mechanism in itself. For example, the article about “nano-

technology” links to a number of other Wikipedia pages as ref-

erenced in its content. “Quantum realm,” “nanometer” or “Na-

tional Nanotechnology Initiative” are among the meaningful con-

cepts used in the description of nanotechnology that also have their

own individual Wikipedia pages. Using these pages, we can assume

that if a scientiﬁc document, such as an NSF award abstract, has

enough similarity with any one of the articles associated with nano-

technology, it can be tagged as such in the classiﬁcation exercise.

The process can also be turned around: if the user knows ex-

actly the clusters of documents in a given corpus, these can be

mapped to an external knowledge repository, such as Wikipedia, to

discover yet unknown and emerging relationships between concepts

that are not explicitly mentioned in the Wikipedia ontology at the

current moment. This situation is likely given the time lag between

the discovery of new phenomena, their introduction to the research

community, and their adoption by the wider user community re-

sponsible for writing Wikipedia pages.

Examples Some examples from our recent work can demonstrate

how Wikipedia-based labeling and labeled LDA [278,315] cope with

the task of document classiﬁcation and labeling in the scientiﬁc

domain. See Table 7.1.

7.3.3 Other approaches

Our focus in this chapter is on approaches that are language in-

dependent and require little (human) eﬀort to analyze text data.

In addition to topic modeling and information retrieval discussed

above, natural language processing and computational linguistics

206 7. Text Analysis

Table 7.1. Wikipedia articles as potential labels generated by n-gram indexing of NSF awards

Abstract excerpt ProQuest

subject category Labeled LDA Wikipedia-

based labeling

Reconﬁgurable computing platform for small-

scale resource-constrained robot. Speciﬁc appli-

cations often require robots of small size for rea-

sons such as costs, access, and stealth. Small-

scale robots impose constraints on resources such

as power or space for modules . . .

Engineering,

Electronics and

Electrical;

Engineering,

Robotics

Motor controller

Robotics,

Robot, Field-

programmable

gate array

Genetic mechanisms of thalamic nuclei speci-

ﬁcation and the inﬂuence of thalamocortical

axons in regulating neocortical area formation.

Sensory information from the periphery is essential

for all animal species to learn, adapt, and survive in

their environment. The thalamus, a critical struc-

ture in the diencephalon, receives sensory informa-

tion . . .

Biology,

Neurobiology HSD2 neurons

Sonic hedgehog,

Induced stem

cell, Nervous

system

Poetry ’n acts: The cultural politics of twentieth-

century American poets’ theater. This study fo-

cuses on the disciplinary blind spot that obscures

the productive overlap between poetry and dramatic

theater and prevents us from seeing the cultural

work that this combination can perform . . .

Literature,

American;

Theater

Audience

Counter-

culture of the

1960s, Novel,

Modernism

are rich, well-developed subdisciplines of computer science that can

help analyze text data. While covering these subﬁelds is beyond this

chapter, we brieﬂy discuss some of the most widely used approaches

to process and understand natural language texts.

In contrast to the unsupervised approaches discussed above,

most techniques in natural language processing are supervised ma-

chine learning algorithms. Supervised machine learning produce

◮Chapter 6 reviews super-

vised machine learning ap-

proaches. labels ygiven inputs x—the algorithm’s job is to learn how to auto-

matically produce correct labels given automatic inputs x.

However, the algorithm must have access to many examples of

xand y, often of the order of thousands of examples. This is expen-

sive, as the labels often require linguistic expertise [251]. While it

is possible to annotate data using crowdsourcing [350], this is not

a panacea, as it often forces compromises in the complexity of the

task or the quality of the labels.

7.3. Approaches and applications 207

In the sequel, we discuss how diﬀerent deﬁnitions of xand y—

both in the scope and structure of the examples and labels—deﬁne

unique analyses of linguistic data.

Document classiﬁcation If the examples xare documents and y

are what these documents are about, the problem is called docu-

ment classiﬁcation. In contrast to the techniques in Section 7.3.1,

document classiﬁcation is used when you know the speciﬁc docu-

ment types for which you are looking and you have many examples

of those document types.

One simple but ubiquitous example of document classiﬁcation

is spam detection: an email is either an unwanted advertisement

(spam) or it is not. Document classiﬁcation techniques such as

naïve Bayes [235] touch essentially every email sent worldwide,

making email usable even though most emails are spam.

Sentiment analysis Instead of being what a document is about, a

label ycould also reveal the speaker. A recent subﬁeld of natural

language processing is to use machine learning to reveal the internal

state of speakers based on what they say about a subject [295]. For

example, given an example of sentence x, can we determine whether

the speaker is a Liberal or a Conservative? Is the speaker happy or

sad?

Simple approaches use dictionaries and word counting meth-

ods [299], but more nuanced approaches make use of domain-

speciﬁc information to make better predictions. One uses diﬀerent

approaches to praise a toaster than to praise an air conditioner [44];

liberals and conservatives each frame health care diﬀerently from

how they frame energy policy [277].

Part-of-speech tagging When the examples xare individual words

and the labels yrepresent the grammatical function of a word (e.g.,

whether a word is a noun, verb, or adjective), the task is called

part-of-speech tagging. This level of analysis can be useful for dis-

covering simple patterns in text: distinguishing between when “hit”

is used as a noun (a Hollywood hit) and when “hit” is used as a verb

(the car hit the guard rail).

Unlike document classiﬁcation, the examples xare not indepen-

dent: knowing whether the previous word was an adjective makes

it far more likely that the next word will be a noun than a verb.

Thus, the classiﬁcation algorithms need to incorporate structure

into the decisions. Two common algorithms for this problem are

hidden Markov models [313] and conditional random ﬁelds [220].

208 7. Text Analysis

7.4 Evaluation

Evaluation techniques are common in economics, policy analysis,

and development. They allow researchers to justify their conclu-

sions using statistical means of validation and assessment. Text,

however, is less amenable to standard deﬁnitions of error: it is clear

that predicting that revenue will be $110 when it is really $100 is

far better than predicting $900; however, it is hard to say how far

“potato harvest” is from “journalism” if you are attempting to auto-

matically label documents. Documents are hard to transform into

numbers without losing semantic meanings and context.

Content analysis, discourse analysis, and bibliometrics are all

common tools used by social scientists in their text mining exer-

cises [134, 358]. However, they are rarely presented with robust

evaluation metrics, such as type I and type II error rates, when re-

trieving data for further analysis. For example, bibliometricians

◮Chapter 10 discusses

how to measure and diag-

nose errors in big data. often rely on search strings derived from expert interviews and

workshops. However, it is hard to certify that those search strings

are optimal. For instance, in nanotechnology research, Porter et

al. [305] developed a canonical search strategy for retrieving nano-

related papers from major scientiﬁc databases. Nevertheless, others

adopt their own search string modiﬁcations and claim similar va-

lidity [144, 374].

Evaluating these methods depends on reference corpora. We

discuss metrics that help you understand whether a collection of

documents for a query is a good one or not or whether a labeling of

a document collection is consistent with an existing set of labels.

Purity Suppose you are tasked with categorizing a collection of

documents based on what they are about. Reasonable people may

disagree: I might put “science and medicine” together, while an-

other person may create separate categories for “energy,” “scientiﬁc

research,” and “health care,” none of which is a strict subset of

my “science and medicine” category. Nevertheless, we still want to

know whether two categorizations are consistent.

Let us ﬁrst consider the case where the labels diﬀer but all cat-

egories match (i.e., even though you call one category “taxes” and

I call it “taxation,” it has exactly the same constituent documents).

This should be the best case; it should have the highest score poss-

ible. Let us say that this maximum score should be 1.

The opposite case is if we both simply assign labels randomly.

There will still be some overlap in our labeling: we will agree some-

7.4. Evaluation 209

times, purely by chance. On average, if we both assign one label,

selected from the same set of Klabels, to each document, then we

should expect to agree on about 1

Kof the labels. This is a lower

bound on performance.

The formalization of this measure is called purity: how much

overlap there is between each of my labels and the “best” match

from your labels. Box 7.2 shows how to calculate it.

Box 7.2: Purity calculation

We compute purity by assigning each cluster to the class that is

most frequent in the cluster, and then measuring the accuracy

of this assignment by counting correctly assigned documents

and dividing by the number of all documents, N[248]. In formal

terms,

Purity(Ω,C)=1

max

j|wk∩cj|,

where Ω = {w1, w2, . . . , wk}is the set of candidate clusters and

C={c1, c2, . . . , cj}is the gold set of classes.

Precision and recall Chapter 6 already touched on the importance

of precision and recall for evaluating the results of information re-

trieval and machine learning models (Box 7.3 provides a reminder

of the formulae). Here we look at a particular example of how these

metrics can be computed when working with scientiﬁc documents.

We assume that a user has three sets of documents Da={da1,

da2, . . . , dn},Db={db1, db2, . . . , dk}, and Dc={dc1, dc2, . . . , di}. All

three sets are clearly tagged with a disciplinary label: Daare com-

puter science documents, Dbare physics, and Dcare chemistry.

The user also has a diﬀerent set of documents—Wikipedia pages

on “Computer Science,” “Chemistry,” and “Physics.” Knowing that

all documents in Da,Db, and Dchave clear disciplinary assign-

ments, let us map the given Wikipedia pages to all documents within

those three sets. For example, the Wikipedia-based query on “Com-

puter Science” should return all computer science documents and

none in physics or chemistry. So, if the query based on the “Com-

puter Science” Wikipedia page returns only 50% of all computer

science documents, then 50% of the relevant documents are lost:

the recall is 0.5.

On the other hand, if the same “Computer Science” query re-

turns 50% of all computer science documents but also 20% of the

210 7. Text Analysis

Box 7.3: Precision and recall

These two metrics are commonly used in information retrieval

and computational linguistics [318]. Precision computes the

type I errors—false positives—in a similar manner to the purity

measure; it is formally deﬁned as

Precision =|{relevant documents} ∩ {retrieved documents}|

|{retrieved documents}| .

Recall accounts for type II errors—false negatives—and is de-

ﬁned as

Recall =|{relevant documents} ∩ {retrieved documents}|

|{relevant documents}| .

physics documents and 50% of the chemistry documents, then all

of the physics and chemistry documents returned are false posi-

tives. Assuming that all document sets are of equal size, so that

|Da|=10, |Db|=10 and |Dc|=10, then the precision is 5

12 =0.42.

Fscore The F score takes precision and recall measures a step

further and considers the general accuracy of the model. In formal

terms, the Fscore is a weighted average of the precision and recall:

F1=2·Precision ·Recall

Precision +Recall.(7.1)

In terms of type I and type II errors:

F=(1+̙2)·true positive

(1+̙2)·true positive +̙2·false negative +false positive,

where ̙is the balance between precision and recall. Thus, F2puts

more emphasis on the recall measure and F0.5puts more emphasis

on precision.

7.5 Text analysis tools

We are fortunate to have access to a set of powerful open source

text analysis tools. We describe three here.

7.5. Text analysis tools 211

The Natural Language Toolkit The NLTK is a commonly used nat-

ural language toolkit that provides a large number of relevant so-

lutions for text analysis. It is Python-based and can be easily in-

tegrated into data processing and analytical scripts by a simple

import nltk (or similar for any one of its submodules).

The NLTK includes a set of tokenizers, stemmers, lemmatiz-

ers and other natural language processing tools typically applied

in text analysis and machine learning. For example, a user can

extract tokens from a document doc by running the command

tokens = nltk.word_tokenize(doc).

Useful text corpora are also present in the NLTK distribution.

For example, the stop words list can be retrieved by running the

command stops=nltk.corpus.stopwords.words(language). These

stop words are available for several languages within NTLK, includ-

ing English, French, and Spanish.

Similarly, the Brown Corpus or WordNet can be called by run-

ning from nltk.corpus import wordnet/brown. After the corpora

are loaded, their various properties can be explored and used in

text analysis; for example, dogsyn = wordnet.synsets(’dog’)will

return a list of WordNet synsets related to the word “dog.”

Term frequency distribution and n-gram indexing are other tech-

niques implemented in NLTK. For example, a user can compute fre-

quency distribution of individual terms within a document doc by

running a command in Python: fdist=nltk.FreqDist(text). This

command returns a dictionary of all tokens with associated fre-

quency within doc.

N-gram indexing is implemented as a chain-linked collocations

algorithm that takes into account the probability of any given two,

three, or more words appearing together in the entire corpus. In

general, n-grams can be discovered as easily as running bigrams =

nltk.bigrams(text). However, a more sophisticated approach is

needed to discover statistically signiﬁcant word collocations, as we

show in Listing 7.4.

Bird et al. [39] provide a detailed description of NLTK tools and

techniques. See also the oﬃcial NLTK website [284].

Stanford CoreNLP While NLTK’s emphasis is on simple reference

implementations, Stanford’s CoreNLP [249, 354] is focused on fast

implementations of cutting-edge algorithms, particularly for syntac-

tic analysis (e.g., determining the subject of a sentence).

212 7. Text Analysis

def bigram_finder(texts):

# NLTK bigrams from a corpus of documents separated by new line

tokens_list = nltk.word_tokenize(re.sub("\n"," ",texts))

bgm = nltk.collocations.BigramAssocMeasures()

finder = nltk.collocations.BigramCollocationFinder.from_words(

tokens_list)

scored = finder.score_ngrams( bgm.likelihood_ratio )

# Group bigrams by first word in bigram.

prefix_keys = collections.defaultdict(list)

for key, scores in scored:

prefix_keys[key[0]].append((key[1], scores))

# Sort keyed bigrams by strongest association.

for key in prefix_keys:

prefix_keys[key].sort(key = lambda x: -x[1])

Listing 7.4. Python code to ﬁnd bigrams using NLTK

MALLET For probabilistic models of text, MALLET, the MAchine

Learning for LanguagE Toolkit [255], often strikes the right balance

between usefulness and usability. It is written to be fast and eﬃ-

cient but with enough documentation and easy enough interfaces to

be used by novices. It oﬀers fast, popular implementations of condi-

tional random ﬁelds (for part-of-speech tagging), text classiﬁcation,

and topic modeling.

7.6 Summary

Much “big data” of interest to social scientists is text: tweets, Face-

book posts, corporate emails, and the news of the day. However,

the meaning of these documents is buried beneath the ambigui-

ties and noisiness of the informal, inconsistent ways by which hu-

mans communicate with each other. Despite attempts to formalize

the meaning of text data through asking users to tag people, apply

metadata, or to create structured representations, these attempts

to manually curate meaning are often incomplete, inconsistent, or

both.

These aspects make text data diﬃcult to work with, but also a

rewarding object of study. Unlocking the meaning of a piece of text

helps bring machines closer to human-level intelligence—as lan-

guage is one of the most quintessentially human activities—and

helps overloaded information professionals do their jobs more ef-

fectively: understand large corpora, ﬁnd the right documents, or

7.7. Resources 213

automate repetitive tasks. And as an added bonus, the better com-

puters become at understanding natural language, the easier it is

for information professionals to communicate their needs: one day

using computers to grapple with big data may be as natural as

sitting down to a conversation over coﬀee with a knowledgeable,

trusted friend.

7.7 Resources

Text analysis is one of the more complex tasks in big data analysis.

Because it is unstructured, text (and natural language overall) re-

quires signiﬁcant processing and cleaning before we can engage

in interesting analysis and learning. In this chapter we have refer-

enced several resources that can be helpful in mastering text mining

techniques:

•The Natural Language Toolkit is one of the most popular Python-

based tools for natural language processing. It has a variety of

methods and examples that are easily accessible online [284].

The book by Bird et al. [39], available online, contains multiple

examples and tips on how to use NLTK.

•The book Pattern Recognition and Machine Learning by Christo-

pher Bishop [40] is a useful introduction to computational

techniques, including probabilistic methods, text analysis, and

machine learning. It has a number of tips and examples that

are helpful to both learning and experienced researchers.

•A paper by Anna Huang [171] provides a brief overview of

the key similarity measures for text document clustering dis-

cussed in this chapter, including their strengths and weak-

nesses in diﬀerent contexts.

•Materials at the MALLET website [255] can be specialized for

the unprepared reader but are helpful when looking for speciﬁc

solutions with topic modeling and machine classiﬁcation using

this toolkit.

•David Blei, one of the authors of the latent Dirichlet allocation

algorithm (topic modeling), maintains a helpful web page with

introductory resources for those interested in topic modeling

[41].

214 7. Text Analysis

•We provide an example of how to run topic modeling using

MALLET on textual data from the National Science Foundation

and Norwegian Research Council award abstracts [49].

•Weka, developed at the University of Waikato in New Zealand,

is a useful resource for running both complex text analysis

and other machine learning tasks and evaluations [150, 384].

Networks: The Basics

Chapter 8

Jason Owen-Smith

Social scientists are typically interested in describing the activities

of individuals and organizations (such as households and ﬁrms) in

a variety of economic and social contexts. The frame within which

data has been collected will typically have been generated from tax

or other programmatic sources. The new types of data permit new

units of analysis—particularly network analysis—largely enabled by

advances in mathematical graph theory. This chapter provides an

overview of how social scientists can use network theory to generate

measurable representations of patterns of relationships connecting

entities. As the author points out, the value of the new framework is

not only in constructing diﬀerent right-hand-side variables but also

in studying an entirely new unit of analysis that lies somewhere

between the largely atomistic actors that occupy the markets of

neo-classical theory and the tightly managed hierarchies that are

the traditional object of inquiry of sociologists and organizational

theorists.

8.1 Introduction

This chapter provides a basic introduction to the analysis of large

networks. The following introduces the basic logic of network anal-

ysis, then turn to a summary of data structures and essential mea-

sures before presenting a primer on network visualization and a

more elaborated descriptive case comparison of the collaboration

networks of two research-intensive universities. Both those collab-

oration networks and a grant co-employment network for a large

public university also examined in this chapter are derived from

data produced by the multi-university Committee on Institutional

Cooperation (CIC)’s UMETRICS project [228]. The snippets of code

215

216 8. Networks: The Basics

that are provided are from the igraph package for network analysis

as implemented in Python.

At their most basic, networks are measurable representations of

patterns of relationships connecting entities in an abstract or actual

space. What this means is that there are two fundamental ques-

tions to ask of any network presentation or measure, First, what are

the nodes? Second, what are the ties? While the network methods

sketched in this chapter are equally applicable to technical or bio-

logical networks (e.g., the hub-and-spoke structure of the worldwide

air travel system, or the neuronal network of a nematode worm), I

focus primarily on social networks; patterns of relationships among

people or organizations that are created by and come to inﬂuence

individual action. This chapter draws most of its examples from

the world of science, technology, and innovation. Thus, this chap-

ter focuses particularly on networks developed and maintained by

the collaborations of scientists and by the contractual relationships

organizations form in pursuit of innovation.

This substantive area is of great interest because a great deal of

research in sociology, management, and related ﬁelds demonstrates

that networks of just these sorts are essential to understanding the

process of innovation and outcomes at both the individual and the

organizational level.

In other words, networks oﬀer not just another convenient set

of right-hand-side variables, but an entirely new unit of analysis

that lies somewhere between the largely atomistic actors that oc-

cupy the markets of neo-classical theory and the tightly managed

hierarchies that are the traditional object of inquiry of sociologists

and organizational theorists. As Walter W. Powell [308] puts it in a

description of buyer supplier networks of small Italian ﬁrms: “when

the entangling of obligation and reputation reaches a point where

the actions of the parties are interdependent, but there is no com-

mon ownership or legal framework . . . such a transaction is neither

a market exchange nor a hierarchical governance structure, but a

separate, diﬀerent mode of exchange.”

Existing as they do between the uncoordinated actions of inde-

pendent individuals and coordinated work of organizations, net-

works oﬀer a unique level of analysis for the study of scientiﬁc

and creative teams [410], collaborations [202], and clusters of en-

trepreneurial ﬁrms [294]. The following sections introduce you

to this approach to studying innovation and discovery, focusing

on examples drawn from high-technology industries and particu-

larly from the scientiﬁc collaborations among grant-employed re-

searchers at UMETRICS universities. I make particular use of a

8.1. Introduction 217

network that connects individual researchers to grants that paid

their salaries in 2012 for a large public university. The grants net-

work for university A includes information on 9,206 individuals who

were employed on 3,389 research grants from federal science agen-

cies in a single year.

Before turning to those more speciﬁc substantive topics, the

chapter ﬁrst introduces the most common structures for large net-

work data, brieﬂy introduce three key social “mechanisms of action”

by which social networks are thought to have their eﬀects, and then

present a series of basic measures that can be used to quantify

characteristics of entire networks and the relative position individ-

ual people or organizations hold in the diﬀerentiated social structure

created by networks.

Taken together, these measures oﬀer an excellent starting point

for examining how global network structures create opportunities

and challenges for the people in them, for comparing and explaining

the productivity of collaborations and teams, and for making sense

of the diﬀerences between organizations, industries, and markets

that hinge on the pattern of relationships connecting their partici-

pants.

But what is a network? At its simplest, a network is a pattern

of concrete, measurable relationships connecting entities engaged

in some common activity. While this chapter focuses my attention

on social networks, you could easily use the techniques described

here to examine the structure of networks such as the World Wide

Web, the national railway route map of the USA, the food web of

an ecosystem, or the neuronal network of a particular species of

animal. Networks can be found everywhere, but the primary exam-

ple used here is a network connecting individual scientists through

their shared work on particular federal grants. The web of partner-

ships that emerges from university scientists’ decentralized eﬀorts

to build eﬀective collaborations and teams generates a distinctive

social infrastructure for cutting-edge science.

Understanding the productivity and eﬀects of university research

thus requires an eﬀort to measure and characterize the networks on

which it depends. As suggested, those networks inﬂuence outcomes

in three ways: ﬁrst, they distinguish among individuals; second,

they diﬀerentiate among teams; and third, they help to distinguish

among research-performing universities. Most research-intensive

institutions have departments and programs that cover similar ar-

rays of topics and areas of study. What distinguishes them from

one another is not the topics they cover but the ways in which their

distinctive collaboration networks lead them to have quite diﬀerent

scientiﬁc capabilities.

218 8. Networks: The Basics

8.2 Network data

Networks are comprised of nodes, which represent things that can

be connected to one another, and of ties that represent the relation-

ships connecting nodes. When ties are undirected they are called

edges. When they are directed (as when I lend money to you and

you do or do not reciprocate) they are called arcs. Nodes, edges

and arcs can, in principle, be anything: patents and citations, web

pages and hypertext links, scientists and collaborations, teenagers

and intimate relationships, nations and international trade agree-

ments. The very ﬂexibility of network approaches means that the

ﬁrst step toward doing a network analysis is to clearly deﬁne what

counts as a node and what counts as a tie.

While this seems like an easy move, it often requires deep thought.

For instance, an interest in innovation and discovery could take

several forms. We could be interested in how universities diﬀer in

their capacity to respond to new requests for proposals (a macro

question that would require the comparison of full networks across

campuses). We could wonder what sorts of training arrangements

lead to the best outcomes for graduate students (a more micro-level

question that requires us to identify individual positions in larger

networks). Or we could ask what team structure is likely to lead

to more or less radical discoveries (a decidedly meso-level question

that requires we identify substructures and measure their features).

Each of these is a network question that relies on the ways in

which people are connected to one another. The ﬁrst challenge of

measurement is to identify the nodes (what is being connected) and

the ties (the relationships that matter) in order to construct the

relevant networks. The next is to collect and structure the data in

a fashion that is suﬃcient for analysis. Finally, measurement and

visualization decisions must be made.

8.2.1 Forms of network data

Network ties can be directed (ﬂowing from one node to another)

or undirected. In either case they can be binary (indicating the

presence or absence of a tie) or valued (allowing for relationships

of diﬀerent types or strengths). Network data can be represented

in matrices or as lists of edges and arcs. All these types of rela-

tionships can connect one type of node (what is commonly called

one-mode network data) or multiple types of nodes (what is called

two-mode or aﬃliation data). Varied data structures correspond to

8.2. Network data 219

Represented as an edge list

Undirected (edge), binary data

Example: Actors are rms,

ties represent the presence

of any strategic alliance

Represented as a symmetric square matrixTies among a group of actors

B C D E

Figure 8.1. Undirected, binary, one-mode network data

diﬀerent classes of network data. The simplest form of network data

represents instances where the same kinds of nodes are connected

by undirected ties (edges) that are binary. An example of this type

of data is a network where nodes are ﬁrms and ties indicate the

presence of a strategic alliance connecting them [309]. This net-

work would be represented as a square symmetric matrix or a list

of edges connecting nodes. Figure 8.1 summarizes this simple data

structure, highlighting the idea that network data of this form can

be represented either as a matrix or as an edge list.

A much more complicated network would be one that is both

directed and valued. One example might be a network of nations

connected by ﬂows of international trade. Goods and services ﬂow

from one nation to another and the value of those goods and ser-

vices (or their volume) represents ties of diﬀerent strengths. When

networks connecting one class of nodes (in this case nations) are

directed and valued, they can be represented as asymmetric valued

matrices or lists of arcs with associated values. (See Figure 8.2 for

an example.)

While many studies of small- to medium-sized social networks

rely on one-mode data. Large-scale social network data of this type

are relatively rare, but one-mode data of this sort are fairly common

in relationships among other types of nodes such as hyperlinks

220 8. Networks: The Basics

B A 3

B C 3

B D 1

C A 4

D E 5

E B 2

E D 5

A C 1

Represented as an arc list

Directed (arc), valued data

Example: Nodes are faculty,

ties represent the number of

payments from one principal

investigator’s grant to another’s

Represented as an asymmetric square matrixTies among a group of actors

B C D E

Figure 8.2. Directed, valued, one-mode network data

connecting web pages or citations connecting patents or publica-

tions. Nevertheless, much “big” social network analysis is con-

ducted using two-mode data. The UMETRICS employee data set

is a two-mode network that connects people (research employees)

to the grants that pay their wages. These two types of nodes can

be represented as a rectangular matrix that is either valued or bi-

nary. It is relatively rare to analyze untransformed two-mode net-

work data. Instead, most analyses take advantage of the fact that

such networks are dual [398]. In other words, a two-mode network

connecting grants and people can be conceptualized (and analyzed)

as two one-mode networks, or projections.*

⋆Key insight: A two-mode

network can be conceptual-

ized and analyzed as two

one-mode networks, or pro-

jections.

8.2.2 Inducing one-mode networks from two-mode data

The most important trick in large-scale social network analysis is

that of inducing one-mode, or unipartite, networks (e.g., employee

×employee relationships) from two-mode, or bipartite, data. But

the ubiquity and potential value of two-mode data can come at a

cost. Not all aﬃliations are equally likely to represent real, mean-

ingful relationships. While it seems plausible to assume that two

individuals paid by the same grant have interactions that reason-

8.2. Network data 221

ably pertain to the work funded by the grant, this need not be the

case.

For example, consider the two-mode grant ×person network for

university A. I used SQL to create a representation of this network

that is readable by a freeware network visualization program called

Pajek [30]. In this format, a network is represented as two lists:

avertex list that lists the nodes in the graph and an edge list that

lists the connections between those nodes. In our grant ×person

network, we have two types of nodes, people and grants, and one

kind of edge, used to represent wage payments from grants to indi-

viduals.

I present a brief snippet of the resulting network ﬁle in what fol-

lows, showing ﬁrst the initial 10 elements of the vertex list and then

the initial 10 elements of the edge list, presented in two columns for

compactness. (The complete ﬁle comprises information on 9,206

employees and 3,389 grants, for a total of 12,595 vertices and

15,255 edges. The employees come ﬁrst in the vertex list, and so

the 10 rows shown below all represent employees.) Each vertex

is represented by a vertex number–label pair and each edge by a

pair of vertices plus an optional value. Thus, the ﬁrst entry in the

edge list (1 10419) speciﬁes that the vertex with identiﬁer 1 (which

happens to be the ﬁrst element of the vertex list, which has value

“00100679”) is connected to the vertex with identiﬁer 10419 by an

edge with value 1, indicating that employee “00100679” is paid by

the grant described by vertex 10419.

*Grant-Person-Network

*Vertices 12595 9206 *Edges

1 "00100679" 1 10419

2 "00107462" 2 10422

3 "00109569" 3 9855

4 "00145355" 3 9873

5 "00153190" 4 9891

6 "00163131" 7 10432

7 "00170348" 7 12226

8 "00172339" 8 10419

9 "00176582" 9 11574

10 "00203529" 10 11196

The network excerpted above is two-mode because it represents

relationships between two diﬀerent classes of nodes, grants, and

people. In order to use data of this form to address questions about

patterns of collaboration on UMETRICS campuses, we must ﬁrst

transform it to represent collaborative relationships.

222 8. Networks: The Basics

1 1

XX’ yields a g × g

symmetrical matrix

that represents

co-employment ties

among individuals

X’X yields an h × h

symmetrical matrix

that represents

grants connected

by people

X = g × h matrix

X’ = h × g transpose

Example: Researchers and grants

Represented by a rectangular matrixTies linking nodes of two dierent types

2 3

Figure 8.3. Two-mode afﬁliation data

A person-by-person projection of the original two-mode network

assumes that ties exist between people when they are paid by the

same grant. By the same token, a grant-by-grant projection of the

original two-mode network assumes that ties exist between grants

when they pay the same people. Transforming two-mode data into

one-mode projections is a fairly simple matter. If Xis a rectangular

matrix, p×g, then a one-mode projection, p×p, can be obtained

by multiplying Xby its transpose X′. Figure 8.3 summarizes this

transformation.

In the following snippet of code, I use the igraph package in

Python to read in a Pajek ﬁle and then transform the original two-

mode network into two separate projections. Because my focus

in this discussion is on relationships among people, I then move

on to work exclusively with the employee-by-employee projection.

However, every technique that I describe below can also be used

with the grant-by-grant projection, which provides a diﬀerent view

of how federally funded research is put together by collaborative

relationships on campus.

from igraph import *

# Read the graph

g = Graph.Read_Pajek("public_a_2m.net")

8.2. Network data 223

# Look at result

summary(g)

# IGRAPH U-WT 12595 15252 --

# + attr: color (v), id (v), shape (v), type (v), x (v), y (v), z

(v), weight (e)

# ...

# Transform to get 1M projection

pr_g_proj1, pr_g_proj2= g.bipartite_projection()

# Look at results

summary(pr_g_proj1)

# IGRAPH U-WT 9206 65040 --

# + attr: color (v), id (v), shape (v), type (v), x (v), y (v), z

(v), weight (e)

summary(pr_g_proj2)

# IGRAPH U-WT 3389 12510 --

# + attr: color (v), id (v), shape (v), type (v), x (v), y (v), z

(v), weight (e)

# pr_g_proj1 is the employeeXemployee projection, n=9,206 nodes

# Rename to emp for use in future calculations

emp=pr_g_proj1

We now can work with the graph emp, which represents the col-

laborative network of federally funded research on this campus.

Care must be taken when inducing one-mode network projections

from two-mode network data because not all aﬃliations provide

equally compelling evidence of actual social relationships. While

assuming that people who are paid by the same research grants

are collaborating on the same project seems plausible, it might be

less realistic to assume that all students who take the same uni-

versity classes have meaningful relationships. For the remainder of

this chapter, the examples I discuss are based on UMETRICS em-

ployee data rendered as a one-mode person-by-person projection of

the original two-mode person-by-grants data. In constructing these

networks I assume that a tie exists between two university research

employees when they are paid any wages from the same grant dur-

ing the same year. Other time frames or thresholds might be used

to deﬁne ties if appropriate for particular analyses.*

⋆Key insight: Care must

be taken when inducing

one-mode network projec-

tions from two-mode net-

work data because not all

afﬁliations provide equally

compelling evidence of ac-

tual social relationships.

224 8. Networks: The Basics

8.3 Network measures

The power of networks lies in their unique ﬂexibility and ability to

address many phenomena at multiple levels of analysis. But har-

nessing that power requires the application of measures that take

into account the overall structure of relationships represented in a

given network. The key insight of structural analysis is that out-

comes for any individual or group are a function of the complete

pattern of connections among them. In other words, the explana-

tory power of networks is driven as much by the pathways that

indirectly connect nodes as by the particular relationships that di-

rectly link members of a given dyad. Indirect ties create reachability

in a network.*

⋆Key insight: Structural

analysis of outcomes for

any individual or group are

a function of the com-

plete pattern of connections

among them.

8.3.1 Reachability

Two nodes are said to be reachable when they are connected by

an unbroken chain of relationships through other nodes. For in-

stance, two people who have never met may nonetheless be able

to reach each other through a common acquaintance who is posi-

tioned to broker an introduction [286] or the transfer of information

and resources [60]. It is the reachability that networks create that

makes them so important for understanding the work of science

and innovation.

Consider Figure 8.4, which presents three schematic networks.

In each, one focal node, ego, is colored orange. Each ego has four

alters, but the fact that each has connections to four other nodes

masks important diﬀerences in their structural positions. Those dif-

ferences have to do with the number of other nodes they can reach

through the network and the extent to which the other nodes in the

network are connected to each other. The orange node (ego) in each

network has four partners, but their positions are far from equiv-

alent. Centrality measures on full network data can tease out the

diﬀerences. The networks also vary in their gross characteristics.

Those diﬀerences, too, are measurable.*

⋆Key insight: Much of

the power of networks (and

their systemic features) is

due to indirect ties that cre-

ate reachability. Two nodes

can reach each other if they

are connected by an unbro-

ken chain of relationships.

These are often called indi-

rect ties.

Networks in which more of the possible connections among nodes

are realized are denser and more cohesive than networks in which

fewer potential connections are realized. Consider the two smaller

networks in Figure 8.4, each of which is comprised of ﬁve nodes.

Just ﬁve ties connect those nodes in the network on the far right

of the ﬁgure. One smaller subset of that network, the triangle con-

necting ego and two alters at the center of the image, represents

8.3. Network measures 225

Figure 8.4. Reachability and indirect ties

a more cohesively connected subset of the networks. In contrast,

eight of the nine ties that are possible connect the ﬁve nodes in

the middle ﬁgure; no subset of those nodes is clearly more inter-

connected than any other. While these kinds of diﬀerences may

seem trivial, they have implications for the orange nodes, and for

the functioning of the networks as a whole. Structural diﬀerences

between the positions of nodes, the presence and characteristics

of cohesive “communities” within larger networks [133], and many

important properties of entire structures can be quantiﬁed using

diﬀerent classes of network measures. Newman [272] provides the

most recent and most comprehensive look at measures and algo-

rithms for network research.

The most essential thing to be able to understand about larger

scale networks is the pattern of indirect connections among nodes.

What is most important about the structure of networks is not nec-

essarily the ties that link particular pairs of nodes to one another.

Instead, it is the chains of indirect connections that make networks

function as a system and thus make them worthwhile as new levels

of analysis for understanding social and other dynamics.

8.3.2 Whole-network measures

The basic terms needed to characterize whole networks are fairly

simple. It is useful to know the size (in terms of nodes and ties) of

each network you study. This is true both for the purposes of being

able to generally gauge the size and connectivity of an entire network

and because many of the measures that one might calculate using

such networks should be standardized for analytic use. While the

list of possible network measures is long, a few commonly used

226 8. Networks: The Basics

indices oﬀer useful insights into the structure and implications of

entire network structures.

Components and reachability As we have seen, a key feature of

networks is reachability. The reachability of participants in a net-

work is determined by their membership in what network theorists

call components, subsets of larger networks where every member

of a group is indirectly connected to every other. If you imagine a

standard node and line drawing of a network, a component is a por-

tion of the network where you can trace paths between every pair

of nodes without ever having to lift your pen.

Most large networks have a single dominant component that typ-

ically includes anywhere from 50% to 90% of its participants as well

as many smaller components and isolated nodes that are discon-

nected from the larger portion of the network. Because the path

length centrality measures described below can only be computed

on connected subsets of networks, it is typical to analyze the largest

component of any given network. Thus any description of a network

or any eﬀort to compare networks should report the number of com-

ponents and the percentage of nodes reachable through the largest

component. In the code snippet below, I identify the weakly con-

nected components of the employee network, emp.

# Add component membership

emp.vs["membership"] = emp.clusters(mode="weak").membership

# Add component size

emp.vs["csize"] = [emp.clusters(mode="weak").sizes()[i] for i in

emp.clusters(mode="weak").membership]

# Identify the main component

# Get indices of max clusters

maxSize = max(emp.clusters(mode="weak").sizes())

emp.vs["largestcomp"] = [1 if maxSize == x else 0for x in emp.vs[

"csize"]]

# Add component membership

emp.vs["membership"] = emp.clusters(mode="weak").membership

The main component of a network is commonly analyzed and

visualized because the graph-theoretic distance among unconnected

nodes is inﬁnite, which renders calculation of many common net-

work measures impossible without strong assumptions about just

how far apart unconnected nodes actually are. While some re-

searchers replace inﬁnite path lengths with a value that is one plus

the longest path, called the network’s diameter, observed in a given

8.3. Network measures 227

structure, it is also common to simply analyze the largest connected

component of the network.

Path length One of the most robust and reliable descriptive statis-

tics about an entire network is the average path length, lG, among

nodes. Networks with shorter average path lengths have structures

that may make it easier for information or resources to ﬂow among

members in the network. Longer path lengths, by comparison, are

associated with greater diﬃculty in the diﬀusion and transmission

of information or resources. Let gbe the number of nodes or vertices

in a network. Then

lG=1

g(g−1)X

i,j

d(ni, nj).

As with other measures based on reachability, it is most common to

report the average path length for the largest connected component

of the network because the graph-theoretic distance between two

unconnected nodes is inﬁnite. In an electronic network such as the

World Wide Web, a shorter path length means that any two pages

can be reached through fewer hyperlink clicks.

The snippet of code below identiﬁes the distribution of shortest

path lengths among all pairs of nodes in a network and the aver-

age path length. I also include a line of code that calculates the

network distance among all nodes and returns a matrix of those

distances. That matrix (saved as empdist) can be used to calculate

additional measures or to visualize the graph-theoretic proximities

among nodes.

# Calculate distances and construct distance table

dfreq=emp.path_length_hist(directed=False)

print(dfreq)

# N = 12506433, mean +- sd: 5.0302 +- 1.7830

# Each *represents 51657 items

# [ 1, 2): *(65040)

# [ 2, 3): ********* (487402)

# [ 3, 4): *********************************** (1831349)

# [ 4, 5): ******************************************************

**** (2996157)

# [ 5, 6): ****************************************************

(2733204)

# [ 6, 7): ************************************** (1984295)

# [ 7, 8): ************************ (1267465)

# [ 8, 9): ************ (649638)

# [ 9, 10): ***** (286475)

228 8. Networks: The Basics

# [10, 11): ** (125695)

# [11, 12): *(52702)

# [12, 13): (18821)

# [13, 14): (5944)

# [14, 15): (1682)

# [15, 16): (403)

# [16, 17): (128)

# [17, 18): (28)

# [18, 19): (5)

print(dfreq.unconnected)

# 29864182

print(emp.average_path_length(directed=False))

#[1] 5.030207

empdist= emp.shortest_paths()

These measures provide a few key insights into the employee

network we have been considering. First, the average pair of nodes

that are connected by indirect paths are slightly more than ﬁve

steps from one another. Second, however, many node pairs in this

network ($unconnected = 29,864,182) are unconnected and thus

unreachable to each other. Figure 8.5 presents a histogram of the

distribution of path lengths in the network. It represents the nu-

meric values returned by the distance.table command in the code

snippet above. In this case the diameter of the network is 18 and ﬁve

pairs of nodes are reachable at this distance, but the largest group

of dyads is reachable (N=2,996,157 dyads) at distance 4. In short,

nearly 3 million pairs of nodes are collaborators of collaborators of

collaborators of collaborators.

Degree distribution Another powerful way to describe and com-

pare networks is to look at the distribution of centralities across

nodes. While any of the centrality measures described above could

be summarized in terms of their distribution, it is most common to

plot the degree distribution of large networks. Degree distributions

commonly have extremely long tails. The implication of this pattern

is that most nodes have a small number of ties (typically one or two)

and that a small percentage of nodes account for the lion’s share of

a network’s connectivity and reachability. Degree distributions are

typically so skewed that it is common practice to plot degree against

the percentage of nodes with that degree score on a log–log scale.

High-degree nodes are often particularly important actors. In the

UMETRICS networks that are employee ×employee projections of

employee ×grant networks, for instance, the nodes with the high-

est degree seem likely to include high-proﬁle faculty—the investi-

8.3. Network measures 229

3500000

3000000

2500000

1500000

2000000

Number of Dyads

1000000

500000

12345678910

Path Length

11 12 1413 15 16 17 18

Figure 8.5. Histogram of path lengths for university A employee network

gators on larger institutional grants such as National Institutes of

Health-funded Clinical and Translational Science Awards and Na-

tional Science Foundation-funded Science and Technology Centers,

and perhaps staﬀ whose particular skills are in demand (and paid

for) by multiple research teams. For instance, the head technician

in a core microscopy facility or a laboratory manager who serves

multiple groups might appear highly central in the degree distribu-

tion of a UMETRICS network.

Most importantly, the degree distribution is commonly taken to

provide insight into the dynamics by which a network was created.

Highly skewed degree distributions often represent scale-free net-

works [25, 271, 309], which grow in part through a process called

preferential attachment, where new nodes entering the network are

more likely to attach to already prominent participants. In the kinds

of scientiﬁc collaboration networks that UMETRICS represents, a

scale-free degree distribution might come about as faculty new to an

institution attempt to enroll more established colleagues on grants

as coinvestigators. In the comparison exercise outlined below, I

plot degree distributions for the main components of two diﬀerent

university networks.

Clustering coefﬁcient The third commonly used whole-network mea-

sure captures the extent to which a network is cohesive, with many

230 8. Networks: The Basics

nodes interconnected. In networks that are more cohesively clus-

tered, there are fewer opportunities for individuals to play the kinds

of brokering roles that we will discuss below in the context of be-

tweenness centrality. Less cohesive networks, with lower levels

of clustering, are potentially more conducive to brokerage and the

kinds of innovation that accompany it.

However, the challenge of innovation and discovery is both the

moment of invention, the “aha!” of a good new idea, and the often

complicated, uncertain, and collaborative work that is required to

turn an initial insight into a scientiﬁc ﬁnding. While less clustered,

open networks are more likely to create opportunities for brokers to

develop fresh ideas, more cohesive and clustered networks support

the kinds of repeated interactions, trust, and integration that are

necessary to do uncertain and diﬃcult collaborative work.

While it is possible to generate a global measure of cohesiveness

in networks, which is generically the number of closed triangles

(groups of three nodes all connected to one another) as a proportion

of the number of possible triads, it is more common to take a local

measure of connectivity and average it across all nodes in a net-

work. This local connectivity measure more closely approximates

the notion of cohesion around nodes that is at the heart of stud-

ies of networks as means to coordinate diﬃcult, risky work. The

code snippet below calculates both the global clustering coeﬃcient

and a vector of node-speciﬁc clustering coeﬃcients whose average

represents the local measure for the employee ×employee network

projection of the university A UMETRICS data.

# Calculate clustering coefficients

emp.transitivity_undirected()

# 0.7241

local_clust=emp.transitivity_local_undirected(mode="zero")

# (isolates="zero" sets clustering to zero rather than undefined)

import pandas as pd

print(pd.Series(local_clust).describe())

# count 9206.000000

# mean 0.625161

# std 0.429687

# min 0.000000

# 25% 0.000000

# 50% 0.857143

# 75% 1.000000

# max 1.000000

#--------------------------------------------------#

8.3. Network measures 231

Together, these summary statistics—number of nodes, average

path length, distribution of path lengths, degree distribution, and

the clustering coeﬃcient—oﬀer a robust set of measures to examine

and compare whole networks. It is also possible to distinguish

among the positions nodes hold in a particular network. Some

of the most powerful centrality measures also rely on the idea of

indirect ties.*⋆Key insight: Some of

the most powerful centrality

measures also rely on the

idea of indirect ties.

Centrality measures This class of measures is the most common

way to distinguish between the positions individual nodes hold in

networks. There are many diﬀerent measures of centrality that cap-

ture diﬀerent aspects of network positions, but they fall into three

general types. The most basic and intuitive measure of central-

ity, degree centrality, simply counts the number of ties that a node

has. In a binary undirected network, this measure resolves into

the number of unique alters each node is connected to. In mathe-

matical terms it is the row or column sum of the adjacency matrix

that characterizes a network. Degree centrality, CD(ni), represents

a clear measure of the prominence or visibility of a node. Let

CD(ni)=X

xij.

The degree of a node is limited by the size of the network in which it

is embedded. In a network of gnodes the maximum degree of any

node is g−1. The two orange nodes in the small networks presented

in Figure 8.4 have the maximum degree possible (4). In contrast,

the orange node in the larger, 13-node network in that ﬁgure has the

same number of alters but the possible number of partners is three

times as large (12). For this reason it is problematic to compare raw

degree centrality measures across networks of diﬀerent sizes. Thus,

it is common to normalize degree by the maximum value deﬁned by

g−1:

C′

D(ni)=Pjxij

g−1.

While the normalized degree centrality of the two orange nodes

of the smaller networks in Figure 8.4 is 1.0, the normalized value

for the node in the large network of 13 nodes is 0.33. Despite the

fact that the highlighted nodes in the two smaller networks have the

same degree centrality, the pattern of indirect ties connecting their

alters means they occupy meaningfully diﬀerent positions. There

are a number of degree-based centrality measures that take more

232 8. Networks: The Basics

of the structural information from a complete network into account

by using a variety of methods to account not just for the number of

partners a particular ego might have but also for the prominence of

those partners. Two well-known examples are eigenvector centrality

and page rank (see [272, Ch. 7.2 and 8.4]).

Consider two additional measures that capture aspects of cen-

trality that have more to do with the indirect ties that increase

reachability. Both make explicit use of the idea that reachability

is the source of many of the important social and economic beneﬁts

of salutary network positions, but they do so with diﬀerent sub-

stantive emphases. Both of these approaches rely on the idea of

a network geodesic, the longest shortest path*connecting any pair

⋆Ashortest path is a path

that does not repeat any

nodes or ties. Most pairs

have several of those. The

geodesic is the longest

shortest path. So, if two

people are directly con-

nected (path length 1) and

connected through shared

ties to another person

(path length 2), then their

geodesic distance is two.

of actors. Because these measures rely on reachability, they are

only useful when applied to components. When nodes have no ties

(degree 0) they are called isolates. The geodesic distances are inﬁ-

nite and thus path-based centrality measures cannot be calculated.

This is a shortcoming of these measures, which can only be used

on connected subsets of graphs where each node has at least one

tie to another and all are indirectly connected.

Closeness centrality, CC, is based on the idea that networks po-

sition some individuals closer to or farther away from other par-

ticipants. The primary idea is that shorter network paths between

actors increase the likelihood of communication and with it the abil-

ity to coordinate complicated activities. Let d(ni, nj)represent the

number of network steps in the geodesic path connecting two nodes

iand j. As dincreases, the network distance between a pair of nodes

grows. Thus a standard measure of closeness is the inverse of the

sum of distances between any given node and all the others that are

reachable in a network:

CC(ni)=1

j=1d(ni, nj).

The maximum of closeness centrality occurs when a node is di-

rectly connected to every possible partner in the network. As with

degree centrality, closeness depends on the number of nodes in a

network. Thus, it is necessary to standardize the measure to allow

comparisons across multiple networks:

C′

C(ni)=g−1

j=1d(ni, nj).

Like closeness centrality, betweenness centrality, CB, relies on

the concept of geodesic paths to capture nuanced diﬀerences between

8.3. Network measures 233

the positions of nodes in a connected network. Where closeness as-

sumes that communication and the ﬂow of information increase

with proximity, betweenness captures the idea of brokerage that

was made famous by Burt [59]. Here too the idea is that ﬂows of

information and resources pass between nodes that are not directly

connected through indirect paths. The key to the idea of brokerage

is that such paths pass through nodes that can interdict, or other-

wise proﬁt from their position “in between” unconnected alters. This

idea has been particularly important in network studies of innova-

tion [60,293], where ﬂows of information through strategic alliances

among ﬁrms or social networks connecting individuals loom large

in explanations of why some organizations or individuals are better

able to develop creative ideas than others.

To calculate betweenness as originally speciﬁed, two strong as-

sumptions are required [129]. First, one must assume that when

people (or organizations) search for new information through their

networks, they are capable of identifying the shortest path to what

they seek. When multiple paths of equal length exist, we assume

that each path is equally likely to be used. Newman [271] de-

scribes an alternative betweenness measure based on random paths

through a network, rather than shortest paths, that relaxes these

assumptions. For now, let gjk equal the number of geodesic paths

linking any two actors. Then 1/gjk is the probability that any given

path will be followed on a particular node’s search for information or

resources in a network. In order to calculate the betweenness score

of a particular actor, i, it is then necessary to determine how many

of the geodesic paths connecting jto kinclude i. That quantity is

gjk (ni). With these (unrealistic) assumptions in place, we calculate

CB(ni)as

CB(ni)=X

j<k

g(ni)

jk /gjk .

Here, too, the maximum value depends on the size of the network.

CB(ni)=1 if isits on every geodesic path in the network. While

this is only likely to occur in small, star-shaped networks, it is

still common to standardize the measure. Instead of conceptual-

izing network size in terms of the number of nodes, however, this

measure requires that we consider the number of possible pairs

of actors (excluding ego) in a structure. When there are gnodes,

that quantity is (g−1)(g−2)/2 and the standardized betweenness

measure is

C′

B(ni)=CB(ni)

(g−1)(g−2)/2.

234 8. Networks: The Basics

Centrality measures of various sorts are the most commonly

used means to examine network eﬀects at the level of individual

participants in a network. In the context of UMETRICS, such in-

dices might be applied to examine the diﬀerential scientiﬁc or ca-

reer success of graduate students as a function of their positions

in the larger networks of their universities. In such an analysis,

care must be taken to use the standardized measures as university

collaboration networks can vary dramatically in size and structure.

Describing and accounting for such variations and the possibility of

analyses conducted at the level of entire networks or subsets of net-

works, such as teams and labs, requires a diﬀerent set of measures.

The code snippet presented below calculates each of these measures

for the university A employee network we have been examining.

# Calculate centrality measures

emp.vs["degree"]=emp.degree()

emp.vs["close"]=emp.closeness(vertices=emp.vs)

emp.vs["btc"]=emp.betweenness(vertices=emp.vs, directed=False)

8.4 Comparing collaboration networks

Consider Figure 8.6, which presents visualizations of the main com-

ponent of two university networks. Both of these representations

are drawn from a single year (2012) of UMETRICS data. Nodes rep-

resent people, and ties reﬂect the fact that those individuals were

paid with the same federal grant in the same year. The images are

scaled so that the physical location of any node is a function of

its position in the overall pattern of relationships in the network.

The size and color of nodes represent their betweenness centrality.

Larger, darker nodes are better positioned to play the role of bro-

kers in the network. A complete review of the many approaches to

network visualization and their dangers in the absence of descrip-

tive statistics such as those presented above is beyond the scope of

this chapter, but consider the guidelines presented in Chapter 9 on

information visualization as well as useful discussions by Powell et

al. [309] and Healy and Moody [163].

Consider the two images. University A is a major public institu-

tion with a signiﬁcant medical school. University B, likewise, is a

public institution but lacks a medical school. It is primarily known

for strong engineering. The two networks manifest some interest-

ing and suggestive diﬀerences. Note ﬁrst that the network on the

left (university A) appears much more tightly connected. There is a

8.4. Comparing collaboration networks 235

University A University B

Figure 8.6. The main component of two university networks

dense center and there are fewer very large nodes whose positions

bridge less well-connected clusters. Likewise, the network on the

right (university B) seems at a glance to be characterized by a num-

ber of densely interconnected groups that are pulled together by ties

through high-degree brokers. One part of this may have to do with

the size and structure of university A’s medical school, whose sig-

niﬁcant NIH funding dominates the network. In contrast, university

B’s engineering-dominated research portfolio seems to be arranged

around clusters of researchers working on similar topic areas and

lacks the dominant central core apparent in university B’s image.

The implications of these kinds of university-level diﬀerences are

just starting to be realized, and the UMETRICS data oﬀer great pos-

sibilities for exactly this kind of study. These networks, in essence,

represent the local social capacity to respond to new problems and

to develop scientiﬁc ﬁndings. Two otherwise similar institutions

might have quite diﬀerent capabilities based on the structure and

composition of their collaboration networks.

The intuitions suggested by Figure 8.6 can also be checked

against some of the measures we have described. Figure 8.7, for in-

stance, presents degree distributions for each of the two networks.

Figure 8.8 presents the histogram of path lengths for each network.

It is evident from Figure 8.7 that they are quite diﬀerent in

character. University A’s network follows a more classic skewed

distribution of the sort that is often associated with the kinds of

power-law degree distributions common to scale-free networks. In

contrast, university B’s distribution has some interesting features.

236 8. Networks: The Basics

225

250

275

300

325 University A

200

175

150

125

100

Count

0 20 40 60 80 100 120 140 160 180 200 220 240 260

Value

225

University B

200

175

150

125

100

Count

0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340

Value

Figure 8.7. Degree distribution for two universities

First, the left-hand side of the distribution is more dispersed than

it is for university A, suggesting that there are many nodes with

moderate degree. These nodes may also have high betweenness

centrality if their ties allow them to span diﬀerent subgroups within

the networks. Of course this might also reﬂect the fact that each

cluster also has members that are more locally prominent. Finally,

consider the few instances on the right-hand end of the distribution

8.4. Comparing collaboration networks 237

3500000

3000000

2500000

1500000

2000000

Number of Dyads

1000000

University A

University B

500000

12345678910

Path Length

11 12 1413 15 16 17 18

Figure 8.8. Distribution of path lengths for universities A and B

where there are relatively large numbers of people with surprisingly

high degree. I suspect these are the result of large training grants

or center grants that employ many people. A quirk of relying on

one-mode projections of two-mode data is that every person asso-

ciated with a particular grant is connected to every other. More

work needs to be done to bear out these hypotheses, but for now

it suﬃces to say that the degree distribution of the networks bears

out the intuition we drew from the images that they are signiﬁcantly

diﬀerent.

The path length histogram presented in Figure 8.8 suggests a

similar pattern. While the average distance among any pair of con-

nected nodes in both networks is fairly similar (see Table 8.1), uni-

versity B has a larger number of unconnected nodes and university

A has a greater concentration of more closely connected dyads. The

preponderance of shorter paths in this network could also be a re-

sult of a few larger grants that connect many pairs of nodes at unit

distance and thus shorten overall path lengths.

But how do the descriptive statistics shake out? Table 8.1

presents the basic descriptive statistics we have discussed for each

238 8. Networks: The Basics

Table 8.1. Descriptive statistics for the main components of two university net-

works

University A University B

Nodes 4,999 4,144

Edges (total) 57,756 91,970

% nodes in main component 68.67% 67.34%

Diameter 18 18

Average degree 11.554 44.387

Clustering coeﬃcient 0.855 0.913

Density 0.005 0.011

Average path length 5.034 5.463

network. University A’s network includes 855 more nodes than

university B’s, a diﬀerence of about 20%. In contrast, there are far

fewer edges connecting university A’s research employees than con-

necting university B’s, a diﬀerence that appears particularly starkly

in the much higher density of university B’s network. Part of the

story can be found in the average degree of nodes in each network.

As the degree distributions presented in Figure 8.6 suggested, the

average researcher at university B is much more highly connected

to others than is the case at university A. The diﬀerence is stark and

quite likely has to do with the presence of larger grants that employ

many individuals.

Both schools have a low average path length (around 5), suggest-

ing that no member of the network is more than ﬁve acquaintances

away from any other. Likewise, the diameter of both networks is 18,

which means that on each campus the most distant pair of nodes is

separated by just 18 steps. University A’s slightly lower path length

may be accounted for by the centralizing eﬀect of its large medical

school grant infrastructure. Finally, consider the clustering coeﬃ-

cient. This measure approaches 1 as it becomes more likely that

two partners to a third node will themselves be connected. The like-

lihood that collaborators of collaborators will collaborate is high on

both campuses, but substantially higher at university B.

8.5 Summary

This chapter has provided a brief overview of the basics of networks,

using UMETRICS data as a source for examples. While network

measures can produce new and exciting ways to characterize social

dynamics, they are also important levels of analysis in their own

8.6. Resources 239

right. There are essential concepts that need to be mastered, such

as reachability, cohesion, brokerage, and reciprocity. Numbers are

important, for a variety of reasons—they can be used to describe

networks in terms of their composition and community structure.

This chapter provides a classic example of how well social science

meets data science. Social science is needed to identify the nodes

(what is being connected) and the ties (the relationships that mat-

ter) in order to construct the relevant networks. Computer science

is necessary to collect and structure the data in a fashion that is

suﬃcient for analysis. The combination of data science and social

science is key to making the right measurement and visualization

decisions.

8.6 Resources

For more information about network analysis in general, the Inter-

national Network for Social Network Analysis (http://www.insna.

org/) is a large, interdisciplinary association dedicated to network

analysis. It publishes a traditional academic journal, Social Net-

works, an online journal, Journal of Social Structure, and a short-

format journal, Connections, all dedicated to social network analy-

sis. Its several listservs oﬀer vibrant international forums for dis-

cussion of network issues and questions. Finally, its annual meet-

ings include numerous opportunities for intensive workshops and

training for both beginning and advanced analysts.

A new journal, Network Science (http://journals.cambridge.org/

action/displayJournal?jid=NWS), published by Cambridge Univer-

sity Press and edited by a team of interdisciplinary network schol-

ars, is a good venue to follow for cutting-edge articles on compu-

tational network methods and for substantive ﬁndings from a wide

range of social, natural, and information science applications.

There are some good software packages available. Pajek (http://

mrvar.fdv.uni-lj.si/pajek/) is a freeware package for network anal-

ysis and visualization. It is routinely updated and has a vibrant

user group. Pajek is exceptionally ﬂexible for large networks and

has a number of utilities that allow import of its relatively simple

ﬁle types into other programs and packages for network analysis.

Gephi (https://gephi.org/) is another freeware package that sup-

ports large-scale network visualization. Though I ﬁnd it less ﬂexible

than Pajek, it oﬀers strong support for compelling visualizations.

Network Workbench (http://nwb.cns.iu.edu/) is a freeware pack-

age that supports extensive analysis and visualization of networks.

240 8. Networks: The Basics

This package also includes numerous shared data sets from many

diﬀerent ﬁelds that can used to test and hone your network analytic

skills.

iGraph (http://igraph.org/redirect.html) is my preferred package

for network analysis. Implementations are available in R, in Python,

and in C libraries. The examples in this chapter were coded in

iGraph for Python.

Nexus (http://nexus.igraph.org/api/dataset_info?format=html&

limit=10&oﬀset=20&operator=or&order=date) is a growing reposi-

tory for network data sets that includes some classic data dating

back to the origins of social science network research as well as

more recent data from some of the best-known publications and

authors in network science.

Part III

Inference and Ethics

Information Visualization

Chapter 9

M. Adil Yalçın and Catherine Plaisant

This chapter will show you how to explore data and communicate

results so that data can be turned into interpretable, actionable

information. There are many ways of presenting statistical infor-

mation that convey content in a rigorous manner. The goal of this

chapter is to present an introductory overview of eﬀective visualiza-

tion techniques for a range of data types and tasks, and to explore

the foundations and challenges of information visualization.

9.1 Introduction

One of the most famous discoveries in science—that disease was

transmitted through germs, rather than through pollution—resulted

from insights derived from a visualization of the location of London

cholera deaths near a water pump [349]. Information visualization

in the twenty-ﬁrst century can be used to generate similar insights:

detecting ﬁnancial fraud, understanding the spread of a contagious

illness, spotting terrorist activity, or evaluating the economic health

of a country. But the challenge is greater: many (102–107) items

may be manipulated and visualized, often extracted or aggregated

from yet larger data sets, or generated by algorithms for analytics.

Visualization tools can organize data in a meaningful way that

lowers the cognitive and analytical eﬀort required to make sense

of the data and make data-driven decisions. Users can scan, rec-

ognize, understand, and recall visually structured representations

more rapidly than they can process nonstructured representations.

The science of visualization draws on multiple ﬁelds such as percep-

tual psychology, statistics, and graphic design to present informa-

tion, and on advances in rapid processing and dynamic displays

to design user interfaces that permit powerful interactive visual

analysis.

243

244 9. Information Visualization

Data Set A Data Set B Data Set C

Data Set C

Data Set D

8.04

6.95

7.58

8.81

8.33

9.96

7.24

4.26

10.84

4.82

5.68

y y

9.14

8.14

8.74

8.77

9.26

8.1

6.13

3.1

9.13

7.26

4.74

7.46

6.77

12.74

7.11

7.81

8.84

6.08

5.39

8.15

6.42

5.73

6.58

5.76

7.71

8.84

8.47

7.04

5.25

12.5

5.56

7.91

6.89

48 12 16 20

4 8 12 16 20

Data Set A Data Set B

4 8 12 16 20

Average =Average =Average =Average

Variance =Variance =Variance =Variance

Correlation =Correlation =Correlation =Correlation

0.816 0.816 0.816 0.816

11 4.12

9 7.50 9 7.50 9 7.50 9 7.50

11 4.12 11 4.12 11 4.12

Figure 9.1. Anscombe’s quartet [16]

Figure 9.1, “Anscombe’s quartet” [16], provides a classic example

of the value of visualization compared to basic descriptive statisti-

cal analysis. The left-hand panel includes raw data of four small

number-pair data sets (A, B, C, D), which have the same average,

median, and standard deviation and have correlation across num-

ber pairs. The right-hand panel shows these data sets visualized

with each point plotted on perpendicular axes (scatterplots), reveal-

ing dramatic diﬀerences between the data sets, trends, and outliers

visually.

In broad terms, visualizations are used either to present results

or for analysis and open-ended exploration. This chapter provides

an overview of how modern information visualization, or visual data

mining, can be used to in the context of big data.

9.2 Developing effective visualizations

The eﬀectiveness of a visualization depends on both analysis needs

and design goals. Sometimes, questions about the data are known

in advance; in other cases, the goal may be to explore new data sets,

generate insights, and answer questions that are unknown before

starting the analysis. The design, development, and evaluation of a

visualization is guided by understanding the background and goals

of the target audience (see Box 9.1).

9.2. Developing effective visualizations 245

Box 9.1:

The development of an eﬀective visualization is an iterative pro-

cess that generally includes the following steps:

•Specify user needs, tasks, accessibility requirements and

criteria for success.

•Prepare data (clean, transform).

•Design visual representations.

•Design interaction.

•Plan sharing of insights, provenance.

•Prototype/evaluate, including usability testing.

•Deploy (monitor usage, provide user support, manage re-

vision process).

If the goal is to present results, there is a wide spectrum of users

◮See Chapters 2, 3, 4, and

5 for an overview of collect-

ing, merging, storing, and

processing data sets.

and a wide range of options. If the audience is broad, then info-

graphics can be developed by graphic designers, as described in

classic texts (see [119,380,381] or the examples compiled by Harri-

son et al. [157,206]). If, on the other hand, the audience comprises

domain experts interested in monitoring the overview status of dy-

namic processes on a continuous basis, dashboards can be used.

Examples include the monitoring of sales, or the number of tweets

about people, or symptoms of the ﬂu and how they compare to a

baseline [120]. Dashboards can increase situational awareness so

that problems can be noticed and solved early and better decisions

can be made with up-to-date information.

Another goal of visualization is to enable interactive exploratory

analysis for casual users as well as professional analysts. One

casual example is BabyNameVoyager [156], which lets users type in

a name and see a graph of its popularity over the past century. *

⋆As the baby name is

typed letter by letter, Baby-

NameVoyager visualizes

the popularity of all the

names starting with the

letters entered so far, and

it animates smoothly with

each new letter input. For

example, typing “Jo” shows

all the names starting with

“Jo”. This view reveals

that Joyce and Joan were

popular girl names in the

1930s, and the use of

“John” has declined since

the 1960s.

Data analysis tools can enable analysis of structured and generic

data sets. Figure 9.2 shows an exploratory browser of a selection of

awards (grants) from the US Department of Agriculture, created us-

ing the web-based data exploration tool Keshif (http://www.keshif.

me). The awards (records) are listed in the middle panel by latest

start date ﬁrst. Attributes of the awards, such as funding source,

are shown on side panels, revealing the range of their values. The

awards with status new are focused per the ﬁltering selection, and

246 9. Information Visualization

Figure 9.2. A data analysis browser of a selection of grants from the US Department of Agriculture was created by

using the web-based tool Keshif

are visualized using darker gray colors. Lighter gray colors provide

visual cues to the distribution of all awards. The orange selection

highlights distributions of awards with formula funding, known as

Hatch funding. The analyst can explore trends in new or old awards,

funding resources, and other attributes in this rich award portfolio.

Commercial tools such as Spotﬁre and Tableau, among many

other tools (see Section 9.6), allow users to create visualizations by

oﬀering chart types and visual design environments to analyze their

data, and to combine them in potent dashboards rapidly shared

with colleagues. For example, Figure 9.3 shows the charting in-

terface of Tableau on a transaction data set. The left-hand panel

shows the list of attributes associated with vendor transactions for a

given university. The visualization (center) is constructed by placing

the month of spending in chart columns, and the sum of payment

amount on the chart row, with data encoded using line mark type.

Agencies are broken down by color mapping. The agency list, to the

right, allows ﬁltering the agencies, which can be used to simplify the

chart view. A peak in the line chart is annotated with an explanation

of the spike. On the rightmost side, the Show Me panel suggests

the applicable chart types potentially appropriate for the selected

attributes. This chart can be combined with other charts focus-

ing on other aspects in interactive dashboards. Figure 9.4 shows

9.2. Developing effective visualizations 247

Figure 9.3. Charting interface of Tableau

Figure 9.4. A treemap visualization of agency and sub-agency spending breakdown

248 9. Information Visualization

More accurate Position

Length

Angle Slope

Area

Volume

Color Density

Less accurate

Figure 9.5. Visual elements described by MacKinlay [245]

a treemap [192] for agency and sub-agency spending breakdown,

combined with a map showing average spending per state. Okla-

homa state stands out with few but large expenditures. Mousing-

over Oklahoma reveals details of these expenditures. An additional

histogram provides an overview of spending change across three

years.

Creating eﬀective visualizations requires careful consideration of

many components. Data values may be encoded using one or more

visual elements, like position, length, color, angle, area, and texture

(Figure 9.5; see also [79, 380]). Each of these can be organized in a

multitude of ways, discussed in more detail by Munzner [264]. In

addition to visual data encoding, units for axes, labels, and legends

need to be provided as well as explanations of the mappings when

the design is unconventional. (The website “A World of Terror” pro-

vides some compelling examples [301].) Annotations or comments

can be used to guide viewer attention and to describe related in-

sights. Providing attribution and data source, where applicable, is

an ethical practice that also enables validating data, and promotes

reuse to explore new perspectives.

The following is a short list of guidelines: provide immediate

feedback upon interaction with the visualization; generate tightly

coupled views (i.e., so that selection in one view updates the oth-

ers); and use a high “data to ink ratio” [380]. Use color carefully

9.3. A data-by-tasks taxonomy 249

and ensure that the visualization is truthful (e.g., watch for per-

ceptual biases or distortion). Avoid use of three-dimensional rep-

resentations or embellishments, since comparing 3D volumes is

perceptually challenging and occlusion is a problem. Labels and

legends should be meaningful, novel layouts should be carefully ex-

plained, and online visualizations should adapt to diﬀerent screen

sizes. For extended and in-depth discussions, see various text-

books [119,208, 264, 380,381,395].

We provide a summary of the basic tasks that users typically

perform during visual analysis of data in the next section.

9.3 A data-by-tasks taxonomy

We give an overview of visualization approaches for six common

data types: multivariate, spatial, temporal, hierarchical, network,

and text [344]. For each data type listed in this section, we dis-

cuss its distinctive properties, the common analytical questions,

and examples. Real-life data sets often include multiple data types

coming from multiple sources. Even a single data source can in-

clude a variety of data types. For example, a single data table of

countries (as rows) can have a list of attributes with varying types:

the growth rate in the last 10 years (one observation per year, time

series data), their current population (single numerical data), the

amount of trade with other countries (networked/linked data), and

the top 10 exported products (if grouped by industry, hierarchical

data). Furthermore, we provide an overview of common tasks for vi-

sual data analysis in Box 9.2, which can be applied across diﬀerent

data types based on goals and types of visualizations.

Interactive visualization design is also closely coupled with the

targeted devices. Conventionally, visualizations have been designed

for mouse and keyboard interaction on desktop computers. How-

ever, a wider range of device forms, such as mobile devices with

small displays and touch interaction, is becoming common. Creat-

ing visualizations for new forms requires special care, though basic

design principles such as “less is more” still apply.

9.3.1 Multivariate data

In common tabular data, each record (row) has a list of attributes

(columns), whose value is mostly categorical or numerical. The

analysis of multivariate data with basic categorical and interval

types aims to understand patterns within and across data attributes.

250 9. Information Visualization

Box 9.2: A task categorization for visual data analysis

Select/Query

•Filter to focus on a subset of the data

•Retrieve details of item

•Brush linked selections across multiple charts

•Compare across multiple selections

Navigate

•Scroll along a dimension (1D)

•Pan along two dimensions (2D)

•Zoom along the third dimension (3D)

Derive

•Aggregate item groups and generate characteristics

•Cluster item groups by algorithmic techniques

•Rank items to deﬁne ordering

Organize

•Select chart type and data encodings to organize data

•Layout multiple components or panels in the interface

Understand

•Observe distributions

•Compare items and distributions

•Relate items and patterns

Communicate

•Annotate ﬁndings

•Share results

•Trace action histories

9.3. A data-by-tasks taxonomy 251

Given a larger number of attributes, one of the challenges in data

exploration and analytics is to select the attributes and relations to

focus on. Expertise in the data domain can be helpful for targeting

relevant attributes.

Multivariate data can be presented in multiple forms of charts

depending on the data and relations being explored. One-dimen-

sional (1D) charts present data on a single axis only. An example

is a box-plot, which shows quartile ranges for numerical data. So-

called 1.5D charts list the range of possible values on one axis, and

describe a measurement of data on the other. Bar charts are a ubiq-

uitous example, in order to show, for example, a numeric grade per

student, or grade average for aggregated student groups by gen-

der. Records can also be grouped over numerical ranges, and bars

can show the number of items in each grouping, which generates

ahistogram chart. Two-dimensional charts plot data along two at-

tributes, such as scatterplots. Matrix (grid) charts can also be used

to show relations between two attributes. Heatmaps visualize each

matrix cell using color to represent its value. Correlation matrices

show the relation between attribute pairs.

To show relations of more than two attributes (3D+), one option is

to use additional visual encodings in a single chart, for example, by

adding point size/shape as a data variable in scatterplots. Another

option is to use alternative visual designs that can encode multiple

relations within a single chart. For example, a parallel coordinate

plot [180] has multiple parallel axes, each one representing an at-

tribute; each record is shown as connected lines passing through

the record’s values on each attribute. Charts can also show part-of-

whole relations using appropriate mappings based on subdividing

the chart space, such as stacked charts or pie charts.

Finally, another approach to analyzing multidimensional data is

to use clustering algorithms to identify similar items. Clusters are

typically represented as a tree structure (see Section 9.3.4). For

example, k-means clustering starts by users specifying how many

clusters to create; the algorithm then places every item into the

most appropriate cluster. Surprising relationships and interesting

outliers may be identiﬁed by these techniques on mechanical anal-

ysis algorithms. However, such results may require more eﬀort to

interpret.

9.3.2 Spatial data

Spatial data convey a physical context, commonly in a 2D space,

such as geographical maps or ﬂoor plans. Several of the most

252 9. Information Visualization

famous examples of information visualization include maps, from

the 1861 representation of Napoleon’s ill-fated Russian campaign

by Minard (popularized by Tufte [380] and Kraak [214]) to the in-

teractive HomeFinder application that introduced the concept of dy-

namic queries [7]. The tasks include ﬁnding adjacent items, regions

containing certain items or with speciﬁc characteristics, and paths

between items—and performing the basic tasks listed in Box 9.1.

The primary form of visualizing spatial data is maps. In choro-

pleth maps, color encoding is used to add represent one data at-

tribute. Cartograms aim to encode the attribute value with the size

of regions by distorting the underlying physical space. Tile grid

maps reduce each spatial area to a uniform size and shape (e.g.,

a square) so that the color-coded data are easier to observe and

compare, and they arrange the tiles to approximate the neighbor

relations between physical locations [92,355]. Grid maps also make

selection of smaller areas (such as small cities or states) easier. Con-

tour (isopleth) maps connect areas with similar measurements and

color each one separately. Network maps aim to show network con-

nectivity between locations, such as ﬂights to/from many regions

of the world. Spatial data can be also presented with a nonspatial

emphasis (e.g., as a hierarchy of continents, countries, and cities).

Maps are commonly combined with other visualizations. For ex-

ample, in Figure 9.6, the US Cancer Atlas combines a map showing

patterns across states on one attribute, with a sortable table provid-

ing additional statistical information and a scatterplot that allows

users to explore correlations between attributes.

9.3.3 Temporal data

Time is the unique dimension in our physical world that steadily

ﬂows forward. While we cannot control time, we frequently record

it as a point or interval. Time has multiple levels of representation

(year, month, day, hour, minute, and so on) with irregularities (leap

year, diﬀerent days per month, etc.). As we measure time based

on cyclic events in nature (day/night), our representations are also

commonly cyclic. For example, January follows December (ﬁrst

month follows last). This cyclic nature can be captured by circular

visual encodings, such as the the conventional clock with hour,

minute, and second hands.

Time series data (Figures 9.7 and 9.8) describe values measured

at regular intervals, such as stock market or weather data. The

focus of analysis is to understand temporal trends and anomalies,

querying for speciﬁc patterns, or prediction. To show multiple time-

9.3. A data-by-tasks taxonomy 253

Figure 9.6. The US Cancer Atlas [66]. Interface based on [244]

1976/01

West Virginia

Michigan

Alaska

Mississippi

District of Columbia

California

Oregon

1980/01 1984/01 1988/01

Figure 9.7. Horizon graphs used to display time series

254 9. Information Visualization

Figure 9.8. EventFlow (www.cs.umd.edu/hcil/eventﬂow) is used to visualize sequences of innovation activities by

Illinois companies. Created with EventFlow; data sources include NIH, NSF, USPTO, SBIR. Image created by C.

Scott Dempwolf, used with permission

series trends across diﬀerent data categories in a very compact chart

area, each trend can be shown with small height using a multi-

layered color approach, creating horizon graphs. While perceptually

eﬀective after learning to read its encoding, this chart design may

not be appropriate for audiences who may lack such training or

familiarity.

Another form of temporal analysis is understanding sequences of

events. The study of human activity often includes analyzing event

sequences. For example, students’ records include events such as

attending orientation, getting a grade in a class, going on intern-

ship, and graduation. In the analysis of event sequences, ﬁnding

the most common patterns, spotting rare ones, searching for spe-

ciﬁc sequences, or understanding what leads to particular types of

events is important (e.g., what events lead to a student dropping

out, precede a medical error, or a company ﬁling bankruptcy). Fig-

ure 9.8 shows EventFlow used to visualize sequences of innovation

activities by Illinois companies. Activity types include research, in-

vention, prototyping, and commercialization. The timeline (right

9.3. A data-by-tasks taxonomy 255

panel) shows the sequence of activities for each company. The

overview panel (center) summarizes all the records aligned by the

ﬁrst prototyping activity of the company. In most of the sequences

shown here, the company’s ﬁrst prototype is preceded by two or

more patents with a lag of about a year.

9.3.4 Hierarchical data

Data are often organized in a hierarchical fashion. Each item ap-

pears in one grouping (e.g., like a ﬁle in a folder), and groups can be

grouped to form larger groups (e.g., a folder within a folder), up to

the root (e.g., a hard disk). Items, and the relations between items

and their grouping, can have their own attributes. For example,

the National Science Foundation is organized into directorates and

divisions, each with a budget and a number of grant recipients.

Analysis may focus on the structure of the relations, by ques-

tions such as “how deep is the tree?”, “how many items does this

branch have?”, or “what are the characteristics of one branch com-

pared to another?” In such cases, the most appropriate repre-

sentation is usually a node-link diagram [62, 303]. In Figure 9.9,

Figure 9.9. SpaceTree (www.cs.umd.edu/hcil/spacetree/)

256 9. Information Visualization

Figure 9.10. The Finviz treemap helps users monitor the stock market (www.ﬁnviz.com)

Spacetree is used to browse a company organizational chart. Since

not all the nodes of the tree ﬁt on the screen, we see an iconic repre-

sentation of the branches that cannot be displayed, indicating the

size of each branch. As the tree branches are opened or closed,

the layout is updated with smooth multiple-step animations to help

users remain oriented.

When the structure is less important but the attribute values

of the leaf nodes are of primary interest, treemaps, a space-ﬁlling

approach, are preferable as they can show arbitrary-sized trees in

a ﬁxed rectangular space and map one attribute to the size of each

rectangle and another to color. For example, Figure 9.10 shows the

Finviz treemap that helps users monitor the stock market. Each

stock is shown as a rectangle. The size of the rectangle represents

market capitalization, and color indicates whether the stock is going

up or down. Treemaps are eﬀective for situation awareness: we can

see that today is a fairly bad day as most stocks are red (i.e., down).

Stocks are organized in a hierarchy of industries, allowing users

to see that “healthcare technology” is not doing as poorly as most

other industries. Users can also zoom on healthcare to focus on

that industry.

9.3. A data-by-tasks taxonomy 257

Automotive | Electronics Illinois-based

Toronto-based Electronics Paper |

Packaging

Industrial Biotech

Controls Photonics? Automotive

Parts

Chem & Materials Ag Equipment

Figure 9.11. NodeXL showing innovation networks of the Great Lakes manufacturing region. Created with NodeXL.

Data source: USPTO. Image created by C. Scott Dempwolf, used with permission

9.3.5 Network data

Network data encode relationships between items: for example, ◮See Chapter 8.

social connection patterns (friendships, follows and reposts, etc.),

travel patterns (such as trips between metro stations), and commu-

nication patterns (such as emails). The network overviews attempt

to reveal the structure of the network, show clusters of related items

(e.g., groups of tightly connected people), and allow the path be-

tween items to be traced. Analysis can also focus on attributes of

the items and the links in between, such as age of people in com-

munication or the average duration of communications.

Node-link diagrams are the most common representation of net-

work structures and overviews (Figures 9.11 and 9.12), and may

use linear (arc), circular, or force-directed layouts for positioning

the nodes (items). Matrices or grid layouts are also a valuable way

to represent networks [164]. Hybrid solutions have been proposed,

258 9. Information Visualization

Figure 9.12. An example from “Maps of Science: Forecasting Large Trends in Science,” 2007, The Regents of the

with powerful ordering algorithms to reveal clusters [153]. A ma-

jor challenge in network data exploration is in dealing with larger

networks where nodes and edges inevitably overlap by virtue of the

underlying network structure, and where aggregation and ﬁltering

may be needed before eﬀective overviews can be presented to users.

Figure 9.11 shows the networks of inventors (white) and compa-

nies (orange) and their patenting connections (purple lines) in the

network visualization NodeXL. Each company and inventor is also

connected to a location node (blue = USA; yellow = Canada). Green

lines are weak ties based on patenting in the same class and sub-

class, and they represent potential economic development leads.

9.4. Challenges 259

The largest of the technology clusters are shown using the group-

in-a-box layout option, which makes the clusters more visible. Note

the increasing level of structure moving from the cluster in the lower

right to the main cluster in the upper left. NodeXL is designed for

interactive network exploration; many controls (not shown in the

ﬁgure) allow users to zoom on areas of interest or change options.

Figure 9.12 shows an example of network visualization on science as

a topic used for data presentation in a book and a traveling exhibit.

Designed for print media, it includes a clear title and annotations

and shows a series of topic clusters at the bottom with a summary

of the insights gathered by analysts.

9.3.6 Text data

Text is usually preprocessed (for word/paragraph counts, sentiment ◮See Chapter 7 for text

analysis approaches.

analysis, categorization, etc.) to generate metadata about text seg-

ments, which are then visualized. Simple visualizations like tag

clouds display statistics about word usage in a text collection, or can

be used to compare two collections or text segments. While visually

appealing, they can easily be misinterpreted and are often replaced

by word indexes sorted by some count of interest. Specialized visual

text analysis tools combine multiple visualizations of data extracted

from the text collections, such as matrices to see relations, network

diagrams, or parallel coordinates to see entity relationships (e.g.,

between what, who, where, and when). Timelines can be mapped to

the linear dimension of text. Figure 9.13 shows an example using

Jigsaw [357] for the exploration of car reviews. Entities have been

extracted automatically (in this case make, model, features, etc.),

and a cluster analysis has been performed, visualized in the bottom

right. A separate view (rightmost) allows analysts to review links be-

tween entities. Another view allows traversing word sequences as a

tree. Reading original documents is critical, so all the visualization

elements are linked to the corresponding text.

9.4 Challenges

While information visualization is a powerful tool, there are many

obstacles to its eﬀective use. We note here four areas of particu-

lar concern: scalability, evaluation, visual impairment, and visual

literacy.

260 9. Information Visualization

Figure 9.13. Jigsaw used to explore a collection of car reviews

9.4.1 Scalability

Most visualizations handle relatively small data sets (between a

thousand and a hundred thousand, sometimes up to millions, de-

pending on the technique) but scaling visualizations from millions

to billions of records does require careful coordination of analytic

algorithms to ﬁlter data or perform rapid aggregation, eﬀective vi-

sual summary designs, and rapid refreshing of displays [343]. The

visual information seeking mantra, “Overview ﬁrst, zoom and ﬁlter,

then details on demand,” remains useful with data at scale. To

accommodate a billion records, aggregate markers (which may rep-

resent thousands of records) and density plots are useful [101]. In

some cases the large volume of data can be aggregated meaningfully

into a small number of pixels. One example is Google Maps and its

visualization of road conditions. A quick glance at the map allows

drivers to use a highly aggregated summary of the speed of a large

number of vehicles and only a few red pixels are enough to decide

when to get on the road.

While millions of graphic elements may be represented on large

screens [116], perception issues need to be taken into considera-

9.4. Challenges 261

tion [411]. Extraction and ﬁltering may be necessary before even

attempting to visualize individual records [408]. Preserving interac-

tive rates in querying big data sources is a challenge, with a variety

of methods proposed, such as approximations [123] and compact

caching of aggregated query results [239]. Progressive loading and

processing will help users review the results as they appear and

steer the lengthy data processing [115, 135]. Systems are starting

to emerge, and strategies to cope with volume and variety of pat-

terns are being described [344].

9.4.2 Evaluation

Human-centric evaluation of visualization techniques can generate

qualitative and quantitative assessments of their potential quality,

with early studies focusing on the eﬀectiveness of basic visual vari-

ables [245]. To this day, user studies remain the workhorse of eval-

uation. In laboratory settings, experiments can demonstrate faster

task completion, reduced error rates, or increased user satisfaction.

These studies are helpful for comparing visual and interaction de-

signs. For example, studies are reporting on the eﬀects of latency

on interaction and understanding [241], and often reveal that dif-

ferent visualizations perform better for diﬀerent tasks [303, 321].

Evaluations may also aim to measure and study the amount and

value of the insights revealed by the use of exploratory visualiza-

tion tools [325]. Diagnostic usability evaluation remains a corner-

stone of user-centered design. Usability studies can be conducted

at various stages of the development process to verify that users are

able to complete benchmark tasks with adequate speed and accu-

racy. Comparisons with the technology previously used by target

users may also be possible to verify improvements. Metrics need

to address the learnability and utility of the system, in addition to

performance and user satisfaction [223]. Usage data logging, user

interviews, and surveys can also help identiﬁcation of potential im-

provements in visualization and interaction design.

9.4.3 Visual impairment

Color impairment is a common condition that needs to be taken

into consideration [291]. For example, red and green are appealing

for their intuitive mapping to positive or negative outcomes (also

depending on cultural associations); however, users with red–green

color blindness, one of the most common forms, would not be able to

diﬀerentiate such scales clearly. To assess and assist visual design

262 9. Information Visualization

under diﬀerent color deﬁciencies, color simulation tools can be used

(see additional resources). The impact of color impairment can be

mitigated by careful selection of limited color schemes, using double

encoding when appropriate (i.e., using symbols that vary by both

shape and color), and allowing users to change or customize color

palettes. To accommodate users with low vision, adjustable size and

zoom settings can be useful. Users with severe visual impairments

may require alternative accessibility-ﬁrst interface and interaction

designs.

9.4.4 Visual literacy

While the number of people using visualization continues to grow,

not everyone is able to accurately interpret graphs and charts.

When designing a visualization for a population of users who are ex-

pected to make sense of the data without training, it is important to

adequately estimate the level of visual literacy of those users. Even

simple scatterplots can be overwhelming for some users. Recent

work has proposed new methods for assessing visual literacy [47],

but user testing with representative users in the early stages of de-

sign and development will remain necessary to verify that adequate

designs are being used. Training is likely to be needed to help ana-

lysts get started when using more visual analytics tools. Recorded

video demonstrations and online support for question answering

are helpful to bring users from novice to expert levels.

9.5 Summary

The use of information visualization is spreading widely, with a

growing number of commercial products and additions to statis-

tical packages now available. Careful user testing should be con-

ducted to verify that visual data presentations go beyond the desire

for eye-candy in visualization, and to implement designs that have

demonstrated beneﬁts for realistic tasks. Visualization is becom-

ing increasingly used by the general public and attention should be

given to the goal of universal usability so the widest range of users

can access and beneﬁt from new approaches to data presentation

and interactive analysis.

9.6. Resources 263

9.6 Resources

We have referred to various textbooks throughout this chapter.

Tufte’s books remain the classics, as inspiring to read as they are

instructive [380, 381]. We also recommend Few’s books on infor-

mation visualization [119] and information dashboard design [120].

See also the book’s website for further readings.

Given the wide variety of goals, tasks, and use cases of visual-

ization, many diﬀerent data visualization tools have been developed

that address diﬀerent needs and appeal to diﬀerent skill levels. In

this chapter we can only point to a few examples to get started.

To generate a wide range of visualizations and dashboards, and to

quickly share them online, Tableau and Tableau Public provide a

ﬂexible visualization design platform. If a custom design is required

and programmers are available, d3 is the de facto low-level library

of choice for many web-based visualizations, with its native integra-

tion to web standards and ﬂexible ways to convert and manipulate

data into visual objects as a JavaScript library. There exist other

JavaScript web libraries that oﬀer chart templates (such as High-

charts), or web services that can be used to create a range of charts

from given (small) data sets, such as Raw or DataWrapper. To clean,

transform, merge, and restructure data sources so that they can be

visualized appropriately, tools like Trifacta and Alteryx can be used

to create pipelines for data wrangling. For statistical analysis and

batch-processing data, programming environments such as R or li-

braries for languages such as Python (for example, the Python Plotly

library) can be used.

An extended list of tools and books is available at http://www.

keshif.me/demo/VisTools.

Errors and Inference

Chapter 10

Paul P. Biemer

This chapter deals with inference and the errors associated with big

data. Social scientists know only too well the cost associated with

bad data—we highlighted both the classic Literary Digest example

and the more recent Google Flu Trends problems in Chapter 1.

Although the consequences are well understood, the new types of

data are so large and complex that their properties often cannot

be studied in traditional ways. In addition, the data generating

function is such that the data are often selective, incomplete, and

erroneous. Without proper data hygiene, the errors can quickly

compound. This chapter provides, for the ﬁrst time, a systematic

way to think about the error framework in a big data setting.

10.1 Introduction

The massive amounts of high-dimensional and unstructured data

in big data bring both new opportunities and new challenges to the

data analyst. Many of the problems with big data are well known

(see, for example, the AAPOR report by Japec et al. [188]). The

volume, variety, and velocity of big data can overwhelm traditional

methods of data analysis. Add to this complexity the fact that, as it

is generated, big data is often selective, incomplete, and erroneous.

As it is processed, new errors can be introduced in downstream

operations.

Big data is typically aggregated from disparate sources at various

points in time and integrated to form data sets. These processes

involve linking records together, transforming them to form new

attributes (or variables), documenting the actions taken (although

sometimes inadequately), and interpreting the newly created fea-

tures of the data. These activities may introduce new errors into

the data set: errors that may be either variable (i.e., errors that

265

266 10. Errors and Inference

create random noise resulting in poor reliability) or systematic (i.e.,

errors that tend to be directional, thus exacerbating biases). Using

big data in statistically valid ways is increasingly challenging in this

environment; however, it is important for data analysts to be aware

of the error risks and the potential eﬀects of big data error on infer-

ences and decision-making. The massiveness, high dimensionality,

and accelerating pace of big data, combined with the risks of vari-

able and systematic data errors, requires new, robust approaches

to data analysis.

The core issue confronting big data veracity is that such data

may not be generated from instruments and methods designed to

produce valid and reliable data for scientiﬁc analysis and discovery.

Rather, this is data that are being repurposed for uses not origi-

nally intended. Big data has been referred to as “found” data or

“data exhaust” because it is generated for purposes that often do

not align with those of the data analyst. In addition to inadvertent

errors, there are also errors from mischief in big data; for example,

automated systems have been written to generate bogus content in

the social media that is indistinguishable from legitimate or authen-

tic data. Regardless of the source, many big data generators have

little or no regard for the quality of the data that are cast oﬀ from

their processes. Big data analysts must be keenly aware of these

limitations and should take the necessary steps to understand and

hopefully mitigate the eﬀects of hidden errors on their results.

10.2 The total error paradigm

We now provide a framework for describing, mitigating, and inter-

preting the errors in essentially any data set, be it structured or

unstructured, massive or small, static or dynamic. This framework

has been referred to as the total error framework or paradigm. We

begin by reviewing the traditional paradigm, acknowledging its lim-

itations for truly big data sets, and we suggest how this framework

can be extended to encompass the new error structures often asso-

ciated with big data.

10.2.1 The traditional model

Dealing with the risks that errors introduce in big data analysis can

be facilitated through a better understanding of the sources and

nature of those errors. Such knowledge is gained through in-depth

10.2. The total error paradigm 267

understanding of the data generating mechanism, the data pro-

cessing/transformation infrastructure, and the approaches used to

create a speciﬁc data set or the estimates derived from it. For sur-

vey data, this knowledge is embodied in the well-known total survey

error (TSE) framework that identiﬁes all the major sources of error

contributing to data validity and estimator accuracy [33, 35, 143].

The TSE framework attempts to describe the nature of the error

sources and what they may suggest about how the errors could af-

fect inference. The framework parses the total error into bias and

variance components that, in turn, may be further subdivided into

subcomponents that map the speciﬁc types of errors to unique com-

ponents of the total mean squared error. It should be noted that,

while our discussion on issues regarding inference has quantitative

analyses in mind, some of the issues discussed here are also of

interest to more qualitative uses of big data.

For surveys, the TSE framework provides useful insights regard-

ing how the many steps in the data generating, reformatting, and

ﬁle preparation processes aﬀect estimation and inference, and may

also suggest methods for either reducing the errors at their source

or adjusting for their eﬀects in the ﬁnal data products to produce

inferences of higher quality. The AAPOR Task Force Report on big

data referenced above [188] describes the concept of data quality

for big data and provides a total error framework for big data that

we consider and further extend in this chapter. This approach was

closely modeled after the TSE framework since, as we shall see,

both share a number of common error sources. However, the big

data total error (BDTE) framework, as it is called in the AAPOR re-

port, necessarily includes additional error sources that are unique

to big data and can create substantial biases and uncertainties in

big data products. Like the TSE framework, the BDTE framework

aids our understanding of the limitations of the data, leading to

better-informed analyses and applications of the results. It may

also inform a research agenda for reducing the eﬀects of error on

big data analytics.

In Figure 10.1, a canonical data ﬁle is represented as an array

consisting of rows (records) and columns (variables), with their size

denoted by Nand p, respectively. Many data sets derived from big

data can be represented in this way, at least conceptually. Later,

we will consider data sets that do not conform to this rectangular

structure and may even be unstructured.

Many administrative data sets have a simple tabular structure,

as do survey sampling frames, population registers, and account-

ing spreadsheets. Let us assume that the data set is intended to

268 10. Errors and Inference

Record # V1V2VP

...

Figure 10.1. A typical rectangular data ﬁle format

represent some target population to which inference will be made in

the subsequent analysis. Thus, the rows are typically aligned with

units or elements of this target population, the columns represent

characteristics, variables (or features) of the row elements, and the

cells correspond to values of the column features for elements on

the rows.

The total error for this data set may be expressed by the following

heuristic formula:

Total error =Row error +Column error +Cell error.

Row error For the situations considered in this chapter, the row

errors may be of three types:

•Omissions: Some rows are missing, which implies that ele-

ments in the target population are not represented on the ﬁle.

•Duplications: Some population elements occupy more than

one row.

•Erroneous inclusions: Some rows contain elements or entities

that are not part of the target population.

For survey sample data sets, omissions include members of the

target population that are either inadvertently or deliberately ab-

sent from the frame, as well as nonsampled frame members. For

big data, the selectivity of the capture mechanism is a common

form of omissions. For example, a data set consisting of persons

who conducted a Google search in the past week necessarily ex-

cludes persons not satisfying that criterion who may have quite

diﬀerent characteristics from those who do. Such exclusions can

therefore be viewed as a source of selectivity bias if inference is to

be made to the general population. For one, persons who do not

have access to the Internet are excluded from the data set. These

10.2. The total error paradigm 269

exclusions may be biasing in that persons with Internet access may

have quite diﬀerent demographic characteristics from persons who

do not have Internet access. The selectivity of big data capture is

similar to frame noncoverage in survey sampling and can bias in-

ferences when researchers fail to consider it and compensate for it

in their analyses.

Example: Google searches

As an example, in the United States, the word “Jewish” is included in 3.2 times

more Google searches than “Mormon” [360]. This does not mean that the Jewish

population is 3.2 times larger than the Mormon population. Another possible

explanation is that Jewish people use the Internet in higher proportions or have

more questions that require using the word “Jewish.” Thus Google search data are

more useful for relative comparisons than for estimating absolute levels.

A well-known formula in the survey literature provides a use-

ful expression for the so-called coverage bias in the mean of some

variable, V. Denote the mean by ¯

V, and let ¯

VTdenote the (possibly

hypothetical because it may not be observable) mean of the target

population of NTelements, including the NT−Nelements that are

missing from the observed data set. Then the bias due to this non-

coverage is BNC =¯

V−¯

VT=(1−N/NT)( ¯

VC−¯

VNC), where ¯

VCis the

mean of the covered elements (i.e., the elements in the observed

data set) and ¯

VNC is the mean of the NT−Nnoncovered elements.

Thus we see that, to the extent that the diﬀerence between the cov-

ered and noncovered elements is large or the fraction of missing

elements (1−N/NT)is large, the bias in the descriptive statistic will

also be large. As in survey research, often we can only speculate

about the sizes of these two components of bias. Nevertheless, spec-

ulation is useful for understanding and interpreting the results of

data analysis and cautioning ourselves regarding the risks of false

inference.

We can also expect that big data sets, such as a data set con-

taining Google searches during the previous week, could have the

same person represented many times. People who conducted many

searches during the data capture period would be disproportion-

ately represented relative to those who conducted fewer searchers.

If the rows of the data set correspond to tweets in a Twitter feed,

duplication can arise when the same tweet is retweeted or when

some persons are quite active in tweeting while others lurk and

270 10. Errors and Inference

tweet much less frequently. Whether such duplications should be

regarded as “errors” depends upon the goals of the analysis.

For example, if inference is to be made to a population of per-

sons, persons who tweet multiple times on a topic would be over-

represented. If inference is to be made to the population of tweets,

including retweets, then such duplication does not bias inference.

When it is a problem, it still may not be possible to identify

duplications in the data. Failing to account for them could generate

duplication biases in the analysis. If these unwanted duplications

can be identiﬁed, they can either be removed from the data ﬁle (i.e.,

deduplication). Alternatively, if a certain number of rows, say d,

correspond to the same population unit, those row values can be

weighted by 1/d to correct the estimates for the duplications.

Erroneous inclusions can also create biases. For example, Google

searches or tweets may not be generated by a person but rather by a

computer either maliciously or as part of an information-gathering

or publicity-generating routine. Likewise, some rows may not sat-

isfy the criteria for inclusion in an analysis—for example, an analy-

sis by age or gender includes some row elements not satisfying the

criteria. If the criteria can be applied accurately, the rows violating

the criteria can be excluded prior to analysis. However, with big

data, some out-of-scope elements may still be included as a result

of missing or erroneous information, and these inclusions will bias

inference.

Column error The most common type of column error in survey

data analysis is caused by inaccurate or erroneous labeling of the

column data—an example of metadata error. In the TSE framework,

this is referred to as a speciﬁcation error. For example, a business

ﬁned as the number of persons in the company who received a

payroll check in the month preceding. Instead the column contains

the number of persons on the payroll whether or not they received

a check in the prior month, thus including, for example, persons on

leave without pay.

For big data analysis, such errors would seem to be quite com-

mon because of the complexities involved in producing a data set.

For example, data generated from a source, such as an individual

tweet, may undergo a number of transformations before it is in-

cluded in the analysis data set. This transformative process can

be quite complex, involving parsing phrases, identifying words, and

classifying them as to subject matter and then perhaps further clas-

10.2. The total error paradigm 271

sifying them as either positive or negative expressions about some

phenomenon like the economy or a political ﬁgure. There is con-

siderable risk of the resulting variables being either inaccurately

deﬁned or misinterpreted by the data analyst.

Example: Speciﬁcation error with Twitter data

As an example, consider a Twitter data set where the rows correspond to tweets

and one of the columns supposedly contains an indicator of whether the tweet

contained one of the following key words: marĳuana, pot, cannabis, weed, hemp,

ganja, or THC. Instead, the indicator actually corresponds to whether the tweet

contained a shorter list of words; say, either marĳuana or pot. The mislabeled

column is an example of speciﬁcation error which could be a biasing factor in an

analysis. For example, estimates of marĳuana use based upon the indicator could

be underestimates.

Cell errors Finally, cell errors can be of three types: content error,

speciﬁcation error, or missing data. A content error occurs when

the value in a cell satisﬁes the column deﬁnition but still deviates

from the true value, whether or not the true value is known. For

example, the value satisﬁes the deﬁnition of “number of employees”

but is outdated because it does not agree with the current number of

employees. Errors in sensitive data such as drug use, prior arrests,

and sexual misconduct may be deliberate. Thus, content errors may

be the result of the measurement process, a transcription error, a

data processing error (e.g., keying, coding, editing), an imputation

error, or some other cause.

Speciﬁcation error is just as described for column error but ap-

plied to a cell. For example, the column is correctly deﬁned and

labeled; however, a few companies provided values that, although

otherwise highly accurate, were nevertheless inconsistent with the

required deﬁnition. Missing data, as the name implies, are just

empty cells that should be ﬁlled. As described in Kreuter and

Peng [215], data sets derived from big data are notoriously aﬀected

by all three types of cell error, particularly missing or incomplete

data, perhaps because that is the most obvious deﬁciency.

The traditional TSE framework is quite general in that it can be

applied to essentially any data set that conform to the format in Fig-

ure 10.1. However, in most practical situations it is quite limited

because it makes no attempt to describe how the processes that

272 10. Errors and Inference

generated the data may have contributed to what could be con-

strued as data errors. In some cases, these processes constitute

a “black box,” and the best approach is to attempt to evaluate the

quality of the end product. For survey data, the TSE framework pro-

vides a fairly complete description of the error-generating processes

for survey data and survey frames [33]. In addition, there has been

some eﬀort to describe these processes for population registers and

administrative data [391]. But at this writing, little eﬀort has been

devoted to enumerating the error sources and the error generating

processes for big data.

As previously noted, missing data can take two forms: missing

information in a cell of a data matrix (referred to as item missing-

ness) or missing rows (referred to as unit missingness), with the for-

mer being readily observable whereas the latter can be completely

hidden from the analyst. Much is known from the survey research

literature about how both types of missingness aﬀect data analysis

(see, for example, Little and Rubin [240, 320]). Rubin [320] intro-

duced the term missing completely at random (MCAR) to describe

data where the data that are available (say, the rows of a data set)

can be considered as a simple random sample of the inferential

population (i.e., the population to which inferences from the data

analysis will be made). Since the data set represents the population,

MCAR data provide results that are generalizable to this population.

A second possibility also exists for the reasons why data are

missing. For example, students who have high absenteeism may

be missing because they were ill on the day of the test. They may

otherwise be average performers on the test so, in this case, it has

little to do with how they would score. Thus, the values are missing

for reasons related to another variable, health, that may be avail-

able in the data set and completely observed. Students with poor

health tend to be missing test scores, regardless of those student’s

performance on the test. Rubin [320] uses the term missing at ran-

dom (MAR) to describe data that are missing for reasons related

to completely observed variables in the data set. It is possible to

compensate for this type of missingness in statistical inferences by

modeling the missing data mechanism.

However, most often, missing data may be related to factors that

are not represented in the data set and, thus, the missing data

mechanism cannot be adequately modeled. For example, there may

be a tendency for test scores to be missing from school adminis-

trative data ﬁles for students who are poor academic performers.

Rubin calls this form of missingness nonignorable. With nonignor-

able missing data, the reasons for the missing observations depend

10.2. The total error paradigm 273

on the values that are missing. When we suspect a nonignorable

missing data mechanism, we need to use procedures much more

complex than will be described here. Little and Rubin [240] and

Schafer [326] discuss methods that can be used for nonignorable

missing data. Ruling out a nonignorable response mechanism can

simplify the analysis considerably.

In practice, it is quite diﬃcult to obtain empirical evidence about

whether or not the data are MCAR or MAR. Understanding the data

generation process is invaluable for specifying models that appro-

priately represent the missing data mechanism and that will then

be successful in compensating for missing data in an analysis.

(Schafer and Graham [327] provide a more thorough discussion

of this issue.)

One strategy for ensuring that the missing data mechanism can

be successfully modeled is to have available on the data set many

variables that may be causally related to missing data. For example,

features such as personal income are subject to high item missing-

ness, and often the missingness is related to income. However, less

sensitive, surrogate variables such as years of education or type

of employment may be less subject to missingness. The statisti-

cal relationship between income and other income-related variables

increases the chance that information lost in missing variables is

supplemented by other completely observed variables. Model-based

methods use the multivariate relationship between variables to han-

dle the missing data. Thus, the more informative the data set, the

more measures we have on important constructs, the more suc-

cessfully we can compensate for missing data using model-based

approaches.

10.2.2 Extending the framework to big data

The processes involved in generating big data are as varied as big

data itself. This diversity substantially complicates the goal of pro-

viding a detailed error framework that is applicable to all big data.

Nevertheless, some progress can be made by considering the steps

that are typically involved in creating a data set from big data, which

often includes three stages: (a) data generation (see Chapter 2), (b)

extract, transform and load (ETL), and (c) analysis (see, for exam-

ple, Chapter 7). As noted there, this mapping of the process is

oversimpliﬁed for some applications. For example, data that ﬂow

continuously from their sources may not be directed through an

ETL process at all. Rather it may be gathered in real time and

processed on an ad hoc basis. Likewise, the ETL processes may

274 10. Errors and Inference

Source 1

Source 2

Source 3

Generation

Extract

Transform

(Cleanse)

Load

(Store)

Filter/Reduction

(Sampling)

Computation/

Analysis

(Visualization)

ETL Analysis

Figure 10.2. Big data process map

be recursive in that, as errors are identiﬁed, the processes may be

altered and reﬁned. In addition, the transform stage may gener-

ate new data (e.g., proxy variables). The analysis stage may also

involve language transformations such as translation and charac-

ter set transformations. Finally, there may downstream processes

that lead to additional iterations of the stages in the process map.

Thus, the model outline here is but an initial attempt to capture the

tremendous complexities that may be involved in the big data life

cycle.

Figure 10.2 graphically depicts the ﬂow of data along the major

steps in the process. One might imagine many more arrows among

the boxes within the ETL and analysis stages that represent the

recursive nature of those stages. The severity of the errors that

arise from these processes will depend on the speciﬁc data sources

and analytic goals involved. Nevertheless, we can still consider how

each stage might create errors in a more generic fashion.

For example, data generation error is somewhat analogous to

errors arising in survey data collection. Like surveys, the data-

generating process for big data is subject to erroneous, incomplete,

and missing data. In addition, the data-generating sources may be

selective in that the data collected may not represent a well-deﬁned

population or one that is representative of a target population of

inference in an analysis. Thus, data generation errors include low

signal-to-noise ratio, lost signals, incomplete or missing values, sys-

tematic errors, selective elements, and metadata that are lacking,

absent, or erroneous.

ETL processes may be quite similar to various data processing

stages for surveys. These may include creating or enhancing meta-

10.3. Illustrations of errors in big data 275

data, record matching, variable coding, editing, data munging (or

scrubbing), and data integration (i.e., linking and merging records

and ﬁles across disparate systems). ETL errors include speciﬁca-

tion errors (including errors in metadata), matching errors, coding

errors, editing errors, data munging errors, and data integration

errors.

Finally, the analysis of big data introduces risks for a number of

errors that will be described in more detail in the next section. These

risks may be due to noise accumulation, coincidental correlations,

and incidental endogeneity. As we shall see, these errors arise as a

consequence of big data volume and variety even when the big data

itself is infallible. Erroneous data compound these problems and

can lead to other issues that are described in some detail below.

For example, big data can be subject to sampling errors when the

data are ﬁltered, sampled, or otherwise reduced to form more man-

ageable or representative data sets. So-called nonsampling errors

can arise when the analysis involves further transforming the data

and weighting the data elements, as well as can be errors due to

modeling and estimation. The latter errors are similar to modeling

and estimation errors in surveys and may include inadequate or er-

roneous adjustments for representativeness, improper or erroneous

weighting, computation and algorithmic errors, model misspeciﬁca-

tion, and so on.

10.3 Illustrations of errors in big data

A well-known example of the risks of big data error is provided

by the Google Flu Trends series that uses Google searches on ﬂu

◮See the discussion in

Section 1.3.

symptoms, remedies, and other related key words to provide near-

real-time estimates of ﬂu activity in the USA and 24 other countries.

Compared to CDC data, the Google Flu Trends provided remarkably

accurate indicators of ﬂu incidence in the USA between 2009 and

2011. However, for the 2012–2013 ﬂu seasons, Google Flu Trends

predicted more than double the proportion of doctor visits for ﬂu-

like symptoms compared to the CDC [61]. Lazer et al. [230] cite two

causes of this error: big data hubris and algorithm dynamics.

Hubris occurs when the big data researcher believes that the

volume of the data compensates for any of its deﬁciencies, thus

obviating the need for traditional, scientiﬁc analytic approaches.

As Lazer et al. [230] note, big data hubris fails to recognize that

“quantity of data does not mean that one can ignore foundational

issues of measurement and construct validity and reliability.”

276 10. Errors and Inference

Algorithm dynamics refers to properties of algorithms that al-

low them to adapt and “learn” as the processes generating the data

change over time. Although explanations vary, the fact remains

that Google Flu Trends was too high and by considerable margins

for 100 out of 108 weeks starting in July 2012. Lazer et al. [230] also

blame “blue team dynamics,” which arises when the data generating

engine is modiﬁed in such a way that the formerly highly predictive

search terms eventually failed to work. For example, when a Google

user searched on “fever” or “cough,” Google’s other programs started

recommending searches for ﬂu symptoms and treatments—the very

search terms the algorithm used to predict ﬂu. Thus, ﬂu-related

searches artiﬁcially spiked as a result of these changes to the al-

gorithm and the impact these changes had on user behavior. In

survey research, this is similar to the measurement biases induced

by interviewers who suggest to respondents who are coughing that

they might have ﬂu, then ask the same respondents if they think

they might have ﬂu.

Algorithm dynamic issues are not limited to Google. Platforms

such as Twitter and Facebook are also frequently modiﬁed to im-

prove the user experience. A key lesson provided by Google Flu

Trends is that successful analyses using big data today may fail to

produce good results tomorrow. All these platforms change their

methodologies more or less frequently, with ambiguous results for

any kind of long-term study unless highly nuanced methods are

routinely used. Recommendation engines often exacerbate eﬀects

in a certain direction, but these eﬀects are hard to tease out. Fur-

thermore, other sources of error may aﬀect Google Flu Trends to

an unknown extent. For example, selectivity may be an impor-

tant issue because the demographics of people with Internet access

are quite diﬀerent from the demographic characteristics related to

ﬂu incidence [375]. Thus, the “at risk” population for inﬂuenza

and the implied population based on Google searches do not cor-

respond. This illustrates just one type of representativeness issue

that often plagues big data analysis. In general it is an issue that

algorithms are not (publicly) measured for accuracy, since they are

often proprietary. Google Flu Trends is special in that it publicly

failed. From what we have seen, most models fail privately and

often without anyone at all noticing.

In the next section, we consider the impact of errors on some

forms of analysis that are common in the big data literature. Due to

space limitations, our focus is limited to the eﬀects of content errors

on data analysis. However, there are numerous resources available

for studying and mitigating the eﬀects of missing data on analysis;

10.4. Errors in big data analytics 277

in particular, the books by Little and Rubin [240], Schafer [326],

and Allison [10] are quite relevant to big data analytics.

10.4 Errors in big data analytics

In this section, we consider some common types of errors in big

data analytics. The next section considers analytic errors caused

by the so-called “three Vs” of big data, namely, volume, velocity,

and variety, under the assumption of perfect “veracity.” If we relax

the assumption of perfect veracity, the errors that aﬄict big data as

a result of the three Vs can be exacerbated in largely unpredictable

ways. Section 10.4.2 considers three common types of analysis

when data lack veracity: classiﬁcation, correlation, and regression.

While the scope of our review is necessarily limited, hopefully the

discussion will lay the foundation for further study regarding the

eﬀects of errors on big data analytics.

10.4.1 Errors resulting from volume, velocity, and variety,

assuming perfect veracity

Data deﬁciencies represent only one set of challenges for the big data

analyst. Other challenges arise solely as a result of the massive size,

rapid generation, and vast dimensionality of the data. As a conse-

quence of these so-called three Vs of big data, Fan et al. [111] iden-

tify three issues—noise accumulation, spurious correlations, and

incidental endogeneity—which will be brieﬂy discussed in this sec-

tion. These issues should concern big data analysts even if the data

could be regarded as infallible. Content errors, missing data, and

other data deﬁciencies will only exacerbate these problems.

Example: Noise accumulation

To illustrate noise accumulation, Fan et al. [111] consider the following scenario.

Suppose an analyst is interested in classifying individuals into two categories, C1

and C2, based upon the values of 1,000 variables in a big data set. Suppose further

that, unknown to the researcher, the mean value for persons in C1is 0 on all 1,000

variables while persons in C2have a mean of 3 on the ﬁrst 10 variables and 0 on all

other variables. Since we are assuming the data are error-free, a classiﬁcation rule

based upon the ﬁrst m≤10 variables performs quite well, with little classiﬁcation

error. However, as more and more variables are included in the rule, classiﬁcation

error increases because the uninformative variables (i.e., the 990 variables having

278 10. Errors and Inference

2,000

1,500

Piars of Breeding Storks ( )

Millions of Newborn Babies ( )

1,000

1.0

0.75

0.5

1965 1970 1975

Year

1980

Figure 10.3. An illustration of coincidental correlation between two variables: stork

die-off linked to human birth decline [346]

no discriminating power) eventually overwhelm the informative signals (i.e., the

ﬁrst 10 variables). In the Fan et al. [111] example, when m > 200, the accumulated

noise exceeds the signal embedded in the ﬁrst 10 variables and the classiﬁcation

rule becomes equivalent to a coin-ﬂip classiﬁcation rule.

High dimensionality can also introduce coincidental (or spuri-

ous) correlations in that many unrelated variables may be highly

correlated simply by chance, resulting in false discoveries and er-

roneous inferences. The phenomenon depicted in Figure 10.3 is an

illustration of this. Many more examples can be found on a web-

site and in a book devoted to the topic [388, 389]. Fan et al. [111]

explain this phenomenon using simulated populations and rela-

tively small sample sizes. They illustrate how, with 800 indepen-

dent (i.e., uncorrelated) variables, the analyst has a 50% chance of

observing an absolute correlation that exceeds 0.4. Their results

suggest that there are considerable risks of false inference associ-

ated with a purely empirical approach to predictive analytics using

high-dimensional data.

Finally, turning to incidental endogeneity, a key assumption in

regression analysis is that the model covariates are uncorrelated

with the residual error; endogeneity refers to a violation of this

assumption. For high-dimensional models, this can occur purely

by chance—a phenomenon Fan and Liao [113] call incidental endo-

geneity. Incidental endogeneity leads to the modeling of spurious

10.4. Errors in big data analytics 279

variation in the outcome variables resulting in errors in the model

selection process and biases in the model predictions. The risks

of incidental endogeneity increase as the number of variables in

the model selection process grows large. Thus it is a particularly

important concern for big data analytics.

Fan et al. [111] as well as a number of other authors [114, 361]

(see, for example, Hall and Miller [151]; Fan and Liao, [112]) suggest

robust statistical methods aimed at mitigating the risks of noise

accumulation, spurious correlations, and incidental endogeneity.

However, as previously noted, these issues and others are further

compounded when data errors are present in a data set. Biemer and

Trewin [37] show that data errors will bias the results of traditional

data analysis and inﬂate the variance of estimates in ways that are

diﬃcult to evaluate or mitigate in the analysis process.

10.4.2 Errors resulting from lack of veracity

The previous section examined some of the issues big data analysts ◮The volume, velocity, and

variety of big data make it

challenging to analyze be-

cause of noise accumula-

tion, spurious correlations,

and incidental endogeneity.

But, the three Vs are also

the scourge of veracity—

i.e., errors in big data.

face as either Nor pin Figure 10.1 becomes extremely large. When

row, column, and cell errors are added into the mix, these volumatic

problems can be further exacerbated. For example, noise accumu-

lation can be expected to accelerate when random noise (i.e., content

errors) aﬄicts the data. Spurious correlations that give rise to both

incidental endogeneity and coincidental correlations can render cor-

relation analysis meaningless if the error levels in big data are not

mitigated. In this section, we consider some of the issues that arise

in classiﬁcation, correlation, and regression analysis as a result of

content errors that may be either variable or systematic. The cur-

rent literature on big data error acknowledges that the data may be

noisy, that is, subject to variable errors. However, there appears to

be little recognition of the problems associated with systematic or

correlated errors, particularly those introduced when data are com-

bined from disparate sources that may be subject to source-speciﬁc

errors.

There are various important ﬁndings in this section. First, for

rare classes, even small levels of error can impart considerable bi-

ases in classiﬁcation analysis. Second, variable errors will atten-

uate correlations and regression slope coeﬃcients; however, these

eﬀects can be mitigated by forming meaningful aggregates of the

data and substituting these aggregates for the individual units in

these analyses. Third, unlike random noise, systematic errors can

bias correlation and regression analysis is unpredictable ways, and

these biases cannot be eﬀectively mitigated by aggregating the data.

280 10. Errors and Inference

Finally, multilevel modeling can be an important mitigation strat-

egy for dealing with systematic errors emanating from multiple data

sources. These issues will be examined in some detail in the remain-

der of this section.

10.4.2.1 Variable and correlated error

Error models are essential for understanding the eﬀects of error on

data sets and the estimates that may be derived from them. They

allow us to concisely and precisely communicate the nature of the

errors that are being considered, the general conditions that give

rise to them, how they aﬀect the data, how they may aﬀect the

analysis of these data, and how their eﬀects can be evaluated and

mitigated. In the remainder of this chapter, we focus primarily on

content errors and consider two types of error, variable errors and

correlated errors, the latter a subcategory of systematic errors.

Variable errors are sometimes referred to as random noise or

uncorrelated errors. For example, administrative databases often

contain errors from a myriad of random causes, including mistakes

in keying or other forms of data capture, errors on the part of the

persons providing the data due to confusion about the information

requested, diﬃculties in recalling information, the vagaries of the

terms used to request the inputs, and other system deﬁciencies.

Correlated errors, on the other hand, carry a systematic eﬀect

that results in a nonzero covariance between the errors of two dis-

tinct units. For example, quite often, an analysis data set may

combine multiple data sets from diﬀerent sources and each source

may impart errors that follow a somewhat diﬀerent distribution. As

we shall see, these diﬀerences in error distributions can induce cor-

related errors in the merged data set. It is also possible that corre-

lated errors are induced from a single source as a result of diﬀerent

operators (e.g., computer programmers, data collection personnel,

data editors, coders, data capture mechanisms) handling the data.

Diﬀerences in the way these operators perform their tasks have the

potential to alter the error distributions so that data elements han-

dled by the same operator have errors that are correlated [35].

These concepts may be best expressed by a simple error model.

Let yrc denote the cell value for variable con the rth unit in the data

set, and let εrc denote the error associated with this value. Suppose

it can be assumed that there is a true value underlying yrc, which

is denoted by µrc. Then we can write

yrc =µrc +εrc.(10.1)

10.4. Errors in big data analytics 281

At this point, εrc is not stochastic in nature because a statistical

process for generating the data has not yet been assumed. There-

fore, it is not clear what correlated error really means. To remedy

this problem, we can consider the hypothetical situation where the

processes generating the data set can be repeated under the same

general conditions (i.e., at the same point in time with the same

external and internal factors operating). Each time the processes

are repeated, a diﬀerent set of errors may be realized. Thus, it is

assumed that although the true values, µrc , are ﬁxed, the errors,

εrc, can vary across the hypothetical, inﬁnite repetitions of the data

set generating process. Let E(·)denote the expected value over all

these hypothetical repetitions, and deﬁne the variance, Var(·), and

covariance, Cov(·), analogously.

For the present, error correlations between variables are not con-

sidered, and thus the subscript, c, is dropped to simplify the nota-

tion. For the uncorrelated data model, we assume that E(yr|r)=µr,

Var(yr|r)=σ2

ε, and Cov(yr, ys|r, s)=0, for r,s. For the corre-

lated data model, the latter assumption is relaxed. To add a bit

more structure to the model, suppose the data set is the product

of combining data from multiple sources (or operators) denoted by

j=1,2, . . . , J, and let bjdenote the systematic eﬀect of the jth

source. Here we also assume that, with each hypothetical repeti-

tion of the data set generating process, these systematic eﬀects can

vary stochastically. (It is also possible to assume the systematic

eﬀects are ﬁxed. See, for example, Biemer and Stokes [36] for more

details on this model.) Thus, we assume that E(bj)=0, Var(bj)=σ2

and Cov(bj, bk)=0 for j,k.

Finally, for the rth unit within the jth source, let εrj =bj+erj.

Then it follows that

Cov(εrj, εsk )=









σ2

b+σ2

εfor r=s, j =k,

σ2

εfor r=s, j ,k,

0 for r,s, j ,k.

The case where σ2

b=0 corresponds to the uncorrelated error model

(i.e., bj=0) and thus εrj is purely random noise.

Example: Speed sensor

Suppose that, due to calibration error, the jth speed sensor in a traﬃc pattern

study underestimates the speed of vehicle traﬃc on a highway by an average of 4

miles per hour. Thus, the model for this sensor is that the speed for the rth vehicle

282 10. Errors and Inference

recorded by this sensor (yrj )is the vehicle’s true speed (µrj)minus 4 mph (bj) plus

a random departure from −4 for the rth vehicle (εrj ). Note that to the extent that

bjvaries across sensors j=1, . . . , J in the study, σ2

bwill be large. Further, to the

extent that ambient noise in the readings for jth sensor causes variation around

the values µrc +bj, then σ2

εwill be large. Both sources of variation will reduce

the reliability of the measurements. However, as shown in Section 10.4.2.4, the

systematic error component is particularly problematic for many types of analysis.

10.4.2.2 Models for categorical data

For variables that are categorical, the model of the previous section

is not appropriate because the assumptions it makes about the

error structure do not hold. For example, consider the case of a

binary (0/1) variable. Since both yrand µrshould be either 1 or 0,

the error in equation (10.1) must assume the values of −1, 0, or 1.

A more appropriate model is the misclassiﬁcation model described

by Biemer [34], which we summarize here.

Let φrdenote the probability of a false positive error (i.e., φr=

Pr(yr=1|µr=0)), and let θrdenote the probability of a false negative

error (i.e., θr=Pr(yr=0|µr=1)). Thus, the probability that the

value for row ris correct is 1 −θrif the true value is 1, and 1 −φrif

the true value is 0.

As an example, suppose an analyst wishes to compute the pro-

portion, P=Pryr/N, of the units in the ﬁle that are classiﬁed as

1, and let π=Prµr/N denote the true proportion. Then under the

assumption of uncorrelated error, Biemer [34] shows that

P=π(1−θ)+(1−π)φ,

where θ=Prθr/N and φ=Prφr/N.

In the classiﬁcation error literature, the sensitivity of a classi-

ﬁer is deﬁned as 1 −θ, that is, the probability that a true positive

is correctly classiﬁed. Correspondingly, 1 −φis referred to as the

speciﬁcity of the classiﬁer, that is, the probability that a true nega-

tive is correctly classiﬁed. Two other quantities that will be useful in

our study of misclassiﬁcation error are the positive predictive value

(PPV) and negative predictive value (NPV) given by

PPV =Pr(µr=1|yr=1),NPV =Pr(µr=0|yr=0).

The PPV (NPV) is the probability that a positive (negative) classiﬁca-

tion is correct.

10.4. Errors in big data analytics 283

10.4.2.3 Misclassiﬁcation and rare classes

Fan et al. [111] and many others have stated that one of the strengths

of big data is the ability to study rare population groups that sel-

dom show up in large enough numbers in designed studies such as

surveys and clinical trials. While this is true in theory, in practice

content errors can quickly overwhelm such an analysis, rendering

the data useless for this purpose. We illustrate this using the fol-

lowing contrived and somewhat amusing example. The results in

this section are particularly relevant to the approaches considered

in Chapter 6.

Example: Thinking about probabilities

Suppose, using big data and other resources, we construct a terrorist detector and

boast that the detector is 99.9% accurate. In other words, both the probability of a

false negative (i.e., classifying a terrorist as a nonterrorist, θ) and the probability of

a false positive (i.e., classifying a nonterrorist as a terrorist, φ) are 0.001. Assume

that about 1 person in a million in the population is a terrorist, that is, π=

0.000001 (hopefully, somewhat of an overestimate). Your friend, Terry, steps into

the machine and, to Terry’s chagrin (and your surprise) the detector declares that

he is a terrorist! What are the odds that the machine is right? The surprising

answer is only about 1 in 1000. That is, 999 times out of 1,000 times the machine

classiﬁes a person as a terrorist, the machine will be wrong!

How could such an accurate machine be wrong so often in the

terrorism example? Let us do the math.

The relevant probability is the PPV of the machine: given that

the machine classiﬁes an individual (Terry) as a terrorist, what is

the probability the individual is truly a terrorist? Using the notation

in Section 10.4.2.2 and Bayes’ rule, we can derive the PPV as

Pr(µr=1|yr=1)=Pr(yr=1|µr=1) Pr(µr=1)

Pr(yr=1)

=(1−θ)π

π(1−θ)+(1−π)φ

=0.999 ×0.000001

0.000001 ×0.999 +0.99999 ×0.001

≈0.001.

This example calls into question whether security surveillance

using emails, phone calls, etc. can ever be successful in ﬁnding

284 10. Errors and Inference

Table 10.1. Positive predictive value (%) for rare subgroups, high speciﬁcity, and

perfect sensitivity

πkSpeciﬁcity

99% 99.9% 99.99%

0.1 91.70 99.10 99.90

0.01 50.30 91.00 99.00

0.001 9.10 50.00 90.90

0.0001 1.00 9.10 50.00

rare threats such as terrorism since to achieve a reasonably high

PPV (say, 90%) would require a sensitivity and speciﬁcity of at least

1−10−7, or less than 1 chance in 10 million of an error.

To generalize this approach, note that any population can be

regarded as a mixture of subpopulations. Mathematically, this can

be written as

f(y|x;θ)=π1f(y|x;θ1)+π2f(y|x;θ2)+. . . +πKf(y|x;θK),

where f(y|x;θ)denotes the population distribution of ygiven the

vector of explanatory variables xand the parameter vector θ=

(θ1, θ2, . . . , θK),πkis the proportion of the population in the kth

subgroup, and f(y|x;θk)is the distribution of yin the kth sub-

group. A rare subgroup is one where πkis quite small (say, less

than 0.01).

Table 10.1 shows the PPV for a range of rare subgroup sizes when

the sensitivity is perfect (i.e., no misclassiﬁcation of true positives)

and speciﬁcity is not perfect but still high. This table reveals the

fallacy of identifying rare population subgroups using fallible clas-

siﬁers unless the accuracy of the classiﬁer is appropriately matched

to the rarity of the subgroup. As an example, for a 0.1% subgroup,

the speciﬁcity should be at least 99.99%, even with perfect sensi-

tivity, to attain a 90% PPV.

10.4.2.4 Correlation analysis

In Section 10.4.1, we considered the problem of incidental corre-

lation that occurs when an analyst correlates pairs of variables

selected from big data stores containing thousands of variables.

In this section, we discuss how errors in the data can exacerbate

this problem or even lead to failure to recognize strong associations

among the variables. We conﬁne the discussion to the continuous

variable model of Section 10.4.2.1 and begin with theoretical re-

10.4. Errors in big data analytics 285

sults that help explain what happens in correlation analysis when

the data are subject to variable and systematic errors.

For any two variables in the data set, cand d, deﬁne the covari-

ance between yrc and yrd as

σy|cd =PrE(yrc −¯yc)(yrd −¯yd)

where the expectation is with respect to the error distributions and

the sum extends over all rows in the data set. Let

σµ|cd =Pr(µrc −¯µc)(µrd −¯µd)

denote the population covariance. (The population is deﬁned as the

set of all units corresponding to the rows of the data set.) For any

variable c, deﬁne the variance components

σ2

y|c=Pr(yrc −¯yc)2

N, σ2

µ|c=Pr(µrc −¯µc)2

and let

Rc=

σ2

µ|c

σ2

µ|c+σ2

b|c+σ2

ε|c

, ρc=

σ2

b|c

σ2

µ|c+σ2

b|c+σ2

ε|c

with analogous deﬁnitions for d. The ratio Rcis known as the relia-

bility ratio, and ρcwill be referred to as the intra-source correlation.

Note that the reliability ratio is the proportion of total variance that

is due to the variation of true values in the data set. If there were

no errors, either variable or systematic, then this ratio would be 1.

To the extent that errors exist in the data, Rcwill be less than 1.

Likewise, ρcis also a ratio of variance components that reﬂects

the proportion of total variance that is due to systematic errors

with biases that vary by data source. A value of ρcthat exceeds

0 indicates the presence of systematic error variation in the data.

As we shall see, even small values of ρccan cause big problems in

correlation analysis.

Using the results in Biemer and Trewin [37], it can be shown that

the correlation between yrc and yrd , deﬁned as ρy|cd =σy|cd/σy|cσy|d,

can be expressed as

ρy|cd =pRcRdρµ|cd +√ρcρd.(10.2)

Note that if there are no errors (i.e., when σ2

b|c=σ2

ε|c=0), then

Rc=1, ρc=0, and the correlation between ycand ydis just the

population correlation.

286 10. Errors and Inference

Let us consider the implications of these results ﬁrst without

systematic errors (i.e., only variable errors) and then with the eﬀects

of systematic errors.

Variable errors only If the only errors are due to random noise,

then the additive term on the right in equation (10.2) is 0 and

ρy|cd =√RcRdρµ|cd, which says that the correlation is attenuated

by the product of the root reliability ratios. For example, suppose

Rc=Rd=0.8, which is considered excellent reliability. Then the

observed correlation in the data will be about 80% of the true cor-

relation; that is, correlation is attenuated by random noise. Thus,

√RcRdwill be referred to as the attenuation factor for the correlation

between two variables.

Quite often in the analysis of big data, the correlations being

explored are for aggregate measures, as in Figure 10.3. Therefore,

suppose that, rather than being a single element, yrc and yrd are the

means of nrc and nrd independent elements, respectively. For exam-

ple, yrc and yrd may be the average rate of inﬂation and the average

price of oil, respectively, for the rth year, for r=1, . . . , N years.

Aggregated data are less aﬀected by variable errors because, as we

sum up the values in a data set, the positive and negative values

of the random noise components combine and cancel each other

under our assumption that E(εrc)=0. In addition, the variance of

the mean of the errors is of order O(n−1

rc ).

To simplify the result for the purposes of our discussion, suppose

nrc =nc, that is, each aggregate is based upon the same sample size.

It can be shown that equation (10.2) still applies if we replace Rcby

its aggregated data counterpart denoted by RA

c=σ2

µ|c/(σ2

µ|c+σ2

ε|c/nc).

Note that RA

cconverges to 1 as ncincreases, which means that ρy|cd

will converge to ρµ|cd. Figure 10.4 illustrates the speed at which this

convergence occurs.

In this ﬁgure, we assume nc=nd=nand vary nfrom 0 to

60. We set the reliability ratios for both variables to 0.5 (which

is considered to be a “fair” reliability) and assume a population

correlation of ρµ|cd =0.5. For nin the range [2,10], the attenuation

is pronounced. However, above 10 the correlation is quite close to

the population value. Attenuation is negligible when n > 30. These

results suggest that variable error can be mitigated by aggregating

like elements that can be assumed to have independent errors.

Both variable and systematic errors If both systematic and vari-

able errors contaminate the data, the additive term on the right in

10.4. Errors in big data analytics 287

0.6

0.5

0.4

0.3

Correlation

0.2

0.1

00 10 20 30

Sample Size (n)

40 50 60

Figure 10.4. Correlation as a function of sample size for ρµ|cd =0.5and Rc=Rd=

0.5

equation (10.2) is positive. For aggregate data, the reliability ratio

takes the form

σ2

µ|c

σ2

µ|c+σ2

b|c+n−1

cσ2

ε|c

which converges not to 1 as in the case of variable error only, but to

σ2

µ|c/(σ2

µ|c+σ2

b|c), which will be less than 1. Thus, some attenuation

is possible regardless of the number of elements in the aggregate.

In addition, the intra-source correlation takes the form

ρA

σ2

b|c

σ2

µ|c+σ2

b|c+n−1

cσ2

ε|c

which converges to ρA

c=σ2

b|c/(σ2

µ|c+σ2

b|c), which converges to 1 −

c. Thus, the systematic eﬀects may still operate for correlation

analysis without regard to the number of elements comprising the

aggregates.

For example, consider the illustration in Figure 10.4 with nc=

nd=n, reliability ratios (excluding systematic eﬀects) set at 0.5 and

population correlation at ρµ|cd =0.5. In this scenario, let ρc=ρd=

0.25. Figure 10.5 shows the correlation as a function of the sample

size with systematic errors (dotted line) compared to the correlation

without systematic errors (solid line). Correlation with systematic

errors is both inﬂated and attenuated. However, at the assumed

level of intra-source variation, the inﬂation factor overwhelms the

288 10. Errors and Inference

0.7

0.6

0.5

0.4

0.3

Correlation

0.2

0.1

00 20 40 60

Sample Size (n)

80 100 120

Figure 10.5. Correlation as a function of sample size for ρµ|cd =0.5,Rc=Rd=0.5,

and ρc=ρd=0.25

attenuation factors and the result is a much inﬂated value of the

correlation across all aggregate sizes.

To summarize these ﬁndings, correlation analysis is attenuated

by variable errors, which can lead to null ﬁndings when conduct-

ing a correlation analysis and the failure to identify associations

that exist in the data. Combined with systematic errors that may

arise when data are extracted and combined from multiple sources,

correlation analysis can be unpredictable because both attenuation

and inﬂation of correlations can occur. Aggregating data mitigates

the eﬀects of variable error but may have little eﬀect on systematic

errors.

10.4.2.5 Regression analysis

The eﬀects of variable errors on regression coeﬃcients are well

known [37, 82, 130]. The eﬀects of systematic errors on regression

have been less studied. We review some results for both types of

errors in this section.

Consider the simple situation where we are interested in com-

puting the population slope and intercept coeﬃcients given by

b=Pr(yr−¯y)(xr−¯x)

Pr(xr−¯x)2and b0=¯y−b¯x,

where, as before, the sum extends over all rows in the data set.

When xis subject to variable errors, it can be shown that the ob-

served regression coeﬃcient will be attenuated from its error-free

10.4. Errors in big data analytics 289

–100 –50

No errors regression

Rx = 1.0

Intercept = –0.61

Slope = 1.05

50 100 –100 –50

Regression with error

Rx = 0.73

Intercept = –2.51

Slope = 0.77

50 100

Figure 10.6. Regression of yon xwith and without variable error. On the left is the population regression with no

error in the xvariable. On the right, variable error was added to the x-values with a reliability ratio of 0.73. Note its

attenuated slope, which is very near the theoretical value of 0.77

counterpart. Let Rxdenote the reliability ratio for x. Then

b=RxB,

where B=Pr(yr−¯y)(µr|x−¯µx)/Pr(µr|x−¯µx)2is the population slope

coeﬃcient, with xr=µr|x+εr|x, where εr|xis the variable error with

mean 0 and variance σ2

ε|x. It can also be shown that Bias(b0)≈

B(1−Rx) ¯µx.

As an illustration of these eﬀects, consider the regressions dis-

played in Figure 10.6, which are based upon contrived data. The

regression on the left is the population (true) regression with a slope

of 1.05 and an intercept of −0.61. The regression on the left uses

the same y- and x-values. The only diﬀerence is that normal error

was added to the x-values, resulting in a reliability ratio of 0.73. As

the theory predicted, the slope was attenuated toward 0 in direct

proportion to the reliability, Rx. As random error is added to the

x-values, reliability is reduced and the ﬁtted slope will approach 0.

When the dependent variable, y, only is subject to variable error,

the regression deteriorates, but the expected values of the slope

and intercept coeﬃcients are still equal to true to their population

values. To see this, suppose yr=µy|r+εy|r, where µr|ydenotes the

error-free value of yrand εr|yis the associated variable error with

variance σ2

ε|y. The regression of yon xcan now be rewritten as

µy|r=b0+bxr+er−εr|y,(10.3)

290 10. Errors and Inference

where eris the usual regression residual error with mean 0 and

variance σ2

e, which is assumed to be uncorrelated with εr|y. Letting

e′=er−εr|y, it follows that the regression in equation (10.3) is

equivalent to the previously considered regression of yon xwhere

yis not subject to error, but now the residual variance is increased

by the additive term, that is, σ′2

e=σ2

ε|y+σ2

Chai [67] considers the case of systematic errors in the regres-

sion variables that may induce correlations both within and between

variables in the regression. He shows that, in the presence of sys-

tematic errors in the independent variable, the bias in the slope

coeﬃcient may either attenuate the slope or increase its magnitude

in ways that cannot be predicted without extensive knowledge of

the error properties. Thus, like the results from correlation anal-

ysis, systematic errors greatly increase the complexity of the bias

eﬀects and their eﬀects on inference can be quite severe.

One approach for dealing with systematic error at the source

level in regression analysis is to model it using, for example, random

eﬀects [169]. In brief, a random eﬀects model speciﬁes yijk =̙∗

0i+

̙xijk +εijk , where ε′

ijk =bi+εijk and Var(ε′

ijk )=σ2

b+σ2

ε|j. The next section

considers other mitigation strategies that attempt to eliminate the

error rather than model it.

10.5 Some methods for mitigating, detecting,

and compensating for errors

For survey data and other designed data collections, error mitiga-

tion begins at the data generation stage by incorporating design

strategies that generate high-quality data that are at least adequate

for the purposes of the data users. For example, missing data can

◮Data errors further com-

plicate analysis and exac-

erbate the analytical prob-

lems. There are essentially

three solutions: prevention,

remediation, and the choice

of analysis methodology.

be mitigated by repeated follow-up of nonrespondents, question-

naires can be perfected through pretesting and experimentation,

interviewers can be trained in the eﬀective methods for obtaining

highly accurate responses, and computer-assisted interviewing in-

struments can be programmed to correct errors in the data as they

are generated. For big data, the data generation process is often

outside the purview of the data collectors, as noted in Section 10.1,

and there is limited opportunity to address deﬁciencies in the data

generation process. This is because big data is often a by-product

of systems designed with little or no thought given to the potential

secondary uses of the data. Instead, error mitigation must neces-

sarily begin at the data processing stage, which is the focus of this

chapter, particularly with regard to data editing and cleaning.

10.5. Some methods for mitigating, detecting, and compensating for errors 291

In survey collections, data processing may involve data entry,

coding textual responses, creating new variables, editing the data,

imputing missing cells, weighting the observations, and preparing

the ﬁle, including the application of disclosure-limiting processes.

For big data, many of these same operations could be needed to

some extent, depending upon requirements of the data, its uses,

and the prevailing statutory requirements. Of all the operations

◮For example, Title 13 in

the US code is explicit in

terms of data statutory pro-

tections and—as you will

see in the next chapter—

can have substantial impact

on the quality of subsequent

inference.

comprising data processing, the area that has the greatest potential

to both alter data quality and consume vast resources is data edit-

ing. For example, Biemer and Lyberg [35] note that national statis-

tical oﬃces may spend as much as 40% or more of their production

budgets on data editing for some surveys. Recent computer tech-

nologies have automated many formerly manual processes, which

has greatly reduced these expenditures. Nevertheless, editing re-

mains a key component of the quality improvement process for big

data. Therefore, the remainder of this chapter will discuss the issue

of editing big data.

Biemer and Lyberg [35] deﬁne data editing as a set of method-

ologies for identifying and correcting (or transforming) anomalies

in the data. It often involves verifying that various relationships

among related variables of the data set are plausible and, if they are

not, attempting to make them so. Editing is typically a rule-based

approach where rules can apply to a particular variable, a combi-

nation of variables, or an aggregate value that is the sum over all

the rows or a subset of the rows in a data set.

In small data sets, data editing usually starts as a bottom-up pro-

cess that applies various rules to the cell values—often referred to

as micro-editing. The rules specify that the values for some variable

or combinations of variables (e.g., a ratio or diﬀerence between two

values) should be within some speciﬁed range. Editing may reveal

impossible values (so-called fatal edits) as well as values that are

simply highly suspect (leading to query edits). For example, a preg-

nant male would generate a fatal edit, while a property that sold

for $500,000 in a low-income neighborhood might generate a query

edit. Once the anomalous values are identiﬁed, the process at-

tempts to deduce more accurate values based upon other variables

in the data set or, perhaps, from an external data set.

The identiﬁcation of anomalies can be handled quite eﬃciently

and automatically using editing software. Recently, data mining

and machine learning techniques have been applied to data edit-

ing with excellent results (see Chandola et al. [69] for a review).

Tree-based methods such as classiﬁcation and regression trees and

random forests are particularly useful for creating editing rules for

292 10. Errors and Inference

anomaly identiﬁcation and resolution [302]. However, some human

review may be necessary to resolve the most complex situations.

For big data, the identiﬁcation of data anomalies could result in

possibly billions of edit failures. Even if only a tiny proportion of

these required some form of manual review for resolution, the task

could still require the inspection of tens or hundreds of thousands of

query edits, which would be infeasible for most applications. Thus,

micro-editing must necessarily be a completely automated process

unless it can be conﬁned to a relatively small subset of the data.

As an example, a representative (random) subset of the data set

could be edited using manual editing for purposes of evaluating

the error levels for the larger data set, or possibly to be used as a

training data set, benchmark, or reference distribution for further

processing, including recursive learning.

To complement fully automated micro-editing, big data editing

usually involves top-down or macro-editing approaches. For such

approaches, analysts and systems inspect aggregated data for con-

formance to some benchmark values or data distributions that are

known from either training data or prior experience. When unex-

pected or suspicious aggregates are identiﬁed, the analyst can “drill

down” into the data to discover and, if possible, remove the discrep-

ancy by either altering the value at the source (usually a micro-data

element) or delete the edit-failed value.

Note that an aggregate that is not ﬂagged as suspicious in macro-

editing passes the edit and is deemed correct. However, because

serious oﬀsetting errors can be masked by aggregation, there is no

guarantee that the elements comprising the aggregate are indeed

accurate. Recalling the discussion in Section 10.4.2.1 regarding

random noise, we know that the data may be quite noisy while

the aggregates may appear to be essentially error-free. However,

macro-editing can be an eﬀective control for systematic errors and

egregious and some random errors that create outliers and other

data anomalies.

Given the volume, velocity, and variety of big data, even macro-

editing can be challenging when thousands of variables and billions

of records are involved. In such cases, the selective editing strategies

developed in the survey research literature can be helpful [90,136].

Using selective macro-editing, query edits are selected based upon

the importance of the variable for the analysis under study, the

severity of the error, and the cost or level of eﬀort involved in in-

vestigating the suspicious aggregate. Thus, only extreme errors, or

less extreme errors on critical variables, would be investigated.

10.5. Some methods for mitigating, detecting, and compensating for errors 293

There are a variety of methods that may be eﬀective in macro-

editing. Some of these are based upon data mining [267], ma-

chine learning [78], cluster analysis [98, 162], and various data vi-

sualization tools such as treemaps [192, 342, 372] and tableplots

[312, 371, 373]. The tableplot seems particularly well suited for the

three Vs of big data and will be discussed in some detail.

Like other visualization techniques examined in Chapter 9, the

tableplot has the ability to summarize a large multivariate data set

in a single plot [246]. In editing big data, it can be used to de-

tect outliers and unusual data patterns. Software for implementing

this technique has been written in R and is available without cost

from the Comprehensive R Archive Network (https://cran.r-project.

org/). Figure 10.7 shows an example. The key idea is that micro-

aggregates of two related variables should have similar data pat-

terns. Inconsistent data patterns may signal errors in one of the

aggregates that can be investigated and corrected in the editing

process to improve data quality. The tableplot uses bar charts cre-

ated for the micro-aggregates to identify these inconsistent data

patterns.

Each column in the tableplot represents some variable in the

data table, and each row is a “bin” containing a subset of the data.

A statistic such as the mean or total is computed for the values in

a bin and is displayed as a bar (for continuous variables) or as a

stacked bar for categorical variables.

The sequence of steps typically involved in producing a tableplot

is as follows:

1. Sort the records in the data set by the key variable.

2. Divide the sorted data set into Bbins containing the same

number of rows.

3. For continuous variables, compute the statistic to be compared

across variables for each row bin, say Tb, for b=1, . . . , B, for

each continuous variable, V, ignoring missing values. The

level of missingness for Vmay be represented by the color or

brightness of the bar. For categorical variables with Kcate-

gories, compute the proportion in the kth category, denoted by

Pbk. Missing values are assigned to a new (K+1)th category

(“missing”).

4. For continuous variables, plot the Bvalues Tbas a bar chart.

For categorical variables, plot the Bproportions Pbk as a

stacked bar chart.

294 10. Errors and Inference

90%

80%

70%

60%

40%

50%

30%

20%

10%

100%

Unedited Edited

log(Turnover) log(Employees) log(Personnel costs) log(Turnover) log(Employees) log(Personnel costs)

Figure 10.7. Comparison of tableplots for the Dutch Structural Business Statistics Survey for ﬁve variables before

and after editing. Row bins with high missing and unknown numeric values are represented by lighter colored bars

Typically, Tbis the mean, but other statistics such as the median

or range could be plotted if they aid in the outlier identiﬁcation

process. For highly skewed distributions, Tennekes and de Jonge

[372] suggest transforming Tbby the log function to better capture

the range of values in the data set. In that case, negative values can

be plotted as log(−Tb)to the left of the origin and zero values can

be plotted on the origin line. For categorical variables, each bar in

the stack should be displayed using contrasting colors so that the

divisions between categories are apparent.

Tableplots appear to be well suited for studying the distributions

of variable values, the correlation between variables, and the occur-

rence and selectivity of missing values. Because they can help vi-

sualize massive, multivariate data sets, they seem particularly well

suited for big data. Currently, the R implementation of tableplot is

limited to 2 billion records.

10.6. Summary 295

The tableplot in Figure 10.7 is taken from Tennekes and de

Jonge [372] for the annual Dutch Structural Business Statistics

survey, a survey of approximately 58,000 business units annually.

Topics covered in the questionnaire include turnover, number of

employed persons, total purchases, and ﬁnancial results. Figure

10.7 was created by sorting on the ﬁrst column, viz., log(turnover),

and dividing the 57,621 observed units into 100 bins, so that each

row bin contains approximately 576 records. To aid the compar-

isons between unedited and edited data, the two tableplots are dis-

played side by side, with the unedited graph on the left and the

edited graph on the right. All variables were transformed by the log

function.

The unedited tableplot reveals that all four of the variables in

the comparison with log(turnover) show some distortion by large

values for some row bins. In particular, log(employees) has some

fairly large nonconforming bins with considerable discrepancies. In

addition, that variable suﬀers from a large number of missing val-

ues, as indicated by the brightness of the bar color. All in all, there

are obvious data quality issues in the unprocessed data set for all

four of these variables that should be dealt with in the subsequent

processing steps.

The edited tableplot reveals the eﬀect of the data checking and

editing strategy used in the editing process. Notice the much darker

color for the number of employees for the graph on the left compared

to same graph on the right. In addition, the lack of data in the

lowest part of the turnover column has been somewhat improved.

The distributions for the graph on the right appear smoother and

are less jagged.

10.6 Summary

This review of big data errors should dispel the misconception that

the volume of the data can somehow compensate for other data

deﬁciencies. To think otherwise is big data “hubris,” according to

Lazer et al. [230]. In fact, the errors in big data are at least as severe

and impactful for data analysis as are the errors in designed data

collections. For the latter, there is a vast literature that developed

during the last century on mitigating and adjusting for the error

eﬀects on data analysis. But for the former, the corresponding

literature is still in its infancy. While the survey literature can form

a foundation for the treatment of errors in big data, its utility is

also quite limited because, unlike for smaller data sets, it is often

296 10. Errors and Inference

quite diﬃcult to eliminate all but the most egregious errors from big

data due to its size. Indeed, volume, velocity, and variety may be

regarded as the three curses of big data veracity.

This chapter considers only a few types of data analysis that are

common in big data analytics: classiﬁcation, correlation, and re-

gression analysis. We show that even when there is veracity, big

data analytics can result in poor inference as a result of the three

Vs. These issues include noise accumulation, spurious correla-

tions, and incidental endogeneity. However, relaxing the assump-

tion of veracity compounds these inferential issues and adds many

new ones. Analytics can suﬀer in unpredictable ways.

One solution we propose is to clean up the data before analysis.

Another option that was not discussed is the possibility of using

analytical techniques that attempt to model errors and compensate

for them in the analysis. Such techniques include the use of latent

class analysis for classiﬁcation error [34], multilevel modeling of

systematic errors from multiple sources [169], and Bayesian statis-

tics for partitioning massive data sets across multiple machines and

then combining the results [176, 335].

The BDTE perspective can be helpful when we consider using

a big data source. In the end, it is the responsibility of big data

analysts to be aware of the many limitations of the data and to take

the necessary steps to limit the eﬀects of big data error on analytical

results.

Finally, we note that the BDTE paradigm focuses on the accu-

racy dimension of Total Quality (writ large). Accuracy may also

impinge on other quality dimensions such as timeliness, compara-

bility, coherence, and relevance that we have not considered in this

chapter. However, those other dimensions are equally important

to some consumers of big data analytics. For example, timeliness

often competes with accuracy because achieving acceptable levels

of the latter often requires greater expenditures of resources and

time. In fact, some consumers prefer analytic results that are less

accurate for the sake of timeliness. Biemer and Lyberg [35] discuss

these and other issues in some detail.

10.7 Resources

The American Association of Public Opinion Research has a number

of resources on its website [1]. See, in particular, its report on big

data [188].

10.7. Resources 297

The Journal of Oﬃcial Statistics [194] is a standard resource with

many relevant articles. There is also an annual international con-

ference on the total survey error framework, supported by major

survey organizations [378].

Privacy and Conﬁdentiality

Chapter 11

Stefan Bender, Ron Jarmin, Frauke Kreuter, and Julia Lane

This chapter addresses the issue that sits at the core of any study

of human beings—privacy and conﬁdentiality. In a new ﬁeld, like

the one covered in this book, it is critical that many researchers

have access to the data so that work can be replicated and built

upon—that there be a scientiﬁc basis to data science. Yet the rules

that social scientists have traditionally used for survey data, namely

anonymity and informed consent, no longer apply when data are

collected in the wild. This concluding chapter identiﬁes the issues

that must be addressed for responsible and ethical research to take

place.

11.1 Introduction

Most big data applications in the social sciences involve data on

units such as individuals, households, and diﬀerent types of busi-

ness, educational, and government organizations. Indeed, the ex-

ample running throughout this book involves data on individuals

(such as faculty and students) and organizations (such as universi-

ties and ﬁrms). In circumstances such as these, researchers must

ensure that such data are used responsibly and ethically—that the

subjects under study suﬀer no harm from their data being accessed,

analyzed, and reported. A clear distinction must be made between

analysis done for the public good and that done for private gain. In

practical terms, this requires that the private interests of individ-

ual privacy and data conﬁdentiality be balanced against the social

beneﬁts of research access and use.

Privacy “encompasses not only the famous ‘right to be left alone,’

or keeping one’s personal matters and relationships secret, but also

the ability to share information selectively but not publicly” [287].

299

300 11. Privacy and Conﬁdentiality

Utility

Privacy

Initial

utility/privacy

frontier

Frontier after

increase in

external data

P2P1

Figure 11.1. The privacy–utility tradeoff

Conﬁdentiality is “preserving authorized restrictions on information

access and disclosure, including means for protecting personal pri-

vacy and proprietary information” [254]. Doing so is not easy—the

challenge to the research community is how to balance the risk of

providing access with the associated utility [100]. To give a sim-

ple example, if means and percentages are presented for a large

number of people, it will be diﬃcult to infer an individual’s value

from such output, even if one knew that a certain individual or unit

contributed to the formation of that mean or percentage. However,

if those means and percentages are presented for subgroups or in

multivariate tables with small cell sizes, the risk for disclosure in-

creases [96]. As a result, the quality of data analysis is typically

degraded with the production of public use data [100].

In general, the greater the access to data and their original val-

ues, the greater the risk of reidentiﬁcation for individual units. We

depict this tradeoﬀ graphically in Figure 11.1. The concave curves

in this hypothetical example depict the technological relationship

between data utility and privacy for an organization such as a busi-

ness ﬁrm or a statistical agency. At one extreme, all information

is available to anybody about all units, and therefore high analytic

utility is associated with the data that are not at all protected. At

the other extreme, nobody has access to any data and no utility is

11.1. Introduction 301

achieved. Initially, assume the organization is on the outer frontier.

Increased external data resources (those not used by the organiza-

tion) increase the risk of reidentiﬁcation. This is represented by an

inward shift of the utility/privacy frontier in Figure 11.1. Before the

increase in external data, the organization could achieve a level of

data utility U∗and privacy P1. The increase in externally available

data now means that in order to maintain utility at U∗, privacy is

reduced to P2. This simple example represents the challenge to all

organization that release statistical or analytical products obtained

from underlying identiﬁable data in the era of big data. As more

data become available externally, the more diﬃcult it is to maintain

privacy.

Before the big data era, national statistical agencies had the ca-

pacity and the mandate to make dissemination decisions: they as-

sessed the risk, they understood the data user community and the

associated utility from data releases. And they had the wherewithal

to address the legal, technical, and statistical issues associated with

protecting conﬁdentiality [377].

But in a world of big data, many once-settled issues have new

complications, and wholly new issues arise that need to be ad-

dressed, albeit under the same rubrics. The new types of data have

much greater potential utility, often because it is possible to study

small cells or the tails of a distribution in ways not possible with

small data. In fact, in many social science applications, the tails of

the distribution are often the most interesting and hardest-to-reach

parts of the population being studied; consider health care costs

for a small number of ill people [356], or economic activity such as

rapid employment growth by a small number of ﬁrms [93].

Example: The importance of activity in the tails

Spending on health care services in the United States is highly concentrated among

a small proportion of people with extremely high use. For the overall civilian

population living in the community, the latest data indicate that more than 20% of

all personal health care spending in 2009 ($275 billion) was on behalf of just 1%

of the population [333].

It is important to understand where the risk of privacy breaches

comes from. Let us assume for a moment that we conducted a

traditional small-scale survey with 1,000 respondents. The survey

contains information on political attitudes, spending and saving in

302 11. Privacy and Conﬁdentiality

a given year, and income, as well as background variables on in-

come and education. If name and address are saved together with

this data, and someone gets access to the data, obviously it is easy

to identify individuals and gain access to information that is other-

wise not public. If the personal identiﬁable information (name and

address) are removed from this data ﬁle, the risk is much reduced.

If someone has access to the survey data and sees all the individual

values, it might be diﬃcult to assess with certainty who among the

320 million inhabitants in the USA is associated with an individual

data record. However, the risk is higher if one knows some of this

information (say, income) for a person, and knows that this person

is in the survey. With these two pieces of information, it is likely

possible to uniquely identify the person in the survey data.

Big data increases the risk precisely for this reason. Much data

is available for reidentiﬁcation purposes [289]. Most obviously, the

risk of reidentiﬁcation is much greater because the new types of

data have much richer detail and a much larger public community

has access to ways to reidentify individuals. There are many famous

examples of reidentiﬁcation occurring even when obvious personal

information, such as name and social security number, has been

removed and the data provider thought that the data were conse-

quently deidentiﬁed. In the 1990s, Massachusetts Group Insurance

released “deidentiﬁed” data on the hospital visits of state employ-

ees; researcher Latanya Sweeney quickly reidentiﬁed the hospital

records of the then Governor William Weld using nothing more

than state voter records about residence and date of birth [366].

In 2006, the release of supposedly deidentiﬁed web search data

by AOL resulted in two New York Times reports being able to rei-

dentify a customer simply from her browsing habits [26]. And in

2012, statisticians at the department store, Target, used a young

teenager’s shopping patterns to determine that she was pregnant

before her father did [166].

But there are also less obvious problems. What is the legal

framework when the ownership of data is unclear? In the past,

when data were more likely to be collected and used within the

same entity—for example, within an agency that collects adminis-

trative data or within a university that collects data for research

purposes—organization-speciﬁc procedures were (usually) in place

and suﬃcient to regulate the usage of these data. In a world of big

data, legal ownership is less clear.

Who has the legal authority to make decisions about permission,

access, and dissemination and under what circumstances? The

answer is often not clear. The challenge in the case of big data is

11.2. Why is access important? 303

that data sources are often combined, collected for one purpose, and

used for another. Data providers often have a poor understanding

of whether or how their data will be used.

Example: Knowledge is power

In a discussion of legal approaches to privacy in the context of big data, Strandburg

[362] says: “‘Big data’ has great potential to beneﬁt society. At the same time,

its availability creates signiﬁcant potential for mistaken, misguided or malevolent

uses of personal information. The conundrum for the law is to provide space for big

data to fulﬁll its potential for societal beneﬁt, while protecting citizens adequately

from related individual and social harms. Current privacy law evolved to address

diﬀerent concerns and must be adapted to confront big data’s challenges.”

It is critical to address privacy and conﬁdentiality issues if the

full public value of big data is to be realized. This chapter high-

lights why the challenges need to be met (i.e., why access to data

is crucial), review the pre-big data past, point out challenges with

this approach in the context of big data, brieﬂy describe the current

state of play from a legal, technical, and statistical perspective, and

point to open questions that need to be addressed in the future.

11.2 Why is access important?

This book gives detailed examples of the potential of big data to pro-

vide insights into a variety of social science questions—particularly

the relationship between investments in R&D and innovation. But

that potential is only realized if researchers have access to the

data [225]: not only to perform primary analyses but also to validate

the data generation process (in particular, data linkage), replicate

analyses, and build a knowledge infrastructure around complex

data sets.

Validating the data generating process Research designs requir-

ing a combination of data sources and/or analysis of the tails of

populations challenge the traditional paradigm of conducting sta-

tistical analysis on deidentiﬁed or aggregated data. In order to com-

bine data sets, someone in the chain that transforms raw data into

research outputs needs access to link keys contained in the data

sets to be combined. High-quality link keys uniquely identify the

304 11. Privacy and Conﬁdentiality

subjects under study and typically are derived from items such as

individual names, birth dates, social security numbers, and busi-

ness names, addresses, and tax ID numbers. From a privacy and

conﬁdentiality perspective, link keys are among most sensitive in-

formation in many data sets of interest to social scientists. This is

why many organizations replace link keys containing personal iden-

tiﬁable information (PII)*with privacy-protecting identiﬁers [332].

⋆PII is “any information

about an individual main-

tained by an agency, in-

cluding (1) any information

that can be used to distin-

guish or trace an individ-

ual’s identity, such as name,

social security number, date

and place of birth, mother’s

maiden name, or biometric

records; and (2) any other

information that is linked

or linkable to an individ-

ual, such as medical, edu-

cational, ﬁnancial, and em-

ployment information” [254].

Regardless, at some point in the process those must be generated

out of the original information, thus access to the latter is impor-

tant.

Replication John Ioannidis has claimed that most published re-

search ﬁndings are false [184]; for example, the unsuccessful repli-

cation of genome-wide association studies, at less than 1%, is stag-

gering [29]. Inadequate understanding of coverage, incentive, and

quality issues, together with the lack of a comparison group, can re-

sult in biased analysis—famously in the case of using administrative

records on crime to make inference about the role of death penalty

policy in crime reduction [95, 234]. Similarly, overreliance on, say,

Twitter data, in targeting resources after hurricanes can lead to

the misallocation of resources towards young, Internet-savvy people

with cell phones and away from elderly or impoverished neighbor-

hoods [340], just as bad survey methodology led the Literary Digest

to incorrectly call the 1936 election [353]. The ﬁrst step to repli-

cation is data access; such access can enable other researchers to

ascertain whether the assumptions of a particular statistical model

are met, what relevant information is included or excluded, and

whether valid inferences can be drawn from the data [215].

Building knowledge infrastructure Creating a community of prac-

tice around a data infrastructure can result in tremendous new

insights, as the Sloan Digital Sky Survey and the Polymath project

have shown [281]. In the social science arena, the Census Bu-

reau has developed a productive ecosystem that is predicated on

access to approved external experts to build, conduct research us-

ing, and improve key data assets such as the Longitudinal Busi-

ness Database [190] and Longitudinal Employer Household Dy-

namics [4], which have yielded a host of new data products and

critical policy-relevant insights on business dynamics [152] and la-

bor market volatility [55], respectively. Without providing robust,

but secure, access to conﬁdential data, researchers at the Census

Bureau would have been unable to undertake the innovations that

11.3. Providing access 305

made these new products and insights possible.

11.3 Providing access

The approaches to providing access have evolved over time. Sta-

tistical agencies often employ a range of approaches depending on

the needs of heterogeneous data users [96, 126]. Dissemination

of data to the public usually occurs in three steps: an evaluation

of disclosure risks, followed by the application of an anonymiza-

tion technique, and ﬁnally an evaluation of disclosure risks and

analytical quality of the candidate data release(s). The two main

approaches have been statistical disclosure control techniques to

produce anonymized public use data sets, and controlled access

through a research data center.

Statistical disclosure control techniques Statistical agencies have

made data available in a number of ways: through tabular data,

public use ﬁles, licensing agreements and, more recently, through

synthetic data [317]. Hundepool et al. [174] deﬁne statistical dis-

closure control as follows:

concepts and methods that ensure the conﬁdentiality of

micro and aggregated data that are to be published. It is

methodology used to design statistical outputs in a way

that someone with access to that output cannot relate a

known individual (or other responding unit) to an element

in the output.

Traditionally, conﬁdentiality protection has been accomplished

by releasing only aggregated tabular data. This practice works well

in settings where the primary purpose is enumeration, such as cen-

sus taking. However, tabular data are poorly suited to describing

the underlying distributions and covariance across variables that

are often the focus of applied social science research [100].

To provide researchers access to data that permitted analysis

of the underlying variance–covariance structure of the data, some

agencies have constructed public use micro-data samples. To prod-

uct conﬁdentiality in such public use ﬁles, a number of statisti-

cal disclosure control procedures are typically applied. These in-

clude stripping all identifying (e.g., PII) ﬁelds from the data, topcod-

ing highly skewed variables (e.g., income), and swapping records

[96,413]. However, the mosaic eﬀect—where disparate pieces of in-

formation can be combined to reidentify individuals—dramatically

306 11. Privacy and Conﬁdentiality

increases the risk of releasing public use ﬁles [88]. In addition,

there is more and more evidence that the statistical disclosure pro-

cedure applied to produce them decreases their utility across many

applications [58].

Some agencies provide access to conﬁdential micro-data through

licensing arrangements. A contract speciﬁes the conditions of use

and what safeguards must be in place. In some cases, the agency

has the authority to conduct random inspections. However, this

approach has led to a number of operational challenges, including

version control, identifying and managing risky researcher behavior,

and management costs [96].

More recently, synthetic data have been created whereby key

features of the original data are preserved but the original data are

replaced by results from estimations (synthetic data) so that no in-

dividual or business entity can be found in the released data [97].

Two examples of synthetic data sets are the SIPP Synthetic-Beta [5]

of linked Survey of Income and Program Participation (SIPP) and So-

cial Security Administration earnings data, and the Synthetic Lon-

gitudinal Business Database (SynLBD) [207]. Jarmin et al. [189]

discuss how synthetic data sets lack utility in many research set-

tings but are useful for generating ﬂexible data sets underlying data

tools and apps such as the Census Bureau’s OnTheMap.

Research data centers The second approach is establishing re-

search data centers. Here, qualiﬁed researchers gain access to

micro-level data after they are sworn in to protect the conﬁden-

tially of the data they access. Strong input and output controls

are in place to ensure that published ﬁndings comply with the pri-

vacy and conﬁdentiality regulations [161]. Some RDCs allow access

through remote execution, where no direct access to the data is al-

lowed, but it is not necessary to travel; others allow remote direct

access.

11.4 The new challenges

While there are well-established policies and protocols surrounding

access to and use of survey and administrative data, a major new

challenge is the lack of clear guidelines governing the collection of

data about human activity in a world in which all public, and some

private, actions generate data that can be harvested [287,289,362].

The twin pillars on which so much of social science have rested—

informed consent and anonymization—are virtually useless in a big

11.4. The new challenges 307

data setting where multiple data sets can be and are linked together

using individual identiﬁers by a variety of players beyond social sci-

entists with formal training and whose work is overseen by institu-

tional review boards. This rapid expansion in data and their use is

very much driven by the increased utility of the linked information

to businesses, policymakers, and ultimately the taxpayer. In addi-

tion, there are no obvious data stewards and custodians who can

be entrusted with preserving the privacy and conﬁdentiality with

regard to both the source data collected from sensors, social media,

and many other sources, and the related analyses [226].

It is clear that informed consent as historically construed is no

longer feasible. As Nissenbaum [283] points out, notiﬁcation is ei-

ther comprehensive or comprehensible, but not both. While ideally

human subjects are oﬀered true freedom of choice based on a sound

and suﬃcient understanding of what the choice entails, in reality

the ﬂow of data is so complex and the interest in the data usage

so diverse that simplicity and clarity in the consent statement un-

avoidably result in losses of ﬁdelity, as anyone who has accepted a

Google Maps agreement is likely to understand [160]. In addition,

informed consent requires a greater understanding of the breadth

of type of privacy breaches, the nature of harm as diﬀused over

time, and an improved valuation of privacy in the big data con-

text. Consumers may value their own privacy in variously ﬂawed

ways. They may, for example, have incomplete information, or an

overabundance of information rendering processing impossible, or

use heuristics that establish and routinize deviations from rational

decision-making [6].

It is also nearly impossible to anonymize data. Big data are

often structured in such a way that essentially everyone in the ﬁle

is unique, either because so many variables exist or because they

are so frequent or geographically detailed, that they make it easy

to reidentify individual patterns [266]. It is also no longer possible

to rely on sampling or measurement error in external ﬁles as a

buﬀer for data protection, since most data are not in the hands of

statistical agencies.

There are no data stewards controlling access to individual data.

Data are often so interconnected (think social media network data)

that one person’s action can disclose information about another

person without that person even knowing that their data are be-

ing accessed. The group of students posting pictures about a beer

party is an obvious example, but, in a research context, if the prin-

cipal investigator grants access to the proposal, information could

be divulged about colleagues and students. In other words, volun-

308 11. Privacy and Conﬁdentiality

teered information of a minority of individuals can unlock the same

information about many—a type of “tyranny of the minority” [28].

There are particular issues raised by the new potential to link

information based on a variety of attributes that do not include PII.

Barocas and Nissenbaum write as follows [27]:

Rather than attempt to deanonymize medical records, for

instance, an attacker (or commercial actor) might instead

infer a rule that relates a string of more easily observable

or accessible indicators to a speciﬁc medical condition,

rendering large populations vulnerable to such inferences

even in the absence of PII. Ironically, this is often the

very thing about big data that generate the most excite-

ment: the capacity to detect subtle correlations and draw

actionable inferences. But it is this same feature that

renders the traditional protections aﬀorded by anonymity

(again, more accurately, pseudonymity) much less eﬀec-

tive.

In light of these challenges, Barocas and Nissenbaum continue

the value of anonymity inheres not in namelessness, and

not even in the extension of the previous value of name-

lessness to all uniquely identifying information, but in-

stead to something we called “reachability,” the possi-

bility of knocking on your door, hauling you out of bed,

calling your phone number, threatening you with sanc-

tion, holding you accountable—with or without access to

identifying information.

It is clear that the concepts used in the larger discussion of

privacy and big data require updating. How we understand and

assess harms from privacy violations needs updating. And we must

rethink established approaches to managing privacy in the big data

context. The next section discusses the framework for doing so.

11.5 Legal and ethical framework

The Fourth Amendment to the US Constitution, which constrains

the government’s power to “search” the citizenry’s “persons, houses,

papers, and eﬀects” is usually cited as the legal framework for pri-

vacy and conﬁdentiality issues. In the US a “sectoral” approach to

privacy regulation, for example, the Family Education Rights and

11.5. Legal and ethical framework 309

Privacy Act and the Health Insurance Portability and Accountability

Act, is also used in situations where diﬀerent economic areas have

separate privacy laws [290]. In addition, current legal restrictions

and guidance on data collection in the industrial setting include the

Fair Information Practice Principles dating from 1973 and underly-

ing the Fair Credit Reporting Act from 1970 and the Privacy Act from

1974 [362]. Federal agencies often have statutory oversight, such

as Title 13 of the US Code for the Census Bureau, the Conﬁden-

tial Information Protection and Statistical Eﬃciency Act for federal

statistical agencies, and Title 26 of the US Code for the Internal

Revenue Service.

Yet the generation of big data often takes place in the open, or

through commercial transactions with a business, and hence is not

covered by these frameworks. There are major questions as to what

is reasonably private and what constitutes unwarranted intrusion

[362]. There is a lack of clarity on who owns the new types of data—

whether it is the person who is the subject of the information, the

person or organization who collects these data (the data custodian),

the person who compiles, analyzes, or otherwise adds value to the

information, the person who purchases an interest in the data, or

society at large. The lack of clarity is exacerbated because some

laws treat data as property and some treat it as information [65].

The ethics of the use of big data are also not clear, because

analysis may result in being discriminately against unfairly, being

limited in one’s life choices, being trapped inside stereotypes, being

unable to delineate personal boundaries, or being wrongly judged,

embarrassed, or harassed. There is an entire research agenda to be

pursued that examines the ways that big data may threaten inter-

ests and values, distinguishes the origins and nature of threats to

individual and social integrity, and identiﬁes diﬀerent solutions [48].

The approach should be to describe what norms and expectations

are likely to be violated if a person agrees to provide data, rather

than to describe what will be done during the research.

What is clear is that most data are housed no longer in sta-

tistical agencies, with well-deﬁned rules of conduct, but in busi-

nesses or administrative agencies. In addition, since digital data

can be alive forever, ownership could be claimed by yet-to-be-born

relatives whose personal privacy could be threatened by release of

information about blood relations.

Traditional regulatory tools for managing privacy, notice, and

consent have failed to provide a viable market mechanism allowing

a form of self-regulation governing industry data collection. Going

forward, a more nuanced assessment of tradeoﬀs in the big data

310 11. Privacy and Conﬁdentiality

context, moving away from individualized assessments of the costs

of privacy violations, is needed [362]. Ohm advocates for a new

conceptualization of legal policy regarding privacy in the big data

context that uses ﬁve guiding principles for reform: ﬁrst, that rules

take into account the varying levels of inherent risk to individuals

across diﬀerent data sets; second, that traditional deﬁnitions of PII

need to be rethought; third, that regulation has a role in creating

and policing walls between data sets; fourth, that those analyzing

big data must be reminded, with a frequency in proportion to the

sensitivity of the data, that they are dealing with people; and ﬁ-

nally, that the ethics of big data research must be an open topic for

continual reassessment.

11.6 Summary

The excitement about how big data can change the social science re-

search paradigm should be tempered by a recognition that existing

ways of protecting conﬁdentiality are no longer viable [203]. There

is a great deal of research that can be used to inform the develop-

ment of such a structure, but it has been siloed into disconnected

research areas, such as statistics, cybersecurity, and cryptography,

as well as a variety of diﬀerent practical applications, including the

successful development of remote access secure data enclaves. We

must piece together the knowledge from these various ﬁelds to de-

velop ways in which vast new sets of data on human beings can be

collected, integrated, and analyzed while protecting them [227].

It is possible that the conﬁdentiality risks of disseminating data

may be so high that traditional access models will no longer hold;

that the data access model of the future will be to take the analysis

to the data rather than the data to the analyst or the analyst to the

data. One potential approach is to create an integrated system in-

cluding (a) unrestricted access to highly redacted data, most likely

some version of synthetic data, followed by (b) means for approved

researchers to access the conﬁdential data via remote access so-

lutions, combined with (c) veriﬁcation servers that allows users to

assess the quality of their inferences with the redacted data so as

to be more eﬃcient with their use (if necessary) of the remote data

access. Such veriﬁcation servers might be a web-accessible system

based on a conﬁdential database with an associated public micro-

data release, which helps to analyze the conﬁdential database [203].

Such approaches are starting to be developed, both in the USA and

in Europe [106,193].

11.7. Resources 311

There is also some evidence that people do not require complete

protection, and will gladly share even private information provided

that certain social norms are met [300, 403]. There is a research

agenda around identifying those norms as well; characterizing the

interests and wishes of actors (the information senders and recipi-

ents or providers and users); the nature of the attributes (especially

types of information about the providers, including how these might

be transformed or linked); and identifying transmission principles

(the constraints underlying the information ﬂows).

However, it is likely that it is no longer possible for a lone social

scientist to address these challenges. One-oﬀ access agreements to

individuals are conducive to neither the production of high-quality

science nor the high-quality protection of data [328]. The curation,

protection, and dissemination of data on human subjects cannot

be an artisan activity but should be seen as a major research in-

frastructure investment, like investments in the physical and life

sciences [3, 38, 173]. In practice, this means that linkages become

professionalized and replicable, research is fostered within research

data centers that protect privacy in a systematic manner, knowledge

is shared about the process of privacy protections disseminated in

a professional fashion, and there is ongoing documentation about

the value of evidence-based research. It is thus that the risk–utility

tradeoﬀ depicted in Figure 11.1 can be shifted in a manner that

serves the public good.

11.7 Resources

The American Statistical Association’s Privacy and Conﬁdentiality

website provides a useful source of information [12].

An overview of federal activities is provided by the Conﬁdentiality

and Data Access Committee of the Federal Committee on Statistics

and Methodology [83].

The World Bank and International Household Survey Network

provide a good overview of data dissemination “best practices” [183].

There is a Journal of Privacy and Conﬁdentiality based at Carnegie

Mellon University [195], and also a journal called Transactions in

Data Privacy [370].

The United Nations Economic Commission on Europe hosts work-

shops and conferences and produces occasional reports [382].

Workbooks

Chapter 12

Jonathan Scott Morgan, Christina Jones, and Ahmad Emad

This ﬁnal chapter provides an overview of the Python workbooks

that accompany each chapter. These workbooks combine text ex-

planation and code you can run, implemented in Jupyter note- ◮See jupyter.org.

books, to explain techniques and approaches selected from each

chapter and to provide thorough implementation details, enabling

students and interested practitioners to quickly get up to speed on

and start using the technologies covered in the book. We hope you

have a lot of fun with them.

12.1 Introduction

We provide accompanying Juptyer IPython workbooks for most

chapters in this book. These workbooks explain techniques and

approaches selected from each chapter and provide thorough im-

plementation details so that students and interested practitioners

can quickly start using the technologies covered within.

The workbooks and related ﬁles are stored in the Big-Data- ◮github.com/

BigDataSocialScience

Workbooks GitHub repository, and so are freely available to be

downloaded by anyone at any time and run on any appropriately

conﬁgured computer. These workbooks are a live set of documents

that could potentially change over time, so see the repository for the

most recent set of information.

These workbooks provide a thorough overview of the work needed

to implement the selected technologies. They combine explanation,

basic exercises, and substantial additional Python code to provide a

conceptual understanding of each technology, give insight into how

key parts of the process are implemented through exercises, and

then lay out an end-to-end pattern for implementing each in your

own work. The workbooks are implemented using IPython note-

313

314 12. Workbooks

books, interactive documents that mix formatted text and Python

code samples that can be edited and run in real time in a Jupyter

notebook server, allowing you to run and explore the code for each

technology as you read about it.

12.2 Environment

The Big-Data-Workbooks GitHub repository provides two diﬀerent

types of workbooks, each needing a diﬀerent Python setup to run.

The ﬁrst type of workbooks is intended to be downloaded and

run locally by individual users. The second type is designed to

be hosted, assigned, worked on, and graded on a single server,

using jupyterhub (https://github.com/jupyter/jupyterhub) to host

and run the notebooks and nbgrader (https://github.com/jupyter/

nbgrader) to assign, collect, and grade.

The text, images, and Python code in the workbooks are the same

between the two versions, as are the ﬁles and programs needed to

complete each.

The diﬀerences in the workbooks themselves relate to the code

cells within each notebook where users implement and test exer-

cises. In the workbooks intended to be used locally, exercises are

implemented in simple interactive code cells. In the nbgrader ver-

sions, these cells have additional metadata and contain the solu-

tions for the exercises, making them a convenient answer key even

if you are working on them locally.

12.2.1 Running workbooks locally

To run workbooks locally, you will need to install Python on your

system, then install ipython, which includes a local Jupyter server

you can use to run the workbooks. You will also need to install

additional Python packages needed by the workbooks, and a few

additional programs.

The easiest way to get this all working is to install the free

Anaconda Python distribution provided by Continuum Analytics

(https://www.continuum.io/downloads). Anaconda includes a Jupyter

server and precompiled versions of many packages used in the

workbooks. It includes multiple tools for installing and updating

both Python and installed packages. It is separate from any OS-

level version of Python, and is easy to completely uninstall.

Anaconda also works on Windows as it does on Mac and Linux.

Windows is a much diﬀerent operating system from Apple’s OS X

12.3. Workbook details 315

and Unix/Linux, and Python has historically been much trickier

to install, conﬁgure, and use on Windows. Packages are harder to

compile and install, the environment can be more diﬃcult to set

up, etc. Anaconda makes Python easier to work with on any OS,

and on Windows, in a single run of the Anaconda installer, it inte-

grates Python and common Python utilities like pip into Windows

well enough that it approximates the ease and experience of using

Python within OS X or Unix/Linux (no small feat).

You can also create your Python environment manually, in-

stalling Python, package managers, and Python packages sepa-

rately. Packages like numpy and pandas can be diﬃcult to get work-

ing, however, particularly on Windows, and Anaconda simpliﬁes

this setup considerably regardless of your OS.

12.2.2 Central workbook server

Setting up a server to host workbooks managed by nbgrader is more

involved. Some of the workbooks consume multiple gigabytes of

memory per user and substantial processing power. A hosted im-

plementation where all users work on a single server requires sub-

stantial hardware, relatively complex conﬁguration, and ongoing

server maintenance. Detailed instructions are included in the Big-

Data-Workbooks GitHub repository. It is not rocket science, but it

is complicated, and you will likely need an IT professional to help

you set up, maintain, and troubleshoot. Since all student work

will be centralized in this one location, you will also want a robust,

multi-destination backup plan.

For more information on installing and running the workbooks

that accompany this book, see the Big-Data-Workbooks GitHub

repository.

12.3 Workbook details

Most chapters have an associated workbook, each in its own direc-

tory in the Big-Data-Workbooks GitHub repository. Below is a list

of the workbooks, along with a short summary of the topics that

each covers.

12.3.1 Social Media and APIs

The Social Media and APIs workbook introduces you to the use

of Internet-based web service APIs for retrieving data from online

316 12. Workbooks

data stores. Examples include retrieving information on articles

from Crossref (provider of Digital Object Identiﬁers used as unique

IDs for publications) and using the PLOS Search and ALM APIs to

retrieve information on how articles are shared and referenced in

social media, focusing on Twitter. In this workbook, you will learn

how to:

•Set up user API keys,

•Connect to Internet-based data stores using APIs,

•Collect DOIs and Article-Level Metrics data from web APIs,

•Conduct basic analysis of publication data.

12.3.2 Database basics

In the Database workbook you will learn the practical beneﬁts that

stem from using a database management system. You will imple-

ment basic SQL commands to query grants, patents, and vendor

data, and thus learn how to interact with data stored in a rela-

tional database. You will also be introduced to using Python to

execute and interact with the results of SQL queries, so you can

write programs that interact with data stored in a database. In this

workbook, you will learn how to:

•Connect to a database through Python,

•Query the database by using SQL in Python,

•Begin to understand to the SQL query language,

•Close database connections.

12.3.3 Data Linkage

In the Data Linkage workbook you will use Python to clean input

data, including using regular expressions, then learn and imple-

ment the basic concepts behind the probabilistic record linkage: us-

ing diﬀerent types of string comparators to compare multiple pieces

of information between two records to produce a score that indicates

how likely it is that the records are data about the same underlying

entity. In this workbook, you will learn how to:

•Parse a name string into ﬁrst, middle, and last names using

Python’s split method and regular expressions,

12.3. Workbook details 317

•Use and evaluate the results of common computational

string comparison algorithms including Levenshtein distance,

Levenshtein–Damerau distance, and Jaro–Winkler distance,

•Understand the Fellegi–Sunter probabilistic record linkage

method, with step-by-step implementation guide.

12.3.4 Machine Learning

In the Machine Learning workbook you will train a machine learn-

ing model to predict missing information, working through the pro-

cess of cleaning and prepping data for training and testing a model,

then training and testing a model to impute values for a missing

categorical variable, predicting the academic department of a given

grant’s primary investigator based on other traits of the grant. In

this workbook, you will learn how to:

•Read, clean, ﬁlter, and store data with Python’s pandas data

analysis package,

•Recognize the types of data cleaning and reﬁning needed to

make data more compatible with machine learning models,

•Clean and reﬁne data,

•Manage memory when working with large data sets,

•Employ strategies for dividing data to properly train and test a

machine learning model,

•Use the scikit-learn Python package to train, ﬁt, and evalu-

ate machine learning models.

12.3.5 Text Analysis

In the Text Analysis workbook, you will derive a list of topics from

text documents using MALLET, a Java-based tool that analyzes

clusters of words across a set of documents to derive common top-

ics within the documents, deﬁned by sets of key words that are

consistently used together. In this workbook, you will learn how to:

•Clean and prepare data for automated text analysis,

•Set up data for use in MALLET,

•Derive a set of topics from a collection of text documents,

318 12. Workbooks

•Create a model that detects these topics in documents, and

use this model to categorize documents.

12.3.6 Networks

In the Networks workbook you will create network data where the

nodes are researchers who have been awarded grants, and ties are

created between each researcher on a given grant. You will use

Python to read the grant data and translate them into network data,

then use the networkx Python library to calculate node- and graph-

level network statistics and igraph to create and reﬁne network

visualizations. You will also be introduced to graph databases, an

alternative way of storing and querying network data. In this work-

book, you will learn how to:

•Develop strategies for detecting potential network data in rela-

tional data sets,

•Use Python to derive network data from a relational database,

•Store and query network data using a graph database like

neo4j,

•Load network data into networkx, then use it to calculate node-

and graph-level network statistics,

•Use networkx to export graph data into commonly shared for-

mats (graphml, edge lists, diﬀerent tabular formats, etc.),

•Load network data into the igraph Python package and then

create graph visualizations.

12.3.7 Visualization

The Visualization workbook introduces you to Tableau, a data anal-

ysis and visualization software package that is easy to learn and

use. Tableau allows you to connect to and integrate multiple data

sources into complex visualizations without writing code. It allows

you to dynamically shift between views of data to build anything

from single visualizations to an interactive dashboard that contains

multiple views of your data. In this workbook, you will learn how

to:

•Connect Tableau to a relational database,

•Interact with Tableau’s interface,

12.4. Resources 319

•Select, combine, and ﬁlter the tables and columns included in

visualizations,

•Create bar charts, timeline graphs, and heat maps,

•Group and aggregate data,

•Create a dashboard that combines multiple views of your data.

12.4 Resources

We noted in Section 1.8 the importance of Python, MySQL, and Git/

GitHub for the social scientist who intends to work with large data.

See that section for pointers to useful online resources, and also this

book’s website, at https://github.com/BigDataSocialScience, where

we have collected many useful web links, including the following.

For more on getting started with Anaconda, see Continuum’s

Anaconda documentation [13], Anaconda FAQ [14], and Anaconda

quick start guide [15].

For more information on IPython and the Jupyter notebook server,

see the IPython site [186], IPython documentation [185], Jupyter

Project site [197], and Jupyter Project documentation [196].

For more information on using jupyterhub and nbgrader to host,

distribute, and grade workbooks using a central server, see the

jupyterhub GitHub repository [198], jupyterhub documentation [199],

nbgrader GitHub repository [201], and nbgrader documentation [200].

Bibliography

[1] AAPOR. American association for public opinion research website. http://

www.aapor.org. Accessed February 1, 2016.

[2] Daniel Abadi, Rakesh Agrawal, Anastasia Ailamaki, Magdalena Balazinska,

Philip A Bernstein, Michael J Carey, Surajit Chaudhuri, Jeﬀrey Dean, An-

Hai Doan, Michael J Franklin, et al. The Beckman Report on Database

Research. ACM SIGMOD Record, 43(3):61–70, 2014. http://beckman.cs.

wisc.edu/beckman-report2013.pdf.

[3] Kevork N. Abazajian, Jennifer K. Adelman-McCarthy, Marcel A. Agüeros,

Sahar S. Allam, Carlos Allende Prieto, Deokkeun An, Kurt S. J. Anderson,

Scott F. Anderson, James Annis, Neta A. Bahcall, et al. The seventh data

release of the Sloan Digital Sky Survey. Astrophysical Journal Supplement

Series, 182(2):543, 2009.

[4] John M. Abowd, John Haltiwanger, and Julia Lane. Integrated longitudinal

employer-employee data for the United States. American Economic Review,

94(2):224–229, 2004.

[5] John M. Abowd, Martha Stinson, and Gary Benedetto. Final report to the

Social Security Administration on the SIPP/SSA/IRS Public Use File Project.

Technical report, Census Bureau, Longitudinal Employer-Household Dy-

namics Program, 2006.

[6] Alessandro Acquisti. The economics and behavioral economics of privacy.

In Julia Lane, Victoria Stodden, Stefan Bender, and Helen Nissenbaum,

editors, Privacy, Big Data, and the Public Good: Frameworks for Engagement,

pages 98–112. Cambridge University Press, 2014.

[7] Christopher Ahlberg, Christopher Williamson, and Ben Shneiderman. Dy-

namic queries for information exploration: An implementation and evalua-

tion. In Proceedings of the SIGCHI Conference on Human Factors in Computing

Systems, pages 619–626. ACM, 1992.

[8] Suha Alawadhi, Armando Aldama-Nalda, Hafedh Chourabi, J. Ramon Gil-

Garcia, Soﬁa Leung, Sehl Mellouli, Taewoo Nam, Theresa A. Pardo, Hans J.

Scholl, and Shawn Walker. Building understanding of smart city initiatives.

In Electronic Government, pages 40–53. Springer, 2012.

[9] Ed Albanese. Scaling social science with Hadoop. http://blog.cloudera.com/

blog/2010/04/scaling-social-science-with-hadoop/. Accessed February 1,

2016.

321

322 BIBLIOGRAPHY

[10] Paul D Allison. Missing Data. Sage Publications, 2001.

[11] Amazon. AWS public data sets. http://aws.amazon.com/datasets.

[12] American Statistical Association. ASA Privacy and Conﬁdentiality Subcom-

mittee. http://community.amstat.org/cpc/home.

[13] Continuum Analytics. Anaconda. http://docs.continuum.io/anaconda. Ac-

cessed February 1, 2016.

[14] Continuum Analytics. Anaconda FAQ. http://docs.continuum.io/

anaconda/faq. Accessed February 1, 2016.

[15] Continuum Analytics. Anaconda quick start guide. https://

www.continuum.io/sites/default/ﬁles/Anaconda-Quickstart.pdf. Accessed

February 1, 2016.

[16] Francis J. Anscombe. Graphs in statistical analysis. American Statistician,

27(1):17–21, 1973.

[17] Dolan Antenucci, Michael Cafarella, Margaret Levenstein, Christopher Ré,

and Matthew D. Shapiro. Using social media to measure labor market ﬂows.

Technical report, National Bureau of Economic Research, 2014.

[18] Apache Software Foundation. Apache ambari. http://ambari.apache.org.

Accessed February 1, 2016.

[19] Apache Software Foundation. Apache Hadoop documentation site. https://

hadoop.apache.org/docs/current/. Accessed February 1, 2016.

[20] Apache Software Foundation. Apache Spark documentation site. https://

spark.apache.org/docs/current/. Accessed February 1, 2016.

[21] Michael Armbrust, Armando Fox, Rean Griﬃth, Anthony D. Joseph, Randy

Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica,

et al. A view of cloud computing. Communications of the ACM, 53(4):50–58,

2010.

[22] Art Branch Inc. SQL Cheatsheet. http://www.sql-tutorial.net/

SQL-Cheat-Sheet.pdf. Accessed December 1, 2015.

[23] Eric Baldeschwieler. Best practices for selecting Apache

Hadoop hardware. Hortonworks, http://hortonworks.com/blog/

best-practices-for-selecting-apache-hadoop-hardware/, September 1,

2011.

[24] Anita Bandrowski, Matthew Brush, Jeﬀery S. Grethe, Melissa A. Haendel,

David N. Kennedy, Sean Hill, Patrick R. Hof, Maryann E Martone, Maaike

Pols, Serena S. Tan, et al. The Resource Identiﬁcation Initiative: A cultural

shift in publishing. Brain and Behavior, 2015.

[25] Albert-László Barabási and Réka Albert. Emergence of scaling in random

networks. Science, 286(5439):509–512, 1999.

[26] Michael Barbaro, Tom Zeller, and Saul Hansell. A face is exposed for AOL

searcher no. 4417749. New York Times, August 9, 2006.

BIBLIOGRAPHY 323

[27] Solon Barocas and Helen Nissenbaum. Big data’s end run around procedural

privacy protections. Communications of the ACM, 57(11):31–33, 2014.

[28] Solon Barocas and Helen Nissenbaum. The limits of anonymity and consent

in the big data age. In Julia Lane, Victoria Stodden, Stefan Bender, and Helen

Nissenbaum, editors, Privacy, Big Data, and the Public Good: Frameworks

for Engagement. Cambridge University Press, 2014.

[29] Hilda Bastian. Bad research rising: The 7th Olympiad of re-

search on biomedical publication. Scientiﬁc American, http://blogs.

scientiﬁcamerican.com/absolutely-maybe/bad-research-rising-the-7th

-olympiad-of-research-on-biomedical-publication/, 2013.

[30] Vladimir Batagelj and Andrej Mrvar. Pajek—program for large network anal-

ysis. Connections, 21(2):47–57, 1998.

[31] Alex Bell. Python for economists. http://cs.brown.edu/~ambell/pyseminar/

pyseminar.html, 2012.

[32] Krithika Bhuvaneshwar, Dinanath Sulakhe, Robinder Gauba, Alex Ro-

driguez, Ravi Madduri, Utpal Dave, Lukasz Lacinski, Ian Foster, Yuriy Gu-

sev, and Subha Madhavan. A case study for cloud based high throughput

analysis of NGS data using the Globus Genomics system. Computational

and Structural Biotechnology Journal, 13:64–74, 2015.

[33] Paul P. Biemer. Total survey error: Design, implementation, and evaluation.

Public Opinion Quarterly, 74(5):817–848, 2010.

[34] Paul P. Biemer. Latent Class Analysis of Survey Error. John Wiley & Sons,

2011.

[35] Paul P. Biemer and Lars E. Lyberg. Introduction to Survey Quality. John

Wiley & Sons, 2003.

[36] Paul P. Biemer and S. L. Stokes. Approaches to modeling measurement error.

In Paul P. Biemer, R. Groves, L. Lyberg, N. Mathiowetz, and S. Sudman,

editors, Measurement Errors in Surveys, pages 54–68. John Wiley, 1991.

[37] Paul P. Biemer and Dennis Trewin. A review of measurement error eﬀects

on the analysis of survey data. In L. Lyberg, P. Biemer, M. Collins, E. De

Leeuw, C. Dippo, N. Schwarz, and D. Trewin, editors, Survey Measurement

and Process Quality, pages 601–632. John Wiley & Sons, 1997.

[38] Ian Bird. Computing for the Large Hadron Collider. Annual Review of Nuclear

and Particle Science, 61:99–118, 2011.

[39] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing

with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly

Media, 2009. Available online at http://www.nltk.org/book/.

[40] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer,

2006.

[41] David M. Blei. Topic modeling. http://www.cs.columbia.edu/~blei/

topicmodeling.html. Accessed February 1, 2016.

324 BIBLIOGRAPHY

[42] David M. Blei and John Laﬀerty. Topic models. In Ashok Srivastava and

Mehran Sahami, editors, Text Mining: Theory and Applications. Taylor &

Francis, 2009.

[43] David M. Blei, Andrew Ng, and Michael Jordan. Latent Dirichlet allocation.

Journal of Machine Learning Research, 3:993–1022, 2003.

[44] John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, Bollywood,

boom-boxes and blenders: Domain adaptation for sentiment classiﬁcation.

In Proceedings of the Association for Computational Linguistics, 2007.

[45] Katy Börner. Atlas of Science: Visualizing What We Know. MIT Press, 2010.

[46] Philip E. Bourne and J. Lynn Fink. I am not a scientist, I am a number.

PLoS Computational Biology, 4(12):e1000247, 2008.

[47] Jeremy Boy, Ronald Rensink, Enrico Bertini, Jean-Daniel Fekete, et al. A

principled way of assessing visualization literacy. IEEE Transactions on Vi-

sualization and Computer Graphics, 20(12):1963–1972, 2014.

[48] Danah Boyd and Kate Crawford. Critical questions for big data: Provoca-

tions for a cultural, technological, and scholarly phenomenon. Information,

Communication & Society, 15(5):662–679, 2012.

[49] Jordan Boyd-Graber. http://www.umiacs.umd.edu/~jbg/lda_demo. Ac-

cessed February 1, 2016.

[50] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

[51] Eric Brewer. CAP twelve years later: How the “rules” have changed. Com-

puter, 45(2):23–29, 2012.

[52] C. Brody, T. de Hoop, M. Vojtkova, R. Warnock, M. Dunbar, P. Murthy, and

S. L. Dworkin. Economic self-help group programs for women’s empower-

ment: A systematic review. Campbell Systematic Reviews, 11(19), 2015.

[53] Jeen Broekstra, Arjohn Kampman, and Frank Van Harmelen. Sesame: A

generic architecture for storing and querying RDF and RDF schema. In The

Semantic Web—ISWC 2002, pages 54–68. Springer, 2002.

[54] Marc Bron, Bouke Huurnink, and Maarten de Rĳke. Linking archives using

document enrichment and term selection. In Proceedings of the 15th Inter-

national Conference on Theory and Practice of Digital Libraries: Research and

Advanced Technology for Digital Libraries, pages 360–371. Springer, 2011.

[55] Clair Brown, John Haltiwanger, and Julia Lane. Economic Turbulence: Is a

Volatile Economy Good for America? University of Chicago Press, 2008.

[56] Erik Brynjolfsson, Lorin M. Hitt, and Heekyung Hellen Kim. Strength in

numbers: How does data-driven decisionmaking aﬀect ﬁrm performance?

Available at SSRN 1819486, 2011.

[57] Bureau of Labor Statistics. The employment situation—November 2015.

http://www.bls.gov/news.release/archives/empsit_12042015.pdf, Decem-

ber 4, 2015.

BIBLIOGRAPHY 325

[58] Richard V. Burkhauser, Shuaizhang Feng, and Jeﬀ Larrimore. Improving

imputations of top incomes in the public-use current population survey

by using both cell-means and variances. Economics Letters, 108(1):69–72,

2010.

[59] Ronald S. Burt. The social structure of competition. Explorations in Economic

Sociology, 65:103, 1993.

[60] Ronald S. Burt. Structural holes and good ideas. American Journal of Soci-

ology, 110(2):349–399, 2004.

[61] Declan Butler. When Google got ﬂu wrong. Nature, 494(7436):155, 2013.

[62] Stuart K. Card and David Nation. Degree-of-interest trees: A component of

an attention-reactive user interface. In Proceedings of the Working Conference

on Advanced Visual Interfaces, pages 231–245. ACM, 2002.

[63] Jillian B. Carr and Jennifer L. Doleac. The geography, incidence, and un-

derreporting of gun violence: New evidence using ShotSpotter data. Techni-

cal report, http://jenniferdoleac.com/wp-content/uploads/2015/03/Carr_

Doleac_gunﬁre_underreporting.pdf, 2015.

[64] Charlie Catlett, Tanu Malik, Brett Goldstein, Jonathan Giuﬀrida, Yetong

Shao, Alessandro Panella, Derek Eder, Eric van Zanten, Robert Mitchum,

Severin Thaler, and Ian Foster. Plenario: An open data discovery and ex-

ploration platform for urban science. Bulletin of the IEEE Computer Society

Technical Committee on Data Engineering, pages 27–42, 2014.

[65] Joe Cecil and Donna Eden. The legal foundations of conﬁdentiality. In Key

Issues in Conﬁdentiality Research: Results of an NSF workshop. National

Science Foundation, 2003.

[66] Centers for Disease Control and Prevention. United States cancer statis-

tic: An interactive cancer atlas. http://nccd.cdc.gov/DCPC_INCA. Accessed

February 1, 2016.

[67] John J. Chai. Correlated measurement errors and the least squares estima-

tor of the regression coeﬃcient. Journal of the American Statistical Associa-

tion, 66(335):478–483, 1971.

[68] Lilyan Chan, Hannah F. Cross, Joseph K. She, Gabriel Cavalli, Hugo F. P.

Martins, and Cameron Neylon. Covalent attachment of proteins to solid

supports and surfaces via Sortase-mediated ligation. PLoS One, 2(11):e1164,

2007.

[69] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection:

A survey. ACM Computing Surveys, 41(3):15, 2009.

[70] O. Chapelle and S. S. Keerthi. Eﬃcient algorithms for ranking with SVMs.

Information Retrieval, 13(3):201–215, 2010.

[71] Kyle Chard, Jim Pruyne, Ben Blaiszik, Rachana Ananthakrishnan, Steven

Tuecke, and Ian Foster. Globus data publication as a service: Lowering

barriers to reproducible science. In 11th IEEE International Conference on

eScience, 2015.

326 BIBLIOGRAPHY

[72] Kyle Chard, Steven Tuecke, and Ian Foster. Eﬃcient and secure transfer,

synchronization, and sharing of big data. Cloud Computing, IEEE, 1(3):46–

55, 2014. See also https://www.globus.org.

[73] Nitesh V. Chawla. Data mining for imbalanced datasets: An overview. In

Oded Maimon and Lior Rokach, editors, The Data Mining and Knowledge

Discovery Handbook, pages 853–867. Springer, 2005.

[74] Raj Chetty. The transformative potential of administrative data for microe-

conometric research. http://conference.nber.org/confer/2012/SI2012/LS/

ChettySlides.pdf. Accessed February 1, 2016, 2012.

[75] James O. Chipperﬁeld and Raymond L. Chambers. Using the bootstrap to

account for linkage errors when analysing probabilistically linked categorical

data. Journal of Oﬃcial Statistics, 31(3):397–414, 2015.

[76] Peter Christen. Data Matching: Concepts and Techniques for Record Linkage,

Entity Resolution, and Duplicate Detection. Springer, 2012.

[77] Peter Christen. A survey of indexing techniques for scalable record linkage

and deduplication. IEEE Transactions on Knowledge and Data Engineering,

24(9):1537–1555, 2012.

[78] Claire Clarke. Editing big data with machine learning methods. Paper pre-

sented at the Australian Bureau of Statistics Symposium, Canberra, 2014.

[79] William S. Cleveland and Robert McGill. Graphical perception: Theory, ex-

perimentation, and application to the development of graphical methods.

Journal of the American Statistical Association, 79(387):531–554, 1984.

[80] C. Clifton, M. Kantarcioglu, A. Doan, G. Schadow, J. Vaidya, A.K. Elma-

garmid, and D. Suciu. Privacy-preserving data integration and sharing. In

G. Das, B. Liu, and P. S. Yu, editors, 9th ACM SIGMOD Workshop on Re-

search Issues in Data Mining and Knowledge Discovery, pages 19–26. ACM,

June 2006.

[81] Cloudera. Cloudera Manager. https://www.cloudera.com/content/www/

en-us/products/cloudera-manager.html. Accessed April 16, 2016.

[82] William G. Cochran. Errors of measurement in statistics. Technometrics,

10(4):637–666, 1968.

[83] Conﬁdentiality and Data Access Committee. Federal Committee on Statistics

and Methodology. http://fcsm.sites.usa.gov/committees/cdac/. Accessed

April 16, 2016.

[84] Consumer Financial Protection Bureau. Home mortgage disclosure act

data. http://www.consumerﬁnance.gov/hmda/learn-more. Accessed April

16, 2016.

[85] Paolo Corti, Thomas J. Kraft, Stephen Vincent Mather, and Bborie Park.

PostGIS Cookbook. Packt Publishing, 2014.

[86] Koby Crammer and Yoram Singer. On the algorithmic implementation of

multiclass kernel-based vector machines. Journal of Machine Learning Re-

search, 2:265–292, 2002.

BIBLIOGRAPHY 327

[87] Patricia J. Crossno, Douglas D. Cline, and Jeﬀrey N Jortner. A heterogeneous

graphics procedure for visualization of massively parallel solutions. ASME

FED, 156:65–65, 1993.

[88] John Czajka, Craig Schneider, Amang Sukasih, and Kevin Collins. Mini-

mizing disclosure risk in HHS open data initiatives. Technical report, US

Department of Health & Human Services, 2014.

[89] DataCite. DataCite homepage. https://www.datacite.org. Accessed February

1, 2016.

[90] Ton De Waal, Jeroen Pannekoek, and Sander Scholtus. Handbook of Statis-

tical Data Editing and Imputation. John Wiley & Sons, 2011.

[91] Jeﬀrey Dean and Sanjay Ghemawat. MapReduce: Simpliﬁed data processing

on large clusters. In Proceedings of the 6th Conference on Symposium on

Opearting Systems Design & Implementation—Volume 6, OSDI’04. USENIX

Association, 2004.

[92] Danny DeBelius. Let’s tesselate: Hexagons for tile grid maps. NPR Visuals

Team Blog, http://blog.apps.npr.org/2015/05/11/hex-tile-maps.html, May

11, 2015.

[93] Ryan A. Decker, John Haltiwanger, Ron S. Jarmin, and Javier Miranda.

Where has all the skewness gone? The decline in high-growth (young) ﬁrms

in the US. European Economic Review, to appear.

[94] David J. DeWitt and Michael Stonebraker. MapReduce: A major step back-

wards. http://www.dcs.bbk.ac.uk/~dell/teaching/cc/paper/dbc08/dewitt_

mr_db.pdf, January 17, 2008.

[95] John J. Donohue III and Justin Wolfers. Uses and abuses of empirical

evidence in the death penalty debate. Technical report, National Bureau of

Economic Research, 2006.

[96] Pat Doyle, Julia I. Lane, Jules J. M. Theeuwes, and Laura V. Zayatz. Conﬁ-

dentiality, Disclosure, and Data Access: Theory and Practical Applications for

Statistical Agencies. Elsevier Science, 2001.

[97] Jörg Drechsler. Synthetic Datasets for Statistical Disclosure Control: Theory

and Implementation. Springer, 2011.

[98] Lian Duan, Lida Xu, Ying Liu, and Jun Lee. Cluster-based outlier detection.

Annals of Operations Research, 168(1):151–168, 2009.

[99] Eva H. DuGoﬀ, Megan Schuler, and Elizabeth A. Stuart. Generalizing ob-

servational study results: Applying propensity score methods to complex

surveys. Health Services Research, 49(1):284–303, 2014.

[100] G. Duncan, M. Elliot, and J. J. Salazar-González. Statistical Conﬁdentiality:

Principles and Practice. Springer, 2011.

[101] Cody Dunne and Ben Shneiderman. Motif simpliﬁcation: Improving network

visualization readability with fan, connector, and clique glyphs. In Proceed-

ings of the SIGCHI Conference on Human Factors in Computing Systems,

pages 3247–3256. ACM, 2013.

328 BIBLIOGRAPHY

[102] Ted Dunning. Accurate methods for the statistics of surprise and coinci-

dence. Computational Linguistics, 19(1):61–74, 1993.

[103] Economic and Social Research Council. Administrative Data Research Net-

work, 2016.

[104] Liran Einav and Jonathan D. Levin. The data revolution and economic anal-

ysis. Technical report, National Bureau of Economic Research, 2013.

[105] B. Elbel, J. Gyamﬁ, and R. Kersh. Child and adolescent fast-food choice and

the inﬂuence of calorie labeling: A natural experiment. International Journal

of Obesity, 35(4):493–500, 2011.

[106] Peter Elias. A European perspective on research and big data access. In Ju-

lia Lane, Victoria Stodden, Stefan Bender, and Helen Nissenbaum, editors,

Privacy, Big Data, and the Public Good: Frameworks for Engagement, pages

98–112. Cambridge University Press, 2014.

[107] Joshua Elliott, David Kelly, James Chryssanthacopoulos, Michael Glotter,

Kanika Jhunjhnuwala, Neil Best, Michael Wilde, and Ian Foster. The parallel

system for integrating impact models and sectors (pSIMS). Environmental

Modelling & Software, 62:509–516, 2014.

[108] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios.

Duplicate record detection: A survey. IEEE Transactions on Knowledge and

Data Engineering, 19(1):1–16, 2007.

[109] David S. Evans. Tests of alternative theories of ﬁrm growth. Journal of

Political Economy, 95:657–674, 1987.

[110] J. A. Evans and J. G. Foster. Metaknowledge. Science, 331(6018):721–725,

2011.

[111] Jianqing Fan, Fang Han, and Han Liu. Challenges of big data analysis.

National Science Review, 1(2):293–314, 2014.

[112] Jianqing Fan and Yuan Liao. Endogeneity in ultrahigh dimension. Technical

report, Princeton University, 2012.

[113] Jianqing Fan and Yuan Liao. Endogeneity in high dimensions. Annals of

Statistics, 42(3):872, 2014.

[114] Jianqing Fan, Richard Samworth, and Yichao Wu. Ultrahigh dimensional

feature selection: Beyond the linear model. Journal of Machine Learning

Research, 10:2013–2038, 2009.

[115] Jean-Daniel Fekete. ProgressiVis: A toolkit for steerable progressive analyt-

ics and visualization. Paper presented at 1st Workshop on Data Systems for

Interactive Analysis, Chicago, IL, October 26, 2015.

[116] Jean-Daniel Fekete and Catherine Plaisant. Interactive information visual-

ization of a million items. In IEEE Symposium on Information Visualization,

pages 117–124. IEEE, 2002.

[117] Ronen Feldman and James Sanger. Text Mining Handbook: Advanced Ap-

proaches in Analyzing Unstructured Data. Cambridge University Press, 2006.

BIBLIOGRAPHY 329

[118] Ivan P. Fellegi and Alan B. Sunter. A theory for record linkage. Journal of

the American Statistical Association, 64(328):1183–1210, 1969.

[119] Stephen Few. Now You See It: Simple Visualization Techniques for Quantita-

tive Analysis. Analytics Press, 2009.

[120] Stephen Few. Information Dashboard Design: Displaying Data for At-a-Glance

Monitoring. Analytics Press, 2013.

[121] Roy T. Fielding and Richard N. Taylor. Principled design of the modern Web

architecture. ACM Transactions on Internet Technology, 2(2):115–150, 2002.

[122] Figshare. Figshare homepage. http://ﬁgshare.com. Accessed February 1,

2016.

[123] Danyel Fisher, Igor Popov, Steven Drucker, and m. c. schraefel. Trust me, I’m

partially right: Incremental visualization lets analysts explore large datasets

faster. In Proceedings of the SIGCHI Conference on Human Factors in Com-

puting Systems, pages 1673–1682. ACM, 2012.

[124] Peter Flach. Machine Learning: The Art and Science of Algorithms That Make

Sense of Data. Cambridge University Press, 2012.

[125] Blaz Fortuna, Marko Grobelnik, and Dunja Mladenic. OntoGen: Semi-

automatic ontology editor. In Proceedings of the 2007 Conference on Human

Interface: Part II, pages 309–318. Springer, 2007.

[126] Lucia Foster, Ron S. Jarmin, and T. Lynn Riggs. Resolving the tension

between access and conﬁdentiality: Past experience and future plans at the

US Census Bureau. Technical Report 09-33, US Census Bureau Center for

Economic Studies, 2009.

[127] Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, and Paul

Gauthier. Cluster-based scalable network services. ACM SIGOPS Operating

Systems Review, 31(5), 1997.

[128] W. N. Francis and H. Kucera. Brown corpus manual. Technical report,

Department of Linguistics, Brown University, Providence, Rhode Island, US,

1979.

[129] Linton C. Freeman. Centrality in social networks conceptual clariﬁcation.

Social Networks, 1(3):215–239, 1979.

[130] Wayne A. Fuller. Regression estimation in the presence of measurement

error. In Paul P. Biemer, Robert M. Groves, Lars E. Lyberg, Nancy A.

Mathiowetz, and Seymour Sudman, editors, Measurement Errors in Surveys,

pages 617–635. John Wiley & Sons, 1991.

[131] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the

Bayesian restoration of images. In Glenn Shafer and Judea Pearl, editors,

Readings in Uncertain Reasoning, pages 452–472. Morgan Kaufmann, 1990.

[132] Maria Girone. CERN database services for the LHC computing grid. In Jour-

nal of Physics: Conference Series, volume 119, page 052017. IOP Publishing,

2008.

330 BIBLIOGRAPHY

[133] Michelle Girvan and Mark E. J. Newman. Community structure in social

and biological networks. Proceedings of the National Academy of Sciences,

99(12):7821–7826, 2002.

[134] Wolfgang Glänzel. Bibliometric methods for detecting and analysing emerg-

ing research topics. El Profesional de la Información, 21(1):194–201, 2012.

[135] Michael Glueck, Azam Khan, and Daniel J. Wigdor. Dive in! Enabling

progressive loading for real-time navigation of data visualizations. In Pro-

ceedings of the SIGCHI Conference on Human Factors in Computing Systems,

pages 561–570. ACM, 2014.

[136] Leopold Granquist and John G. Kovar. Editing of survey data: How much

is enough? In L. Lyberg, P. Biemer, M. Collins, E. De Leeuw, C. Dippo,

N. Schwarz, and D. Trewin, editors, Survey Measurement and Process Qual-

ity, pages 415–435. John Wiley & Sons, 1997.

[137] Jim Gray. The transaction concept: Virtues and limitations. In Proceedings

of the Seventh International Conference on Very Large Data Bases, volume 7,

pages 144–154, 1981.

[138] Donald P. Green and Holger L. Kern. Modeling heterogeneous treatment

eﬀects in survey experiments with Bayesian additive regression trees. Public

Opinion Quarterly, 76:491–511, 2012.

[139] Daniel Greenwood, Arkadiusz Stopczynski, Brian Sweatt, Thomas Hardjono,

and Alex Pentland. The new deal on data: A framework for institutional

controls. In Julia Lane, Victoria Stodden, Stefan Bender, and Helen Nis-

senbaum, editors, Privacy, Big Data, and the Public Good: Frameworks for

Engagement, page 192. Cambridge University Press, 2014.

[140] Thomas L. Griﬃths and Mark Steyvers. Finding scientiﬁc topics. Proceedings

of the National Academy of Sciences, 101(Suppl. 1):5228–5235, 2004.

[141] Justin Grimmer and Brandon M. Stewart. Text as data: The promise and

pitfalls of automatic content analysis methods for political texts. Political

Analysis, 21(3):267–297, 2013.

[142] William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable

Parallel Programming with the Message-Passing Interface. MIT Press, 2014.

[143] Robert M. Groves. Survey Errors and Survey Costs. John Wiley & Sons,

2004.

[144] Jiancheng Guan and Nan Ma. China’s emerging presence in nanoscience

and nanotechnology: A comparative bibliometric study of several

nanoscience ‘giants’. Research Policy, 36(6):880–886, 2007.

[145] Laurel L. Haak, Martin Fenner, Laura Paglione, Ed Pentz, and Howard Rat-

ner. ORCID: A system to uniquely identify researchers. Learned Publishing,

25(4):259–264, 2012.

[146] Marc Haber, Daniel E. Platt, Maziar Ashraﬁan Bonab, Sonia C. Youhanna,

David F. Soria-Hernanz, Begoña Martínez-Cruz, Bouchra Douaihy, Michella

Ghassibe-Sabbagh, Hoshang Rafatpanah, Mohsen Ghanbari, et al.

BIBLIOGRAPHY 331

Afghanistan’s ethnic groups share a Y-chromosomal heritage structured by

historical events. PLoS One, 7(3):e34288, 2012.

[147] Apache Hadoop. HDFS architecture. http://spark.apache.org/docs/latest/

programming-guide.html#transformations.

[148] Jens Hainmueller and Chad Hazlett. Kernel regularized least squares: Re-

ducing misspeciﬁcation bias with a ﬂexible and interpretable machine learn-

ing approach. Political Analysis, 22(2):143–168, 2014.

[149] Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable eﬀec-

tiveness of data. IEEE Intelligent Systems, 24(2):8–12, 2009.

[150] Mark Hall, Eibe Frank, Geoﬀrey Holmes, Bernhard Pfahringer, Peter Reute-

mann, and Ian H Witten. The Weka data mining software: An update. ACM

SIGKDD Explorations Newsletter, 11(1):10–18, 2009.

[151] P. Hall and H. Miller. Using generalized correlation to eﬀect variable selection

in very high dimensional problems. Journal of Computational and Graphical

Statistics, 18:533–550, 2009.

[152] John Haltiwanger, Ron S. Jarmin, and Javier Miranda. Who creates jobs?

Small versus large versus young. Review of Economics and Statistics,

95(2):347–361, 2013.

[153] Derek Hansen, Ben Shneiderman, and Marc A. Smith. Analyzing Social

Media Networks with NodeXL: Insights from a Connected World. Morgan

Kaufmann, 2010.

[154] Morris H. Hansen, William N. Hurwitz, and William G. Madow. Sample

Survey Methods and Theory. John Wiley & Sons, 1993.

[155] Tim Harford. Big data: A big mistake? Signiﬁcance, 11(5):14–19, 2014.

[156] Lane Harrison, Katharina Reinecke, and Remco Chang. Baby Name Voyager.

http://www.babynamewizard.com/voyager/. Accessed February 1, 2016.

[157] Lane Harrison, Katharina Reinecke, and Remco Chang. Infographic aesthet-

ics: Designing for the ﬁrst impression. In Proceedings of the 33rd Annual

ACM Conference on Human Factors in Computing Systems, pages 1187–1190.

ACM, 2015.

[158] Trevor Hastie and Rob Tibshirani. Statistical learning course. https://

lagunita.stanford.edu/courses/HumanitiesandScience/StatLearning/

Winter2015/about. Accessed February 1, 2016.

[159] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of

Statistical Learning. Springer, 2001.

[160] Erica Check Hayden. Researchers wrestle with a privacy problem. Nature,

525(7570):440, 2015.

[161] Erika Check Hayden. A broken contract. Nature, 486(7403):312–314, 2012.

[162] Zengyou He, Xiaofei Xu, and Shengchun Deng. Discovering cluster-based

local outliers. Pattern Recognition Letters, 24(9):1641–1650, 2003.

332 BIBLIOGRAPHY

[163] Kieran Healy and James Moody. Data visualization in sociology. Annual

Review of Sociology, 40:105–128, 2014.

[164] Nathalie Henry and Jean-Daniel Fekete. MatrixExplorer: A dual-

representation system to explore social networks. IEEE Transactions on

Visualization and Computer Graphics, 12(5):677–684, 2006.

[165] Thomas N. Herzog, Fritz J. Scheuren, and William E. Winkler. Data Quality

and Record Linkage Techniques. Springer, 2007.

[166] Kashmir Hill. How Target ﬁgured out a teen girl was pregnant before her

father did. Forbes, http://www.forbes.com/sites/kashmirhill/2012/02/16/

how-target-ﬁgured-out-a-teen-girl-was-pregnant-before-her-father-did/#

7280148734c6, February 16, 2012.

[167] Thomas Hofmann. Probabilistic latent semantic analysis. In Proceedings of

Uncertainty in Artiﬁcial Intelligence, 1999.

[168] Torsten Hothorn. Cran task view: Machine learning & statistical learn-

ing. https://cran.r-project.org/web/views/MachineLearning.html. Accessed

February 1, 2016.

[169] Joop Hox. Multilevel Analysis: Techniques and Applications. Routledge,

2010.

[170] Yuening Hu, Ke Zhai, Vlad Eidelman, and Jordan Boyd-Graber. Polylingual

tree-based topic models for translation domain adaptation. In Proceedings

of the 52nd Annual Meeting of the Association for Computational Linguistics,

2014.

[171] Anna Huang. Similarity measures for text document clustering. Paper pre-

sented at New Zealand Computer Science Research Student Conference,

Christchurch, New Zealand, April 14–18, 2008.

[172] Jian Huang, Seyda Ertekin, and C. Lee Giles. Eﬃcient name disambiguation

for large-scale databases. In Knowledge Discovery in Databases: PKDD 2006,

pages 536–544. Springer, 2006.

[173] Human Microbiome Jumpstart Reference Strains Consortium, K. E. Nelson,

G. M. Weinstock, et al. A catalog of reference genomes from the human

microbiome. Science, 328(5981):994–999, 2010.

[174] Anco Hundepool, Josep Domingo-Ferrer, Luisa Franconi, Sarah Giessing,

Rainer Lenz, Jane Longhurst, E. Schulte Nordholt, Giovanni Seri, and

P. Wolf. Handbook on statistical disclosure control. Technical report, Net-

work of Excellence in the European Statistical System in the Field of Statis-

tical Disclosure Control, 2010.

[175] Kaye Husband Fealing, Julia Ingrid Lane, Jack Marburger, and Stephanie

Shipp. Science of Science Policy: The Handbook. Stanford University Press,

2011.

[176] Joseph G. Ibrahim and Ming-Hui Chen. Power prior distributions for regres-

sion models. Statistical Science, 15(1):46–60, 2000.

BIBLIOGRAPHY 333

[177] ICML. International conference on machine learning. http://icml.cc/. Ac-

cessed February 1, 2016.

[178] Kosuke Imai, Marc Ratkovic, et al. Estimating treatment eﬀect heterogeneity

in randomized program evaluation. Annals of Applied Statistics, 7(1):443–

470, 2013.

[179] Guido W. Imbens and Donald B. Rubin. Causal Inference in Statistics, Social,

and Biomedical Sciences. Cambridge University Press, 2015.

[180] Alfred Inselberg. Parallel Coordinates. Springer, 2009.

[181] Institute for Social Research. Panel study of income dynamics. http://

psidonline.isr.umich.edu. Accessed February 1, 2016.

[182] Institute for Social Research. PSID ﬁle structure and merging PSID data ﬁles.

Technical report. http://psidonline.isr.umich.edu/Guide/FileStructure.pdf.

September 17, 2013.

[183] International Household Survey Network. Data Dissemination. http://www.

ihsn.org/home/projects/dissemination. Accessed April 16, 2016.

[184] J. P. A. Ioannidis. Why most published research ﬁndings are false. PLoS

Medicine, 2(8):e124, 2005.

[185] IPython development team. IPython documentation. http://ipython.

readthedocs.org/. Accessed February 1, 2016.

[186] IPython development team. IPython website. http://ipython.org/. Accessed

February 1, 2016.

[187] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An

Introduction to Statistical Learning. Springer, 2013.

[188] Lilli Japec, Frauke Kreuter, Marcus Berg, Paul Biemer, Paul Decker, Cliﬀ

Lampe, Julia Lane, Cathy O’Neil, and Abe Usher. Big data in survey research:

AAPOR Task Force Report. Public Opinion Quarterly, 79(4):839–880, 2015.

[189] Ron S. Jarmin, Thomas A. Louis, and Javier Miranda. Expanding the role

of synthetic data at the US Census Bureau. Statistical Journal of the IAOS,

30(2):117–121, 2014.

[190] Ron S. Jarmin and Javier Miranda. The longitudinal business database.

Available at SSRN 2128793, 2002.

[191] Rachel Jewkes, Yandisa Sikweyiya, Robert Morrell, and Kristin Dunkle. The

relationship between intimate partner violence, rape and HIV amongst South

African men: A cross-sectional study. PLoS One, 6(9):e24256, 2011.

[192] Brian Johnson and Ben Shneiderman. Tree-maps: A space-ﬁlling approach

to the visualization of hierarchical information structures. In Proceedings of

the IEEE Conference on Visualization, pages 284–291. IEEE, 1991.

[193] Paul Jones and Peter Elias. Administrative data as a research resource: A se-

lected audit. Technical report, ESRC National Centre for Research Methods,

2006.

334 BIBLIOGRAPHY

[194] JOS. Journal of oﬃcial statistics website. http://www.jos.nu. Accessed

February 1, 2016.

[195] JPC. Journal of Privacy and Conﬁdentiality. http://repository.cmu.edu/

jpc/. Accessed April 16, 2016.

[196] Jupyter. Jupyter project documentation. http://jupyter.readthedocs.org/.

Accessed February 1, 2016.

[197] Jupyter. Jupyter project website. http://jupyter.org/. Accessed February 1,

2016.

[198] Jupyter. jupyterhub GitHub repository. https://github.com/jupyter/

jupyterhub/. Accessed February 1, 2016.

[199] Jupyter. jupyterhyb documentation. http://jupyterhub.readthedocs.org/.

Accessed February 1, 2016.

[200] Jupyter. nbgrader documentation. http://nbgrader.readthedocs.org/. Ac-

cessed February 1, 2016.

[201] Jupyter. nbgrader GitHub repository. https://github.com/jupyter/

nbgrader/. Accessed February 1, 2016.

[202] Felichism Kabo, Yongha Hwang, Margaret Levenstein, and Jason Owen-

Smith. Shared paths to the lab: A sociospatial network analysis of collabo-

ration. Environment and Behavior, 47(1):57–84, 2015.

[203] Alan Karr and Jerome P. Reiter. Analytical frameworks for data release: A

statistical view. In Julia Lane, Victoria Stodden, Stefan Bender, and Helen

Nissenbaum, editors, Privacy, Big Data, and the Public Good: Frameworks

for Engagement. Cambridge University Press, 2014.

[204] KDD. ACM international conference on knowledge discovery and data mining

(KDD). http://www.kdd.org. Accessed February 1, 2016.

[205] Sallie Ann Keller, Steven E. Koonin, and Stephanie Shipp. Big data and city

living: What can it do for us? Signiﬁcance, 9(4):4–7, 2012.

[206] Keshif. Infographics aesthetics dataset browser. http://keshif.me/demo/

infographics_aesthetics. Accessed February 1, 2016.

[207] Satkartar K. Kinney, Jerome P. Reiter, Arnold P. Reznek, Javier Miranda,

Ron S. Jarmin, and John M. Abowd. Towards unrestricted public use busi-

ness microdata: The synthetic Longitudinal Business Database. Interna-

tional Statistical Review, 79(3):362–384, 2011.

[208] Andy Kirk. Data Visualization: A Successful Design Process. Packt Publish-

ing, 2012.

[209] Tibor Kiss and Jan Strunk. Unsupervised multilingual sentence boundary

detection. Computational Linguistics, 32(4):485–525, 2006.

[210] Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and Ziad Obermeyer.

Prediction policy problems. American Economic Review, 105(5):491–95,

2015.

BIBLIOGRAPHY 335

[211] Ulrich Kohler and Frauke Kreuter. Data Analysis Using Stata, 3rd Edition.

Stata Press, 2012.

[212] Lingpeng Kong, Nathan Schneider, Swabha Swayamdipta, Archna Bhatia,

Chris Dyer, and Noah A. Smith. A dependency parser for tweets. In Pro-

ceedings of the 2014 Conference on Empirical Methods in Natural Language

Processing (EMNLP), pages 1001–1012. Association for Computational Lin-

guistics, October 2014.

[213] Hanna Köpcke, Andreas Thor, and Erhard Rahm. Evaluation of entity res-

olution approaches on real-world match problems. Proceedings of the VLDB

Endowment, 3(1–2):484–493, 2010.

[214] Menno-Jan Kraak. Mapping Time: Illustrated by Minard’s Map of Napoleon’s

Russian Campaign of 1812. ESRI Press, 2014.

[215] Frauke Kreuter and Roger D. Peng. Extracting information from big data:

Issues of measurement, inference, and linkage. In Julia Lane, Victoria Stod-

den, Stefan Bender, and Helen Nissenbaum, editors, Privacy, Big Data, and

the Public Good: Frameworks for Engagement, pages 257–275. Cambridge

University Press, 2014.

[216] H. W. Kuhn. The Hungarian method for the assignment problem. Naval

Research Logistics, 52(1):7–21, 2005.

[217] Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer Science

& Business Media, 2013.

[218] Solomon Kullback and Richard A. Leibler. On information and suﬃciency.

Annals of Mathematical Statistics, 22(1):79–86, 1951.

[219] Mohit Kumar, Rayid Ghani, and Zhu-Song Mei. Data mining to predict

and prevent errors in health insurance claims processing. In Proceedings of

the 16th ACM SIGKDD International Conference on Knowledge Discovery and

Data Mining, KDD ’10, pages 65–74. ACM, 2010.

[220] John D. Laﬀerty, Andrew McCallum, and Fernando C. N. Pereira. Condi-

tional random ﬁelds: Probabilistic models for segmenting and labeling se-

quence data. In Proceedings of the Eighteenth International Conference on

Machine Learning, pages 282–289. Morgan Kaufmann, 2001.

[221] Partha Lahiri and Michael D Larsen. Regression analysis with linked data.

Journal of the American Statistical Association, 100(469):222–230, 2005.

[222] Himabindu Lakkaraju, Everaldo Aguiar, Carl Shan, David Miller, Nasir

Bhanpuri, Rayid Ghani, and Kecia L. Addison. A machine learning frame-

work to identify students at risk of adverse academic outcomes. In Pro-

ceedings of the 21th ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining, KDD ’15, pages 1909–1918. ACM, 2015.

[223] Heidi Lam, Enrico Bertini, Petra Isenberg, Catherine Plaisant, and Sheelagh

Carpendale. Empirical studies in information visualization: Seven scenar-

ios. IEEE Transactions on Visualization and Computer Graphics, 18(9):1520–

1536, 2012.

336 BIBLIOGRAPHY

[224] T. Landauer and S. Dumais. Solutions to Plato’s problem: The latent seman-

tic analysis theory of acquisition, induction and representation of knowledge.

Psychological Review, 104(2):211–240, 1997.

[225] Julia Lane. Optimizing access to micro data. Journal of Oﬃcial Statistics,

23:299–317, 2007.

[226] Julia Lane and Victoria Stodden. What? Me worry? what to do about

privacy, big data, and statistical research. AMSTAT News, 438:14, 2013.

[227] Julia Lane, Victoria Stodden, Stefan Bender, and Helen Nissenbaum, edi-

tors. Privacy, Big Data, and the Public Good: Frameworks for Engagement.

Cambridge University Press, 2014.

[228] Julia I. Lane, Jason Owen-Smith, Rebecca F. Rosen, and Bruce A. Weinberg.

New linked data on research investments: Scientiﬁc workforce, productivity,

and public value. Research Policy, 44:1659–1671, 2015.

[229] Douglas Laney. 3D data management: Controlling data volume, velocity,

and variety. Technical report, META Group, February 2001.

[230] David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani. The

parable of Google Flu: Traps in big data analysis. Science, 343(14 March),

2014.

[231] Sinead C. Leahy, William J. Kelly, Eric Altermann, Ron S. Ronimus, Carl J

Yeoman, Diana M Pacheco, Dong Li, Zhanhao Kong, Sharla McTavish, Carrie

Sang, C. Lambie, Peter H. Janssen, Debjit Dey, and Graeme T. Attwood. The

genome sequence of the rumen methanogen Methanobrevibacter ruminan-

tium reveals new possibilities for controlling ruminant methane emissions.

PLoS One, 2010. DOI: 10.1371/journal.pone.0008926.

[232] Whay C. Lee and Edward A. Fox. Experimental comparison of schemes for

interpreting Boolean queries. Technical Report TR-88-27, Computer Science,

Virginia Polytechnic Institute and State University, 1988.

[233] Yang Lee, WooYoung Chung, Stuart Madnick, Richard Wang, and Hongyun

Zhang. On the rise of the Chief Data Oﬃcers in a world of big data. In

Pre-ICIS 2012 SIM Academic Workshop, Orlando, Florida, 2012.

[234] Steven D. Levitt and Thomas J. Miles. Economic contributions to the under-

standing of crime. Annual Review of Law Social Science, 2:147–164, 2006.

[235] David D. Lewis. Naive (Bayes) at forty: The independence assumption in

information retrieval. In Proceedings of European Conference of Machine

Learning, pages 4–15, 1998.

[236] D. Lifka, I. Foster, S. Mehringer, M. Parashar, P. Redfern, C. Stewart, and

S. Tuecke. XSEDE cloud survey report. Technical report, National Science

Foundation, USA, http://hdl.handle.net/2142/45766, 2013.

[237] Jennifer Lin and Martin Fenner. Altmetrics in evolution: Deﬁning and re-

deﬁning the ontology of article-level metrics. Information Standards Quar-

terly, 25(2):20, 2013.

BIBLIOGRAPHY 337

[238] Jimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce.

Morgan & Claypool Publishers, 2010.

[239] Lauro Lins, James T Klosowski, and Carlos Scheidegger. Nanocubes for

real-time exploration of spatiotemporal datasets. IEEE Transactions on Visu-

alization and Computer Graphics, 19(12):2456–2465, 2013.

[240] Roderick J. A. Little and Donald B Rubin. Statistical Analysis with Missing

Data. John Wiley & Sons, 2014.

[241] Zhicheng Liu and Jeﬀrey Heer. The eﬀects of interactive latency on ex-

ploratory visual analysis. IEEE Transactions on Visualization and Computer

Graphics, 20(12):2122–2131, 2014.

[242] Glenn K. Lockwood. Conceptual overview of map-reduce and hadoop. http://

www.glennklockwood.com/data-intensive/hadoop/overview.html, October

9, 2015.

[243] Sharon Lohr. Sampling: Design and Analysis. Cengage Learning, 2009.

[244] Alan M. MacEachren, Stephen Crawford, Mamata Akella, and Gene

Lengerich. Design and implementation of a model, web-based, GIS-enabled

cancer atlas. Cartographic Journal, 45(4):246–260, 2008.

[245] Jock Mackinlay. Automating the design of graphical presentations of rela-

tional information. ACM Transactions on Graphics, 5(2):110–141, 1986.

[246] Waqas Ahmed Malik, Antony Unwin, and Alexander Gribov. An interactive

graphical system for visualizing data quality–tableplot graphics. In Classiﬁ-

cation as a Tool for Research, pages 331–339. Springer, 2010.

[247] K. Malmkjær. The Linguistics Encyclopedia. Routledge, 2002.

[248] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Intro-

duction to Information Retrieval. Cambridge University Press, 2008.

[249] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel,

Steven J. Bethard, and David McClosky. The Stanford CoreNLP natural

language processing toolkit. In Proceedings of 52nd Annual Meeting of the

Association for Computational Linguistics: System Demonstrations, pages 55–

60, 2014.

[250] John H Marburger. Wanted: Better benchmarks. Science, 308(5725):1087,

2005.

[251] Mitchell P. Marcus, Beatrice Santorini, and Mary A. Marcinkiewicz. Building

a large annotated corpus of English: The Penn treebank. Computational

Linguistics, 19(2):313–330, 1993.

[252] Alexandre Mas and Enrico Moretti. Peers at work. American Economic Re-

view, 99(1):112–145, 2009.

[253] Girish Maskeri, Santonu Sarkar, and Kenneth Heaﬁeld. Mining business

topics in source code using latent Dirichlet allocation. In Proceedings of the

1st India Software Engineering Conference, pages 113–120. ACM, 2008.

338 BIBLIOGRAPHY

[254] Erika McCallister, Timothy Grance, and Karen A Scarfone. SP 800-122.

Guide to Protecting the Conﬁdentiality of Personally Identiﬁable Information

(PII). National Institute of Standards and Technology, 2010.

[255] Andrew Kachites McCallum. Mallet: A machine learning for language toolkit.

http://mallet.cs.umass.edu, 2002.

[256] Edgar Meĳ, Marc Bron, Laura Hollink, Bouke Huurnink, and Maarten Rĳke.

Learning semantic query suggestions. In Proceedings of the 8th International

Semantic Web Conference, ISWC ’09, pages 424–440. Springer, 2009.

[257] Bruce D. Meyer, Wallace K. C. Mok, and James X. Sullivan. Household

surveys in crisis. Journal of Economic Perspectives, 29(4):199–226, 2015.

[258] Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997.

[259] C. L. Moﬀatt. Visual representation of SQL joins. http://www.codeproject.

com/Articles/33052/Visual-Representation-of-SQL-Joins, February 3,

1999.

[260] Anthony Molinaro. SQL Cookbook: Query Solutions and Techniques for

Database Developers. O’Reilly Media, 2005.

[261] Stephen L. Morgan and Christopher Winship. Counterfactuals and Causal

Inference. Cambridge University Press, 2014.

[262] Peter Stendahl Mortensen, Carter Walter Bloch, et al. Oslo Manual: Guide-

lines for Collecting and Interpreting Innovation Data. Organisation for Eco-

nomic Co-operation and Development, 2005.

[263] Sougata Mukherjea. Information retrieval and knowledge discovery utilising

a biomedical semantic web. Brieﬁngs in Bioinformatics, 6(3):252–262, 2005.

[264] Tamara Munzner. Visualization Analysis and Design. CRC Press, 2014.

[265] Joe Murphy, Michael W Link, Jennifer Hunter Childs, Casey Langer Tesfaye,

Elizabeth Dean, Michael Stern, Josh Pasek, Jon Cohen, Mario Callegaro,

and Paul Harwood. Social media in public opinion research: Report of the

AAPOR Task Force on emerging technologies in public opinion research.

Public Opinion Quarterly, 78(4):788–794, 2014.

[266] Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large

sparse datasets. In IEEE Symposium on Security and Privacy, pages 111–125.

IEEE, 2008.

[267] Kalaivany Natarajan, Jiuyong Li, and Andy Koronios. Data Mining Tech-

niques for Data Cleaning. Springer, 2010.

[268] National Science Foundation. Download awards by year. http://nsf.gov/

awardsearch/download.jsp. Accessed February 1, 2016.

[269] Roberto Navigli, Stefano Faralli, Aitor Soroa, Oier de Lacalle, and Eneko

Agirre. Two birds with one stone: Learning semantic models for text cate-

gorization and word sense disambiguation. In Proceedings of the 20th ACM

International Conference on Information and Knowledge Management. ACM,

2011.

BIBLIOGRAPHY 339

[270] Robert K. Nelson. Mining the dispatch. http://dsl.richmond.edu/dispatch/,

2010.

[271] Mark Newman. A measure of betweenness centrality based on random

walks. Social Networks, 27(1):39–54, 2005.

[272] Mark Newman. Networks: An Introduction. Oxford University Press, 2010.

[273] Cameron Neylon. Altmetrics: What are they good for? http://blogs.plos.

org/opens/2014/10/03/altmetrics-what-are-they-good-for/, 2014.

[274] Cameron Neylon. The road less travelled. In Sarita Albagli, Maria Lucia

Maciel, and Alexandre Hannud Abdo, editors, Open Science, Open Issues.

IBICT, UNIRIO, 2015.

[275] Cameron Neylon, Michelle Willmers, and Thomas King. Impact beyond ci-

tation: An introduction to Altmetrics. http://hdl.handle.net/11427/2314,

2014.

[276] Viet-An Nguyen, Jordan Boyd-Graber, and Philip Resnik. SITS: A hierar-

chical nonparametric model using speaker identity for topic segmentation in

multiparty conversations. In Proceedings of the Association for Computational

Linguistics, 2012.

[277] Viet-An Nguyen, Jordan Boyd-Graber, and Philip Resnik. Lexical and hi-

erarchical topic regression. In Advances in Neural Information Processing

Systems, 2013.

[278] Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik, and Jonathan Chang.

Learning a concept hierarchy from multi-labeled documents. In Proceedings

of the Annual Conference on Neural Information Processing Systems. Morgan

Kaufmann, 2014.

[279] Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik, and Kristina Miler. Tea

Party in the House: A hierarchical ideal point topic model and its application

to Republican legislators in the 112th Congress. In Association for Compu-

tational Linguistics, 2015.

[280] Vlad Niculae, Srĳan Kumar, Jordan Boyd-Graber, and Cristian Danescu-

Niculescu-Mizil. Linguistic harbingers of betrayal: A case study on an online

strategy game. In Association for Computational Linguistics, 2015.

[281] Michael Nielsen. Reinventing Discovery: The New Era of Networked Science.

Princeton University Press, 2012.

[282] NIPS. Annual conference on neural information processing systems (NIPS).

https://nips.cc/. Accessed February 1, 2016.

[283] Helen Nissenbaum. A contextual approach to privacy online. Daedalus,

140(4):32–48, 2011.

[284] NLTK Project. NLTK: The natural language toolkit. http://www.nltk.org.

Accessed February 1, 2016.

[285] Regina O. Obe and Leo S. Hsu. PostGIS in Action, 2nd Edition. Manning

Publications, 2015.

340 BIBLIOGRAPHY

[286] David Obstfeld. Social networks, the tertius iungens orientation, and in-

volvement in innovation. Administrative Science Quarterly, 50(1):100–130,

2005.

[287] President’s Council of Advisors on Science and Technology. Big data and

privacy: A technological perspective. Technical report, Executive Oﬃce of

the President, 2014.

[288] Organisation of Economic Co-operation and Development. A summary of

the Frascati manual. Main deﬁnitions and conventions for the measurement

of research and experimental development, 84, 2004.

[289] Paul Ohm. Broken promises of privacy: Responding to the surprising failure

of anonymization. UCLA Law Review, 57:1701, 2010.

[290] Paul Ohm. The legal and regulatory framework: what do the rules say about

data analysis? In Julia Lane, Victoria Stodden, Helen Nissenbaum, and

Stefan Bender, editors, Privacy, Big Data, and the Public Good: Frameworks

for Engagement. Cambridge University Press, 2014.

[291] Judy M. Olson and Cynthia A. Brewer. An evaluation of color selections

to accommodate map users with color-vision impairments. Annals of the

Association of American Geographers, 87(1):103–134, 1997.

[292] Myle Ott, Yejin Choi, Claire Cardie, and Jeﬀrey T. Hancock. Finding decep-

tive opinion spam by any stretch of the imagination. In Proceedings of the

49th Annual Meeting of the Association for Computational Linguistics: Human

Language Technologies—Volume 1, HLT ’11, pages 309–319, Stroudsburg,

PA, 2011. Association for Computational Linguistics.

[293] Jason Owen-Smith and Walter W. Powell. The expanding role of university

patenting in the life sciences: Assessing the importance of experience and

connectivity. Research Policy, 32(9):1695–1711, 2003.

[294] Jason Owen-Smith and Walter W. Powell. Knowledge networks as channels

and conduits: The eﬀects of spillovers in the Boston biotechnology commu-

nity. Organization Science, 15(1):5–21, 2004.

[295] Bo Pang and Lillian Lee. Opinion Mining and Sentiment Analysis. Now Pub-

lishers, 2008.

[296] Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for k-

medoids clustering. Expert Systems with Applications, 36(2):3336–3341,

2009.

[297] Norman Paskin. Digital object identiﬁer (doi) system. Encyclopedia of Library

and Information Sciences, 3:1586–1592, 2008.

[298] Michael Paul and Roxana Girju. A two-dimensional topic-aspect model for

discovering multi-faceted topics. In Association for the Advancement of Arti-

ﬁcial Intelligence, 2010.

[299] James W. Pennebaker and Martha E. Francis. Linguistic Inquiry and Word

Count. Lawrence Erlbaum, 1999.

BIBLIOGRAPHY 341

[300] Alex Pentland, Daniel Greenwood, Brian Sweatt, Arek Stopczynski, and

Yves-Alexandre de Montjoye. Institutional controls: The new deal on data.

In Julia Lane, Victoria Stodden, Stefan Bender, and Helen Nissenbaum, ed-

itors, Privacy, Big Data, and the Public Good: Frameworks for Engagement,

pages 98–112. Cambridge University Press, 2014.

[301] PERISCOPIC. A world of terror. http://terror.periscopic.com/. Accessed

February 1, 2016.

[302] George Petrakos, Claudio Conversano, Gregory Farmakis, Francesco Mola,

Roberta Siciliano, and Photis Stavropoulos. New ways of specifying data

edits. Journal of the Royal Statistical Society, Series A, 167(2):249–274, 2004.

[303] Catherine Plaisant, Jesse Grosjean, and Benjamin B. Bederson. SpaceTree:

Supporting exploration in large node link tree, design evolution and em-

pirical evaluation. In IEEE Symposium on Information Visualization, pages

57–64. IEEE, 2002.

[304] Ruth Pordes, Don Petravick, Bill Kramer, Doug Olson, Miron Livny, Alain

Roy, Paul Avery, Kent Blackburn, Torre Wenaus, Frank Würthwein, et al.

The Open Science Grid. Journal of Physics: Conference Series, 78(1):012057,

2007.

[305] Alan L. Porter, Jan Youtie, Philip Shapira, and David J. Schoeneck. Re-

ﬁning search terms for nanotechnology. Journal of Nanoparticle Research,

10(5):715–728, 2008.

[306] PostGIS Project Steering Committee. PostGIS documentation. http://postgis.

net/documentation/. Accessed December 1, 2015.

[307] Eric Potash, Joe Brew, Alexander Loewi, Subhabrata Majumdar, Andrew

Reece, Joe Walsh, Eric Rozier, Emile Jorgenson, Raed Mansour, and Rayid

Ghani. Predictive modeling for public health: Preventing childhood lead

poisoning. In Proceedings of the 21th ACM SIGKDD International Conference

on Knowledge Discovery and Data Mining, KDD ’15, pages 2039–2047. ACM,

2015.

[308] W. Powell. Neither market nor hierarchy. Sociology of Organizations: Classic,

Contemporary, and Critical Readings, 315:104–117, 2003.

[309] Walter W. Powell, Douglas R. White, Kenneth W. Koput, and Jason Owen-

Smith. Network dynamics and ﬁeld evolution: The growth of interorgani-

zational collaboration in the life sciences. American Journal of Sociology,

110(4):1132–1205, 2005.

[310] Jason Priem, Heather A Piwowar, and Bradley M Hemminger. Altmetrics

in the wild: Using social media to explore scholarly impact. Preprint, arXiv

1203.4745, 2012.

[311] Foster Provost and Tom Fawcett. Data Science for Business: What You Need

to Know About Data Mining and Data-analytic Thinking. O’Reilly Media, 2013.

[312] Marco Puts, Piet Daas, and Ton de Waal. Finding errors in Big Data. Signif-

icance, 12(3):26–29, 2015.

342 BIBLIOGRAPHY

[313] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected

applications in speech recognition. Proceedings of the IEEE, 77(2):257–286,

1989.

[314] Karthik Ram. Git can facilitate greater reproducibility and increased trans-

parency in science. Source Code for Biology and Medicine, 8(1):7, 2013.

[315] Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher Manning.

Labeled LDA: A supervised topic model for credit attribution in multi-labeled

corpora. In Proceedings of Empirical Methods in Natural Language Processing,

2009.

[316] Raghu Ramakrishnan and Johannes Gehrke. Database Management Sys-

tems, 3rd Edition. McGraw-Hill, 2002.

[317] Jerome P. Reiter. Statistical approaches to protecting conﬁdentiality for

microdata and their eﬀects on the quality of statistical inferences. Public

Opinion Quarterly, 76(1):163–181, 2012.

[318] Philip Resnik and Jimmy Lin. Evaluation of NLP systems. In Alex Clark,

Chris Fox, and Shalom Lappin, editors, Handbook of Computational Linguis-

tics and Natural Language Processing. Wiley Blackwell, 2010.

[319] Leonard Richardson. Beautiful Soup. http://www.crummy.com/software/

BeautifulSoup/. Accessed February 1, 2016.

[320] Donald B. Rubin. Inference and missing data. Biometrika, 63:581–592,

1976.

[321] Bahador Saket, Paolo Simonetto, Stephen Kobourov, and Katy Börner. Node,

node-link, and node-link-group diagrams: An evaluation. IEEE Transactions

on Visualization and Computer Graphics, 20(12):2231–2240, 2014.

[322] Gerard Salton. Automatic Information Organization and Retrieval. McGraw-

Hill, 1968.

[323] Arthur L. Samuel. Some studies in machine learning using the game of

Checkers. IBM Journal of Research and Development, 3(3):210–229, 1959.

[324] Evan Sandhaus. The New York Times annotated corpus. Linguis-

tic Data Consortium, http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?

catalogId=LDC2008T19, 2008.

[325] Purvi Saraiya, Chris North, and Karen Duca. An insight-based methodology

for evaluating bioinformatics visualizations. IEEE Transactions on Visualiza-

tion and Computer Graphics, 11(4):443–456, 2005.

[326] Joseph L. Schafer. Analysis of Incomplete Multivariate Data. CRC Press,

1997.

[327] Joseph L Schafer and John W Graham. Missing data: Our view of the state

of the art. Psychological Methods, 7(2):147, 2002.

[328] Michael Schermann, Holmer Hemsen, Christoph Buchmüller, Till Bitter,

Helmut Krcmar, Volker Markl, and Thomas Hoeren. Big data. Business &

Information Systems Engineering, 6(5):261–266, 2014.

BIBLIOGRAPHY 343

[329] Fritz Scheuren and William E. Winkler. Regression analysis of data ﬁles that

are computer matched. Survey Methodology, 19(1):39–58, 1993.

[330] Rainer Schnell. An eﬃcient privacy-preserving record linkage technique for

administrative data and censuses. Statistical Journal of the IAOS, 30:263–

270, 2014.

[331] Rainer Schnell. German Record Linkage Center, 2016.

[332] Rainer Schnell, Tobias Bachteler, and Jörg Reiher. Privacy-preserving record

linkage using Bloom ﬁlters. BMC Medical Informatics and Decision Making,

9(1):41, 2009.

[333] Julie A. Schoenman. The concentration of health care spending. NIHCM

foundation data brief, National Institute for Health Care Management, 2012.

http://www.nihcm.org/pdf/DataBrief3%20Final.pdf.

[334] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support

Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2001.

[335] Steven L. Scott, Alexander W. Blocker, Fernando V. Bonassi, H. Chipman,

E. George, and R. McCulloch. Bayes and big data: The consensus Monte

Carlo algorithm. In EFaBBayes 250 conference, volume 16, 2013. http://

bit.ly/1wBqh4w, Accessed January 1, 2016.

[336] Sesame. Sesame RDF triple store. http://rdf4j.org. Accessed February 1,

2016.

[337] James A. Sethian, Jean-Philippe Brunet, Adam Greenberg, and Jill P.

Mesirov. Computing turbulent ﬂow in complex geometries on a massively

parallel processor. In Proceedings of the 1991 ACM/IEEE Conference on Su-

percomputing, pages 230–241. ACM, 1991.

[338] Charles Severance. Python for informatics: Exploring information. http://

www.pythonlearn.com/book.php, 2013.

[339] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analy-

sis. Cambridge University Press, 2004.

[340] Taylor Shelton, Ate Poorthuis, Mark Graham, and Matthew Zook. Mapping

the data shadows of Hurricane Sandy: Uncovering the sociospatial dimen-

sions of ‘big data’. Geoforum, 52:167–179, 2014.

[341] Aimee Shen, Patrick J. Lupardus, Montse Morell, Elizabeth L. Ponder, A. Ma-

soud Sadaghiani, K. Christopher Garcia, Matthew Bogyo, et al. Simpliﬁed,

enhanced protein puriﬁcation using an inducible, autoprocessing enzyme

tag. PLoS One, 4(12):e8119, 2009.

[342] Ben Shneiderman. Tree visualization with tree-maps: 2-D space-ﬁlling ap-

proach. ACM Transactions on Graphics, 11(1):92–99, 1992.

[343] Ben Shneiderman. Extreme visualization: Squeezing a billion records into

a million pixels. In Proceedings of the 2008 ACM SIGMOD International Con-

ference on Management of Data, pages 3–12. ACM, 2008.

344 BIBLIOGRAPHY

[344] Ben Shneiderman and Catherine Plaisant. Sharpening analytic focus to

cope with big data volume and variety. Computer Graphics and Applications,

IEEE, 35(3):10–14, 2015. See also http://www.cs.umd.edu/hcil/eventﬂow/

Sharpening-Strategies-Help.pdf.

[345] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler.

The Hadoop distributed ﬁle system. In IEEE 26th Symposium on Mass Stor-

age Systems and Technologies (MSST), pages 1–10. IEEE, 2010.

[346] Helmut Sies. A new parameter for sex education. Nature, 332(495), 1988.

[347] Abraham Silberschatz, Henry F. Korth, and S. Sudarshan. Database System

Concepts, 6th Edition. McGraw-Hill, 2010.

[348] Alex J. Smola and Bernhard Schölkopf. A tutorial on support vector regres-

sion. Statistics and Computing, 14(3):199–222, August 2004.

[349] John Snow. On the Mode of Communication of Cholera. John Churchill,

1855.

[350] Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Ng. Cheap and

fast—but is it good? Evaluating non-expert annotations for natural language

tasks. In Proceedings of Empirical Methods in Natural Language Processing,

2008.

[351] Solid IT. DB Engines. http://db-engines.com/en/. Accessed February 1,

2016.

[352] SOSP. Science of science policy. http://www.scienceofsciencepolicy.net/.

Accessed February 1, 2016.

[353] Peverill Squire. Why the 1936 Literary Digest poll failed. Public Opinion

Quarterly, 52(1):125–133, 1988.

[354] Stanford. Stanford CoreNLP—a suite of core NLP tools. http://nlp.stanford.

edu/software/corenlp.shtml. Accessed February 1, 2016.

[355] Stanford Visualization Group. Dorling cartograms in ProtoVis. http://

mbostock.github.io/protovis/ex/cartogram.html. Accessed January 10,

2015.

[356] Mark W Stanton and MK Rutherford. The High Concentration of US Health

Care Expenditures. Agency for Healthcare Research and Quality, 2006.

[357] John Stasko, Carsten Görg, and Zhicheng Liu. Jigsaw: Supporting inves-

tigative analysis through interactive visualization. Information Visualization,

7(2):118–132, 2008.

[358] Steve Stemler. An overview of content analysis. Practical Assessment, Re-

search & Evaluation, 7(17), 2001.

[359] Rebecca C Steorts, Rob Hall, and Stephen E Fienberg. SMERED: a Bayesian

approach to graphical record linkage and de-duplication. Preprint, arXiv

1403.0211, 2014.

[360] S. Stephens-Davidowitz and H. Varian. A hands-on guide to Google

data. http://people.ischool.berkeley.edu/~hal/Papers/2015/primer.pdf.

Accessed October 12, 2015.

BIBLIOGRAPHY 345

[361] James H. Stock and Mark W. Watson. Forecasting using principal compo-

nents from a large number of predictors. Journal of the American Statistical

Association, 97(460):1167–1179, 2002.

[362] Katherine J. Strandburg. Monitoring, dataﬁcation and consent: Legal ap-

proaches to privacy in the big data context. In Julia Lane, Victoria Stod-

den, Stefan Bender, and Helen Nissenbaum, editors, Privacy, Big Data, and

the Public Good: Frameworks for Engagement. Cambridge University Press,

2014.

[363] Carly Strasser. Git/GitHub: A primer for researchers. http://datapub.cdlib.

org/2014/05/05/github-a-primer-for-researchers/, May 5, 2014.

[364] Christof Strauch. Nosql databases. http://www.christof-strauch.de/

nosqldbs.pdf, 2009.

[365] Elizabeth A. Stuart. Matching methods for causal inference: A review and a

look forward. Statistical Science, 25(1):1, 2010.

[366] Latanya Sweeney. Computational disclosure control: A primer on data pri-

vacy protection. Technical report, MIT, 2001. http://groups.csail.mit.edu/

mac/classes/6.805/articles/privacy/sweeney-thesis-draft.pdf.

[367] Alexander S. Szalay, Jim Gray, Ani R. Thakar, Peter Z. Kunszt, Tanu Malik,

Jordan Raddick, Christopher Stoughton, and Jan vandenBerg. The SDSS

skyserver: Public access to the Sloan digital sky server data. In Proceedings

of the 2002 ACM SIGMOD International Conference on Management of Data,

pages 570–581. ACM, 2002.

[368] Edmund M. Talley, David Newman, David Mimno, Bruce W. Herr II,

Hanna M. Wallach, Gully A. P. C. Burns, A. G. Miriam Leenders, and An-

drew McCallum. Database of NIH grants using machine-learned categories

and graphical clustering. Nature Methods, 8(6):443–444, 2011.

[369] Adam Tanner. Harvard professor re-identiﬁes anonymous volunteers in

DNA study. Forbes, http://www.forbes.com/sites/adamtanner/2013/04/

25/harvard-professor-re-identiﬁes-anonymous-volunteers-in-dna-study/#

6cc7f6b43e39, April 25, 2013.

[370] TDP. Transactions on Data Privacy. http://www.tdp.cat/. Accessed April 16,

2016.

[371] M. Tennekes, E. de Jonge, and P. Daas. Innovative visual tools for data

editing. Presented at the United Nations Economic Commission for Europe

Work Session on Statistical Data. Available online at http://www.pietdaas.

nl/beta/pubs/pubs/30_Netherlands.pdf, 2012.

[372] Martĳn Tennekes and Edwin de Jonge. Top-down data analysis with

treemaps. In Proceedings of the International Conference on Imaging Theory

and Applications and International Conference on Information Visualization

Theory and Applications, pages 236–241. SciTePress, 2011.

[373] Martĳn Tennekes, Edwin de Jonge, and Piet J. H. Daas. Visualizing and

inspecting large datasets with tableplots. Journal of Data Science, 11(1):43–

58, 2013.

346 BIBLIOGRAPHY

[374] Alexander I. Terekhov. Evaluating the performance of Russia in the research

in nanotechnology. Journal of Nanoparticle Research, 14(11), 2012.

[375] William W. Thompson, Lorraine Comanor, and David K. Shay. Epidemiology

of seasonal inﬂuenza: Use of surveillance data and statistical models to esti-

mate the burden of disease. Journal of Infectious Diseases, 194(Supplement

2):S82–S91, 2006.

[376] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal

of the Royal Statistical Society, Series B, pages 267–288, 1996.

[377] D. Trewin, A. Andersen, T. Beridze, L. Biggeri, I. Fellegi, and T. Toczyn-

ski. Managing statistical conﬁdentiality and microdata access: Principles

and guidelines of good practice. Technical report, Conference of European

Statisticians, United Nations Economic Commision for Europe, 2007.

[378] TSE15. 2015 international total survey error conference website. https://

www.tse15.org. Accessed February 1, 2016.

[379] Suppawong Tuarob, Line C. Pouchard, and C. Lee Giles. Automatic tag

recommendation for metadata annotation using probabilistic topic modeling.

In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries,

JCDL ’13, pages 239–248. ACM, 2013.

[380] Edward Tufte. The Visual Display of Quantitative information, 2nd Edition.

Graphics Press, 2001.

[381] Edward Tufte. Beautiful Evidence, 2nd Edition. Graphics Press, 2006.

[382] United Nations Economic Commission for Europe. Statistical conﬁden-

tiality and disclosure protection. http://www.unece.org/stats/mos/meth/

conﬁdentiality.html. Accessed April 16, 2016.

[383] University of Oxford. British National Corpus. http://www.natcorp.ox.ac.

uk/, 2006.

[384] University of Waikato. Weka 3: Data mining software in java. http://www.

cs.waikato.ac.nz/ml/weka/. Accessed February 1, 2016.

[385] Richard Valliant, Jill A Dever, and Frauke Kreuter. Practical Tools for De-

signing and Weighting Survey Samples. Springer, 2013.

[386] Hal R. Varian. Big data: New tricks for econometrics. Journal of Economic

Perspectives, 28(2):3–28, 2014.

[387] Samuel L. Ventura, Rebecca Nugent, and Erica R. H. Fuchs. Seeing the

non-stars:(some) sources of bias in past disambiguation approaches and a

new public tool leveraging labeled records. Research Policy, 2015.

[388] Tyler Vigen. Spurious correlations. http://www.tylervigen.com/

spurious-correlations. Accessed February 1, 2016.

[389] Tyler Vigen. Spurious Correlations. Hachette Books, 2015.

[390] Hanna Wallach, David Mimno, and Andrew McCallum. Rethinking LDA:

Why priors matter. In Advances in Neural Information Processing Systems,

2009.

BIBLIOGRAPHY 347

[391] Anders Wallgren and Britt Wallgren. Register-Based Statistics: Administra-

tive Data for Statistical Purposes. John Wiley & Sons, 2007.

[392] Chong Wang, David Blei, and Li Fei-Fei. Simultaneous image classiﬁcation

and annotation. In Computer Vision and Pattern Recognition, 2009.

[393] Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang.

PLDA: parallel latent Dirichlet allocation for large-scale applications. In Inter-

national Conference on Algorithmic Aspects in Information and Management,

2009.

[394] Karl J. Ward. Crossref REST API. http://api.crossref.org. Accessed February

1, 2016.

[395] Matthew O. Ward, Georges Grinstein, and Daniel Keim. Interactive Data

Visualization: Foundations, Techniques, and Applications. CRC Press, 2010.

[396] Bruce A. Weinberg, Jason Owen-Smith, Rebecca F Rosen, Lou Schwarz,

Barbara McFadden Allen, Roy E. Weiss, and Julia Lane. Science funding

and short-term economic activity. Science, 344(6179):41, 2014.

[397] Steven Euĳong Whang, David Menestrina, Georgia Koutrika, Martin

Theobald, and Hector Garcia-Molina. Entity resolution with iterative block-

ing. In Proceedings of the 2009 ACM SIGMOD International Conference on

Management of data, pages 219–232. ACM, 2009.

[398] Harrison C. White, Scott A. Boorman, and Ronald L. Breiger. Social structure

from multiple networks. I. Block models of roles and positions. American

Journal of Sociology, pages 730–780, 1976.

[399] Tom White. Hadoop: The Deﬁnitive Guide. O’Reilly, 2012.

[400] Michael Wick, Sameer Singh, Harshal Pandya, and Andrew McCallum. A

joint model for discovering and linking entities. In Proceedings of the 2013

Workshop on Automated Knowledge Base Construction, pages 67–72. ACM,

2013.

[401] Wikipedia. List of computer science conferences. http://en.wikipedia.org/

wiki/List_of_computer_science_conferences. Accessed April 16, 2016.

[402] Wikipedia. Representational state transfer. https://en.wikipedia.org/wiki/

Representational_state_transfer. Accessed January 10, 2016.

[403] John Wilbanks. Portable approaches to informed consent and open data.

In Julia Lane, Victoria Stodden, Stefan Bender, and Helen Nissenbaum,

editors, Privacy, Big Data, and the Public Good: Frameworks for Engagement,

pages 98–112. Cambridge University Press, 2014.

[404] Dean N. Williams, R. Drach, R. Ananthakrishnan, I. T. Foster, D. Fraser,

F. Siebenlist, D. E. Bernholdt, M. Chen, J. Schwidder, S. Bharathi, et al.

The earth system grid: Enabling access to multimodel climate simulation

data. Bulletin of the American Meteorological Society, 90(2):195–205, 2009.

[405] James Wilsdon, Liz Allen, Eleonora Belﬁore, Philip Campbell, Stephen

Curry, Steven Hill, Richard Jones, Roger Kain, Simon Kerridge, Mike Thel-

wall, Jane Tinkler, Ian Viney, Paul Wouters, Jude Hill, and Ben Johnson.

348 BIBLIOGRAPHY

The metric tide: Report of the independent review of the role of metrics

in research assessment and management. http://www.hefce.ac.uk/pubs/

rereports/Year/2015/metrictide/Title,104463,en.html, 2015.

[406] William E. Winkler. Record linkage. In D. Pfeﬀermann and C. R. Rao, ed-

itors, Handbook of Statistics 29A, Sample Surveys: Design, Methods and

Applications, pages 351–380. Elsevier, 2009.

[407] William E. Winkler. Matching and record linkage. Wiley Interdisciplinary

Reviews: Computational Statistics, 6(5):313–325, 2014.

[408] Krist Wongsuphasawat and Jimmy Lin. Using visualizations to monitor

changes and harvest insights from a global-scale logging infrastructure at

twitter. In Proceedings of the IEEE Conference on Visual Analytics Science

and Technology, pages 113–122. IEEE, 2014.

[409] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang,

Hiroshi Motoda, Geoﬀrey J. McLachlan, Angus Ng, Bing Liu, S. Yu Philip,

Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top

10 algorithms in data mining. Knowledge and Information Systems, 14(1):1–

37, 2008.

[410] Stefan Wuchty, Benjamin F Jones, and Brian Uzzi. The increasing domi-

nance of teams in production of knowledge. Science, 316(5827):1036–1039,

2007.

[411] Beth Yost, Yonca Haciahmetoglu, and Chris North. Beyond visual acuity:

The perceptual scalability of information visualizations for large displays.

In Proceedings of the SIGCHI Conference on Human Factors in Computing

Systems, pages 101–110. ACM, 2007.

[412] Zygmunt Z. Machine learning courses online. http://fastml.com/

machine-learning-courses-online, January 7, 2013.

[413] Laura Zayatz. Disclosure avoidance practices and research at the US Census

Bureau: An update. Journal of Oﬃcial Statistics, 23(2):253, 2007.

[414] Xiaojin Zhu. Semi-supervised learning literature survey. http://pages.cs.

wisc.edu/~jerryzhu/pub/ssl_survey.pdf, 2008.

[415] Nikolas Zolas, Nathan Goldschlag, Ron Jarmin, Paula Stephan, Jason

Owen-Smith, Rebecca F Rosen, Barbara McFadden Allen, Bruce A Wein-

berg, and Julia Lane. Wrapping it up in a person: Examining employment

and earnings outcomes for Ph.D. recipients. Science, 350(6266):1367–1371,

2015.

Index

Note: Page numbers ending in “f” refer to ﬁgures. Page numbers ending in “t” refer to tables.

Administrative Data Research

Network (ADRN), 73

Algebraic models, 201

Algorithm dynamic issues,

275–277

Analysis, 15–16, 15f

“AnscombeÕs quartet,” 244, 244f

Apache Hadoop, 129–138. See

also Hadoop

Apache Spark, 138–143, 141t

APIs

Crossref API, 40

DOIs and, 40–47, 50–59,

67–69

Lagotto API, 46–52, 59

ORCID API, 42–43, 46,

52–58, 68

programming against, 41–42

relevant APIs, 38, 39t

resources, 38

RESTful APIs, 38–39

returned data and, 38–40

streaming API, 136–138

web data and, 23–70

wrappers and, 38–43

Article citations, 58–60

Article links, 58–60, 203–205,

206t

Association rules, 160–161

Bar charts, 251

Big data

analytics and, 277–290

challenges with, 65–66,

265–266

conﬁdentiality and, 299–311

correlation analysis,

284–287, 288f

data generation, 29–31, 45,

273

data quality and, 7–8

deﬁning, 3–4, 125

editing process, 291–295

errors and, 265–297

inference and, 4–5

macro-editing, 292–293

micro-editing, 291–292

positive predictive values,

283–284, 284t

privacy and, 299–311

process map, 274, 274f

programming and, 125–143

regression analysis,

288–290, 289f

selective editing, 292

social science and, 4–13

tableplots, 293–295, 294f

“three Vs,” 125, 277

tools for, 9–10

uses of, 10–11

value of, 3–4

Bookmark analysis, 34–35

Boolean logic, 198–199

Bootstrap aggregation, 169

Box plots, 251

CAP theorem, 116–117

Capture, 13–14, 14f

Cartograms, 252

Causal inference, 5–6, 162, 184

Cell errors, 271–273

Charts, 246–247, 247f, 251

Choropleth maps, 252

Citations, 32–33, 58–60

Cloud, 18–19

Clustering algorithms, 80, 88, 92,

154–160, 251

Clustering coeﬃcient, 229–231

Collocations, 191, 211–212

Color blindness, 261–262

Column errors, 270–271

Column-oriented databases, 120

Comma-separated-value (CSV)

ﬁles, 100–101, 101f, 105–106,

114–115

Compute cluster, 10, 131

Conﬁdentiality

big data and, 299–311

349

350 Index

challenges with, 300–308

deﬁnition of, 300

knowledge infrastructures,

304–305

legal issues with, 301–303,

308–310

licensing arrangements, 306

link keys, 303–304

new challenges with,

306–308

privacy and, 299–311

public use ﬁles, 305–306

research data centers,

305–306

risks with, 300–302

statistical disclosure,

305–306

synthetic data, 305–306, 310

Confusion matrix, 176–177, 177f

Contour maps, 252

CoreNLP, 211–212

Corpus of works, 52–58, 190

Correlation matrices, 251

Cosine similarity, 201–203

Coverage bias, 269–270

Crash recovery, 97–98, 98t,

108–109

Curation, 13–14, 14f

Dashboards, 245, 247f

Data

access to, 303–306

APIs and, 23–70

availability of, 61–62

bookmark examples, 34–35

capturing, 23–70

categorizing, 31–32, 32f

citations, 32–33

comparisons, 63–64

completeness of, 44–45,

61–62

data-by-tasks taxonomy,

249–259

document stores, 98

download examples, 33–34

dynamic data, 62–65

expert recommendations,

36–37

“ﬂavors of impact,” 64–65

functional view of, 37–40

from HHMI website, 24–30,

26f

hierarchical data, 255–256,

255f

link keys, 303–304

loading, 107–108, 114–115,

125–126

multivariate data, 249–251

network data, 218–224,

219f, 220f, 238–239,

257–259

new sources, 66–67

online interactions, 37–38,

38f

overﬁtting, 152, 154, 162,

168

ownership of data, 302

page view examples, 33–34

parsing, 8, 41–43, 92, 141,

270–271

in research enterprise,

31–37, 32f

returned data, 38–40

scope of, 44–45

scraping, 24–30, 26f

social media discussions,

35–36

source of, 23–31, 36–38,

44–58

spatial data, 120–121,

251–252

survey data, 1–4, 2f

synthetic data, 305–306, 310

tabular data, 98–99,

249–250, 305

temporal data, 252–255,

253f, 254f

text data, 153–155,

184–192, 259, 260f

“three Vs,” 125, 277

timescale, 65–66

validity of, 45

from web, 24–30

Data analysis tools, 245–246,

246f

Data cleaning, 112

Data integration

corpus of works, 52–58

Lagotto API, 46–52

ORCID API, 46, 52–58

Data management, 44–46

Data manipulation, 102–103. See

also Database management

systems

Data model, 96–98, 98t

Data protection, 91–92

Data quality, 7–8, 72–78, 86–92

Data storage, 111–112

Database management systems

(DBMSs)

availability, 116

challenges with, 112–113

components of, 97–98, 98t

concurrency control, 109

consistency, 116

crash recovery support,

97–98, 98t, 108–109

CSV ﬁles, 100–101, 101f,

105–106, 114–115

data cleaning, 112

data deﬁnition, 102

data integrity, 102

data manipulation, 102–103

data model and, 96–98, 98t

data storage, 111–112

description of, 9–10, 93–94,

Index 351

dimension table, 113

embedding queries, 114

features of, 97

JSON documents, 114–116,

119

key–value stores, 117–119

linking, 113–116

loading data, 107–108,

114–115, 125–126

metadata, 113

missing values, 112

optimization methods,

109–112

partition tolerance, 116

query languages, 97–98, 98t,

102–103

query plan, 103, 109–112

relational DBMSs, 9–10,

98–117, 120–123

resources for, 124

SQL queries, 102–103

transactions support, 97–98,

98t, 108–109

types of, 98–100, 99t,

116–120

uses of, 94–96, 95t

Databases

cross join, 104

description of, 97

inner join, 104, 121–122,

122t

join techniques, 104,

120–122, 122t

loading data, 107–108

management systems, 9–10,

93–94

overview of, 93–94

projections, 103

resources for, 124

scaling, 116–117

selections, 103

tables, 100–102, 102f,

103–107

types of, 98–99, 99t,

116–123

uses of, 94–96

Data-by-tasks taxonomy,

249–259

Decision trees, 166–168, 167f

Deduplication, 9, 270

Descriptive statistics, 5, 227, 234,

237–238, 238t

Digital Object Identiﬁers (DOIs),

40–47, 50–59, 67–69

Dimensionality reduction, 160

Disco, 138–139

Distance metrics, 155

Document classiﬁcation, 207

Document stores, 98

Document-based databases,

119–120

Download examples, 33–34

Dynamic data, 62–65

EM algorithm, 156

Errors

algorithm dynamics,

275–277

analytics and, 277–290

attenuation factor, 286

big data and, 265–297

categorical variables, 282

cell errors, 271–273

coincidental correlations,

278, 278f, 296

column errors, 270–271

compensating for, 290–296

correlated errors, 280–281

correlation analysis, 287f

coverage bias and, 269–270

detecting, 290–295

ETL processes, 273–275,

274f

framework for, 266–275,

268f

illustrations of, 275–277

incidental endogeneity, 275,

277–279, 296

inference and, 265–297

interpreting, 266–275

misclassiﬁcation, 283–284

missing data, 272–273

mitigating, 290–295

models for, 266–275, 268f

noise accumulation,

275–279, 296

paradigm for, 266–275

regression analysis,

288–290, 289f

resources for, 296–297

searches and, 268–270,

275–276

spurious correlations,

277–278, 296

systematic errors, 286–290,

287f, 288f

total survey error, 267,

270–273

variable errors, 277–283,

286–290, 289f

veracity problems, 279

volumatic problems,

277–279

Ethics, 16–17, 16f

ETL processes, 273–275, 274f

Expectation-maximization (EM)

clustering, 156

Expert recommendations, 36–37

Exploratory analysis, 245

Facebook

algorithm dynamic issues,

276

dynamic data, 62–63

link data, 13

social media discussions,

35–36

352 Index

text analysis, 187, 212

web data, 23

“Flavors of impact,” 64–65

Gibbs sampling, 194–196, 196f

GitHub resources, 18, 313, 319

Google Flu Trends, 7, 17, 265,

275–276

Google searches, 268–270,

275–276

Graph databases, 120

Graph of relationships, 58–63

GreenplumDB, 138–139

Grid maps, 252

Hadoop

fault tolerance, 137

hardware for, 134–135, 135f

limitations of, 137–138

MapReduce and, 129–138,

135f

parallel computing and,

131–132, 132f, 135

performance of, 137–138

programming support,

136–137

real-time applications, 138

resources for, 143

streaming API, 136–138

Hadoop Distributed File System

(HDFS), 130–138, 135f

Hatch funding, 246

Heatmaps, 251

Hierarchical clustering, 158

Hierarchical data, 255–256, 255f

Histogram charts, 251

Inference

causal inference, 5–6, 162,

184

errors and, 265–297

ethics and, 16–17, 16f

posterior inference, 194–195

Infographics, 245

Information discovery, 193–194

Information retrieval, 198

Information visualization

bar charts, 251

box plots, 251

cartograms, 252

challenges with, 259–262

charts, 246–247, 247f, 251

choropleth maps, 252

clustering algorithms, 251

contour maps, 252

correlation matrices, 251

dashboards, 245, 247f

data analysis tools, 245–246,

246f

data-by-tasks taxonomy,

249–259

development of, 244–249,

246f, 247f, 248f

eﬀectiveness of, 243–249

evaluation of, 261

explanation of, 243–244

exploratory analysis, 245

grid maps, 252

group-in-a-box layout, 257f,

259

guidelines for, 248–249

Hatch funding, 246

heatmaps, 251

hierarchical data, 255–256,

255f

histogram charts, 251

infographics, 245

interactive analysis, 245

interactive design, 243–244,

249–251

isopleth maps, 252

maps, 252, 253f

matrix charts, 251

multivariate data, 249–251

network data, 257–259,

257f, 258f

network maps, 252, 257f,

258f

node-link diagrams,

257–258, 257f, 258f

overview of, 243–244,

249–250

parallel coordinate plot, 251

resources for, 263

scalability, 260–261

scatterplots, 251, 253f

spatial data, 251–252

tableplots, 293–295, 294f

temporal data, 252–255,

253f, 254f

text data, 259, 260f

tile grid maps, 252

tools for, 243–263, 293–295

treemaps, 247–248, 247f,

255–256, 255f, 256f, 293

visual impairment, 261–262

visual literacy, 262

Instance-based models, 163

Interactive analysis, 245

Interactive design, 243–244,

249–251. See also Information

visualization

Isopleth maps, 252

JavaScript Object Notation

(JSON), 40–41, 49, 99,

114–116, 119

Kernel methods, 165

Key–value stores, 117–119

k-means clustering, 155–160,

157f, 159f, 163, 163f, 251

Knowledge infrastructures,

304–305

Index 353

Knowledge repositories, 203–204

Kullback–Leibler (KL) divergence,

201–203

Lagotto API, 46–52, 59

Latent Dirichlet allocation (LDA),

193–194, 194f

Legal issues, 301–303, 308–310

Lemmatization, 191–192

Licensing arrangements, 306

Link keys, 76–78, 83, 303–304.

Big Data And Social Science (Statistics In The Behavioral Sciences Series) Ian Foster, Rayid Ghani, Ron S. Jarmin

(Statistics%20in%20the%20social%20and%20behavioral%20sciences%20series)%20Ian%20Foster%2C%20Rayid%20Ghani%2C%20Ron%20S.%20Jarmin

Navigation menu

Versions of this User Manual:

Views

Navigation