Big Data And Social Science (Statistics In The Behavioral Sciences Series) Ian Foster, Rayid Ghani, Ron S. Jarmin

(Statistics%20in%20the%20social%20and%20behavioral%20sciences%20series)%20Ian%20Foster%2C%20Rayid%20Ghani%2C%20Ron%20S.%20Jarmin

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 377

DownloadBig Data And Social Science (Statistics In The Behavioral Sciences Series) Ian Foster, Rayid Ghani, Ron S. Jarmin
Open PDF In BrowserView PDF
BIG DATA AND
SOCIAL SCIENCE
A Practical Guide to Methods and Tools

Chapman & Hall/CRC

Statistics in the Social and Behavioral Sciences Series
Series Editors
Jeff Gill
Washington University, USA

Steven Heeringa
University of Michigan, USA

Wim J. van der Linden
Pacific Metrics, USA

J. Scott Long
Indiana University, USA

Tom Snijders
Oxford University, UK
University of Groningen, NL

Aims and scope
Large and complex datasets are becoming prevalent in the social and behavioral
sciences and statistical methods are crucial for the analysis and interpretation of such
data. This series aims to capture new developments in statistical methodology with
particular relevance to applications in the social and behavioral sciences. It seeks to
promote appropriate use of statistical, econometric and psychometric methods in
these applied sciences by publishing a broad range of reference works, textbooks and
handbooks.
The scope of the series is wide, including applications of statistical methodology in
sociology, psychology, economics, education, marketing research, political science,
criminology, public policy, demography, survey methodology and official statistics. The
titles included in the series are designed to appeal to applied statisticians, as well as
students, researchers and practitioners from the above disciplines. The inclusion of real
examples and case studies is therefore essential.

Published Titles
Analyzing Spatial Models of Choice and Judgment with R
David A. Armstrong II, Ryan Bakker, Royce Carroll, Christopher Hare, Keith T. Poole, and Howard Rosenthal
Analysis of Multivariate Social Science Data, Second Edition
David J. Bartholomew, Fiona Steele, Irini Moustaki, and Jane I. Galbraith
Latent Markov Models for Longitudinal Data
Francesco Bartolucci, Alessio Farcomeni, and Fulvia Pennoni
Statistical Test Theory for the Behavioral Sciences
Dato N. M. de Gruijter and Leo J. Th. van der Kamp
Multivariable Modeling and Multivariate Analysis for the Behavioral Sciences
Brian S. Everitt
Multilevel Modeling Using R
W. Holmes Finch, Jocelyn E. Bolin, and Ken Kelley
Big Data and Social Science: A Practical Guide to Methods and Tools
Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, and Julia Lane
Ordered Regression Models: Parallel, Partial, and Non-Parallel Alternatives
Andrew S. Fullerton and Jun Xu
Bayesian Methods: A Social and Behavioral Sciences Approach, Third Edition
Jeff Gill
Multiple Correspondence Analysis and Related Methods
Michael Greenacre and Jorg Blasius
Applied Survey Data Analysis
Steven G. Heeringa, Brady T. West, and Patricia A. Berglund
Informative Hypotheses: Theory and Practice for Behavioral and Social Scientists
Herbert Hoijtink
Generalized Structured Component Analysis: A Component-Based Approach to Structural Equation Modeling
Heungsun Hwang and Yoshio Takane
Bayesian Psychometric Modeling
Roy Levy and Robert J. Mislevy
Statistical Studies of Income, Poverty and Inequality in Europe: Computing and Graphics in R Using EU-SILC
Nicholas T. Longford
Foundations of Factor Analysis, Second Edition
Stanley A. Mulaik
Linear Causal Modeling with Structural Equations
Stanley A. Mulaik
Age–Period–Cohort Models: Approaches and Analyses with Aggregate Data
Robert M. O’Brien
Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data
Analysis
Leslie Rutkowski, Matthias von Davier, and David Rutkowski
Generalized Linear Models for Categorical and Continuous Limited Dependent Variables
Michael Smithson and Edgar C. Merkle
Incomplete Categorical Data Design: Non-Randomized Response Techniques for Sensitive Questions in
Surveys
Guo-Liang Tian and Man-Lai Tang

Handbook of Item Response Theory, Volume 1: Models
Wim J. van der Linden
Handbook of Item Response Theory, Volume 2: Statistical Tools
Wim J. van der Linden
Handbook of Item Response Theory, Volume 3: Applications
Wim J. van der Linden
Computerized Multistage Testing: Theory and Applications
Duanli Yan, Alina A. von Davier, and Charles Lewis

Chapman & Hall/CRC
Statistics in the Social and Behavioral Sciences Series

BIG DATA AND
SOCIAL SCIENCE
A Practical Guide to Methods and Tools
Edited by

Ian Foster
University of Chicago
Argonne National Laboratory

Rayid Ghani
University of Chicago

Ron S. Jarmin
U.S. Census Bureau

Frauke Kreuter
University of Maryland
University of Manheim
Institute for Employment Research

Julia Lane
New York University
American Institutes for Research

CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2017 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed on acid-free paper
Version Date: 20160414
International Standard Book Number-13: 978-1-4987-5140-7 (Hardback)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but
the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to
trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained.
If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical,
or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without
written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright
Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a
variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to
infringe.
Library of Congress Cataloging‑in‑Publication Data
Names: Foster, Ian, 1959‑ editor.
Title: Big data and social science : a practical guide to methods and tools /
edited by Ian Foster, University of Chicago, Illinois, USA, Rayid Ghani,
University of Chicago, Illinois, USA, Ron S. Jarmin, U.S. Census Bureau,
USA, Frauke Kreuter, University of Maryland, USA, Julia Lane, New York
University, USA.
Description: Boca Raton, FL : CRC Press, [2017] | Series: Chapman & Hall/CRC
statistics in the social and behavioral sciences series | Includes
bibliographical references and index.
Identifiers: LCCN 2016010317 | ISBN 9781498751407 (alk. paper)
Subjects: LCSH: Social sciences‑‑Data processing. | Social
sciences‑‑Statistical methods. | Data mining. | Big data.
Classification: LCC H61.3 .B55 2017 | DDC 300.285/6312‑‑dc23
LC record available at https://lccn.loc.gov/2016010317

Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com

Contents
Preface

xiii

Editors

xv

Contributors

xix

1 Introduction
1.1 Why this book? . . . . . . . . . . . . . . .
1.2 Defining big data and its value . . . . . .
1.3 Social science, inference, and big data . .
1.4 Social science, data quality, and big data
1.5 New tools for new data . . . . . . . . . . .
1.6 The book’s “use case” . . . . . . . . . . .
1.7 The structure of the book . . . . . . . . .
1.7.1 Part I: Capture and curation . . .
1.7.2 Part II: Modeling and analysis . . .
1.7.3 Part III: Inference and ethics . . .
1.8 Resources . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

1
1
3
4
7
9
10
13
13
15
16
17

I Capture and Curation

21

2 Working with Web Data and APIs
Cameron Neylon
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Scraping information from the web . . . . . . . . . . . . .
2.2.1 Obtaining data from the HHMI website . . . . . . .
2.2.2 Limits of scraping . . . . . . . . . . . . . . . . . .
2.3 New data in the research enterprise . . . . . . . . . . . .
2.4 A functional view . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Relevant APIs and resources . . . . . . . . . . . .
2.4.2 RESTful APIs, returned data, and Python wrappers
2.5 Programming against an API . . . . . . . . . . . . . . . .

23

vii

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

23
24
24
30
31
37
38
38
41

viii

Contents

2.6
2.7
2.8

2.9

2.10

2.11
2.12
2.13

Using the ORCID API via a wrapper . . . . . . . . .
Quality, scope, and management . . . . . . . . . .
Integrating data from multiple sources . . . . . . .
2.8.1 The Lagotto API . . . . . . . . . . . . . . .
2.8.2 Working with a corpus . . . . . . . . . . . .
Working with the graph of relationships . . . . . .
2.9.1 Citation links between articles . . . . . . .
2.9.2 Categories, sources, and connections . . . .
2.9.3 Data availability and completeness . . . . .
2.9.4 The value of sparse dynamic data . . . . . .
Bringing it together: Tracking pathways to impact
2.10.1 Network analysis approaches . . . . . . . .
2.10.2 Future prospects and new data sources . .
Summary . . . . . . . . . . . . . . . . . . . . . . .
Resources . . . . . . . . . . . . . . . . . . . . . . .
Acknowledgements and copyright . . . . . . . . . .

3 Record Linkage
Joshua Tokle and Stefan Bender
3.1 Motivation . . . . . . . . . . . . . . . . . . . .
3.2 Introduction to record linkage . . . . . . . . . .
3.3 Preprocessing data for record linkage . . . . . .
3.4 Indexing and blocking . . . . . . . . . . . . . .
3.5 Matching . . . . . . . . . . . . . . . . . . . . .
3.5.1 Rule-based approaches . . . . . . . . .
3.5.2 Probabilistic record linkage . . . . . . .
3.5.3 Machine learning approaches to linking
3.5.4 Disambiguating networks . . . . . . . .
3.6 Classification . . . . . . . . . . . . . . . . . . .
3.6.1 Thresholds . . . . . . . . . . . . . . . .
3.6.2 One-to-one links . . . . . . . . . . . . .
3.7 Record linkage and data protection . . . . . . .
3.8 Summary . . . . . . . . . . . . . . . . . . . . .
3.9 Resources . . . . . . . . . . . . . . . . . . . . .
4 Databases
Ian Foster and Pascal Heus
4.1 Introduction . . . . . . . . . . . . . . . .
4.2 DBMS: When and why . . . . . . . . . . .
4.3 Relational DBMSs . . . . . . . . . . . . .
4.3.1 Structured Query Language (SQL)
4.3.2 Manipulating and querying data .
4.3.3 Schema design and definition . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

42
44
46
46
52
58
58
60
61
62
65
66
66
67
69
70

71
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

71
72
76
78
80
82
83
85
88
88
89
90
91
92
92

93
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

93
94
100
102
102
105

Contents

4.4
4.5

4.6
4.7

4.8
4.9

ix

4.3.4 Loading data . . . . . . . . . . . . . .
4.3.5 Transactions and crash recovery . . .
4.3.6 Database optimizations . . . . . . . .
4.3.7 Caveats and challenges . . . . . . . .
Linking DBMSs and other tools . . . . . . . .
NoSQL databases . . . . . . . . . . . . . . .
4.5.1 Challenges of scale: The CAP theorem
4.5.2 NoSQL and key–value stores . . . . .
4.5.3 Other NoSQL databases . . . . . . . .
Spatial databases . . . . . . . . . . . . . . .
Which database to use? . . . . . . . . . . . .
4.7.1 Relational DBMSs . . . . . . . . . . .
4.7.2 NoSQL DBMSs . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . .
Resources . . . . . . . . . . . . . . . . . . . .

5 Programming with Big Data
Huy Vo and Claudio Silva
5.1 Introduction . . . . . . . . . . . . . . . . . .
5.2 The MapReduce programming model . . . . .
5.3 Apache Hadoop MapReduce . . . . . . . . . .
5.3.1 The Hadoop Distributed File System .
5.3.2 Hadoop: Bringing compute to the data
5.3.3 Hardware provisioning . . . . . . . . .
5.3.4 Programming language support . . . .
5.3.5 Fault tolerance . . . . . . . . . . . . .
5.3.6 Limitations of Hadoop . . . . . . . . .
5.4 Apache Spark . . . . . . . . . . . . . . . . . .
5.5 Summary . . . . . . . . . . . . . . . . . . . .
5.6 Resources . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

125
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

II Modeling and Analysis
6 Machine Learning
Rayid Ghani and Malte Schierholz
6.1 Introduction . . . . . . . . . . . . . . . . .
6.2 What is machine learning? . . . . . . . . .
6.3 The machine learning process . . . . . . . .
6.4 Problem formulation: Mapping a problem to
6.5 Methods . . . . . . . . . . . . . . . . . . . .
6.5.1 Unsupervised learning methods . . .
6.5.2 Supervised learning . . . . . . . . .

107
108
109
112
113
116
116
117
119
120
122
122
123
123
124

125
127
129
130
131
134
136
137
137
138
141
143

145
147
. . . . .
. . . . .
. . . . .
machine
. . . . .
. . . . .
. . . . .

. . . . .
. . . . .
. . . . .
learning
. . . . .
. . . . .
. . . . .

. . . . . .
. . . . . .
. . . . . .
methods .
. . . . . .
. . . . . .
. . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

147
148
150
151
153
153
161

x

Contents

6.6

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6.1 Methodology . . . . . . . . . . . . . . . . . . . . .
6.6.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . .
6.7 Practical tips . . . . . . . . . . . . . . . . . . . . . . . . .
6.7.1 Features . . . . . . . . . . . . . . . . . . . . . . .
6.7.2 Machine learning pipeline . . . . . . . . . . . . . .
6.7.3 Multiclass problems . . . . . . . . . . . . . . . . .
6.7.4 Skewed or imbalanced classification problems . . .
6.8 How can social scientists benefit from machine learning? .
6.9 Advanced topics . . . . . . . . . . . . . . . . . . . . . . .
6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.11 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 Text Analysis
Evgeny Klochikhin and Jordan Boyd-Graber
7.1 Understanding what people write . . . . . . . .
7.2 How to analyze text . . . . . . . . . . . . . . .
7.2.1 Processing text data . . . . . . . . . . .
7.2.2 How much is a word worth? . . . . . . .
7.3 Approaches and applications . . . . . . . . . .
7.3.1 Topic modeling . . . . . . . . . . . . . .
7.3.1.1 Inferring topics from raw text
7.3.1.2 Applications of topic models .
7.3.2 Information retrieval and clustering . .
7.3.3 Other approaches . . . . . . . . . . . .
7.4 Evaluation . . . . . . . . . . . . . . . . . . . .
7.5 Text analysis tools . . . . . . . . . . . . . . . .
7.6 Summary . . . . . . . . . . . . . . . . . . . . .
7.7 Resources . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

173
173
176
180
180
181
181
182
183
185
185
186

187
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

8 Networks: The Basics
Jason Owen-Smith
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Network data . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.1 Forms of network data . . . . . . . . . . . . . . . .
8.2.2 Inducing one-mode networks from two-mode data
8.3 Network measures . . . . . . . . . . . . . . . . . . . . . .
8.3.1 Reachability . . . . . . . . . . . . . . . . . . . . .
8.3.2 Whole-network measures . . . . . . . . . . . . . .
8.4 Comparing collaboration networks . . . . . . . . . . . . .
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

187
189
190
192
193
193
194
197
198
205
208
210
212
213

215
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

215
218
218
220
224
224
225
234
238
239

Contents

xi

III Inference and Ethics
9 Information Visualization
M. Adil Yalçın and Catherine Plaisant
9.1 Introduction . . . . . . . . . . . .
9.2 Developing effective visualizations
9.3 A data-by-tasks taxonomy . . . . .
9.3.1 Multivariate data . . . . . .
9.3.2 Spatial data . . . . . . . . .
9.3.3 Temporal data . . . . . . .
9.3.4 Hierarchical data . . . . . .
9.3.5 Network data . . . . . . . .
9.3.6 Text data . . . . . . . . . .
9.4 Challenges . . . . . . . . . . . . .
9.4.1 Scalability . . . . . . . . .
9.4.2 Evaluation . . . . . . . . .
9.4.3 Visual impairment . . . . .
9.4.4 Visual literacy . . . . . . .
9.5 Summary . . . . . . . . . . . . . .
9.6 Resources . . . . . . . . . . . . . .

241
243
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

243
244
249
249
251
252
255
257
259
259
260
261
261
262
262
263

10 Errors and Inference
Paul P. Biemer
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 The total error paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2.1 The traditional model . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2.2 Extending the framework to big data . . . . . . . . . . . . . . . . . . .
10.3 Illustrations of errors in big data . . . . . . . . . . . . . . . . . . . . . . . . .
10.4 Errors in big data analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.4.1 Errors resulting from volume, velocity, and variety, assuming perfect
veracity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.4.2 Errors resulting from lack of veracity . . . . . . . . . . . . . . . . . . .
10.4.2.1 Variable and correlated error . . . . . . . . . . . . . . . . . .
10.4.2.2 Models for categorical data . . . . . . . . . . . . . . . . . . .
10.4.2.3 Misclassification and rare classes . . . . . . . . . . . . . . .
10.4.2.4 Correlation analysis . . . . . . . . . . . . . . . . . . . . . . .
10.4.2.5 Regression analysis . . . . . . . . . . . . . . . . . . . . . . .
10.5 Some methods for mitigating, detecting, and compensating for errors . . . . .
10.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.7 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

265
265
266
266
273
275
277
277
279
280
282
283
284
288
290
295
296

xii

11 Privacy and Confidentiality
Stefan Bender, Ron Jarmin, Frauke Kreuter,
11.1 Introduction . . . . . . . . . . . . .
11.2 Why is access important? . . . . . .
11.3 Providing access . . . . . . . . . . .
11.4 The new challenges . . . . . . . . .
11.5 Legal and ethical framework . . . . .
11.6 Summary . . . . . . . . . . . . . . .
11.7 Resources . . . . . . . . . . . . . . .

Contents

299
and Julia Lane
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .

12 Workbooks
Jonathan Scott Morgan, Christina Jones, and Ahmad Emad
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
12.2 Environment . . . . . . . . . . . . . . . . . . . . . .
12.2.1 Running workbooks locally . . . . . . . . . .
12.2.2 Central workbook server . . . . . . . . . . . .
12.3 Workbook details . . . . . . . . . . . . . . . . . . . .
12.3.1 Social Media and APIs . . . . . . . . . . . . .
12.3.2 Database basics . . . . . . . . . . . . . . . .
12.3.3 Data Linkage . . . . . . . . . . . . . . . . . .
12.3.4 Machine Learning . . . . . . . . . . . . . . .
12.3.5 Text Analysis . . . . . . . . . . . . . . . . . .
12.3.6 Networks . . . . . . . . . . . . . . . . . . . .
12.3.7 Visualization . . . . . . . . . . . . . . . . . .
12.4 Resources . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

299
303
305
306
308
310
311

313
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

313
314
314
315
315
315
316
316
317
317
318
318
319

Bibliography

321

Index

349

Preface
The class on which this book is based was created in response to a
very real challenge: how to introduce new ideas and methodologies
about economic and social measurement into a workplace focused
on producing high-quality statistics. We are deeply grateful for the
inspiration and support of Census Bureau Director John Thompson
and Deputy Director Nancy Potok in designing and implementing
the class content and structure.
As with any book, there are many people to be thanked. We
are grateful to Christina Jones, Ahmad Emad, Josh Tokle from
the American Institutes for Research, and Jonathan Morgan from
Michigan State University, who, together with Alan Marco and Julie
Caruso from the US Patent and Trademark Office, Theresa Leslie
from the Census Bureau, Brigitte Raumann from the University of
Chicago, and Lisa Jaso from Summit Consulting, actually made the
class happen.
We are also grateful to the students of three “Big Data for Federal Statistics” classes in which we piloted this material, and to the
instructors and speakers beyond those who contributed as authors
to this edited volume—Dan Black, Nick Collier, Ophir Frieder, Lee
Giles, Bob Goerge, Laure Haak, Madian Khabsa, Jonathan Ozik,
Ben Shneiderman, and Abe Usher. The book would not exist without them.
We thank Trent Buskirk, Davon Clarke, Chase Coleman, Stephanie Eckman, Matt Gee, Laurel Haak, Jen Helsby, Madian Khabsa,
Ulrich Kohler, Charlotte Oslund, Rod Little, Arnaud Sahuguet, Tim
Savage, Severin Thaler, and Joe Walsh for their helpful comments
on drafts of this material.
We also owe a great debt to the copyeditor, Richard Leigh; the
project editor, Charlotte Byrnes; and the publisher, Rob Calver, for
their hard work and dedication.

xiii

Editors
Ian Foster is a Professor of Computer Science at the University of
Chicago and a Senior Scientist and Distinguished Fellow at Argonne
National Laboratory.
Ian has a long record of research contributions in high-performance computing, distributed systems, and data-driven discovery.
He has also led US and international projects that have produced
widely used software systems and scientific computing infrastructures. He has published hundreds of scientific papers and six books
on these and other topics. Ian is an elected fellow of the American Association for the Advancement of Science, the Association
for Computing Machinery, and the British Computer Society. His
awards include the British Computer Society’s Lovelace Medal and
the IEEE Tsutomu Kanai award.

Rayid Ghani is the Director of the Center for Data Science and
Public Policy and a Senior Fellow at the Harris School of Public
Policy and the Computation Institute at the University of Chicago.
Rayid is a reformed computer scientist and wannabe social scientist,
but mostly just wants to increase the use of data-driven approaches
in solving large public policy and social challenges. He is also passionate about teaching practical data science and started the Eric
and Wendy Schmidt Data Science for Social Good Fellowship at the
University of Chicago that trains computer scientists, statisticians,
and social scientists from around the world to work on data science
problems with social impact.
Before joining the University of Chicago, Rayid was the Chief
Scientist of the Obama 2012 Election Campaign, where he focused
on data, analytics, and technology to target and influence voters,
donors, and volunteers. Previously, he was a Research Scientist
and led the Machine Learning group at Accenture Labs. Rayid did
his graduate work in machine learning at Carnegie Mellon University and is actively involved in organizing data science related con-

xv

xvi

Editors

ferences and workshops. In his ample free time, Rayid works with
non-profits and government agencies to help them with their data,
analytics, and digital efforts and strategy.

Ron S. Jarmin is the Assistant Director for Research and Methodology at the US Census Bureau. He formerly was the Bureau’s Chief
Economist and Chief of the Center for Economic Studies and a Research Economist. He holds a PhD in economics from the University
of Oregon and has published papers in the areas of industrial organization, business dynamics, entrepreneurship, technology and
firm performance, urban economics, data access, and statistical
disclosure avoidance. He oversees a broad research program in
statistics, survey methodology, and economics to improve economic
and social measurement within the federal statistical system.

Frauke Kreuter is a Professor in the Joint Program in Survey Methodology at the University of Maryland, Professor of Methods and
Statistics at the University of Mannheim, and head of the statistical
methods group at the German Institute for Employment Research
in Nuremberg. Previously she held positions in the Department of
Statistics at the University of California Los Angeles (UCLA), and
the Department of Statistics at the Ludwig-Maximillian’s University
of Munich. Frauke serves on several advisory boards for National
Statistical Institutes around the world and within the Federal Statistical System in the United States. She recently served as the
co-chair of the Big Data Task Force of the American Association
for Public Opinion Research. She is a Gertrude Cox Award winner, recognizing statisticians in early- to mid-career who have made
significant breakthroughs in statistical practice, and an elected fellow of the American Statistical Association. Her textbooks on Data
Analysis Using Stata and Practical Tools for Designing and Weighting
Survey Samples are used at universities worldwide, including Harvard University, Johns Hopkins University, Massachusetts Institute of Technology, Princeton University, and the University College
London. Her Massive Open Online Course in Questionnaire Design attracted over 70,000 learners within the first year. Recently
Frauke launched the international long-distance professional education program sponsored by the German Federal Ministry of Education and Research in Survey and Data Science.

Editors

Julia Lane is a Professor at the New York University Wagner Graduate School of Public Service and at the NYU Center for Urban Science and Progress, and she is a NYU Provostial Fellow for Innovation
Analytics.
Julia has led many initiatives, including co-founding the UMETRICS and STAR METRICS programs at the National Science Foundation. She conceptualized and established a data enclave at
NORC/University of Chicago. She also co-founded the creation and
permanent establishment of the Longitudinal Employer-Household
Dynamics Program at the US Census Bureau and the Linked Employer Employee Database at Statistics New Zealand. Julia has
published over 70 articles in leading journals, including Nature and
Science, and authored or edited ten books. She is an elected fellow
of the American Association for the Advancement of Science and a
fellow of the American Statistical Association.

xvii

Contributors
Stefan Bender

Cameron Neylon

Deutsche Bundesbank
Frankfurt, Germany

Curtin University
Perth, Australia

Paul P. Biemer

Jason Owen-Smith

RTI International
Raleigh, NC, USA
University of North Carolina
Chapel Hill, NC, USA

Jordan Boyd-Graber
University of Colorado
Boulder, CO, USA

University of Michigan
Ann Arbor, MI, USA

Catherine Plaisant
University of Maryland
College Park, MD, USA

Malte Schierholz
Ahmad Emad
American Institutes for Research
Washington, DC, USA

University of Mannheim
Mannheim, Germany

Claudio Silva
Pascal Heus
Metadata Technology North America
Knoxville, TN, USA

Christina Jones
American Institutes for Research
Washington, DC, USA

Evgeny Klochikhin

New York University
New York, NY, USA

Joshua Tokle
Amazon
Seattle, WA, USA

Huy Vo

American Institutes for Research
Washington, DC, USA

City University of New York
New York, NY, USA

Jonathan Scott Morgan

M. Adil Yalçın

Michigan State University
East Lansing, MI, USA

University of Maryland
College Park, MD, USA

xix

Chapter 1
Introduction
This section provides a brief overview of the goals and structure of
the book.

1.1

Why this book?

The world has changed for empirical social scientists. The new types
of “big data” have generated an entire new research field—that of
data science. That world is dominated by computer scientists who
have generated new ways of creating and collecting data, developed
new analytical and statistical techniques, and provided new ways of
visualizing and presenting information. These new sources of data
and techniques have the potential to transform the way applied
social science is done.
Research has certainly changed. Researchers draw on data that
are “found” rather than “made” by federal agencies; those publishing in leading academic journals are much less likely today to draw
on preprocessed survey data (Figure 1.1).
The way in which data are used has also changed for both government agencies and businesses. Chief data officers are becoming
as common in federal and state governments as chief economists
were decades ago, and in cities like New York and Chicago, mayoral
offices of data analytics have the ability to provide rapid answers
to important policy questions [233]. But since federal, state, and
local agencies lack the capacity to do such analysis themselves [8],
they must make these data available either to consultants or to the
research community. Businesses are also learning that making effective use of their data assets can have an impact on their bottom
line [56].
And the jobs have changed. The new job title of “data scientist” is highlighted in job advertisements on CareerBuilder.com and
Burning-glass.com—in the same category as statisticians, economists,
and other quantitative social scientists if starting salaries are useful
indicators.

1

1. Introduction
Micro-data Base Articles using Survey Data (%)

2

100
AER
JPE

QJE
ECMA

80

60

40

20

0
1980

1990

2000
2010
Year
Note: “Pre-existing survey” data sets refer to micro surveys such as the CPS or
SIPP and do not include surveys designed by researchers for their study.
Sample excludes studies whose primary data source is from developing countries.

Figure 1.1. Use of pre-existing survey data in publications in leading journals,
1980–2010 [74]

The goal of this book is to provide social scientists with an understanding of the key elements of this new science, its value, and
the opportunities for doing better work. The goal is also to identify
the many ways in which the analytical toolkits possessed by social
scientists can be brought to bear to enhance the generalizability of
the work done by computer scientists.
We take a pragmatic approach, drawing on our experience of
working with data. Most social scientists set out to solve a realworld social or economic problem: they frame the problem, identify
the data, do the analysis, and then draw inferences. At all points,
of course, the social scientist needs to consider the ethical ramifications of their work, particularly respecting privacy and confidentiality. The book follows the same structure. We chose a particular
problem—the link between research investments and innovation—
because that is a major social science policy issue, and one in which
social scientists have been addressing using big data techniques.
While the example is specific and intended to show how abstract
concepts apply in practice, the approach is completely generalizable. The web scraping, linkage, classification, and text analysis
methods on display here are canonical in nature. The inference

1.2. Defining big data and its value

3

and privacy and confidentiality issues are no different than in any
other study involving human subjects, and the communication of
results through visualization is similarly generalizable.

1.2

Defining big data and its value

There are almost as many definitions of big data as there are new
types of data. One approach is to define big data as anything too big
to fit onto your computer. Another approach is to define it as data
with high volume, high velocity, and great variety. We choose the
description adopted by the American Association of Public Opinion
Research: “The term ‘Big Data’ is an imprecise description of a
rich and complicated set of characteristics, practices, techniques,
ethical issues, and outcomes all associated with data” [188].
The value of the new types of data for social science is quite
substantial. Personal data has been hailed as the “new oil” of the
twenty-first century, and the benefits to policy, society, and public
opinion research are undeniable [139]. Policymakers have found
that detailed data on human beings can be used to reduce crime,
improve health delivery, and manage cities better [205]. The scope
is broad indeed: one of this book’s editors has used such data to
not only help win political campaigns but also show its potential
for public policy. Society can gain as well—recent work shows datadriven businesses were 5% more productive and 6% more profitable
than their competitors [56]. In short, the vision is that social science researchers can potentially, by using data with high velocity,
variety, and volume, increase the scope of their data collection efforts while at the same time reducing costs and respondent burden,
increasing timeliness, and increasing precision [265].

Example: New data enable new analyses
Spotshotter data, which have fairly detailed information for each gunfire incident,
such as the precise timestamp and the nearest address, as well as the type of
shot, can be used to improve crime data [63]; Twitter data can be used to improve
predictions around job loss, job gain, and job postings [17]; and eBay postings can
be used to estimate demand elasticities [104].

But most interestingly, the new data can change the way we
think about measuring and making inferences about behavior. For

◮ This topic is discussed in
more detail in Chapter 5.

4

1. Introduction

example, it enables the capture of information on the subject’s entire environment—thus, for example, the effect of fast food caloric
labeling in health interventions [105]; the productivity of a cashier
if he is within eyesight of a highly productive cashier but not otherwise [252]. So it offers the potential to understand the effects of
complex environmental inputs on human behavior. In addition, big
data, by its very nature, enables us to study the tails of a distribution in a way that is not possible with small data. Much of interest
in human behavior is driven by the tails of the distribution—health
care costs by small numbers of ill people [356], economic activity
and employment by a small number of firms [93,109]—and is impossible to study with the small sample sizes available to researchers.
Instead we are still faced with the same challenges and responsibilities as we were before in the survey and small data collection
environment. Indeed, social scientists have a great deal to offer to a
(data) world that is currently looking to computer scientists to provide answers. Two major areas to which social scientists can contribute, based on decades of experience and work with end users,
are inference and attention to data quality.

1.3

◮ This topic is discussed in
more detail in Chapter 10.

Social science, inference, and big data

The goal of empirical social science is to make inferences about a
population from available data. That requirement exists regardless
of the data source—and is a guiding principle for this book. For
probability-based survey data, methodology has been developed to
overcome problems in the data generating process. A guiding principle for survey methodologists is the total survey error framework,
and statistical methods for weighting, calibration, and other forms
of adjustment are commonly used to mitigate errors in the survey
process. Likewise for “broken” experimental data, techniques like
propensity score adjustment and principal stratification are widely
used to fix flaws in the data generating process. Two books provide
frameworks for survey quality [35, 143].
Across the social sciences, including economics, public policy,
sociology, management, (parts of) psychology and the like, we can
identify three categories of analysis with three different inferential
goals: description, causation, and prediction.

Description The job of many social scientists is to provide descriptive statements about the population of interest. These could be
univariate, bivariate, or even multivariate statements. Chapter 6

1.3. Social science, inference, and big data

on machine learning will cover methods that go beyond simple descriptive statistics, known as unsupervised learning methods.
Descriptive statistics are usually created based on census data
or sample surveys to generate some summary statistics like a mean,
median, or a graphical distribution to describe the population of interest. In the case of a census, the work ends right there. With
sample surveys the point estimates come with measures of uncertainties (standard errors). The estimation of standard errors has
been worked out for most descriptive statistics and most common
survey designs, even complex ones that include multiple layers of
sampling and disproportional selection probabilities [154, 385].

Example: Descriptive statistics
The US Bureau of Labor Statistics surveys about 60,000 households a month and
from that survey is able to describe national employment and unemployment levels.
For example, in November 2015, total nonfarm payroll employment increased by
211,000 in November, and the unemployment rate was unchanged at 5.0%. Job
gains occurred in construction, professional and technical services, and health
care. Mining and information lost jobs [57].

Proper inference, even for purely descriptive purposes, from a
sample to the population rests usually on knowing that everyone
from the target population had the chance to be included in the
survey, and knowing the selection probability for each element in
the population. The latter does not necessarily need to be known
prior to sampling, but eventually a probability is assigned for each
case. Getting the selection probabilities right is particularly important when reporting totals [243]. Unfortunately in practice, samples
that start out as probability samples can suffer from a high rate of
nonresponse. Because the survey designer cannot completely control which units respond, the set of units that ultimately respond
cannot be considered to be a probability sample [257]. Nevertheless,
starting with a probability sample provides some degree of comfort
that a sample will have limited coverage errors (nonzero probability
of being in the sample), and there are methods for dealing with a
variety of missing data problems [240].

Causation In many cases, social scientists wish to test hypotheses,
often originating in theory, about relationships between phenomena
of interest. Ideally such tests stem from data that allow causal infer-

5

6

1. Introduction

ence: typically randomized experiments or strong nonexperimental
study designs. When examining the effect of X on Y , knowing how
cases were selected into the sample or data set is much less important in the estimation of causal effects than for descriptive studies,
for example, population means. What is important is that all elements of the inferential population have a chance of being selected
for the treatment [179]. In the debate about probability and nonprobability surveys, this distinction is often overlooked. Medical
researchers have operated with unknown study selection mechanisms for years: for example, randomized trials that enroll only
selected samples.

Example: New data and causal inference
One of the major risks with using big data without thinking about the data source
is the misallocation of resources. Overreliance on, say, Twitter data in targeting resources after hurricanes can lead to the misallocation of resources towards young,
Internet-savvy people with cell phones, and away from elderly or impoverished
neighborhoods [340]. Of course, all data collection approaches have had similar
risks. Bad survey methodology led the Literary Digest to incorrectly call the 1936
election [353]. Inadequate understanding of coverage, incentive and quality issues,
together with the lack of a comparison group, has hampered the use of administrative records—famously in the case of using administrative records on crime to
make inference about the role of death penalty policy in crime reduction [95].

Of course, in practice it is difficult to ensure that results are
generalizable, and there is always a concern that the treatment
effect on the treated is different than the treatment effect in the
full population of interest [365]. Having unknown study selection
probabilities makes it even more difficult to estimate population
causal effects, but substantial progress is being made [99, 261]. As
long as we are able to model the selection process, there is no reason
not to do causal inference from so-called nonprobability data.

Prediction Forecasting or prediction tasks are a little less common
among applied social science researchers as a whole, but are certainly an important element for users of official statistics—in particular, in the context of social and economic indicators—as generally
for decision-makers in government and business. Here, similar to
the causal inference setting, it is of utmost importance that we do
know the process that generated the data, and we can rule out any
unknown or unobserved systematic selection mechanism.

1.4. Social science, data quality, and big data

7

Example: Learning from the flu
“Five years ago [in 2009], a team of researchers from Google announced a remarkable achievement in one of the world’s top scientific journals, Nature. Without
needing the results of a single medical check-up, they were nevertheless able to
track the spread of influenza across the US. What’s more, they could do it more
quickly than the Centers for Disease Control and Prevention (CDC). Google’s tracking had only a day’s delay, compared with the week or more it took for the CDC
to assemble a picture based on reports from doctors’ surgeries. Google was faster
because it was tracking the outbreak by finding a correlation between what people
searched for online and whether they had flu symptoms. . . .
“Four years after the original Nature paper was published, Nature News had
sad tidings to convey: the latest flu outbreak had claimed an unexpected victim:
Google Flu Trends. After reliably providing a swift and accurate account of flu
outbreaks for several winters, the theory-free, data-rich model had lost its nose for
where flu was going. Google’s model pointed to a severe outbreak but when the
slow-and-steady data from the CDC arrived, they showed that Google’s estimates
of the spread of flu-like illnesses were overstated by almost a factor of two.
“The problem was that Google did not know—could not begin to know—what
linked the search terms with the spread of flu. Google’s engineers weren’t trying to
figure out what caused what. They were merely finding statistical patterns in the
data. They cared about correlation rather than causation” [155].

1.4

Social science, data quality, and big data

Most data in the real world are noisy, inconsistent, and suffers from
missing values, regardless of its source. Even if data collection
is cheap, the costs of creating high-quality data from the source—
cleaning, curating, standardizing, and integrating—are substantial.
Data quality can be characterized in multiple ways [76]:

• Accuracy: How accurate are the attribute values in the data?
• Completeness: Is the data complete?
• Consistency: How consistent are the values in and between
the database(s)?

• Timeliness: How timely is the data?
• Accessibility: Are all variables available for analysis?

◮ This topic is discussed in
more detail in Chapter 3.

8

1. Introduction

Social scientists have decades of experience in transforming
messy, noisy, and unstructured data into a well-defined, clearly
structured, and quality-tested data set. Preprocessing is a complex
and time-consuming process because it is “hands-on”—it requires
judgment and cannot be effectively automated. A typical workflow
comprises multiple steps from data definition to parsing and ends
with filtering. It is difficult to overstate the value of preprocessing
for any data analysis, but this is particularly true in big data. Data
need to be parsed, standardized, deduplicated, and normalized.
Parsing is a fundamental step taken regardless of the data source,
and refers to the decomposition of a complex variable into components. For example, a freeform address field like “1234 E 56th St”
might be broken down into a street number “1234” and a street
name “E 56th St.” The street name could be broken down further
to extract the cardinal direction “E” and the designation “St.” Another example would be a combined full name field that takes the
form of a comma-separated last name, first name, and middle initial
as in “Miller, David A.” Splitting these identifiers into components
permits the creation of more refined variables that can be used in
the matching step.
In the simplest case, the distinct parts of a character field are
delimited. In the name field example, it would be easy to create the
separate fields “Miller” and “David A” by splitting the original field
at the comma. In more complex cases, special code will have to
be written to parse the field. Typical steps in a parsing procedure
include:
1. Splitting fields into tokens (words) on the basis of delimiters,
2. Standardizing tokens by lookup tables and substitution by a
standard form,
3. Categorizing tokens,
4. Identifying a pattern of anchors, tokens, and delimiters,
5. Calling subroutines according to the identified pattern, therein
mapping of tokens to the predefined components.
Standardization refers to the process of simplifying data by replacing variant representations of the same underlying observation
by a default value in order to improve the accuracy of field comparisons. For example, “First Street” and “1st St” are two ways of
writing the same street name, but a simple string comparison of
these values will return a poor result. By standardizing fields—and

1.5. New tools for new data

9

using the same standardization rules across files!—the number of
true matches that are wrongly classified as nonmatches (i.e., the
number of false nonmatches) can be reduced.
Some common examples of standardization are:

• Standardization of different spellings of frequently occurring
words: for example, replacing common abbreviations in street
names (Ave, St, etc.) or titles (Ms, Dr, etc.) with a common
form. These kinds of rules are highly country- and languagespecific.

• General standardization, including converting character fields
to all uppercase and removing punctuation and digits.

Deduplication consists of removing redundant records from a
single list, that is, multiple records from the same list that refer to
the same underlying entity. After deduplication, each record in the
first list will have at most one true match in the second list and vice
versa. This simplifies the record linkage process and is necessary if
the goal of record linkage is to find the best set of one-to-one links
(as opposed to a list of all possible links). One can deduplicate a list
by applying record linkage techniques described in this chapter to
link a file to itself.
Normalization is the process of ensuring that the fields that are
being compared across files are as similar as possible in the sense
that they could have been generated by the same process. At minimum, the same standardization rules should be applied to both
files. For additional examples, consider a salary field in a survey.
There are number different ways that salary could be recorded: it
might be truncated as a privacy-preserving measure or rounded to
the nearest thousand, and missing values could be imputed with
the mean or with zero. During normalization we take note of exactly
how fields are recorded.

1.5

New tools for new data

The new data sources that we have discussed frequently require
working at scales for which the social scientist’s familiar tools are
not designed. Fortunately, the wider research and data analytics
community has developed a wide variety of often more scalable and
flexible tools—tools that we will introduce within this book.
Relational database management systems (DBMSs) are used
throughout business as well as the sciences to organize, process,

◮ This topic is discussed in
more detail in Chapter 4.

10

1. Introduction

and search large collections of structured data. NoSQL DBMSs are
used for data that is extremely large and/or unstructured, such as
collections of web pages, social media data (e.g., Twitter messages),
and clinical notes. Extensions to these systems and also specialized single-purpose DBMSs provide support for data types that are
not easily handled in statistical packages such as geospatial data,
networks, and graphs.
Open source programming systems such as Python (used extensively throughout this book) and R provide high-quality implementations of numerous data analysis and visualization methods,
from regression to statistics, text analysis, network analysis, and
much more. Finally, parallel computing systems such as Hadoop
and Spark can be used to harness parallel computer clusters for
extremely large data sets and computationally intensive analyses.
These various components may not always work together as
smoothly as do integrated packages such as SAS, SPSS, and Stata,
but they allow researchers to take on problems of great scale and
complexity. Furthermore, they are developing at a tremendous rate
as the result of work by thousands of people worldwide. For these
reasons, the modern social scientist needs to be familiar with their
characteristics and capabilities.

1.6

⋆ UMETRICS: Universities Measuring the Impact
of Research on Innovation
and Science [228]
◮ iris.isr.umich.edu

The book’s “use case”

This book is about the uses of big data in social science. Our focus
is on working through the use of data as a social scientist normally
approaches research. That involves thinking through how to use
such data to address a question from beginning to end, and thereby
learning about the associated tools—rather than simply engaging in
coding exercises and then thinking about how to apply them to a
potpourri of social science examples.
There are many examples of the use of big data in social science
research, but relatively few that feature all the different aspects
that are covered in this book. As a result, the chapters in the book
draw heavily on a use case based on one of the first large-scale big
data social science data infrastructures. This infrastructure, based
on UMETRICS* data housed at the University of Michigan’s Institute for Research on Innovation and Science (IRIS) and enhanced
with data from the US Census Bureau, provides a new quantitative
analysis and understanding of science policy based on large-scale
computational analysis of new types of data.

1.6. The book’s “use case”

The infrastructure was developed in response to a call from the
President’s Science Advisor (Jack Marburger) for a science of science
policy [250]. He wanted a scientific response to the questions that
he was asked about the impact of investments in science.

Example: The Science of Science Policy
Marburger wrote [250]: “How much should a nation spend on science? What
kind of science? How much from private versus public sectors? Does demand for
funding by potential science performers imply a shortage of funding or a surfeit
of performers? These and related science policy questions tend to be asked and
answered today in a highly visible advocacy context that makes assumptions that
are deserving of closer scrutiny. A new ‘science of science policy’ is emerging, and
it may offer more compelling guidance for policy decisions and for more credible
advocacy. . . .
“Relating R&D to innovation in any but a general way is a tall order, but not
a hopeless one. We need econometric models that encompass enough variables
in a sufficient number of countries to produce reasonable simulations of the effect
of specific policy choices. This need won’t be satisfied by a few grants or workshops, but demands the attention of a specialist scholarly community. As more
economists and social scientists turn to these issues, the effectiveness of science
policy will grow, and of science advocacy too.”

Responding to this policy imperative is a tall order, because it involves using all the social science and computer science tools available to researchers. The new digital technologies can be used to
capture the links between the inputs into research, the way in which
those inputs are organized, and the subsequent outputs [396, 415].
The social science questions that are addressable with this data infrastructure include the effect of research training on the placement
and earnings of doctoral recipients, how university trained scientists and engineers affect the productivity of the firms they work for,
and the return on investments in research. Figure 1.2 provides an
abstract representation of the empirical approach that is needed:
data about grants, the people who are funded on grants, and the
subsequent scientific and economic activities.
First, data must be captured on what is funded, and since the
data are in text format, computational linguistics tools must be
applied (Chapter 7). Second, data must be captured on who is
funded, and how they interact in teams, so network tools and analysis must be used (Chapter 8). Third, information about the type of
results must be gleaned from the web and other sources (Chapter 2).

11

12

1. Introduction

Co-Author

Collaborate
Train

Pro
du
c
e&

e
us

Em

People
oy
pl

Pays for &

Is Awarded to

Institutions

Products

Aw
or

de

ts

ar

d

to

Su

pp

Funding

Figure 1.2. A visualization of the complex links between what and who is funded, and the results; tracing the direct
link between funding and results is misleading and wrong

Finally, the disparate complex data sets need to be stored in databases (Chapter 4), integrated (Chapter 3), analyzed (Chapter 6), and
used to make inferences (Chapter 10).
The use case serves as the thread that ties many of the ideas
together. Rather than asking the reader to learn how to code “hello
world,” we build on data that have been put together to answer a
real-world question, and provide explicit examples based on that
data. We then provide examples that show how the approach generalizes.
For example, the text analysis chapter (Chapter 7) shows how
to use natural language processing to describe what research is
being done, using proposal and award text to identify the research
topics in a portfolio [110, 368]. But then it also shows how the
approach can be used to address a problem that is not just limited
to science policy—the conversion of massive amounts of knowledge
that is stored in text to usable information.

1.7. The structure of the book

Similarly, the network analysis chapter (Chapter 8) gives specific
examples using the UMETRICS data and shows how such data can
be used to create new units of analysis—the networks of researchers
who do science, and the networks of vendors who supply research
inputs. It also shows how networks can be used to study a wide
variety of other social science questions.
In another example, we use APIs* provided by publishers to describe the results generated by research funding in terms of publications and other measures of scientific impact, but also provide
code that can be repurposed for many similar APIs.
And, of course, since all these new types of data are provided
in a variety of different formats, some of which are quite large (or
voluminous), and with a variety of different timestamps (or velocity),
we discuss how to store the data in different types of data formats.

1.7

The structure of the book

We organize the book in three parts, based around the way social
scientists approach doing research. The first set of chapters addresses the new ways to capture, curate, and store data. The second set of chapters describes what tools are available to process and
classify data. The last set deals with analysis and the appropriate
handling of data on individuals and organizations.

1.7.1

Part I: Capture and curation

The four chapters in Part I (see Figure 1.3) tell you how to capture
and manage data.
Chapter 2 describes how to extract information from social media about the transmission of knowledge. The particular application will be to develop links to authors’ articles on Twitter using
PLOS articles and to pull information about authors and articles
from web sources by using an API. You will learn how to retrieve
link data from bookmarking services, citations from Crossref, links
from Facebook, and information from news coverage. In keeping with the social science grounding that is a core feature of the
book, the chapter discusses what data can be captured from online
sources, what is potentially reliable, and how to manage data quality
issues.
Big data differs from survey data in that we must typically combine data from multiple sources to get a complete picture of the
activities of interest. Although computer scientists may sometimes

13

⋆ Application Programming
Interfaces

14

1. Introduction

API and Web
Scraping

Chapter 2: Different ways of collecting data

Record
Linkage

Chapter 3: Combining different data sets

Storing Data

Chapter 4: Ingest, query and export data

Processing
Large
Data Sets

Chapter 5: Output (creating innovation measures)

Figure 1.3. The four chapters of Part I focus on data capture and curation

simply “mash” data sets together, social scientists are rightfully
concerned about issues of missing links, duplicative links, and
erroneous links. Chapter 3 provides an overview of traditional
rule-based and probabilistic approaches to data linkage, as well
as the important contributions of machine learning to the linkage
problem.
Once data have been collected and linked into different files, it
is necessary to store and organize it. Social scientists are used to
working with one analytical file, often in statistical software tools
such as SAS or Stata. Chapter 4, which may be the most important chapter in the book, describes different approaches to storing data in ways that permit rapid and reliable exploration and
analysis.
Big data is sometimes defined as data that are too big to fit onto
the analyst’s computer. Chapter 5 provides an overview of clever
programming techniques that facilitate the use of data (often using
parallel computing). While the focus is on one of the most widely
used big data programming paradigms and its most popular implementation, Apache Hadoop, the goal of the chapter is to provide a
conceptual framework to the key challenges that the approach is
designed to address.

1.7. The structure of the book

15

Machine
Learning

Chapter 6: Classifying data in new ways

Text
Analysis

Chapter 7: Creating new data from text

Networks

Chapter 8: Creating new measures of
social and economic activity

Figure 1.4. The four chapters in Part II focus on data modeling and analysis

1.7.2

Part II: Modeling and analysis

The three chapters in Part II (see Figure 1.4) introduce three of the
most important tools that can be used by social scientists to do new
and exciting research: machine learning, text analysis, and social
network analysis.
Chapter 6 introduces machine learning methods. It shows the
power of machine learning in a variety of different contexts, particularly focusing on clustering and classification. You will get an
overview of basic approaches and how those approaches are applied.
The chapter builds from a conceptual framework and then shows
you how the different concepts are translated into code. There is
a particular focus on random forests and support vector machine
(SVM) approaches.
Chapter 7 describes how social scientists can make use of one of
the most exciting advances in big data—text analysis. Vast amounts
of data that are stored in documents can now be analyzed and
searched so that different types of information can be retrieved.
Documents (and the underlying activities of the entities that generated the documents) can be categorized into topics or fields as well
as summarized. In addition, machine translation can be used to
compare documents in different languages.

16

1. Introduction

Social scientists are typically interested in describing the activities of individuals and organizations (such as households and firms)
in a variety of economic and social contexts. The frames within
which data are collected have typically been generated from tax or
other programmatic sources. The new types of data permit new
units of analysis—particularly network analysis—largely enabled by
advances in mathematical graph theory. Thus, Chapter 8 describes
how social scientists can use network theory to generate measurable
representations of patterns of relationships connecting entities. As
the author points out, the value of the new framework is not only in
constructing different right-hand-side variables but also in studying an entirely new unit of analysis that lies somewhere between
the largely atomistic actors that occupy the markets of neo-classical
theory and the tightly managed hierarchies that are the traditional
object of inquiry of sociologists and organizational theorists.

1.7.3

Part III: Inference and ethics

The four chapters in Part III (see Figure 1.5) cover three advanced
topics relating to data inference and ethics—information visualization, errors and inference, and privacy and confidentiality—and introduce the workbooks that provide access to the practical exercises
associated with the text.

Visualization

Inference

Privacy and
Confidentiality

Workbooks

Chapter 9: Making sense of the data

Chapter 10: Drawing statistically valid conclusions

Chapter 11: Handling data appropriately

Chapter 12: Applying new models and tools

Figure 1.5. The four chapters in Part III focus on inference and ethics

1.8. Resources

Chapter 9 introduces information visualization methods and describes how you can use those methods to explore data and communicate results so that data can be turned into interpretable, actionable information. There are many ways of presenting statistical information that convey content in a rigorous manner. The
goal of this chapter is to explore different approaches and examine the information content and analytical validity of the different
approaches. It provides an overview of effective visualizations.
Chapter 10 deals with inference and the errors associated with
big data. Social scientists know only too well the cost associated
with bad data—we highlighted the classic Literary Digest example
in the introduction to this chapter, as well as the more recent Google
Flu Trends. Although the consequences are well understood, the
new types of data are so large and complex that their properties
often cannot be studied in traditional ways. In addition, the data
generating function is such that the data are often selective, incomplete, and erroneous. Without proper data hygiene, errors can
quickly compound. This chapter provides a systematic way to think
about the error framework in a big data setting.
Chapter 11 addresses the issue that sits at the core of any study
of human beings—privacy and confidentiality. In a new field, like the
one covered in this book, it is critical that many researchers have
access to the data so that work can be replicated and built on—that
there be a scientific basis to data science. Yet the rules that social
scientists have traditionally used for survey data, namely anonymity
and informed consent, no longer apply when the data are collected
in the wild. This concluding chapter identifies the issues that must
be addressed for responsible and ethical research to take place.
Finally, Chapter 12 provides an overview of the practical work
that accompanies each chapter—the workbooks that are designed,
using Jupyter notebooks, to enable students and interested practitioners to apply the new techniques and approaches in selected
chapters. We hope you have a lot of fun with them.

1.8

Resources

For more information on the science of science policy, see Husbands
et al.’s book for a full discussion of many issues [175] and the online
resources at the eponymous website [352].
This book is above all a practical introduction to the methods and
tools that the social scientist can use to make sense of big data,
and thus programming resources are also important. We make

17

◮ See jupyter.org.

18

◮ Read this!
1VgytVV

1. Introduction

http://bit.ly/

extensive use of the Python programming language and the MySQL
database management system in both the book and its supporting
workbooks. We recommend that any social scientist who aspires
to work with large data sets become proficient in the use of these
two systems, and also one more, GitHub. All three, fortunately, are
quite accessible and are supported by excellent online resources.
Time spent mastering them will be repaid many times over in more
productive research.
For Python, Alex Bell’s Python for Economists (available online
[31]) provides a wonderful 30-page introduction to the use of Python
in the social sciences, complete with XKCD cartoons. Economists
Tom Sargent and John Stachurski provide a very useful set of lectures and examples at http://quant-econ.net/. For more detail, we
recommend Charles Severance’s Python for Informatics: Exploring
Information [338], which not only covers basic Python but also provides material relevant to web data (the subject of Chapter 2) and
MySQL (the subject of Chapter 4). This book is also freely available
online and is supported by excellent online lectures and exercises.
For MySQL, Chapter 4 provides introductory material and pointers to additional resources, so we will not say more here.
We also recommend that you master GitHub. A version control
system is a tool for keeping track of changes that have been made
to a document over time. GitHub is a hosting service for projects
that use the Git version control system. As Strasser explains [363],
Git/GitHub makes it straightforward for researchers to create digital lab notebooks that record the data files, programs, papers, and
other resources associated with a project, with automatic tracking
of the changes that are made to those resources over time. GitHub
also makes it easy for collaborators to work together on a project,
whether a program or a paper: changes made by each contributor are recorded and can easily be reconciled. For example, we
used GitHub to create this book, with authors and editors checking in changes and comments at different times and from many time
zones. We also use GitHub to provide access to the supporting workbooks. Ram [314] provides a nice description of how Git/GitHub can
be used to promote reproducibility and transparency in research.
One more resource that is outside the scope of this book but that
you may well want to master is the cloud [21,236]. It used to be that
when your data and computations became too large to analyze on
your laptop, you were out of luck unless your employer (or a friend)
had a larger computer. With the emergence of cloud storage and
computing services from the likes of Amazon Web Services, Google,
and Microsoft, powerful computers are available to anyone with a

1.8. Resources

credit card. We and many others have had positive experiences
using such systems for the analysis of urban [64], environmental
[107], and genomic [32] data analysis and modeling, for example.
Such systems may well represent the future of research computing.

19

Part I
Capture and Curation

Chapter 2
Working with Web Data and APIs
Cameron Neylon
This chapter will show you how to extract information from social
media about the transmission of knowledge. The particular application will be to develop links to authors’ articles on Twitter using
PLOS articles and to pull information using an API. You will get link
data from bookmarking services, citations from Crossref, links from
Facebook, and information from news coverage. The examples that
will be used are from Twitter. In keeping with the social science
grounding that is a core feature of the book, it will discuss what can
be captured, what is potentially reliable, and how to manage data
quality issues.

2.1

Introduction

A tremendous lure of the Internet is the availability of vast amounts
of data on businesses, people, and their activity on social media.
But how can we capture the information and make use of it as we
might make use of more traditional data sources? In this chapter,
we begin by describing how web data can be collected, using the
use case of UMETRICS and research output as a readily available
example, and then discuss how to think about the scope, coverage,
and integration issues associated with its collection.
Often a big data exploration starts with information on people or
on a group of people. The web can be a rich source of additional information. It can also act as pointers to new sources of information,
allowing a pivot from one perspective to another, from one kind of
query to another. Often this is exploratory. You have an existing
core set of data and are looking to augment it. But equally this
exploration can open up whole new avenues. Sometimes the data
are completely unstructured, existing as web pages spread across a

23

24

2. Working with Web Data and APIs

site, and sometimes they are provided in a machine-readable form.
The challenge is in having a sufficiently diverse toolkit to bring all
of this information together.
Using the example of data on researchers and research outputs,
we will explore obtaining information directly from web pages (web
scraping) as well as explore the uses of APIs—web services that allow
an interaction with, and retrieval of, structured data. You will see
how the crucial pieces of integration often lie in making connections
between disparate data sets and how in turn making those connections requires careful quality control. The emphasis throughout
this chapter is on the importance of focusing on the purpose for
which the data will be used as a guide for data collection. While
much of this is specific to data about research and researchers, the
ideas are generalizable to wider issues of data and public policy.

2.2

Scraping information from the web

With the range of information available on the web, our first question is how to access it. The simplest approach is often to manually
go directly to the web and look for data files or other information.
For instance, on the NSF website [268] it is possible to obtain data
dumps of all grant information. Sometimes data are available only
on web pages or we only want a subset of this information. In this
case web scraping is often a viable approach.
Web scraping involves using a program to download and process
web pages directly. This can be highly effective, particularly where
tables of information are made available online. It is also useful in
cases where it is desirable to make a series of very similar queries.
In each case we need to look at the website, identify how to get the
information we want, and then process it. Many websites deliberately make this difficult to prevent easy access to their underlying
data.

2.2.1

Obtaining data from the HHMI website

Let us suppose we are interested in obtaining information on those
investigators that are funded by the Howard Hughes Medical Institute (HHMI). HHMI has a website that includes a search function
for funded researchers, including the ability to filter by field, state,
and role. But there does not appear to be a downloadable data set
of this information. However, we can automate the process with
code to create a data set that you might compare with other data.

2.2. Scraping information from the web

This process involves first understanding how to construct a URL
that will do the search we want. This is most easily done by playing
with search functionality and investigating the URL structures that
are returned. Note that in many cases websites are not helpful
here. However, with HHMI if we do a general search and play with
the structure of the URL, we can see some of the elements of the URL
that we can think of as a query. As we want to see all investigators,
we do not need to limit the search, and so with some fiddling we
come up with a URL like the following. (We have broken the one-line
URL into three lines for ease of presentation.)
http://www.hhmi.org/scientists/browse?
kw=&sort_by=field_scientist_last_name&
sort_order=ASC&items_per_page=20&page=0

The requests module, available natively in Jupyter Python notebooks, is a useful set of tools for handling interactions with websites. It lets us construct the request that we just presented in
terms of a base URL and query terms, as follows:
>> BASE_URL = "http://www.hhmi.org/scientists/browse"
>> query = {
"kw" : "",
"sort_by" : "field_scientist_last_name",
"sort_order" : "ASC",
"items_per_page" : 20,
"page" : None
}

With our request constructed we can then make the call to the
web page to get a response.
>> import requests
>> response = requests.get(BASE_URL, params=query)

The first thing to do when building a script that hits a web page
is to make sure that your call was successful. This can be checked
by looking at the response code that the web server sent—and, obviously, by checking the actual HTML that was returned. A 200 code
means success and that everything should be OK. Other codes may
mean that the URL was constructed wrongly or that there was a
server error.
>> response.status_code
200

With the page successfully returned, we now need to process
the text it contains into the data we want. This is not a trivial
exercise. It is possible to search through and find things, but there

25

26

2. Working with Web Data and APIs

Figure 2.1. Source HTML from the portion of an HHMI results page containing information on HHMI investigators;
note that the webscraping results in badly formatted html which is difficult to read.

⋆ Python features many
useful libraries; BeautifulSoup is particularly helpful
for webscraping.

are a range of tools that can help with processing HTML and XML
data. Among these one of the most popular is a module called
BeautifulSoup* [319], which provides a number of useful functions
for this kind of processing. The module documentation provides
more details.
We need to check the details of the page source to find where the
information we are looking for is kept (see, for example, Figure 2.1).
Here, all the details on HHMI investigators can be found in a 
element with the class attribute view-content. This structure is not something that can be determined in advance. It requires knowledge of the structure of the page itself. Nested inside this
element are another series of divs, each of which corresponds to one investigator. These have the class attribute view-rows. Again, there is nothing obvious about finding these, it requires a close examination of the page HTML itself for any specific case you happen to be looking at. We first process the page using the BeautifulSoup module (into the variable soup) and then find the div element that holds the information on investigators (investigator_list). As this element is unique on the page (I checked using my web browser), we can use the find method. We then process that div (using find_all) to create an iterator object that contains each of the page segments detailing a single investigator (investigators). >> >> >> >> from bs4 import BeautifulSoup soup = BeautifulSoup(response.text, "html5lib") investigator_list = soup.find(’div’, class_ = "view-content") investigators = investigator_list.find_all("div", class_ = " views-row") As we specified in our query parameters that we wanted 20 results per page, we should check whether our list of page sections has the right length. >> len(investigators) 20 2.2. Scraping information from the web # Given a request response object, parse for HHMI investigators def scrape(page_response): # Obtain response HTML and the correct
from the page soup = BeautifulSoup(response.text, "html5lib") inv_list = soup.find('div', class_ = "view-content") # Create a list of all the investigators on the page investigators = inv_list.find_all("div", class_ = "views-row") data = [] # Make the data object to store scraping results # Scrape needed elements from investigator list for investigator in investigators: inv = {} # Create a dictionary to store results # Name and role are in same HTML element; this code # separates them into two data elements name_role_tag = investigator.find("div", class_ = "views-field-field-scientist-classification") strings = name_role_tag.stripped_strings for string,a in zip(strings, ["name", "role"]): inv[a] = string # Extract other elements from text of specific divs or from # class attributes of tags in the page (e.g., URLs) research_tag = investigator.find("div", class_ = "views-field-field-scientist-research-abs-nod") inv["research"] = research_tag.text.lstrip() inv["research_url"] = "http://hhmi.org" + research_tag.find("a").get("href") institution_tag = investigator.find("div", class_ = "views-field-field-scientist-academic-institu") inv["institute"] = institution_tag.text.lstrip() town_state_tag = investigator.find("div", class_ = "views-field-field-scientist-institutionstate" ) inv["town"], inv["state"] = town_state_tag.text.split(",") inv["town"] = inv.get("town").lstrip() inv["state"] = inv.get("state").lstrip() thumbnail_tag = investigator.find("div", class_ = "views-field-field-scientist-image-thumbnail") inv["thumbnail_url"] = thumbnail_tag.find("img")["src"] inv["url"] = "http://hhmi.org" + thumbnail_tag.find("a").get("href") # Add the new data to the list data.append(inv) return data Listing 2.1. Python code to parse for HHMI investigators 27 28 2. Working with Web Data and APIs Finally, we need to process each of these segments to obtain the data we are looking for. This is the actual “scraping” of the page to get the information we want. Again, this involves looking closely at the HTML itself, identifying where the information is held, what tags can be used to find it, and often doing some postprocessing to clean it up (removing spaces, splitting different elements up). Listing 2.1 provides a function to handle all of this. The function accepts the response object from the requests module as its input, processes the page text to soup, and then finds the investigator_list as above and processes it into an actual list of the investigators. For each investigator it then processes the HTML to find and clean up the information required, converting it to a dictionary and adding it to our growing list of data. Let us check what the first two elements of our data set now look like. You can see two dictionaries, one relating to Laurence Abbott, who is a senior fellow at the HHMI Janelia Farm Campus, and one for Susan Ackerman, an HHMI investigator based at the Jackson Laboratory in Bar Harbor, Maine. Note that we have also obtained URLs that give more details on the researcher and their research program (research_url and url keys in the dictionary) that could provide a useful input to textual analysis or topic modeling (see Chapter 7). >> data = scrape(response) >> data[0:2] [{’institute’: u’Janelia Research Campus ’, ’name’: u’Laurence Abbott, PhD’, ’research’: u’Computational and Mathematical Modeling of Neurons and Neural... ’, ’research_url’: u’http://hhmi.org/research/computational-andmathematical-modeling-neurons-and-neural-networks’, ’role’: u’Janelia Senior Fellow’, ’state’: u’VA ’, ’thumbnail_url’: u’http://www.hhmi.org/sites/default/files/Our %20Scientists/Janelia/Abbott-112x112.jpg’, ’town’: u’Ashburn’, ’url’: u’http://hhmi.org/scientists/laurence-f-abbott’}, {’institute’: u’The Jackson Laboratory ’, ’name’: u’Susan Ackerman, PhD’, ’research’: u’Identification of the Molecular Mechanisms Underlying... ’, ’research_url’: u’http://hhmi.org/research/identificationmolecular-mechanisms-underlying-neurodegeneration’, ’role’: u’Investigator’, ’state’: u’ME ’, ’thumbnail_url’: u’http://www.hhmi.org/sites/default/files/Our%20Scientists/ Investigators/Ackerman-112x112.jpg’, 2.2. Scraping information from the web ’town’: u’Bar Harbor’, ’url’: u’http://hhmi.org/scientists/susan-l-ackerman’}] So now we know we can process a page from a website to generate usefully structured data. However, this was only the first page of results. We need to do this for each page of results if we want to capture all the HHMI investigators. We could just look at the number of pages that our search returned manually, but to make this more general we can actually scrape the page to find that piece of information and use that to calculate how many pages we need to work through. The number of results is found in a div with the class “viewheaders” as a piece of free text (“Showing 1–20 of 493 results”). We need to grab the text, split it up (I do so based on spaces), find the right number (the one that is before the word “results”) and convert that to an integer. Then we can divide by the number of items we requested per page (20 in our case) to find how many pages we need to work through. A quick mental calculation confirms that if page 0 had results 1–20, page 24 would give results 481–493. >> >> >> >> >> # Check total number of investigators returned view_header = soup.find("div", class_ = "view-header") words = view_header.text.split(" ") count_index = words.index("results.") - 1 count = int(words[count_index]) >> # Calculate number of pages, given count & items_per_page >> num_pages = count/query.get("items_per_page") >> num_pages 24 Then it is a simple matter of putting the function we constructed earlier into a loop to work through the correct number of pages. As we start to hit the website repeatedly, we need to consider whether we are being polite. Most websites have a file in the root directory called robots.txt that contains guidance on using programs to interact with the website. In the case of http://hhmi.org the file states first that we are allowed (or, more properly, not forbidden) to query http://www.hhmi.org/scientists/ programmatically. Thus, you can pull down all of the more detailed biographical or research information, if you so desire. The file also states that there is a requested “Crawl-delay” of 10. This means that if you are making repeated queries (as we will be in getting the 24 pages), you should wait for 10 seconds between each query. This request is easily accommodated by adding a timed delay between each page request. 29 30 2. Working with Web Data and APIs >> >> >> >> >> >> >> >> for page_num in range(num_pages): # We already have page zero and we need to go to 24: # range(24) is [0,1,...,23] query["items_per_page"] = page_num + 1 page = requests.get(BASE_URL, params=query) # We use extend to add list for each page to existing list data.extend(scrape(page)) print "Retrieved and scraped page number:", query.get(" items_per_page") >> time.sleep(10) # robots.txt at hhmi.org specifies a crawl delay of 10 seconds Retrieved and scraped page number: 1 Retrieved and scraped page number: 2 ... Retrieved and scraped page number: 24 Finally we can check that we have the right number of results after our scraping. This should correspond to the 493 records that the website reports. >> len(data) 493 2.2.2 Limits of scraping While scraping websites is often necessary, is can be a fragile and messy way of working. It is problematic for a number of reasons: for example, many websites are designed in ways that make scraping difficult or impossible, and other sites explicitly prohibit this kind of scripted analysis. (Both reasons apply in the case of the NSF and Grants.gov websites, which is why we use the HHMI website in our example.) In many cases a better choice is to process a data dump from an organization. For example, the NSF and Wellcome Trust both provide data sets for each year that include structured data on all their awarded grants. In practice, integrating data is a continual challenge of figuring out what is the easiest way to proceed, what is allowed, and what is practical and useful. The selection of data will often be driven by pragmatic rather than theoretical concerns. Increasingly, however, good practice is emerging in which organizations provide APIs to enable scripted and programmatic access to the data they hold. These tools are much easier and generally more effective to work with. They are the focus of much of the rest of this chapter. 2.3. New data in the research enterprise 2.3 New data in the research enterprise The new forms of data we are discussing in this chapter are largely available because so many human activities—in this case, discussion, reading, and bookmarking—are happening online. All sorts of data are generated as a side effect of these activities. Some of that data is public (social media conversations), some private (IP addresses requesting specific pages), and some intrinsic to the service (the identity of a user who bookmarks an article). What exactly are these new forms of data? There are broadly two new directions that data availability is moving in. The first is information on new forms of research output, data sets, software, and in some cases physical resources. There is an interest across the research community in expanding the set of research outputs that are made available and, to drive this, significant efforts are being made to ensure that these nontraditional outputs are seen as legitimate outputs. In particular there has been a substantial policy emphasis on data sharing and, coupled with this, efforts to standardize practice around data citation. This is applying a well-established measure (citation) to a new form of research output. The second new direction, which is more developed, takes the alternate route, providing new forms of information on existing types of output, specifically research articles. The move online of research activities, including discovery, reading, writing, and bookmarking, means that many of these activities leave a digital trace. Often these traces are public or semi-public and can be collected and tracked. This certainly raises privacy issues that have not been comprehensively addressed but also provides a rich source of data on who is doing what with research articles. There are a wide range of potential data sources, so it is useful to categorize them. Figure 2.2 shows one possible categorization, in which data sources are grouped based on the level of engagement and the stage of use. It starts from the left with “views,” measures of online views and article downloads, followed by “saves” where readers actively collect articles into a library of their own, through online discussion forums such as blogs, social media and new commentary, formal scholarly recommendations, and, finally, formal citations. These categories are a useful way to understand the classes of information available and to start digging into the sources they can be obtained from. For each category we will look at the kind of usage that the indicator is a proxy for, which users are captured by 31 32 2. Working with Web Data and APIs Research Article Viewed PLOS HTML PLOS PDF PLOS XML PMC HTML PMC PDF Discussed Saved CiteULike Recommended NatureBlogs ScienceSeeker ResearchBlogging PLOS Comments Wikipedia Twitter Facebook Mendeley F1000 Prime Cited Crossref PMC Web of Science Scopus Increasing Engagement Figure 2.2. Classes of online activity related to research journal articles. Reproduced from Lin and Fenner [237], under a Creative Commons Attribution v 3.0 license the indicator, the limitations that the indicator has as a measure, and the sources of data. We start with the familiar case of formal literature citations to provide context. Example: Citations Most quantitative analyses of research have focused on citations from research articles to other research articles. Many familiar measures—such as Impact Factors, Scimago Journal Rank, or Eigenfactor—are actually measures of journal rather than article performance. However, information on citations at the article level is increasingly the basis for much bibliometric analysis. • Kind of usage ◦ Citing a scholarly work is a signal from a researcher that a specific work has relevance to, or has influenced, the work they are describing. ◦ It implies significant engagement and is a measure that carries some weight. • Users ◦ Researchers, which means usage by a specific group for a fairly small range of purposes. ◦ With high-quality data, there are some geographical, career, and disciplinary demographic details. • Limitations ◦ The citations are slow to accumulate, as they must pass through a peer-review process. 2.3. New data in the research enterprise ◦ It is seldom clear from raw data why a paper is being cited. ◦ It provides a limited view of usage, as it only reflects reuse in research, not application in the community. • Sources ◦ Public sources of citation data include PubMed Central and Europe PubMed Central, which mine publicly available full text to find citations. ◦ Proprietary sources of citation data include Thomson Reuters’ Web of Knowledge and Elsevier’s Scopus. ◦ Some publishers make citation data collected by Crossref available. Example: Page views and downloads A major new source of data online is the number of times articles are viewed. Page views and downloads can be defined in different ways and can be reached via a range of paths. Page views are an immediate measure of usage. Viewing a paper may involve less engagement than citation or bookmarking, but it can capture interactions with a much wider range of users. The possibility of drawing demographic information from downloads has significant potential for the future in providing detailed information on who is reading an article, which may be valuable for determining, for example, whether research is reaching a target audience. • Kind of usage ◦ It counts the number of people who have clicked on an article page or downloaded an article. • Users ◦ Page views and downloads report on use by those who have access to articles. For publicly accessible articles this could be anyone; for subscription articles it is likely to be researchers. • Limitations ◦ Page views are calculated in different ways and are not directly com- parable across publishers. Standards are being developed but are not yet widely applied. ◦ Counts of page views cannot easily distinguish between short-term visitors and those who engage more deeply with an article. ◦ There are complications if an article appears in multiple places, for example at the journal website and a repository. 33 34 2. Working with Web Data and APIs • Sources ◦ Some publishers and many data repositories make page view data available in some form. Publishers with public data include PLOS, Nature Publishing Group, Ubiquity Press, Co-Action Press, and Frontiers. ◦ Data repositories, including Figshare and Dryad, provide page view and download information. ◦ PubMed Central makes page views of articles hosted on that site avail- able to depositing publishers. PLOS and a few other publishers make this available. Example: Analyzing bookmarks Tools for collecting and curating personal collections of literature, or web content, are now available online. They make it easy to make copies and build up indexes of articles. Bookmarking services can choose to provide information on the number of people who have bookmarked a paper. Two important services targeted at researchers are Mendeley and CiteULike. Mendeley has the larger user base and provides richer statistics. Data include the number of users that who bookmarked a paper, groups that have collected a paper, and in some cases demographics of users, which can include discipline, career stage, and geography. Bookmarks accumulate rapidly after publication and provide evidence of scholarly interest. They correlate quite well with the eventual number of citations. There are also public bookmarking services that provide a view onto wider interest in research articles. • Kind of usage ◦ Bookmarking is a purposeful act. It may reveal more interest than a page view, but less than a citation. ◦ Its uses are different from those captured by citations. ◦ The bookmarks may include a variety of documents, such as papers for background reading, introductory material, position or policy papers, or statements of community positions. • Users ◦ Academic-focused services provide information on use by researchers. ◦ Each service has a different user profile in, for instance, sciences or social sciences. 2.3. New data in the research enterprise ◦ All services have a geographical bias towards North America and Europe. ◦ There is some demographic information, for instance, on countries where users are bookmarking the most. • Limitations ◦ There is bias in coverage of services; for instance, Mendeley has good coverage of biomedical literature. ◦ It can only report on activities of signed-up users. ◦ It is not usually possible to determine why a bookmark has been created. • Sources ◦ Mendeley and CiteULike both have public APIs that provide data that are freely available for reuse. ◦ Most consumer bookmarking services provide some form of API, but this often has restrictions or limitations. Example: Discussions on social media Social media are one of the most valuable new services producing information about research usage. A growing number of researchers, policymakers, and technologists are on these services discussing research. There are three major features of social media as a tool. First, among a large set of conversations, it is possible to discover a discussion about a specific paper. Second, Twitter makes it possible to identify groups discussing research and to learn whether they were potential targets of the research. Third, it is possible to reconstruct discussions to understand what paths research takes to users. In the future it will be possible to identify target audiences and to ask whether they are being reached and how modified distribution might maximize that reach. This could be a powerful tool, particularly for research with social relevance. Twitter provides the most useful data because discussions and the identity of those involved are public. Connections between users and the things they say are often available, making it possible to identify communities discussing work. However, the 140-character limit on Twitter messages (“tweets”) does not support extended critiques. Facebook has much less publicly available information—but being more private, it can be a site for frank discussion of research. • Kind of usage ◦ Those discussing research are showing interest potentially greater than page views. 35 36 2. Working with Web Data and APIs ◦ Often users are simply passing on a link or recommending an article. ◦ It is possible to navigate to tweets and determine the level and nature of interest. ◦ Conversations range from highly technical to trivial, so numbers should be treated with caution. ◦ Highly tweeted or Facebooked papers also tend to have significant bookmarking and citation. ◦ Professional discussions can be swamped when a piece of research captures public interest. • Users ◦ The user bases and data sources for Twitter and Facebook are global and public. ◦ There are strong geographical biases. ◦ A rising proportion of researchers use Twitter and Facebook for professional activities. ◦ Many journalists, policymakers, public servants, civil society groups, and others use social media. • Limitations ◦ Frequent lack of explicit links to papers is a serious limitation. ◦ Use of links is biased towards researchers and against groups not directly engaged in research. ◦ There are demographic issues and reinforcement effects—retweeting leads to more retweeting in preference to other research—so analysis of numbers of tweets or likes is not always useful. Example: Recommendations A somewhat separate form of usage is direct expert recommendations. The bestknown case of this is the F1000 service on which experts offer recommendations with reviews of specific research articles. Other services such as collections or personal recommendation services may be relevant here as well. • Kind of usage ◦ Recommendations from specific experts show that particular outputs are worth looking at in detail, are important, or have some other value. ◦ Presumably, recommendations are a result of in-depth reading and a high level of engagement. 2.4. A functional view • Users ◦ Recommendations are from a selected population of experts depending on the service in question. ◦ In some cases this might be an algorithmic recommendation service. • Limitations ◦ Recommendations are limited to the interests of the selected population of experts. ◦ The recommendation system may be biased in terms of the interests of recommenders (e.g., towards—or away from—new theories vs. developing methodology) as well as their disciplines. ◦ Recommendations are slow to build up. 2.4 A functional view The descriptive view of data types and sources is a good place to start, but it is subject to change. Sources of data come and go, and even the classes of data types may expand and contract in the medium to long term. We also need a more functional perspective to help us understand how these sources of data relate to activities in the broader research enterprise. Consider Figure 1.2 in Chapter 1. The research enterprise has been framed as being made up of people who are generating outputs. The data that we consider in this chapter relate to connections between outputs, such as citations between research articles and tweets referring to articles. These connections are themselves created by people, as shown in Figure 2.3. The people in turn may be classed as belonging to certain categories or communities. What is interesting, and expands on the simplified picture of Figure 1.2, is that many of these people are not professional researchers. Indeed, in some cases they may not be people at all but automated systems of some kind. This means we need to expand the set of actors we are considering. As described above, we are also expanding the range of outputs (or objects) that we are considering as well. In the simple model of Figure 2.3, there are three categories of things (nodes on the graph): objects, people, and the communities they belong to. Then there are the relationships between these elements (connections between nodes). Any given data source may provide information on different parts of this graph, and the information available is rarely complete or comprehensive. Data from 37 38 2. Working with Web Data and APIs Created by Created by Refers to Figure 2.3. A simplified model of online interactions between research outputs and the objects that refer to them ◮ See Section 3.2. different sources can also be difficult to integrate. As with any data integration, combining sources relies on being able to confidently identify those nodes that are common between data sources. Therefore identifying unique objects and people is critical to making progress. These data are not necessarily public but many services choose to make some data available. An important characteristic of these data sources is that they are completely in the gift of the service provider. Data availability, its presentation, and upstream analysis can change without notice. Data are sometimes provided as a dump but is also frequently provided through an API. An API is simply a tool that allows a program to interface with a service. APIs can take many different forms and be of varying quality and usefulness. In this section we will focus on one common type of API and examples of important publicly available APIs relevant to research communications. We will also cover combining APIs and the benefits and challenges of bringing multiple data sources together. 2.4.1 Relevant APIs and resources There is a wide range of other sources of information that can be used in combination with the APIs featured above to develop an overview of research outputs and of where and how they are being used. There are also other tools that can allow deeper analysis of the outputs themselves. Table 2.1 gives a partial list of key data sources and APIs that are relevant to the analysis of research outputs. 2.4.2 RESTful APIs, returned data, and Python wrappers The APIs we will focus on here are all examples of RESTful services. REST stands for Representational State Transfer [121, 402], but for 2.4. A functional view 39 Table 2.1. Popular sources of data relevant to the analysis of research outputs Source PubMed Web of Science Scopus Crossref Google Scholar Microsoft Academic Search Altmetric.com Twitter Facebook ORCID LinkedIn Gateway to Research NIH Reporter NSF Award Search Description Bibliographic Data An online index that combines bibliographic data from Medline and PubMed Central. PubMed Central and Europe PubMed Central also provide information. The bibliographic database provided by Thomson Reuters. The ISI Citation Index is also available. The bibliographic database provided by Elsevier. It also provides citation information. Provides a range of bibliographic metadata and information obtained from members registering DOIs. Provides a search index for scholarly objects and aggregates citation information. Provides a search index for scholarly objects and aggregates citation information. Not as complete as Google Scholar, but has an API. Social Media A provider of aggregated data on social media and mainstream media attention of research outputs. Most comprehensive source of information across different social media and mainstream media conversations. Provides an API that allows a user to search for recent tweets and obtain some information on specific accounts. The Facebook API gives information on the number of pages, likes, and posts associated with specific web pages. Author Profiles Unique identifiers for research authors. Profiles include information on publication lists, grants, and affiliations. CV-based profiles, projects, and publications. Funder Information A database of funding decisions and related outputs from Research Councils UK. Online search for information on National Institutes of Health grants. Does not provide an API but a downloadable data set is available. Online search for information on NSF grants. Does not provide an API but downloadable data sets by year are available. * The data are restricted: sometimes fee based, other times not. our purposes it is most easily understood as a means of transferring data using web protocols. Other forms of API require additional tools or systems to work with, but RESTful APIs work directly over the web. This has the advantage that a human user can also with relative ease play with the API to understand how it works. Indeed, some websites work simply by formatting the results of API calls. API Free Y Y Y N Y N Y Y N Y Y Y Y N Y Y Y Y Y Y Y * Y Y N Y N Y 40 2. Working with Web Data and APIs As an example let us look at the Crossref API. This provides a range of information associated with Digital Object Identifiers (DOIs) registered with Crossref. DOIs uniquely identify an object, and Crossref DOIs refer to research objects, primarily (but not entirely) research articles. If you use a web browser to navigate to http://api. crossref.org/works/10.1093/nar/gni170, you should receive back a webpage that looks something like the following. (We have laid it out nicely to make it more readable.) { "status" : "ok", "message-type" : "work", "message-version" : "1.0.0", "message" : { "subtitle": [], "subject" : ["Genetics"], "issued" : { "date-parts" : [[2005,10,24]] }, "score" : 1.0, "prefix" : "http://id.crossref.org/prefix/10.1093", "author" : [ "affiliation" : [], "family" : "Whiteford", "given" : "N."}], "container-title" : ["Nucleic Acids Research"], "reference-count" : 0, "page" : "e171-e171", "deposited" : {"date-parts" : [[2013,8,8]], "timestamp" : 1375920000000}, "issue" : "19", "title" : ["An analysis of the feasibility of short read sequencing"] , "type" : "journal-article", "DOI" : "10.1093/nar/gni170", "ISSN" : ["0305-1048","1362-4962"], "URL" : "http://dx.doi.org/10.1093/nar/gni170", "source" : "Crossref", "publisher" : "Oxford University Press (OUP)", "indexed" : {"date-parts" : [[2015,6,8]], "timestamp" : 1433777291246}, "volume" : "33", "member" : "http://id.crossref.org/member/286" } } ⋆ JSON is an open standard way of storing and exchanging data. This is a package of JavaScript Object Notation (JSON)* data returned in response to a query. The query is contained entirely in the URL, which can be broken up into pieces: the root URL (http://api.crossref.org) and a data “query,” in this case made up of a “field” (works) and an identifier (the DOI 10.1093/nar/gni170). The Crossref API provides information about the article identified with this specific DOI. 2.5. Programming against an API 2.5 Programming against an API Programming against an API involves constructing HTTP requests and parsing the data that are returned. Here we use the Crossref API to illustrate how this is done. Crossref is the provider of DOIs used by many publishers to uniquely identify scholarly works. Crossref is not the only organization to provide DOIs. The scholarly communication space DataCite is another important provider. The documentation is available at the Crossref website [394]. Once again the requests Python library provides a series of convenience functions that make it easier to make HTTP calls and to process returned JSON. Our first step is to import the module and set a base URL variable. >> import requests >> BASE_URL = "http://api.crossref.org/" A simple example is to obtain metadata for an article associated with a specific DOI. This is a straightforward call to the Crossref API, similar to what we saw earlier. >> doi = "10.1093/nar/gni170" >> query = "works/" >> url = BASE_URL + query + doi >> response = requests.get(url) >> url http://api.crossref.org/works/10.1093/nar/gni170 >> response.status_code 200 The response object that the requests library has created has a range of useful information, including the URL called and the response code from the web server (in this case 200, which means everything is OK). We need the JSON body from the response object (which is currently text from the perspective of our script) converted to a Python dictionary. The requests module provides a convenient function for performing this conversion, as the following code shows. (All strings in the output are in Unicode, hence the u’ notation.) >> response_dict = response.json() >> response_dict { u’message’ : { u’DOI’ : u’10.1093/nar/gni170’, u’ISSN’ : [ u’0305-1048’, u’1362-4962’ ], u’URL’ : u’http://dx.doi.org/10.1093/nar/gni170’, u’author’ : [ {u’affiliation’ : [], u’family’ : u’Whiteford’, 41 42 2. Working with Web Data and APIs u’given’ : u’N.’} ], u’container-title’ : [ u’Nucleic Acids Research’ ], u’deposited’ : { u’date-parts’ : [[2013, 8, 8]], u’timestamp’ : 1375920000000 }, u’indexed’ : { u’date-parts’ : [[2015, 6, 8]], u’timestamp’ : 1433777291246 }, u’issue’ : u’19’, u’issued’ : { u’date-parts’ : [[2005, 10, 24]] }, u’member’ : u’http://id.crossref.org/member/286’, u’page’ : u’e171-e171’, u’prefix’ : u’http://id.crossref.org/prefix/10.1093’, u’publisher’ : u’Oxford University Press (OUP)’, u’reference-count’ : 0, u’score’ : 1.0, u’source’ : u’Crossref’, u’subject’ : [u’Genetics’], u’subtitle’ : [], u’title’ : [u’An analysis of the feasibility of short read sequencing’], u’type’ : u’journal-article’, u’volume’ : u’33’ }, u’message-type’ : u’work’, u’message-version’ : u’1.0.0’, u’status’ : u’ok’ } This data object can now be processed in whatever way the user wishes, using standard manipulation techniques. The Crossref API can, of course, do much more than simply look up article metadata. It is also valuable as a search resource and for cross-referencing information by journal, funder, publisher, and other criteria. More details can be found at the Crossref website. 2.6 Using the ORCID API via a wrapper ORCID, which stands for “Open Research and Contributor Identifier” (see orcid.org; see also [145]), is a service that provides unique identifiers for researchers. Researchers can claim an ORCID profile and populate it with references to their research works, funding and affiliations. ORCID provides an API for interacting with this information. For many APIs there is a convenient Python wrapper that can be used. The ORCID–Python wrapper works with the ORCID v1.2 API to make various API calls straightforward. This wrapper only works with the public ORCID API and can therefore only access publicly available data. 2.6. Using the ORCID API via a wrapper Using the API and wrapper together provides a convenient means of getting this information. For instance, given an ORCID, it is straightforward to get profile information. Here we get a list of publications associated with my ORCID and look at the the first item on the list. >> import orcid >> cn = orcid.get("0000-0002-0068-716X") >> cn >> cn.publications[0] The wrapper has created Python objects that make it easier to work with and manipulate the data. It is common to take the return from an API and create objects that behave as would be expected in Python. For instance, the publications object is a list populated with publications (which are also Python-like objects). Each publication in the list has its own attributes, which can then be examined individually. In this case the external IDs attribute is a list of further objects that include a DOI for the article and the ISSN of the journal the article was published in. >> len(cn.publications) 70 >> cn.publications[12].external_ids [, ] As a simple example of data processing, we can iterate over the list of publications to identify those for which a DOI has been provided. In this case we can see that of the 70 publications listed in this ORCID profile (at the time of testing), 66 have DOIs. >> exids = [] >> for pub in cn.publications: if pub.external_ids: exids = exids + pub.external_ids >> DOIs = [exid.id for exid in exids if exid.type == "DOI"] >> len(DOIs) 66 Wrappers generally make operating with an API simpler and cleaner by abstracting away the details of making HTTP requests. Achieving the same by directly interacting with the ORCID API would require constructing the appropriate URLs and parsing the returned data into a usable form. Where a wrapper is available it is generally much easier to use. However, wrappers may not be actively developed and may lag the development of the API. Where 43 44 2. Working with Web Data and APIs possible, use a wrapper that is directly supported or recommended by the API provider. 2.7 ◮ See Chapter 10. Quality, scope, and management The examples in the previous section are just a small dip into the surface of the data available, but we already can see a number of issues that are starting to surface. A great deal of care needs to be taken when using these data, and a researcher will need to apply subject matter knowledge as well as broader data management expertise. Some of the core issues are as follows: Integration In the examples given above with Crossref and ORCID, we used a known identifier (a DOI or an ORCID). Integrating data from Crossref to supplement the information from an ORCID profile is possible, but it depends on the linking of identifiers. Note that for the profile data we obtained, only 66 or the 70 items had DOIs. Data integration across multiple data sources that reference DOIs is straightforward for those objects that have DOIs, and messy or impossible for those that do not. In general, integration is possible, but it depends on a means of cross-referencing between data sets. Unique identifiers that are common to both are extremely powerful but only exist in certain cases (see also Chapter 3). Without a population frame, it is difficult to know whether the information that can be captured is comprehensive. For example, “the research literature” is at best a vague concept. A variety of indexes, some openly available (PubMed, Crossref), some proprietary (Scopus, Web of Knowledge, many others), cover different partially overlapping segments of this corpus of work. Each index has differing criteria for inclusion and differing commitments to completeness. Sampling of “the literature” is therefore impossible, and the choice of index used for any study can make a substantial difference to the conclusions. Coverage Completeness Alongside the question of coverage (how broad is a data source?), with web data and opt-in services we also need to probe the completeness of a data set. In the example above, 66 of 70 objects have a DOI registered. This does not mean that those four other objects do not have a DOI, just that there are none included in the ORCID record. Similarly, ORCID profiles only exist for a subset of researchers at this stage. Completeness feeds into integration 2.7. Quality, scope, and management challenges. While many researchers have a Twitter profile and many have an ORCID profile, only a small subset of ORCID profiles provide a link to a Twitter profile. See below for a worked example. Scope In survey data sets, the scope is defined by the question being asked. This is not the case with much of these new data. For example, the challenges listed above for research articles, traditionally considered the bedrock of research outputs, at least in the natural sciences, are much greater for other forms of research outputs. Increasingly, the data generated from research projects, software, materials, and tools, as well as reports and presentations, are being shared by researchers in a variety of settings. Some of these are formal mechanisms for publication, such as large disciplinary databases, books, and software repositories, and some are highly informal. Any study of (a subset of) these outputs has as its first challenge the question of how to limit the corpus to be studied. Source and validity The challenges described above relate to the identification and counting of outputs. As we start to address questions of how these outputs are being used, the issues are compounded. To illustrate some of the difficulties that can arise, we examine the number of citations that have been reported for a single sample article on a biochemical methodology [68]. This article has been available for eight years and has accumulated a reasonable number of citations for such an article over that time. However, the exact number of citations identified varies radically, depending on the data source. Scopus finds 40, while Web of Science finds only 38. A Google Scholar search performed on the same date identified 59. These differences relate to the size of the corpus from which inward citations are being counted. Web of Science has the smallest database, with Scopus being larger and Google Scholar substantially larger again. Thus the size of the index not only affects output counting, it can also have a substantial effect on any analysis that uses that corpus. Alongside the size of the corpus, the means of analysis can also have an effect. For the same article, PubMed Central reports 10 citations but Europe PubMed Central reports 18, despite using a similar corpus. The distinction lies in differences in the methodology used to mine the corpus for citations. Identifying the underlying latent variable These issues multiply as we move into newer forms of data. These sparse and incomplete sources of data require different treatment than more traditional 45 46 2. Working with Web Data and APIs structured and comprehensive forms of data. They are more useful as a way of identifying activities than of quantifying or comparing them. Nevertheless, they can provide new insight into the processes of knowledge dissemination and community building that are occurring online. 2.8 Integrating data from multiple sources We often must work across multiple data sources to gather the information needed to answer a research question. A common pattern is to search in one location to create a list of identifiers and then use those identifiers to query another API. In the ORCID example above, we created a list of DOIs from a single ORCID profile. We could use those DOIs to obtain further information from the Crossref API and other sources. This models a common path for analysis of research outputs: identifying a corpus and then seeking information on its performance. In this example, we will build on the ORCID and Crossref examples to collect a set of work identifiers from an ORCID profile and use a range of APIs to identify additional metadata as well as information on the performance of those articles. In addition to the ORCID API, we will use the PLOS Lagotto API. Lagotto is the software that was built to support the Article Level Metrics program at PLOS, the open access publisher, and its API provides information on various metrics of PLOS articles. A range of other publishers and service providers, including Crossref, also provide an instance of this API, meaning the same tools can be used to collect information on articles from a range of sources. 2.8.1 The Lagotto API The module pyalm is a wrapper for the Lagotto API, which is served from a range of hosts. We will work with two instances in particular: one run by PLOS, and the Crossref DOI Event Tracker (DET, recently renamed Crossref Event Data) pilot service. We first need to provide the details of the URLs for these instances to our wrapper. Then we can obtain some information for a single DOI to see what the returned data look like. >> import pyalm >> pyalm.config.APIS = {’plos’ : {’url’ : >> ’http://alm.plos.org/api/v5/articles’}, >> ’det’ : {’url’ : 2.8. Integrating data from multiple sources >> ’http://det.labs.crossref.org/api/v5/articles’} >> } >> det_alm_test = pyalm.get_alm(’10.1371/journal.pbio.1001677’, >> info=’detail’, instance=’det’) det_alm_test { ’articles’ : [], ’meta’ : {u’error’ : None, u’page’ : 1, u’total’ : 1, u’total_pages’ : 1} } The library returns a Python dictionary containing two elements. The articles key contains the actual data and the meta key includes general information on the results of the interaction with the API. In this case the library has returned one page of results containing one object (because we only asked about one DOI). If we want to collect a lot of data, this information helps in the process of paging through results. It is common for APIs to impose some limit on the number of results returned, so as to ensure performance. By default the Lagotto API has a limit of 50 results. The articles key holds a list of ArticleALM objects as its value. Each ArticleALM object has a set of internal attributes that contain information on each of the metrics that the Lagotto instance collects. These are derived from various data providers and are called sources. Each can be accessed by name from a dictionary called “sources.” The iterkeys() function provides an iterator that lets us loop over the set of keys in a dictionary. Within the source object there is a range of information that we will dig into. >> article = det_alm_test.get(’articles’)[0] >> article.title u’Expert Failure: Re-evaluating Research Assessment’ >> for source in article.sources.iterkeys(): >> print source, article.sources[source].metrics.total reddit 0 datacite 0 pmceuropedata 0 wikipedia 1 pmceurope 0 citeulike 0 pubmed 0 facebook 0 wordpress 0 pmc 0 mendeley 0 crossref 0 The DET service only has a record of citations to this article from Wikipedia. As we will see below, the PLOS service returns more results. This is because some of the sources are not yet being queried by DET. 47 48 2. Working with Web Data and APIs Because this is a PLOS paper we can also query the PLOS Lagotto instance for the same article. >> plos_alm_test = pyalm.get_alm(’10.1371/journal.pbio.1001677’, info=’detail’, instance=’plos’) >> article_plos = plos_alm_test.get(’articles’)[0] >> article_plos.title u’Expert Failure: Re-evaluating Research Assessment’ >> for source in article_plos.sources.iterkeys(): >> print source, article_plos.sources[source].metrics.total datacite 0 twitter 130 pmc 610 articlecoveragecurated 0 pmceurope 1 pmceuropedata 0 researchblogging 0 scienceseeker 0 copernicus 0 f1000 0 wikipedia 1 citeulike 0 wordpress 2 openedition 0 reddit 0 nature 0 relativemetric 125479 figshare 0 facebook 1 mendeley 14 crossref 3 plos_comments 2 articlecoverage 0 counter 12551 scopus 2 pubmed 1 orcid 3 The PLOS instance provides a greater range of information but also seems to be giving larger numbers than the DET instance in many cases. For those sources that are provided by both API instances, we can compare the results returned. >> for source in article.sources.iterkeys(): >> print source, article.sources[source].metrics.total, >> article_plos.sources[source].metrics.total reddit 0 0 datacite 0 0 pmceuropedata 0 0 wikipedia 1 1 pmceurope 0 1 citeulike 0 0 pubmed 0 1 2.8. Integrating data from multiple sources facebook 0 1 wordpress 0 2 pmc 0 610 mendeley 0 14 crossref 0 3 The PLOS Lagotto instance is collecting more information and has a wider range of information sources. Comparing the results from the PLOS and DET instances illustrates the issues of coverage and completeness discussed previously. The data may be sparse for a variety of reasons, and it is important to have a clear idea of the strengths and weaknesses of a particular data source or aggregator. In this case the DET instance is returning information for some sources for which it is does not yet have data. We can dig deeper into the events themselves that the metrics.total count aggregates. The API wrapper collects these into an event object within the source object. These contain the JSON returned from the API in most cases. For instance, the Crossref source is a list of JSON objects containing information on an article that cites our article of interest. The first citation event in the list is a citation from the Journal of the Association for Information Science and Technology by Du et al. >> article_plos.sources[’crossref’].events[0] {u’event’ : {u’article_title’ : u’The effects of research level and article type on the differences between citation metrics and F1000 recommendations’, u’contributors’ : {u’contributor’ : [ { u’contributor_role’ : u’author’, u’first_author’ : u’true’, u’given_name’ : u’Jian’, u’sequence’ : u’first’, u’surname’ : u’Du’ }, { u’contributor_role’ : u’author’, u’first_author’ : u’false’, u’given_name’ : u’Xiaoli’, u’sequence’ : u’additional’, u’surname’ : u’Tang’}, { u’contributor_role’ : u’author’, u’first_author’ : u’false’, u’given_name’ : u’Yishan’, u’sequence’ : u’additional’, u’surname’ : u’Wu’} ] }, u’doi’ : u’10.1002/asi.23548’, u’first_page’ : u’n/a’, u’fl_count’ : u’0’, u’issn’ : u’23301635’, 49 50 2. Working with Web Data and APIs u’journal_abbreviation’ : u’J Assn Inf Sci Tec’, u’journal_title’ : u’Journal of the Association for Information Science and Technology’, u’publication_type’ : u’full_text’, u’year’ : u’2015’ }, u’event_csl’ : { u’author’ : [ { u’family’ : u’Du’, u’given’ : u’Jian’}, {u’family’ : u’Tang’, u’given’ : u’Xiaoli’}, {u’family’ : u’Wu’, u’given’ : u’Yishan’} ], u’container-title’ : u’Journal of the Association for Information Science and Technology’, u’issued’ : {u’date-parts’ : [[2015]]}, u’title’ : u’The Effects Of Research Level And Article Type On The Differences Between Citation Metrics And F1000 Recommendations’, u’type’ : u’article-journal’, u’url’ : u’http://doi.org/10.1002/asi.23548’ }, u’event_url’ : u’http://doi.org/10.1002/asi.23548’ } Another source in the PLOS data is Twitter. In the case of the Twitter events (individual tweets), this provides the text of the tweet, user IDs, user names, URL of the tweet, and the date. We can see from the length of the events list that there are at least 130 tweets that link to this article. >> len(article_plos.sources[’twitter’].events) 130 Again, noting the issues of coverage, scope, and completeness, it is important to consider the limitations of these data. This is a lower bound as it represents search results returned by searching the Twitter API for the DOI or URL of the article. Other tweets that discuss the article may not include a link, and the Twitter search API also has limitations that can lead to incomplete results. The number must therefore be seen as both incomplete and a lower bound. We can look more closely at data on the first tweet on the list. Bear in mind that the order of the list is not necessarily special. This is not the first tweet about this article chronologically. >> article_plos.sources[’twitter’].events[0] { u’event’ : {u’created_at’: u’2013-10-08T21:12:28Z’, u’id’ : u’387686960585641984’, u’text’ : u’We have identified the Higgs boson; it is surely not beyond our reach to make research assessment useful http://t .co/Odcm8dVRSU#PLOSBiology’, 2.8. Integrating data from multiple sources 51 u’user’ : u’catmacOA’, u’user_name’ : u’Catriona MacCallum’, u’user_profile_image’ : u’http://a0.twimg.com/profile_images/1779875975/ CM_photo_reduced_normal.jpg’}, u’event_time’ : u’2013-10-08T21:12:28Z’, u’event_url’ : u’http://twitter.com/catmacOA/status /387686960585641984’ } We could use the Twitter API to understand more about this person. For instance, we could look at their Twitter followers and whom they follow, or analyze the text of their tweets for topic modeling. Much work on social media interactions is done with this kind of data, using forms of network and text analysis described elsewhere in this book. A different approach is to integrate these data with information from another source. We might be interested, for instance, in whether the author of this tweet is a researcher, or whether they have authored research papers. One thing we could do is search the ORCID API to see if there are any ORCID profiles that link to this Twitter handle. >> twitter_search = orcid.search("catmacOA") >> for result in twitter_search: >> print unicode(result) >> print result.researcher_urls} [] So the person with this Twitter handle seems to have an ORCID profile. That means we can also use ORCID to gather more information on their outputs. Perhaps they have authored work which is relevant to our article? >> cm = orcid.get("0000-0001-9623-2225") >> for pub in cm.publications[0:5]: >> print pub.title The future is open: opportunities for publishers and institutions Open Science and Reporting Animal Studies: Who’s Accountable? Expert Failure: Re-evaluating Research Assessment Why ONE Is More Than 5 Reporting Animal Studies: Good Science and a Duty of Care From this analysis we can show that this tweet is actually from one of my co-authors of the article. To make this process easier we write the convenience function shown in Listing 2.2 to go from a Twitter user handle to try and find an ORCID for that person. ◮ See Chapters 7 and 8. 52 2. Working with Web Data and APIs 2 4 6 8 10 12 14 16 18 # Take a twitter handle or user name and return an ORCID def twitter2orcid(twitter_handle, resp = 'orcid', search_depth = 10): search = orcid.search(twitter_handle) s = [r for r in search] orc = None i = 0 while i < search_depth and orc == None and i < len(s): arr = [('twitter.com' in website.url) for website in s[i].researcher_urls] if True in arr: index = arr.index(True) url = s[i].researcher_urls[index].url if url.lower().endswith(twitter_handle.lower()): orc = s[i].orcid return orc i+=1 return None Listing 2.2. Python code to find ORCID for Twitter handle Let us do a quick test of the function. >> twitter2orcid(’catmacOA’) u’0000-0001-9623-2225’ 2.8.2 Working with a corpus In this case we will continue as previously to collect a set of works from a single ORCID profile. This collection could just as easily be a date range, or subject search at a range of other APIs. The target is to obtain a set of identifiers (in this case DOIs) that can be used to precisely query other data sources. This is a general pattern that reflects the issues of scope and source discussed above. The choice of how to construct a corpus to analyze will strongly affect the results and the conclusions that can be drawn. >> >> >> >> >> >> >> >> 66 # As previously, collect DOIs available from an ORCID profile cn = orcid.get("0000-0002-0068-716X") exids = [] for pub in cn.publications: if pub.external_ids: exids = exids + pub.external_ids DOIs = [exid.id for exid in exids if exid.type == "DOI"] len(DOIs) 2.8. Integrating data from multiple sources We have recovered 66 DOIs from the ORCID profile. Note that we have not obtained an identifier for every work, as not all have DOIs. This result illustrates an important point about data integration. In practice it is generally not worth the effort of attempting to integrate data on objects unless they have a unique identifier or key that can be used in multiple data sources, hence the focus on DOIs and ORCIDs in these examples. Even in our search of the ORCID API for profiles that are associated with a Twitter account, we used the Twitter handle as a unique ID to search on. While it is possible to work with author names or the titles of works directly, disambiguating such names and titles is substantially more difficult than working with unique identifiers. Other chapters (in particular, Chapter 3) deal with issues of data cleaning and disambiguation. Much work has been done on this basis, but increasingly you will see that the first step in any analysis is simply to discard objects without a unique ID that can be used across data sources. We can obtain data for these from the DET API. As is common with many APIs, there is a limit to how many queries can be simultaneously run, in this case 50, so we divide our query into batches. >> batches = [DOIs[0:50], DOIs[51:-1]] >> det_alms = [] >> for batch in batches: >> alms_response = pyalm.get_alm(batch, info="detail", instance="det") >> det_alms.extend(alms_response.get(’articles’)) >> len(det_alms) 24 The DET API only provides information on a subset of Crossref DOIs. The process that Crossref has followed to populate its database has focused on more recently published articles, so only 24 responses are received in this case for the 66 DOIs we queried on. A good exercise would be to look at which of the DOIs are found and which are not. Let us see how much interesting data is available in the subset of DOIs for which we have data. >> for r in [d for d in det_alms if d.sources[’wikipedia’].metrics .total != 0]: >> print r.title >> print ’ ’, r.sources[’pmceurope’].metrics.total, ’ pmceurope citations’ >> print ’ ’, r.sources[’wikipedia’].metrics.total, ’ wikipedia citations’ Architecting the Future of Research Communication: Building the Models and Analytics for an Open Access Future 53 54 2. Working with Web Data and APIs 1 pmceurope citations 1 wikipedia citations Expert Failure: Re-evaluating Research Assessment 0 pmceurope citations 1 wikipedia citations LabTrove: A Lightweight, Web Based, Laboratory "Blog" as a Route towards a Marked Up Record of Work in a Bioscience Research Laboratory 0 pmceurope citations 1 wikipedia citations The lipidome and proteome of oil bodies from Helianthus annuus ( common sunflower) 2 pmceurope citations 1 wikipedia citations As discussed above, this shows that the DET instance, while it provides information on a greater number of DOIs, has less complete data on each DOI at this stage. Only four of the 24 responses have Wikipedia references. You can change the code to look at the full set of 24, which shows only sparse data. The PLOS Lagotto instance provides more data but only on PLOS articles. However, it does provide data on all PLOS articles, going back earlier than the set returned by the DET instance. We can collect the set of articles from the profile published by PLOS. >> plos_dois = [] >> for doi in DOIs: >> # Quick and dirty, should check Crossref API for publisher >> if doi.startswith(’10.1371’): >> plos_dois.append(doi) >> len(plos_dois) 7 >> plos_alms = pyalm.get_alm(plos_dois, info=’detail’, instance=’ plos’).get(’articles’) >> for article in plos_alms: >> print article.title >> print ’ ’, article.sources[’crossref’].metrics.total, ’ Crossref citations’ >> print ’ ’, article.sources[’twitter’].metrics.total, ’ tweets’ Architecting the Future of Research Communication: Building the Models and Analytics for an Open Access Future 2 Crossref citations 48 tweets Expert Failure: Re-evaluating Research Assessment 3 Crossref citations 130 tweets LabTrove: A Lightweight, Web Based, Laboratory "Blog" as a Route towards a Marked Up Record of Work in a Bioscience Research Laboratory 6 Crossref citations 2.8. Integrating data from multiple sources 1 tweets More Than Just Access: Delivering on a Network-Enabled Literature 4 Crossref citations 95 tweets Article-Level Metrics and the Evolution of Scientific Impact 24 Crossref citations 5 tweets Optimal Probe Length Varies for Targets with High Sequence Variation: Implications for Probe Library Design for Resequencing Highly Variable Genes 2 Crossref citations 1 tweets Covalent Attachment of Proteins to Solid Supports and Surfaces via Sortase-Mediated Ligation 40 Crossref citations 0 tweets From the previous examples we know that we can obtain information on citing articles and tweets associated with these 66 articles. From that initial corpus we now have a collection of up to 86 related articles (cited and citing), a few hundred tweets that refer to (some of) those articles, and perhaps 500 people if we include authors of both articles and tweets. Note how for each of these links our query is limited, so we have a subset of all the related objects and agents. At this stage we probably have duplicate articles (one article might cite multiple in our set of seven) and duplicate people (authors in common between articles and authors who are also tweeting). These data could be used for network analysis, to build up a new corpus of articles (by following the citation links), or to analyze the links between authors and those tweeting about the articles. We do not pursue an in-depth analysis here, but will gather the relevant objects, deduplicate them as far as possible, and count how many we have in preparation for future analysis. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> # Collect all citing DOIs & author names from citing articles citing_dois = [] citing_authors = [] for article in plos_alms: for cite in article.sources[’crossref’].events: citing_dois.append(cite[’event’][’doi’]) # Use ’extend’ because the element is a list citing_authors.extend(cite[’event_csl’][’author’]) print ’\nBefore de-deduplication:’ print ’ ’, len(citing_dois), ’DOIs’ print ’ ’, len(citing_authors), ’citing authors’ # Easiest way to deduplicate is to convert to a Python set citing_dois = set(citing_dois) citing_authors = set([author[’given’] + author[’family’] for 55 56 2. Working with Web Data and APIs author in citing_authors]) >> print ’\nAfter de-deduplication:’ >> print ’ ’, len(citing_dois), ’DOIs’ >> print ’ ’, len(citing_authors), ’citing authors’ Before de-deduplication: 81 DOIs 346 citing authors After de-deduplication: 78 DOIs 278 citing authors >> >> >> >> >> >> >> >> >> >> # Collect all tweets, usernames; check for ORCIDs tweet_urls = set() twitter_handles = set() for article in plos_alms: for tweet in article.sources[’twitter’].events: tweet_urls.add(tweet[’event_url’]) twitter_handles.add(tweet[’event’][’user’]) # No need to explicitly deduplicate as we created sets directly print len(tweet_urls), ’tweets’ print len(twitter_handles), ’Twitter users’ 280 tweets 210 Twitter users It could be interesting to look at which Twitter users interact most with the articles associated with this ORCID profile. To do that we would need to create not a set but a list, and then count the number of duplicates in the list. The code could be easily modified to do this. Another useful exercise would be to search ORCID for profiles corresponding to citing authors. The best way to do this would be to obtain ORCIDs associated with each of the citing articles. However, because ORCID data are sparse and incomplete, there are two limitations here. First, the author may not have an ORCID. Second, the article may not be explicitly linked to another article. Try searching ORCID for the DOIs associated with each of the citing articles. In this case we will look to see how many of the Twitter handles discussing these articles are associated with an ORCID profile we can discover. This in turn could lead to more profiles and more cycles of analysis to build up a network of researchers interacting through citation and on Twitter. Note that we have inserted a delay between calls. This is because we are making a larger number of API calls (one for each Twitter handle). It is considered polite to keep the pace at which calls are made to an API to a reasonable level. The ORCID API does not post suggested limits at the moment, but delaying for a second between calls is reasonable. 2.8. Integrating data from multiple sources >> tweet_orcids = [] >> for handle in twitter_handles: >> orc = twitter2orcid(handle) >> if orc: >> tweet_orcids.append(orc) >> time.sleep(1) # wait one second between each call to the ORCID API >> print len(tweet_orcids) 12 In this case we have identified 12 ORCID profiles that we can link positively to tweets about this set of articles. This is a substantial underestimate of the likely number of ORCIDs associated with these tweets. However, relatively few ORCIDs have Twitter accounts registered as part of the profile. To gain a broader picture a search and matching strategy would need to be applied. Nevertheless, for these 12 we can look more closely into the profiles. The first step is to obtain the actual profile information for each of the 12 ORCIDs that we have found. Note that at the moment what we have is the ORCIDs themselves, not the retrieved profiles. >> orcs = [] >> for id in tweet_orcids: >> orcs.append(orcid.get(id)) With the profiles retrieved we can then take a look at who they are, and check that we do in fact have sensible Twitter handles associated with them. We could use this to build up the network of related authors and Twitter users for further analysis. >> for orc in orcs: >> i = [(’twitter.com’ in website.url) for website in orc. researcher_urls].index(True) >> twitter_url = orc.researcher_urls[i].url >> print orc.given_name, orc.family_name, orc.orcid, twitter_url Catriona MacCallum 0000-0001-9623-2225 http://twitter.com/catmacOA John Dupuis 0000-0002-6066-690X https://twitter.com/dupuisj Johannes Velterop 0000-0002-4836-6568 https://twitter.com/ Villavelius Stuart Lawson 0000-0002-1972-8953 https://twitter.com/Lawsonstu Nelson Piedra 0000-0003-1067-8707 http://www.twitter.com/nopiedra Iryna Kuchma 0000-0002-2064-3439 https://twitter.com/irynakuchma Frank Huysmans 0000-0002-3468-9032 https://twitter.com/fhuysmans Salvatore Salvi VICIDOMINI 0000-0001-5086-7401 https://twitter.com /SalViVicidomini William Gunn 0000-0002-3555-2054 http://twitter.com/mrgunn Stephen Curry 0000-0002-0552-8870 https://twitter.com/ Stephen_Curry Cameron Neylon 0000-0002-0068-716X http://twitter.com/ 57 58 2. Working with Web Data and APIs cameronneylon Graham Steel 0000-0003-4681-8011 https://twitter.com/McDawg 2.9 ◮ See Chapter 8. Working with the graph of relationships In the above examples we started with the profile of an individual, used this to create a corpus of works, which in turn led us to other citing works (and their authors) and commentary about those works on Twitter (and the people who wrote those comments). Along the way we built up a graph of relationships between objects and people. In this section we will look at this model of the data and how it reveals limitations and strengths of these forms of data and what can be done with them. 2.9.1 Citation links between articles A citation in a research article (or a policy document or working paper) defines a relationship between that citing article and the cited article. The exact form of the relationship is generally poorly defined, at least at the level of large-scale data sets. A citation might be referring to previous work, indicating the source of data, or supporting (or refuting) an idea. While efforts have been made to codify citation types, they have thus far gained little traction. In our example we used a particular data source (Crossref) for information about citations. As previously discussed, this will give different results than other sources (such as Thomson Reuters, Scopus, or Google Scholar) because other sources look at citations from a different set of articles and collect them in a different way. The completeness of the data will always be limited. We could use the data to clearly connect the citing articles and their authors because author information is generally available in bibliographic metadata. However, we would have run into problems if we had only had names. ORCIDs can provide a way to uniquely identify authors and ensure that our graph of relationships is clean. A citation is a reference from an object of one type to an object of the same type. We also sought to link social media activity with specific articles. Rather than a link between objects that are the same (articles) we started to connect different kinds of objects together. We are also expanding the scope of the communities (i.e., people) that might be involved. While we focused on the question of which Twitter handles were connected with researchers, we could 2.9. Working with the graph of relationships 59 Other Agents/Stakeholders Researchers Patient group Tweets Tweeted by Research group Authored by Outputs Links to Institution Articles Funder Figure 2.4. A functional view of proxies and relationships just as easily have focused on trying to discover which comments came from people who are not researchers. We used the Lagotto API at PLOS to obtain this information. The PLOS API in turn depends on the Twitter Search API. A tweet that refers explicitly to a research article, perhaps via a Crossref DOI, can be discovered, and a range of services do these kinds of checks. These services generally rely either on Twitter Search or, more generally, on a search of “the firehose,” a dump of all Twitter data that are available for purchase. The distinction is important because Twitter Search does not provide either a complete or a consistent set of results. In addition, there will be many references to research articles that do not contain a unique identifier, or even a link. These are more challenging to discover. As with citations, the completeness of any data set will always be limited. However, the set of all tweets is a more defined set of objects than the set of all “articles.” Twitter is a specific social media service with a defined scope. “Articles” is a broad class of objects served by a very wide range of services. Twitter is clearly a subset of all discussions and is highly unlikely to be representative of “all discussions.” Equally the set of all objects with a Crossref DOI, while defined, is unlikely to be representative of all articles. Expanding on Figure 2.3, we show in Figure 2.4 agents and actors (people) and outputs. We place both agents and outputs into categories that may be more or less well defined. In practice our analysis is limited to those objects that we discover by using some “selector” (circles in this diagram), which may or may not have a 60 2. Working with Web Data and APIs close correspondence with the “real” categories (shown with graded shapes). Our aim is to identify, aggregate, and in some cases count the relationships between and within categories of objects; for instance, citations are relationships between formal research outputs. A tweet may have a relationship (“links to”) with a specific formally published research output. Both tweets and formal research outputs relate to specific agents (“authors”) of the content. 2.9.2 Categories, sources, and connections We can see in this example a distinction between categories of objects of interest (articles, discussions, people) and sources of information on subsets of those categories (Crossref, Twitter, ORCID). Any analysis will depend on one or more data sources, and in turn be limited by the coverage of those data sources. The selectors used to generate data sets from these sources will have their own limitations. Similar to a query on a structured data set, the selector itself may introduce bias. The crucial difference between filtering on a comprehensive (or at least representative) data set and the data sources we are discussing here is that these data sources are by their very nature incomplete. Survey data may include biases introduced in the way that the survey itself is structured or the sampling is designed, but the intent is to be comprehensive. Many of these new forms of data make no attempt to be comprehensive or avowedly avoid such an attempt. Understanding this incompleteness is crucial to understanding the forms of inference that can be made from these data. Sampling is only possible within a given source or corpus, and this limits the conclusions that can be drawn to the scope of that corpus. It is frequently possible to advance a plausible argument to claim that such findings are more broadly applicable, but it is crucial to avoid assuming that this is the case. In particular, it is important to be clear about what data sources a finding applies to and where the boundary between the strongly evidenced finding and a claim about its generalization lies. Much of the literature on scholarly communications and research impact is poor on this point. If this is an issue in identifying the objects of interest, it is even more serious when seeking to identify the relationships between them, which are, after all generally the thing of interest. In some cases there are reasonably good sources of data between objects of the same class (at least those available from the same data sources) such as citations between journal articles or links between tweets. 2.9. Working with the graph of relationships However, as illustrated in this chapter, detecting relationships between tweets and articles is much more challenging. These issues can arise both due to the completeness of the data source itself (e.g., ORCID currently covers only a subset of researchers; therefore, the set of author–article relationships is limited) or due to the challenges of identification (e.g., in the Twitter case above) or due to technical limitations at source (the difference between the Twitter search API and the firehose). In addition, because the source data and the data services are both highly dynamic and new, there is often a mismatch. Many services tracking Twitter data only started collecting data relatively recently. There is a range of primary and secondary data sources working to create more complete data sets. However, once again it is important to treat all of these data as sparse and limited as well as highly dynamic and changeable. 2.9.3 Data availability and completeness With these caveats in hand and the categorization discussed above, we can develop a mapping of what data sources exist, what objects those data sources inform us about, the completeness of those data sources, and how well the relationships between the different data sources are tracked. Broadly speaking, data sources concern themselves with either agents (mostly people) or objects (articles, books, tweets, posts), while additionally providing additional data about the relationships of the agents or objects that they describe with other objects or agents. The five broad types of data described above are often treated as ways of categorizing the data source. They are more properly thought of as relationships between objects, or between objects and agents. Thus, for example, citations are relationships between articles; the tweets that we are considering are actually relationships between specific Twitter posts and articles; and “views” are an event associating a reader (agent) with an article. The last case illustrates that often we do not have detailed information on the relationship but merely a count of them. Relationships between agents (such as co-authorship or group membership) can also be important. With this framing in hand, we can examine which types of relationships we can obtain data on. We need to consider both the quality of data available and the completeness of the data availability. These metrics are necessarily subjective and any analysis will be a personal view of a particular snapshot in time. Nevertheless, some major trends are available. 61 62 2. Working with Web Data and APIs We have growing and improving data on the relationships between a wide range of objects and agents and traditional scholarly outputs. Although it is sparse and incomplete in many places, nontraditional information on traditional outputs is becoming more available and increasingly rich. By contrast, references from traditional outputs to nontraditional outputs are weaker and data that allow us to understand the relationships between nontraditional outputs is very sparse. In the context of the current volume, a major weakness is our inability to triangulate around people and communities. While it may be possible to collect a set of co-authors from a bibliographic data source and to identify a community of potential research users on Twitter or Facebook, it is extremely challenging to connect these different sets. If a community is discussing an article or book on social media, it is almost impossible to ascertain whether the authors (or, more generically, interested parties such as authors of cited works or funders) are engaged in that conversation. 2.9.4 The value of sparse dynamic data Two clear messages arise from our analysis. These new forms of data are incomplete or sparse, both in quality and in coverage, and they change. A data source that is poor today may be much improved tomorrow. A query performed one minute may give different results the next. This can be both a strength and a weakness: data are up to the minute, giving a view of relationships as they form (and break), but it makes ensuring consistency within analyses and across analyses challenging. Compared to traditional surveys, these data sources cannot be relied on as either representative samples or to be stable. A useful question to ask, therefore, is what kind of statements these data can support. Questions like this will be necessarily different from the questions that can be posed with high-quality survey data. More often they provide an existence proof that something has happened—but they cannot, conversely, show that it has not. They enable some forms of comparison and determination of the characteristics of activity in some cases. Because much of the data that we have is sparse, the absence of an indicator cannot reliably be taken to mean an absence of activity. For example, a lack of Mendeley bookmarks may not mean that a paper is not being saved by researchers, Provide evidence that . . . 2.9. Working with the graph of relationships just that those who do save the article are not using Mendeley to do it. Similarly, a lack of tweets about an article does not mean the article is not being discussed. But we can use the data that do exist to show that some activity is occurring. Here are some examples: • Provide evidence that relevant communities are aware of a specific paper. I identified the fact that a paper by Jewkes et al. [191] was mentioned by crisis centers, sexual health organizations, and discrimination support groups in South Africa when I was looking for University of Cape Town papers that had South African Twitter activity using Altmetric.com. • Provide evidence that a relatively under-cited paper is having a research impact. There is a certain kind of research article, often a method description or a position paper, that is influential without being (apparently) heavily cited. For instance, the PLoS One article by Shen et al. [341] has a respectable 14,000 views and 116 Mendeley bookmarks, but a relatively (for the number of views) small number of WoS citations (19) compared to, say, another article, by Leahy et al. [231] and also in PLoS One, that is similar in age and number of views but has many more citations. • Provide evidence of public interest in some topic. Many art- icles at the top of lists ordered by views or social media mentions are of ephemeral (or prurient) interest—the usual trilogy of sex, drugs, and rock and roll. However, if we dig a little deeper, a wide range of articles surface, often not highly cited but clearly of wider interest. For example, an article on Ychromosome distribution in Afghanistan [146] has high page views and Facebook activity among papers with a Harvard affiliation but is not about sex, drugs, nor rock and roll. Unfortunately, because this is Facebook data we cannot see who is talking about it, which limits our ability to say which groups are talking about it, which could be quite interesting. Comparisons using social media or download statistics need real care. As noted above, the data are sparse so it is important that comparisons are fair. Also, comparisons need to be on the basis of something that the data can actually tell you: for example, “which article is discussed more by this online community,” not “which article is discussed more.” Compare . . . • Compare the extent to which these articles are discussed by this online patient group, or possibly specific online communi- 63 64 2. Working with Web Data and APIs ties in general. Here the online communities might be a proxy for a broader community, or there might be a specific interest in knowing whether the dissemination strategy reaches this community. It is clear that in the longer term social media will be a substantial pathway for research to reach a wide range of audiences, and understanding which communities are discussing what research will help us to optimize the communication. • Compare the readership of these articles in these countries. One thing that most data sources are weak on at the moment is demographics, but in principle the data are there. Are these articles that deal with diseases of specific areas actually being viewed by readers in those areas? If not, why not? Do they have Internet access, could lay summaries improve dissemination, are they going to secondary online sources instead? • Compare the communities discussing these articles online. Is most conversation driven by science communicators or by researchers? Are policymakers, or those who influence them, involved? What about practitioner communities? These comparisons require care, and simple counting rarely provides useful information. But understanding which people within which networks are driving conversations can give insight into who is aware of the work and whether it is reaching target audiences. What flavor is it? Priem et al. [310] provide a thoughtful analysis of the PLOS Article Level Metrics data set. They used principal component analysis to define different “flavors of impact” based on the way different combinations of signals seemed to point to different kinds of interest. Many of the above use cases are variants on this theme—what kind of article is this? Is it a policy piece, of public interest? Is it of interest to a niche research community or does it have wider public implications? Is it being used in education or in health practice? And to what extent are these different kinds of use independent from each other? It is important to realize that these kinds of data are proxies of things that we do not truly understand. They are signals of the flow of information down paths that we have not mapped. To me this is the most exciting possibility and one we are only just starting to explore. What can these signals tell us about the underlying path- 2.10. Bringing it together: Tracking pathways to impact ways down which information flows? How do different combinations of signals tell us about who is using that information now, and how they might be applying it in the future? Correlation analysis cannot answer these questions, but more sophisticated approaches might. And with that information in hand we could truly design scholarly communication systems to maximize their reach, value, and efficiency. 2.10 Bringing it together: Tracking pathways to impact Collecting data on research outputs and their performance clearly has significant promise. However, there are a series of substantial challenges in how best to use these data. First, as we have seen, it is sparse and patchy. Absence of evidence cannot be taken as evidence of absence. But, perhaps more importantly, it is unclear in many cases what these various proxies actually mean. Of course this is also true of more familiar indicators like citations. Finally, there is a challenge in how to effectively analyze these data. The sparse nature of the data is a substantial problem in itself, but in addition there are a number of significantly confounding effects. The biggest of these is time. The process of moving research outputs and their use online is still proceeding, and the uptake and penetration of online services and social media by researchers and other relevant communities has increased rapidly over the past few years and will continue to do so for some time. These changes are occurring on a timescale of months, or even weeks, so any analysis must take into account how those changes may contribute to any observed signal. Much attention has focused on how different quantitative proxies correlate with each other. In essence this has continued the mistake that has already been made with citations. Focusing on proxies themselves implicitly makes the assumption that it is the proxy that matters, rather than the underlying process that is actually of interest. Citations are irrelevant; what matters is the influence that a piece of research has had. Citations are merely a proxy for a particular slice of influence, a (limited) indicator of the underlying process in which a research output is used by other researchers. Of course, these are common challenges for many “big data” situations. The challenge lies in using large, but disparate and messy, data sets to provide insight while avoiding the false positives 65 66 2. Working with Web Data and APIs that will arise from any attempt to mine data blindly for correlations. Using the appropriate models and tools and careful validation of findings against other sources of data are the way forward. 2.10.1 Network analysis approaches One approach is to use these data to dissect and analyze the (visible) network of relationships between agents and objects. This approach can be useful in defining how networks of collaborators change over time, who is in contact with whom, and how outputs are related to each other. This kind of analysis has been productive with citation graphs (see Eigenfactor for an example) as well as with small-scale analysis of grant programs (see, for instance, the Lattes analysis of the network grant program). Network analysis techniques and visualization are covered in Chapter 8 (on networks) and clustering and categorization in Chapter 6 (on machine learning). Networks may be built up from any combination of outputs, actors/agents, and their relationships to each other. Analyses that may be particularly useful are those searching for highly connected (proxy for influential) actors or outputs, clustering to define categories that emerge from the data itself (as opposed to external categorization) and comparisons between networks, both between those built from specific nodes (people, outputs) and between networks that are built from data relating to different time frames. Care is needed with such analyses to make sure that comparisons are valid. In particular, when doing analyses of different time frames, it is important to compare any change in the network characteristics that are due to general changes over time as opposed to specific changes. As noted above, this is particularly important with networks based on social media data, as any networks are likely to have increased in size and diversity over the past few years as more users interested in research have joined. It is important to distinguish in these cases between changes relating to a specific intervention or treatment and those that are environmental. As with any retrospective analysis, a good counterfactual sample is required. 2.10.2 Future prospects and new data sources As the broader process of research moves online we are likely to have more and more information on what is being created, by whom, 2.11. Summary and when. As access to these objects increases, both through provision of open access to published work and through increased data sharing, it will become more and more feasible to mine the objects themselves to enrich the metadata. And finally, as the use of unique identifiers increases for both outputs and people, we will be able to cross-reference across data sources much more strongly. Much of the data currently being collected is of poor quality or is inconsistently processed. Major efforts are underway to develop standards and protocols for initial processing, particularly for page view and usage data. Alongside efforts such as the Crossref DOI Event Tracker Service to provide central clearing houses for data, both consistency and completeness will continue to rise, making new and more comprehensive forms of analysis feasible. Perhaps the most interesting prospect is new data that arise as more of the outputs and processes of research move online. As the availability of data outputs, software products, and even potentially the raw record of lab notebooks increases, we will have opportunities to query how (and how much) different reagents, techniques, tools, and instruments are being used. As the process of policy development and government becomes more transparent and better connected, it will be possible to watch in real time as research has its impact on the public sphere. And as health data moves online there will be opportunities to see how both chemical and behavioral interventions affect health outcomes in real time. In the end all of this data will also be grist to the mill for further research. For the first time we will have the opportunity to treat the research enterprise as a system that is subject to optimization and engineering. Once again the challenges of what it is we are seeking to optimize for are questions that the data itself cannot answer, but in turn the data can better help us to have the debate about what matters. 2.11 Summary The term research impact is difficult and politicized, and it is used differently in different areas. At its root it can be described as the change that a particular part of the research enterprise (e.g., research project, researcher, funding decision, or institute) makes in the world. In this sense, it maps well to standard approaches in the social sciences that seek to identify how an intervention has led to change. 67 68 2. Working with Web Data and APIs The link between “impact” and the distribution of limited research resources makes its definition highly political. In fact, there are many forms of impact, and the different pathways to different kinds of change, further research, economic growth, improvement in health outcomes, greater engagement of citizenry, or environmental change may have little in common beyond our interest in how they can be traced to research outputs. Most public policy on research investment has avoided the difficult question of which impacts are most important. In part this is due to the historical challenges of providing evidence for these impacts. We have only had good data on formal research outputs, primarily journal articles, and measures have focused on naïve metrics such as productivity or citations, or on qualitative peer review. Broader impacts have largely been evidenced through case studies, an expensive and nonscalable approach. The move of research processes online is providing much richer and more diverse information on how research outputs are used and disseminated. We have the prospect of collecting much more information around the performance and usage of traditional research outputs as well as greater data on the growing diversity of nontraditional research outputs that are now being shared. It is possible to gain quantitative information on the numbers of people looking at research, different groups talking about research (in different places), those citing research in different places, and recommendations and opinions on the value of work. These data are sparse and incomplete and its use needs to acknowledge these limitations, but it is nonetheless possible to gain new and valuable insights from analysis. Much of this data is available from web services in the form of application programming interfaces. Well-designed APIs make it easy to search for, gather, and integrate data from multiple sources. A key aspect of successfully integrating data is the effective use and application of unique identifiers across data sets that allow straightforward cross-referencing. Key among the identifiers currently being used are ORCIDs to uniquely identify researchers and DOIs, from both Crossref and increasingly DataCite, to identify research outputs. With good cross-referencing it is possible to obtain rich data sets that can be used as inputs to many of the techniques described elsewhere in the book. The analysis of this new data is a nascent field and the quality of work done so far has been limited. In my view there is a substantial opportunity to use these rich and diverse data sets to treat the 2.12. Resources underlying question of how research outputs flow from the academy to their sites of use. What are the underlying processes that lead to various impacts? This means treating these data sets as time domain signals that can be used to map and identify the underlying processes. This approach is appealing because it offers the promise of probing the actual process of knowledge diffusion while making fewer assumptions about what we think is happening. 2.12 Resources We talked a great deal here about how to access publications and other resources via their DOIs. Paskin [297] provides a nice summary of the problems that DOIs solve and how they work. ORCIDs are another key piece of this puzzle, as we have seen throughout this chapter. You might find some of the early articles describing the need for unique author IDs useful, such as Bourne et al. [46], as well as more recent descriptions [145]. More recent initiatives on expanding the scope of identifiers to materials and software have also been developed [24]. More general discussions of the challenges and opportunities of using metrics in research assessment may be found in recent reports such as the HEFCE Expert Group Report [405], and I have covered some of the broader issues elsewhere [274]. There are many good introductions to web scraping using BeautifulSoup and other libraries as well as API usage in general. Given the pace at which APIs and Python libraries change, the best and most up to date source of information is likely to be a web search. In other settings, you may be concerned with assigning DOIs to data that you generate yourself, so that you and others can easily and reliably refer to and access that data in their own work. Here we face an embarrassment of riches, with many systems available that each meet different needs. Big data research communities such as climate science [404], high-energy physics [304], and astronomy [367] operate their own specialized infrastructures that you are unlikely to require. For small data sets, Figshare [122] and DataCite [89] are often used. The Globus publication service [71] permits an institution or community to build their own publication system. 69 70 2. Working with Web Data and APIs 2.13 Acknowledgements and copyright Section 2.3 is adapted in part from Neylon et al. [275], copyright International Development Research Center, Canada, used here under a Creative Commons Attribution v 4.0 License. Section 2.9.4 is adapted in part from Neylon [273], copyright PLOS, used here under a Creative Commons Attribution v 4.0 License. Chapter 3 Record Linkage Joshua Tokle and Stefan Bender Big data differs from survey data in that it is typically necessary to combine data from multiple sources to get a complete picture of the activities of interest. Although computer scientists tend to simply “mash” data sets together, social scientists are rightfully concerned about issues of missing links, duplicative links, and erroneous links. This chapter provides an overview of traditional rule-based and probabilistic approaches, as well as the important contribution of machine learning to record linkage. 3.1 Motivation Big data offers social scientists great opportunities to bring together many different types of data, from many different sources. Merging different data sets provides new ways of creating population frames that are generated from the digital traces of human activity rather than, say, tax records. These opportunities, however, create different kinds of challenges from those posed by survey data. Combining information from different sources about an individual, business, or geographic entity means that the social scientist must determine whether or not two entities on two different files are the same. This determination is not easy. In the UMETRICS data, if data are to be used to measure the impact of research grants, is David A. Miller from Stanford, CA, the same as David Andrew Miller from Fairhaven, NJ, in a list of inventors? Is Google the same as Alphabet if the productivity and growth of R&D-intensive firms is to be studied? Or, more generally, is individual A the same person as the one who appears on a list of terrorists that has been compiled? Does the product that a customer is searching for match the products that business B has for sale? 71 72 3. Record Linkage The consequences of poor record linkage decisions can be substantial. In the business arena, Christen reports that as much as 12% of business revenues are lost due to bad linkages [76]. In the security arena, failure to match travelers to a “known terrorist” list may result in those individuals entering the country, while overzealous matching could lead to numbers of innocent citizens being detained. In finance, incorrectly detecting a legitimate purchase as a fraudulent one annoys the customer, but failing to identify a thief will lead to credit card losses. Less dramatically, in the scientific arena when studying patenting behavior, if it is decided that two inventors are the same person, when in fact they are not, then records will be incorrectly grouped together and one researcher’s productivity will be overstated. Conversely, if the records for one inventor are believed to correspond to multiple individuals, then that inventor’s productivity will be understated. This chapter discusses current approaches to joining multiple data sets together—commonly called record linkage. Other names associated with record linkage are entity disambiguation, entity resolution, co-reference resolution, statistical matching, and data fusion, meaning that records which are linked or co-referent can be thought of as corresponding to the same underlying entity. The number of names is reflective of a vast literature in social science, statistics, computer science, and information sciences. We draw heavily here on work by Winkler, Scheuren, and Christen, in particular [76, 77, 165]. To ground ideas, we use examples from a recent paper examining the effects of different algorithms on studies of patent productivity [387]. 3.2 Introduction to record linkage There are many reasons to link data sets. Linking to existing data sources to solve a measurement need instead of implementing a new survey results in cost savings (and almost certainly time savings as well) and reduced burden on potential survey respondents. For some research questions (e.g., a survey of the reasons for death of a longitudinal cohort of individuals) a new survey may not be possible. In the case of administrative data or other automatically generated data, the sample size is much greater than would be possible from a survey. Record linkage can be used to compensate for data quality issues. If a large number of observations for a particular field are missing, it may be possible to link to another data source to fill 3.2. Introduction to record linkage 73 in the missing values. For example, survey respondents might not want to share a sensitive datum like income. If the researcher has access to an official administrative list with income data, then those values can be used to supplement the survey [5]. Record linkage is often used to create new longitudinal data sets by linking the same entities over time [190]. More generally, linking separate data sources makes it possible to create a combined data set that is richer in coverage and measurement than any of the individual data sources [4]. Example: The Administrative Data Research Network The UK’s Administrative Data Research Network* (ADRN) is a major investment by the United Kingdom to “improve our knowledge and understanding of the society we live in . . . [and] provide a sound base for policymakers to decide how to tackle a range of complex social, economic and environmental issues” by linking administrative data from a variety of sources, such as health agencies, court records, and tax records in a confidential environment for approved researchers. The linkages are done by trusted third-party providers. [103] Linking is straightforward if each entity has a corresponding unique identifier that appears in the data sets to be linked. For example, two lists of US employees may both contain Social Security numbers. When a unique identifier exists in the data or can be created, no special techniques are necessary to join the data sets. If there is no unique identifier available, then the task of identifying unique entities is challenging. One instead relies on fields that only partially identify the entity, like names, addresses, or dates of birth. The problem is further complicated by poor data quality and duplicate records, issues well attested in the record linkage literature [77] and sure to become more important in the context of big data. Data quality issues include input errors (typos, misspellings, truncation, extraneous letters, abbreviations, and missing values) as well as differences in the way variables are coded between the two data sets (age versus date of birth, for example). In addition to record linkage algorithms, we will discuss different data preprocessing steps that are necessary first steps for the best results in record linkage. To find all possible links between two data sets it would be necessary to compare each record of the first data set with each record of the second data set. The computational complexity of this approach ⋆ “Administrative data” typically refers to data generated by the administration of a government program, as distinct from deliberate survey collection. 74 ◮ This topic is discussed in more detail in Chapter 10. 3. Record Linkage grows quadratically with the size of the data—an important consideration, especially in the big data context. To compensate for this complexity, the standard second step in record linkage, after preprocessing, is indexing or blocking, which creates subsets of similar records and reduces the total number of comparisons. The outcome of the matching step is a set of predicted links— record pairs that are likely to correspond to the same entity. After these are produced, the final stage of the record linkage process is to evaluate the result and estimate the resulting error rates. Unlike other areas of application for predictive algorithms, ground truth or gold standard data sets are rarely available. The only way to create a reliable truth data set sometimes is through an expensive clerical review process that may not be viable for a given application. Instead, error rates must be estimated. An input data set may contribute to the linked data in a variety of ways, such as increasing coverage, expanding understanding of the measurement or mismeasurement of underlying latent variables, or adding new variables to the combined data set. It is therefore important to develop a well-specified reason for linking the data sets, and to specify a loss function to proxy the cost of false negative matches versus false positive matches that can be used to guide match decisions. It is also important to understand the coverage of the different data sets being linked because differences in coverage may result in bias in the linked data. For example, consider the problem of linking Twitter data to a sample-based survey—elderly adults and very young children are unlikely to use Twitter and so the set of records in the linked data set will have a youth bias, even if the original sample was representative of the population. It is also essential to engage in critical thinking about what latent variables are being captured by the measures in the different data sets—an “occupational classification” in a survey data set may be very different from a “job title” in an administrative record or a “current position” in LinkedIn data. Example: Employment and earnings outcomes of doctoral recipients A recent paper in Science matched UMETRICS data on doctoral recipients to Census data on earnings and employment outcomes. The authors note that some 20% of the doctoral recipients are not matched for several reasons: (i) the recipient does not have a job in the US, either for family reasons or because he/she goes back to his/her home country; (ii) he/she starts up a business rather than 3.2. Introduction to record linkage 75 choosing employment; or (iii) it is not possible to uniquely match him/her to a Census Bureau record. They correctly note that there may be biases introduced in case (iii), because Asian names are more likely duplicated and harder to uniquely match [415]. Improving the linkage algorithm would increase the estimate of the effects of investments in research and the result would be more accurate. Comparing the kinds of heterogeneous records associated with big data is a new challenge for social scientists, who have traditionally used a technique first developed in the 1960s to apply computers to the problem of medical record linkage. There is a reason why this approach has survived: it has been highly successful in linking survey data to administrative data, and efficient implementations of this algorithm can be applied at the big data scale. However, the approach is most effective when the two files being linked have a number of fields in common. In the new landscape of big data, there is a greater need to link files that have few fields in common but whose noncommon fields provide additional predictive power to determine which records should be linked. In some cases, when sufficient training data can be produced, more modern machine learning techniques may be applied. The canonical record linkage workflow process is shown in Figure 3.1 for two data files, A and B. The goal is to identify all pairs of records in the two data sets that correspond to the same underlying individual. One approach is to compare all data units from file A Data file A Data file B Comparison Classification Links Non links Figure 3.1. The preprocessing pipeline 76 3. Record Linkage with all units in file B and classify all of the comparison outcomes to decide whether or not the records match. In a perfect statistical world the comparison would end with a clear determination of links and nonlinks. Alas, a perfect world does not exist, and there is likely to be noise in the variables that are common to both data sets and that will be the main identifiers for the record linkage. Although the original files A and B are the starting point, the identifiers must be preprocessed before they can be compared. Determining identifiers for the linkage and deciding on the associated cleaning steps are extremely important, as they result in a necessary reduction of the possible search space. In the next section we begin our overview of the record linkage process with a discussion of the main steps in data preprocessing. This is followed by a section on approaches to record linkage that includes rule-based, probabilistic, and machine learning algorithms. Next we cover classification and evaluation of links, and we conclude with a discussion of data privacy in record linkage. 3.3 ◮ This topic (quality of data, preprocessing issues) is discussed in more detail in Section 1.4. Preprocessing data for record linkage As noted in the introductory chapter, all data work involves preprocessing, and data that need to be linked is no exception. Preprocessing refers to a workflow that transforms messy, noisy, and unstructured data into a well-defined, clearly structured, and quality-tested data set. Elsewhere in this book, we discuss general strategies for data preprocessing. In this section, we focus specifically on preprocessing steps relating to the choice of input fields for the record linkage algorithm. Preprocessing for any kind of a new data set is a complex and time-consuming process because it is “hands-on”: it requires judgment and cannot be effectively automated. It may be tempting to minimize this demanding work under the assumption that the record linkage algorithm will account for issues in the data, but it is difficult to overstate the value of preprocessing for record linkage quality. As Winkler notes: “In situations of reasonably high-quality data, preprocessing can yield a greater improvement in matching efficiency than string comparators and ‘optimized’ parameters. In some situations, 90% of the improvement in matching efficiency may be due to preprocessing” [406]. The first step in record linkage is to develop link keys, which are the record fields that will be used to predict link status. These can include common identifiers like first and last name. Survey and ad- 3.3. Preprocessing data for record linkage ministrative data sets may include a number of clearly identifying variables like address, birth date, and sex. Other data sets, like transaction records or social media data, often will not include address or birth date but may still include other identifying fields like occupation, a list of interests, or connections on a social network. Consider this chapter’s illustrative example of the US Patent and Trademark Office (USPTO) data [387]: USPTO maintains an online database of all patents issued in the United States. In addition to identifying information about the patent, the database contains each patent’s list of inventors and assignees, the companies, organizations, individuals, or government agencies to which the patent is assigned. . . . However, inventors and assignees in the USPTO database are not given unique identification numbers, making it difficult to track inventors and assignees across their patents or link their information to other data sources. There are some basic precepts that are useful when considering identifying fields. The more different values a field can take, the less likely it is that two randomly chosen individuals in the population will agree on those values. Therefore, fields that exhibit a wider range of values are more powerful as link keys: names are much better link keys than sex or year of birth. Example: Link keys in practice “A Harvard professor has re-identified the names of more than 40 percent of a sample of anonymous participants in a high-profile DNA study, highlighting the dangers that ever greater amounts of personal data available in the Internet era could unravel personal secrets. . . . Of the 1,130 volunteers Sweeney and her team reviewed, about 579 provided zip code, date of birth and gender, the three key pieces of information she needs to identify anonymous people combined with information from voter rolls or other public records. Of these, Sweeney succeeded in naming 241, or 42 percent of the total. The Personal Genome Project confirmed that 97 percent of the names matched those in its database if nicknames and first name variations were included” [369]. Complex link keys like addresses can be broken down into components so that the components can be compared independently of one another. This way, errors due to data quality can be further 77 78 3. Record Linkage isolated. For example, assigning a single comparison value to the complex fields “1600 Pennsylvania” and “160 Pennsylvania Ave” is less informative than assigning separate comparison values to the street number and street name portions of those fields. A record linkage algorithm that uses the decomposed field can make more nuanced distinctions by assigning different weights to errors in each component. Sometimes a data set can include different variants of a field, like legal first name and nickname. In these cases match rates can be improved by including all variants of the field in the record comparison. For example, if only the first list includes both variants, and the second list has a single “first name” field that could be either a legal first name or a nickname, then match rates can be improved by comparing both variants and then keeping the better of the two comparison outcomes. It is important to remember, however, that some record linkage algorithms expect field comparisons to be somewhat independent. In our example, using the outcome from both comparisons as separate inputs into the probabilistic model we describe below may result in a higher rate of false negatives. If a record has the same value in the legal name and nickname fields, and if that value happens to agree with the first name field in the second file, then the agreement is being double-counted. By the same token, if a person in the first list has a nickname that differs significantly from their legal first name, then a comparison of that record to the corresponding record will unfairly penalize the outcome because at least one of those name comparisons will show a low level of agreement. Preprocessing serves two purposes in record linkage. First, to correct for issues in data quality that we described above. Second, to account for the different ways that the input files were generated, which may result in the same underlying data being recorded on different scales or according to different conventions. Once preprocessing is finished, it is possible to start linking the records in the different data sets. In the next section we describe a technique to improve the efficiency of the matching step. 3.4 Indexing and blocking There is a practical challenge to consider when comparing the records in two files. If both files are roughly the same size, say 100 records in the first and 100 records in the second file, then there are 10,000 possible comparisons, because the number of pairs is the product 3.4. Indexing and blocking 79 of the number of records in each file. More generally, if the number of records in each file is approximately n, then the total number of possible record comparisons is approximately n 2 . Assuming that there are no duplicate records in the input files, the proportion of record comparisons that correspond to a link is only 1/n. If we naively proceed with all n 2 possible comparisons, the linkage algorithm will spend the bulk of its time comparing records that are not matches. Thus it is possible to speed up record linkage significantly by skipping comparisons between record pairs that are not likely to be linked. Indexing refers to techniques that determine which of the possible comparisons will be made in a record linkage application. The most used technique for indexing is blocking. In this approach you construct a “blocking key” for each record by concatenating fields or parts of fields. Two records with identical blocking keys are said to be in the same block, and only records in the same block are compared. This technique is effective because performing an exact comparison of two blocking keys is a relatively quick operation compared to a full record comparison, which may involve multiple applications of a fuzzy string comparator. Example: Blocking in practice Given two lists of individuals, one might construct the blocking key by concatenating the first letter of the last name and the postal code and then “blocking” on first character of last name and postal code. This reduces the total number of comparisons by only comparing those individuals in the two files who live in the same locality and whose last names begin with the same letter. There are important considerations when choosing the blocking key. First, the choice of blocking key creates a potential bias in the linked data because true matches that do not share the same blocking key will not be found. In the example, the blocking strategy could fail to match records for individuals whose last name changed or who moved. Second, because blocking keys are compared exactly, there is an implicit assumption that the included fields will not have typos or other data entry errors. In practice, however, the blocking fields will exhibit typos. If those typos are not uniformly distributed over the population, then there is again the possibility of bias in the linked data set. One simple strategy for dealing with imperfect blocking keys is to implement multiple rounds of block- ◮ This topic is discussed in more detail in Chapter 10. 80 3. Record Linkage ing and matching. After the first set of matches is produced, a new blocking strategy is deployed to search for additional matches in the remaining record pairs. Blocking based on exact field agreements is common in practice, but there are other approaches to indexing that attempt to be more error tolerant. For example, one may use clustering algorithms to identify sets of similar records. In this approach an index key, which is analogous to the blocking key above, is generated for both data sets and then the keys are combined into a single list. A distance function must be chosen and pairwise distances computed for all keys. The clustering algorithm is then applied to the combined list, and only record pairs that are assigned to the same cluster are compared. This is a theoretically appealing approach but it has the drawback that the similarity metric has to be computed for all pairs of records. Even so, computing the similarity measure for a pair of blocking keys is likely to be cheaper than computing the full record comparison, so there is still a gain in efficiency. Whang et al. [397] provide a nice review of indexing approaches. In addition to reducing the computational burden of record linkage, indexing plays an important secondary role. Once implemented, the fraction of comparisons made that correspond to true links will be significantly higher. For some record linkage approaches that use an algorithm to find optimal parameters—like the probabilistic approach—having a larger ratio of matches to nonmatches will produce a better result. 3.5 Matching The purpose of a record linkage algorithm is to examine pairs of records and make a prediction as to whether they correspond to the same underlying entity. (There are some sophisticated algorithms that examine sets of more than two records at a time [359], but pairwise comparison remains the standard approach.) At the core of every record linkage algorithm is a function that compares two records and outputs a “score” that quantifies the similarity between those records. Mathematically, the match score is a function of the output from individual field comparisons: agreement in the first name field, agreement in the last name field, etc. Field comparisons may be binary—indicating agreement or disagreement—or they may output a range of values indicating different levels of agreement. There are a variety of methods in the statistical and computer science literature that can be used to generate a match score, includ- 3.5. Matching 81 ing nearest-neighbor matching, regression-based matching, and propensity score matching. The probabilistic approach to record linkage defines the match score in terms of a likelihood ratio [118]. Example: Matching in practice Long strings, such as assignee and inventor names, are susceptible to typographical errors and name variations. For example, none of Sony Corporation, Sony Corporatoin and Sony Corp. will match using simple exact matching. Similarly, David vs. Dave would not match [387]. Comparing fields whose values are continuous is straightforward: often one can simply take the absolute difference as the comparison value. Comparing character fields in a rigorous way is more complicated. For this purpose, different mathematical definitions of the distance between two character fields have been defined. Edit distance, for example, is defined as the minimum number of edit operations—chosen from a set of allowed operations—needed to convert one string to another. When the set of allowed edit operations is single-character insertions, deletions, and substitutions, the corresponding edit distance is also known as the Levenshtein distance. When transposition of adjacent characters is allowed in addition to those operations, the corresponding edit distance is called the Levenshtein–Damerau distance. Edit distance is appealing because of its intuitive definition, but it is not the most efficient string distance to compute. Another standard string distance known as Jaro–Winkler distance was developed with record linkage applications in mind and is faster to compute. This is an important consideration because in a typical record linkage application most of the algorithm run time will be spent performing field comparisons. The definition of Jaro–Winkler distance is less intuitive than edit distance, but it works as expected: words with more characters in common will have a higher Jaro–Winkler value than those with fewer characters in common. The output value is normalized to fall between 0 and 1. Because of its history in record linkage applications, there are some standard variants of Jaro–Winkler distance that may be implemented in record linkage software. Some variants boost the weight given to agreement in the first few characters of the strings being compared. Others decrease the score penalty for letter substitutions that arise from common typos. 82 3. Record Linkage Once the field comparisons are computed, they must be combined to produce a final prediction of match status. In the following sections we describe three types of record linkage algorithms: rulebased, probabilistic, and machine learning. 3.5.1 Rule-based approaches A natural starting place is for a data expert to create a set of ad hoc rules that determine which pairs of records should be linked. In the classical record linkage setting where the two files have a number of identifying fields in common, this is not the optimal approach. However, if there are few fields in common but each file contains auxiliary fields that may inform a linkage decision, then an ad hoc approach may be appropriate. Example: Linking in practice Consider the problem of linking two lists of individuals where both lists contain first name, last name, and year of birth. Here is one possible linkage rule: link all pairs of records such that • the Jaro–Winkler comparison of first names is greater than 0.9 • the Jaro–Winkler comparison of last names is greater than 0.9 • the first three digits of the year of birth are the same. The result will depend on the rate of data errors in the year of birth field and typos in the name fields. By auxiliary field we mean data fields that do not appear on both data sets, but which may nonetheless provide information about whether records should be linked. Consider a situation in which the first list includes an occupation field and the second list includes educational history. In that case one might create additional rules to eliminate matches where the education was deemed to be an unlikely fit for the occupation. This method may be attractive if it produces a reasonable-looking set of links from intuitive rules, but there are several pitfalls. As the number of rules grows it becomes harder to understand the ways that the different rules interact to produce the final set of links. There is no notion of a threshold that can be increased or decreased depending on the tolerance for false positive and false negative errors. The rules themselves are not chosen to satisfy any kind of optimality, unlike the probabilistic and machine learning 3.5. Matching methods. Instead, they reflect the practitioner’s domain knowledge about the data sets. 3.5.2 Probabilistic record linkage In this section we describe the probabilistic approach to record linkage, also known as the Fellegi–Sunter algorithm [118]. This approach dominates in traditional record linkage applications and remains an effective and efficient way to solve the record linkage problem today. In this section we give a somewhat formal definition of the statistical model underlying the algorithm. By understanding this model, one is better equipped to define link keys and record comparisons in an optimal way. Example: Usefulness of probabilistic record linkage In practice, it is typically the case that a researcher will want to combine two or more data sets containing records for the same individuals or units that possibly come from different sources. Unless the sources all contain the same unique identifiers, linkage will likely require matching on standardized text strings. Even standardized data are likely to contain small differences that preclude exact matching as in the matching example above. The Census Bureau’s Longitudinal Business Database (LBD) links establishment records from administrative and survey sources. Exact numeric identifiers do most of the heavy lifting, but mergers, acquisitions, and other actions can break these linkages. Probabilistic record linkage on company names and/or addresses is used to fix these broken linkages that bias statistics on business dynamics [190]. Let A and B be two lists of individuals whom we wish to link. The product set A × B contains all possible pairs of records where the first element of the pair comes from A and the second element of the pair comes from B. A fraction of these pairs will be matches, meaning that both records in the pair represent the same underlying individual, but the vast majority of them will be nonmatches. In other words, A × B is the disjoint union of the set of matches M and the set of nonmatches U , a fact that we denote formally by A × B = M ∪ U. Let γ be a vector-valued function on A × B such that, for a ∈ A and b ∈ B, γ (a, b) represents the outcome of a set of field comparisons between a and b. For example, if both A and B contain data on 83 84 3. Record Linkage individuals’ first names, last names, and cities of residence, then γ could be a vector of three binary values representing agreement in first name, last name, and city. In that case γ (a, b) = (1, 1, 0) would mean that the records a and b agree on first name and last name, but disagree on city of residence. For this model, the comparison outcomes in γ (a, b) are not required to be binary, but they do have to be categorical: each component of γ (a, b) should take only finitely many values. This means that a continuous comparison outcome—such as output from the Jaro–Winkler string comparator—has to be converted to an ordinal value representing levels of agreement. For example, one might create a three-level comparison, using one level for exact agreement, one level for approximate agreement defined as a Jaro–Winkler score greater than 0.85, and one level for nonagreement corresponding to a Jaro–Winkler score less than 0.85. If a variable being used in the comparison has a significant number of missing values, it can help to create a comparison outcome level to indicate missingness. Consider two data sets that both have middle initial fields, and suppose that in one of the data sets the middle initial is filled in only about half of the time. When comparing records, the case where both middle initials are filled in but are not the same should be treated differently from the case where one of the middle initials is blank, because the first case provides more evidence that the records do not correspond to the same person. We handle this in the model by defining a three-level comparison for the middle initial, with levels to indicate “equal,” “not equal,” and “missing.” Probabilistic record linkage works by weighing the probability of seeing the result γ (a, b) if (a, b) belongs to the set of matches M against the probability of seeing the result if (a, b) belongs to the set of nonmatches U . Conditional on M or U , the distribution of the individual comparisons defined by γ are assumed to be mutually independent. The parameters that define the marginal distributions of γ |M are called m-weights, and similarly the marginal distributions of γ |U are called u-weights. In order to apply the Fellegi–Sunter method, it is necessary to choose values for these parameters, m-weights and u-weights. With labeled data—a pair of lists for which the match status is known— it is straightforward to solve for optimal values. Training data are not usually available, however, and the typical approach is to use expectation maximization to find optimal values. We have noted that primary motivation for record linkage is to create a linked data set for analysis that will have a richer set of fields 3.5. Matching 85 than either of the input data sets alone. A natural application is to perform a linear regression using a combination of variables from both files as predictors. With all record linkage approaches it is a challenge to understand how errors from the linkage process will manifest in the regression. Probabilistic record linkage has an advantage over rule-based and machine learning approaches in that there are theoretical results concerning coefficient bias and errors [221, 329]. More recently, Chipperfield and Chambers have developed an approach based on the bootstrap to account for record linkage errors when making inferences for cross-tabulated variables [75]. 3.5.3 Machine learning approaches to linking Computer scientists have contributed extensively in parallel literature focused on linking large data sets [76]. Their focus is on identifying potential links using approaches that are fast and scalable, and approaches are developed based on work in network algorithms and machine learning. While simple blocking as described in Section 3.4 is standard in Fellegi–Sunter applications, computer scientists are likely to use the more sophisticated clustering approach to indexing. Indexing may also use network information to include, for example, records for individuals that have a similar place in a social graph. When linking lists of researchers, one might specify that comparisons should be made between records that share the same address, have patents in the same patent class, or have overlapping sets of coinventors. These approaches are known as semantic blocking, and the computational requirements are similar to standard blocking [76]. In recent years machine learning approaches have been applied to record linkage following their success in other areas of prediction and classification. Computer scientists couch the analytical problem as one of entity resolution, even though the conceptual problem is identical. As Wick et al. [400] note: Entity resolution, the task of automatically determining which mentions refer to the same real-world entity, is a crucial aspect of knowledge base construction and management. However, performing entity resolution at large scales is challenging because (1) the inference algorithms must cope with unavoidable system scalability issues and (2) the search space grows exponentially in the number of mentions. Current conventional wisdom declares that performing coreference at these scales requires decom- ◮ This topic is discussed in more detail in Chapter 6. 86 3. Record Linkage posing the problem by first solving the simpler task of entity-linking (matching a set of mentions to a known set of KB entities), and then performing entity discovery as a post-processing step (to identify new entities not present in the KB). However, we argue that this traditional approach is harmful to both entity-linking and overall coreference accuracy. Therefore, we embrace the challenge of jointly modeling entity-linking and entity discovery as a single entity resolution problem. ◮ See Chapter 6. Figure 3.2 provides a useful comparison between classical record linkage and learning-based approaches. In machine learning there is a predictive model and an algorithm for “learning” the optimal set of parameters to use in the predictive algorithm. The learning algorithm relies on a training data set. In record linkage, this would be a curated data set with true and false matches labeled as such. See [387] for an example and a discussion of how a training data set was created for the problem of disambiguating inventors in the USPTO database. Once optimal parameters are computed from the training data, the predictive model can be applied to unlabeled data to find new links. The quality of the training data set is critical; the model is only as good as the data it is trained on. An example of a machine learning model that is popular for record linkage is the random forest model [50]. This is a classification model that fits a large number of classification trees to a labeled training data set. Each individual tree is trained on a bootstrap sample of all labeled cases using a random subset of predictor variables. After creating the classification trees, new cases are labeled by giving each tree a vote and keeping the label that receives the most votes. This highly randomized approach corrects for a problem with simple classification trees, which is that they may overfit to training data. As shown in Figure 3.2, a major difference between probabilistic and machine learning approaches is the need for labeled training data to implement the latter approach. Usually training data are created through a painstaking process of clerical review. After an initial round of record linkage, a sample of record pairs that are not clearly matches or nonmatches is given to a research assistant who makes the final determination. In some cases it is possible to create training data by automated means. For example, when there is a subset of the complete data that contains strongly identifying fields. Suppose that both of the candidate lists contain name and date of birth fields and that in the first list the date of birth data are 3.5. Matching Threshold • Similarity Function • Attribute Selection 87 Match Decision Similarity Computation Blocking Source Target Model Application • Learning Algorithm • Matcher Selection Model Generation • No. of Examples • Selection Scheme • Threshold Training Data Selection Training Data Blocking Source Target Figure 3.2. Probabilistic (left) vs. machine learning (right) approaches to linking. Source: Köpcke et al. [213] complete, but in the second list only about 10% of records contain date of birth. For reasonably sized lists, name and date of birth together will be a nearly unique identifier. It is then possible to perform probabilistic record linkage on the subset of records with date of birth and be confident that the error rates would be small. If the subset of records with date of birth is representative of the complete data set, then the output from the probabilistic record linkage can be used as “truth” data. Given a quality training data set, machine learning approaches may have advantages over probabilistic record linkage. Consider the random forest model. Random forests are more robust to correlated predictor variables, because only a random subset of predictors is included in any individual classification tree. The conditional independence assumption, to which we alluded in our discussion of the probabilistic model, can be dropped. An estimate of the generalization error can be computed in the form of “out-of-bag error.” A measure of variable importance is computed that gives an idea of how powerful a particular field comparison is in terms of correctly predicting link status. Finally, unlike the Fellegi–Sunter model, predictor variables can be continuous. The combination of being robust to correlated variables and providing a variable importance measure makes random forests a use- 88 3. Record Linkage ful diagnostic tool for record linkage models. It is possible to refine the record linkage model iteratively, by first including many predictor variables, including variants of the same comparison, and then using the variable importance measure to narrow down the predictors to a parsimonious set. There are many published studies on the effectiveness of random forests and other machine learning algorithms for record linkage. Christen and Ahmed et al. provide some pointers [77, 108]. 3.5.4 Disambiguating networks The problem of disambiguating entities in a network is closely related to record linkage: in both cases the goal is to consolidate multiple records corresponding to the same entity. Rather than finding the same entity in two data sets, however, the goal in network disambiguation is to consolidate duplicate records in a network data set. By network we mean that the data set contains not only typical record fields like names and addresses but also information about how entities relate to one another: entities may be coauthors, coinventors, or simply friends in a social network. The record linkage techniques that we have described in this chapter can be applied to disambiguate a network. To do so, one must convert the network to a form that can be used as input into a record linkage algorithm. For example, when disambiguating a social network one might define a field comparison whose output gives the fraction of friends in common between two records. Ventura et al. demonstrated the relative effectiveness of the probabilistic method and machine learning approaches to disambiguating a database of inventors in the USPTO database [387]. Another approach is to apply clustering algorithms from the computer science literature to identify groups of records that are likely to refer to the same entity. Huang et al. [172] have developed a successful method based on an efficient computation of distance between individuals in the network. These distances are then fed into the DBSCAN clustering algorithm to identify unique entities. 3.6 Classification Once the match score for a pair of records has been computed using the probabilistic or random forest method, a decision has to be made whether the pair should be linked. This requires classifying the pair as either a “true” or a “false” match. In most cases, a third classification is required—sending for manual review and classification. 3.6. Classification 3.6.1 Thresholds In the probabilistic and random forest approaches, both of which output a “match score” value, a classification is made by establishing a threshold T such that all records with a match score greater than T are declared to be links. Because of the way these algorithms are defined, the match scores are not meaningful by themselves and the threshold used for one linkage application may not be appropriate for another application. Instead, the classification threshold must be established by reviewing the model output. Typically one creates an output file that includes pairs of records that were compared along with the match score. The file is sorted by match score and the reviewer begins to scan the file from the highest match scores to the lowest. For the highest match scores the record pairs will agree on all fields and there is usually no question about the records being linked. However, as the scores decrease the reviewer will see more record pairs whose match status is unclear (or that are clearly nonmatches) mixed in with the clear matches. There are a number of ways to proceed, depending on the resources available and the goal of the project. Rather than set a single threshold, the reviewer may set two thresholds T1 > T2 . Record pairs with a match score greater than T1 are marked as matches and removed from further consideration. The set of record pairs with a match score between T1 and T2 are believed to contain significant numbers of matches and nonmatches. These are sent to clerical review, meaning that research assistants will make a final determination of match status. The final set of links will include clear matches with a score greater than T1 as well as the record pairs that pass clerical review. If the resources are available for this approach and the initial threshold T1 is set sufficiently high, then the resulting data set will contain a minimal number of false positive links. The collection of record pairs with match scores between T1 and T2 is sometimes referred to as the clerical review region. The clerical review region generally contains many more pairs than the set of clear matches, and it can be expensive and timeconsuming to review each pair. Therefore, a second approach is to establish tentative threshold T and send only a sample of record pairs with scores in a neighborhood of T to clerical review. This results in data on the relative numbers of true matches and true nonmatches at different score levels, as well as the characteristics of record pairs that appear at a given level. Based on the review and the relative tolerance for false positive errors and false negative 89 90 3. Record Linkage errors, a final threshold T ′ is set such that pairs with a score greater than T ′ are considered to be matches. After viewing the results of the clerical review, it may be determined that the parameters to the record linkage algorithm could be improved to create a clearer delineation between matches and nonmatches. For example, a research assistant may determine that many potential false positives appear near the tentative threshold because the current set of record linkage parameters is giving too much weight to agreement in first name. In this case the reviewer may decide to update the record linkage model to produce an improved set of match scores. The update may consist in an ad hoc adjustment of parameters, or the result of the clerical review may be used as training data and the parameter-fitting algorithm may be run again. An iterative approach like this is common when first linking two data sets because the clerical review process can improve one’s understanding of the data sets involved. Setting the threshold value higher will reduce the number of false positives (record pairs for which a link is incorrectly predicted) while increasing the number of false negatives (record pairs that should be linked but for which a link is not predicted). The proper tradeoff between false positive and false negative error rates will depend on the particular application and the associated loss function, but there are some general concerns to keep in mind. Both types of errors create bias, which can impact the generalizability of analyses conducted on the linked data set. Consider a simple regression on the linked data that includes fields from both data sets. If the threshold is too high, then the linked data will be biased toward records with no data entry errors or missing values, and whose fields did not change over time. This set of records may not be representative of the population as a whole. If a low threshold is used, then the set of linked records will contain more pairs that are not true links and the variables measured in those records are independent of each other. Including these records in a regression amounts to adding statistical noise to the data. 3.6.2 One-to-one links In the probabilistic and machine learning approaches to record linkage that we have described, each record pair is compared and a link is predicted independently of all other record pairs. Because of the independence of comparisons, one record in the first file may be predicted to link to multiple records in the second file. Under the assumption that each input file has been deduplicated, at most one 3.7. Record linkage and data protection of these predictions can correspond to a true link. For many applications it is preferable to extract a set of “best” links with the property that each record in one file links to at most one record in the second file. A set of links with this property is said to be one-to-one. One possible definition of “best” is a set of one-to-one links such that the sum of the match scores of all included links is maximal. This is an example of what is known as the assignment problem in combinatorial optimization. In the linear case above, where we care about the sum of match scores, the problem can be solved exactly using the Hungarian algorithm [216]. 3.7 91 ◮ This topic is discussed in more detail in Chapter 6. Record linkage and data protection In many social science applications data sets there is no need for data to include identifying fields like names and addresses. These fields may be left out intentionally out of concern for privacy, or they may simply be irrelevant to the research question. For record linkage, however, names and addresses are among the best possible identifiers. We describe two approaches to the problem of balancing needs for both effective record linkage and privacy. The first approach is to establish a trusted third party or safe center. The concept of trusted third parties (TTPs) is well known in cryptography. In the case of record linkage, a third party takes a place between the data owners and the data users, and it is this third party that actually performs the linkage work. Both the data owners and data users trust the third party in the sense that it assumes responsibility for data protection (data owners) and data competence (data users) at the same time. No party other than the TTP learns about the private data of the other parties. After record linkage only the linked records are revealed, with no identifiers attached. The TTP ensures that the released linked data set cannot be relinked to any of the source data sets. Possible third parties are safe centers, which are operated by lawyers, or official trusted institutions like the US Census Bureau. Some countries like the UK and Germany are establishing new institutions specifically to act as TTPs for record linkage work. The second approach is known as privacy-preserving record linkage. The goal of this approach is to find the same individual in separate data files without revealing the identity of the individual [80]. In privacy-preserving record linkage, cryptographic procedures are used to encrypt or hash identifiers before they are shared for record linkage. Many of these procedures require exact matching of the ◮ See Chapter 11. 92 ◮ This topic is discussed in more detail in Chapter 11. 3. Record Linkage identifiers, however, and do not tolerate any errors in the original identifiers. This leads to information loss because it is not possible to account for typos or other small variations in hashed fields. To account for this, Schnell has developed a method to calculate string similarity of encrypted fields using bloom filters [330, 332]. In many countries these approaches are combined. For example, when the UK established the ADRN, the latter established the concept of trusted third parties. That third party is provided with data in which identifying fields have been hashed. This solves the challenge of trust between the different parties. Some authors argue that transparency of data use and informed consent will help to build trust. In the context of big data this is more challenging.. 3.8 Summary Accurate record linkage is critical to creating high-quality data sets for analysis. However, outside of a few small centers for record linkage research, linking data sets historically relied on artisan approaches, particularly for parsing and cleaning data sets. As the creation and use of big data increases, so does the need for systematic record linkage. The history of record linkage is long by computer science standards, but new data challenges encourage the development of new approaches like machine learning methods, clustering algorithms, and privacy-preserving record linkage. Record linkage stands on the boundary between statistics, information technology, and privacy. We are confident that there will continue to be exciting developments in this field in the years to come. 3.9 Resources Out of many excellent resources on the subject, we note the following: • We strongly recommend Christen’s book [76]. • There is a wealth of information available on the ADRN website [103]. • Winkler has a series of high-quality survey articles [407]. • The German Record Linkage Center is a resource for research, software, and ongoing conference activities [331]. Chapter 4 Databases Ian Foster and Pascal Heus Once the data have been collected and linked into different files, it is necessary to store and organize them. Social scientists are used to working with one analytical file, often in SAS, Stata, SPSS, or R. This chapter, which may be the most important chapter in the book, describes different approaches to storing data in ways that permit rapid and reliable exploration and analysis. 4.1 Introduction We turn now to the question of how to store, organize, and manage the data used in data-intensive social science. As the data with which you work grow in volume and diversity, effective data management becomes increasingly important if you are to avoid issues of scale and complexity from overwhelming your research processes. In particular, when you deal with data that get frequently updated, with changes made by different people, you will frequently want to use database management systems (DBMSs) instead of maintaining data in single files or within siloed statistical packages such as SAS, SPSS, Stata, and R. Indeed, we go so far as to say: if you take away just one thing from this book, it should be this: Use a database! As we explain in this chapter, DBMSs provide an environment that greatly simplifies data management and manipulation. They require a little bit of effort to set up, but are worth it. They permit large amounts of data to be organized in multiple ways that allow for efficient and rapid exploration via powerful declarative query languages; durable and reliable storage, via transactional features that maintain data consistency; scaling to large data sizes; and intuitive analysis, both within the DBMS itself and via bridges to other data analysis packages and tools when specialized analyses are required. DBMSs have become a critical component of a great variety 93 94 4. Databases of applications, from handling transactions in financial systems to delivering data as a service to power websites, dashboards, and applications. If you are using a production-level enterprise system, chances are there is a database in the back end. They are multipurpose and well suited for organizing social science data and for supporting analytics for data exploration. DBMSs make many easy things trivial, and many hard things easy. They are easy to use but can appear daunting to those unfamiliar with their concepts and workings. A basic understanding of databases and of when and how to use DBMSs is an important element of the social data scientist’s knowledge base. We therefore provide in this chapter an introduction to databases and how to use them. We describe different types of databases and their various features, and how different types can be applied in different contexts. We describe basic features like how to get started, set up a database schema, ingest data, query data within a database, and get results out. We also discuss how to link from databases to other tools, such as Python, R, and Stata (if you really have to). Chapter 5 describes how to apply parallel computing methods when needed. 4.2 DBMS: When and why Consider the following three data sets: 1. 10,000 records describing research grants, each specifying the principal investigator, institution, research area, proposal title, award date, and funding amount in comma-separatedvalue (CSV) format. 2. 10 million records in a variety of formats from funding agencies, web APIs, and institutional sources describing people, grants, funding agencies, and patents. 3. 10 billion Twitter messages and associated metadata—around 10 terabytes (1013 bytes) in total, and increasing at a terabyte a month. Which tools should you use to manage and analyze these data sets? The answer depends on the specifics of the data, the analyses that you want to perform, and the life cycle within which data and analyses are embedded. Table 4.1 summarizes relevant factors, which we now discuss. 4.2. DBMS: When and why 95 Table 4.1. When to use different data management and analysis technologies Text files, spreadsheets, and scripting language • Your data are small • Your analysis is simple • You do not expect to repeat analyses over time Statistical packages • Your data are modest in size • Your analysis maps well to your chosen statistical package Relational database • Your data are structured • Your data are large • You will be analyzing changed versions of your data over time • You want to share your data and analyses with others NoSQL database • Your data are unstructured • Your data are extremely large In the case of data set 1 (10,000 records describing research grants), it may be feasible to leave the data in their original file, use spreadsheets, pivot tables, or write programs in scripting languages* such as Python or R to ask questions of those files. For example, someone familiar with such languages can quickly create a script to extract from data set 1 all grants awarded to one investigator, compute average grant size, and count grants made each year in different areas. However, this approach also has disadvantages. Scripts do not provide inherent control over the file structure. This means that if you obtain new data in a different format, your scripts need to be updated. You cannot just run them over the newly acquired file. Scripts can also easily become unreasonably slow as data volumes grow. A Python or R script will not take long to search a list of 1,000 grants to find those that pertain to a particular institution. But what if you have information about 1 million grants, and for each grant you want to search a list of 100,000 investigators, and for each investigator, you want to search a list of 10 million papers to see whether that investigator is listed as an author of each paper? You now have 1,000,000×100,000×10,000,000 = 1018 comparisons to perform. Your simple script may now run for hours or even days. You can speed up the search process by constructing indices, so that, for example, when given a grant, you can find the associated investigators in constant time rather than in time proportional to the number of investigators. However, the construction of such indices is itself a time-consuming and error-prone process. ⋆ A scripting language is a programming language used to automate tasks that could otherwise be performed one by one be the user. 96 ⋆ A statistical package is a specialized compute program for analysis in statistics and economics. ◮ For example, the Panel Study of Income Dynamics [181] has a series of files that are related and can be combined through common identifier variables [182]. ⋆ DBMS is a system that interacts with users, other applications, and the database itself to capture and analyze data. 4. Databases For these reasons, the use of scripting languages alone for data analysis is rarely to be recommended. This is not to say that all analysis computations can be performed in database systems. A programming language will also often be needed. But many data access and manipulation computations are best handled in a database. Researchers in the social sciences frequently use statistical packages* such as R, SAS, SPSS, and Stata for data analysis. Because these systems integrate some crude data management, statistical analysis, and graphics capabilities in a single package, a researcher can often carry out a data analysis project of modest size within the same environment. However, each of these systems has limitations that hinder its use for modern social science research, especially as data grow in size and complexity. Take Stata, for example. Stata always loads the entire data set into the computer’s working memory, and thus you would have no problems loading data set 1. However, depending on your computer’s memory, it could have problems dealing with with data set 2 and certainly would not be able to handle data set 3. In addition, you would need to perform this data loading step each time you start working on the project, and your analyses would be limited to what Stata can do. SAS can deal with larger data sets, but is renowned for being hard to learn and use. Of course there are workarounds in statistical packages. For example, in Stata you can deal with larger file sizes by choosing to only load the variables or cases that you need for the analysis [211]. Likewise, you can deal with more complex data by creating a system of files that each can be linked as needed for a particular analysis through a common identifier variable. Those solutions essentially mimic core functions of a DBMS, and you would be well advised to set up such system, especially if you find yourself in a situation where the data set is constantly updated through different users, if groups of users have different rights to use your data or should only have access to subsets of the data, and if the analysis takes place on a server that sends results to a client (browser). Statistics packages also have difficulty working with more than one data source at a time—something that DBMSs are designed to do well. These considerations bring us to the topic of this chapter, namely database management systems. A DBMS* handles all of the issues listed above, and more. As we will see below when we look at concrete examples, a DBMS allows the programmer to define a logical design that fits the structure of their data. The DBMS then 4.2. DBMS: When and why implements a data model (more on this below) that allows these data to be stored, queried, and updated efficiently and reliably on disk, thus providing independence from underlying physical storage. It supports efficient access to data through query languages and automatic optimization of those queries to permit fast analysis. Importantly, it also support concurrent access by multiple users, which is not an option for file-based data storage. It supports transactions, meaning that any update to a database is performed in its entirety or not at all, even in the face of computer failures or multiple concurrent updates. And it reduces the time spent both by analysts, by making it easy to express complex analytical queries concisely, and on data administration, by providing simple and uniform data administration interfaces. A database is a structured collection of data about entities and their relationships. It models real-world objects—both entities (e.g., grants, investigators, universities) and relationships (e.g., “Steven Weinberg” works at “University of Texas at Austin”)—and captures structure in ways that allow these entities and relationships to be queried for analysis. A database management system is a software suite designed to safely store and efficiently manage databases, and to assist with the maintenance and discovery of the relationships that database represents. In general, a DBMS encompasses three key components, as shown in Table 4.2: its data model (which defines how data are represented: see Box 4.1), its query language (which defines how the user interacts with the data), and support for transactions and crash recovery (to ensure reliable execution despite system failures).* Box 4.1: Data model A data model specifies the data elements associated with a problem domain, the properties of those data elements, and how those data elements relate to one another. In developing a data model, we commonly first identity the entities that are to be modeled and then define their properties and relationships. For example, when working on the science of science policy (see Figure 1.2), the entities include people, products, institutions, and funding, each of which has various properties (e.g., for a person, their name, address, employer); relationships include “is employed by” and “is funded by.” This conceptual data model can then be translated into relational tables or some other database representation, as we describe next. 97 ⋆ Some key DBMS features are often lacking in standard statistical packages: a standard query language (with commands that allow analyses or data manipulation on a subgroup of cases defined during the analysis, for example “group by . . . ,” “order by . . . ”), keys (for speed improvement), and an explicit model of a relational data structure. 98 4. Databases Table 4.2. Key components of a DBMS User-facing Internal Data model For example: relational, semi-structured Mapping data to storage systems; creating and maintaining indices ◮ Sometimes, as discussed in Chapter 3, the links are one to one and sometimes one to many. Query language For example: SQL (for relational), XPath (for semi-structured) Query optimization and evaluation; consistency Transactions, crash recovery Transactions Locking, concurrency control, recovery Literally hundreds of different open source, commercial, and cloud-hosted versions DBMSs are available. However, you only need to understand a relatively small number of concepts and major database types to make sense of this diversity. Table 4.3 defines the major classes of DBMSs that we will consider in this book. We consider only a few of these in any detail. Relational DBMSs are the most widely used and mature systems, and will be the optimal solution for many social science data analysis purposes. We describe relational DBMSs in detail below, but in brief, they allow for the efficient storage, organization, and analysis of large quantities of tabular data: data organized as tables, in which rows represent entities (e.g., research grants) and columns represent attributes of those entities (e.g., principal investigator, institution, funding level). The associated Structured Query Language (SQL) can then be used to perform a wide range of analyses, which are executed with high efficiency due to sophisticated indexing and query planning techniques. While relational DBMSs have dominated the database world for decades, other database technologies have become popular for various classes of applications in recent years. As we will see, these alternative NoSQL DBMSs have typically been motivated by a desire to scale the quantities of data and/or number of users that can be supported and/or to deal with unstructured data that are not easily represented in tabular form. For example, a key–value store can organize large numbers of records, each of which associates an arbitrary key with an arbitrary value. These stores, and in particular variants called document stores that permit text search on the stored values, are widely used to organize and process the billions of records that can be obtained from web crawlers. We review below some of these alternatives and the factors that may motivate their use. 4.2. DBMS: When and why 99 Table 4.3. Types of databases: relational (first row) and various types of NoSQL (other rows) Type Relational database Key–value store Examples MySQL, PostgreSQL, Oracle, SQL Server, Teradata Dynamo, Redis Column store Cassandra, HBase Document store CouchDB, MongoDB Graph database Neo4j, InfiniteGraph Advantages Consistency (ACID) Disadvantages Fixed schema; typically harder to scale Uses Transactional systems: order processing, retail, hospitals, etc. Dynamic schema; easy scaling; high throughput Same as key–value; distributed; better compression at column level Index entire document (JSON) Not immediately consistent; no higher-level queries Not immediately consistent; using all columns is inefficient Not immediately consistent; no higher-level queries Difficult to do non-graph analysis Web applications Graph queries are fast Relational and NoSQL databases (and indeed other solutions, such as statistical packages) can also be used together. Consider, for example, Figure 4.1, which depicts data flows commonly encountered in large research projects. Diverse data are being collected from different sources: JSON documents from web APIs, web pages from web scraping, tabular data from various administrative databases, Twitter data, and newspaper articles. There may be hundreds or even thousands of data sets in total, some of which may be extremely large. We initially have no idea of what schema* to use for the different data sets, and indeed it may not be feasible to define a unified set of schema, so diverse are the data and so rapidly are new data sets being acquired. Furthermore, the way we organize the data may vary according to our intended purpose. Are we interested in geographic, temporal, or thematic relationships among different entities? Each type of analysis may require a different organization. For these reasons, a common storage solution is to first load all data into a large NoSQL database. This approach makes all data available via a common (albeit limited) query interface. Researchers can then extract from this database the specific elements that are of interest for their work, loading those elements into a re- Large-scale analysis Web applications Recommendation systems, networks, routing ⋆ A schema defines the structure of a database in a formal language defined by the DBMS. See Section 4.3.3. 100 4. Databases PubMed abstracts Extracted research area Researcher web pages Domain-specific relational database Patent abstracts Extracted collaboration network Twitter messages Big NoSQL database Graph database Figure 4.1. A research project may use a NoSQL database to accumulate large amounts of data from many different sources, and then extract selected subsets to a relational or other database for more structured processing lational DBMS, another specialized DBMS (e.g., a graph database), or a statistical package for more detailed analysis. As part of the process of loading data from the NoSQL database into a relational database, the researcher will necessarily define schemas, relationships between entities, and so forth. Analysis results can be stored in a relational database or back into the NoSQL store. 4.3 Relational DBMSs We now provide a more detailed description of relational DBMSs. Relational DBMSs implement the relational data model, in which data are represented as sets of records organized in tables. This model is particularly well suited for the structured, regular data with which we frequently deal in the social sciences; we discuss in Section 4.5 alternative data models, such as those used in NoSQL databases. We use the data shown in Figure 4.2 to introduce key concepts. These two CSV format files describe grants made by the US National Science Foundation (NSF). One file contains information about grants, the other information about investigators. How should you proceed to manipulate and analyze these data? The main concept underlying the relational data model is a table (also referred to as a relation): a set of rows (also referred to as tuples, records, or observations), each with the same columns (also referred to as fields, attributes or variables). A database consists of multiple tables. For example, we show in Figure 4.3 how the data 4.3. Relational DBMSs 101 The file grants.csv # Identifier,Person,Funding,Program 1316033,Steven Weinberg,666000,Elem. Particle Physics/Theory 1336199,Howard Weinberg,323194,ENVIRONMENTAL ENGINEERING 1500194,Irving Weinberg,200000,Accelerating Innovation Rsrch 1211853,Irving Weinberg,261437,GALACTIC ASTRONOMY PROGRAM The file investigators.csv # Name,Institution,Email Steven Weinberg,University of Texas at Austin,weinberg@utexas.edu Howard Weinberg,University of North Carolina Chapel Hill, Irving Weinberg,University of Maryland College Park,irving@ucmc.edu Figure 4.2. CSV files representing grants and investigators. Each line in the first table specifies a grant number, investigator name, total funding amount, and NSF program name; each line in the second gives an investigator name, institution name, and investigator email address contained in the two CSV files of Figure 4.2 may be represented as two tables. The Grants table contains one tuple for each row in grants.csv, with columns GrantID, Person, Funding, and Program. The Investigators table contains one tuple for each row in investigators.csv, with columns ID, Name, Institution, and Email. The CSV files and tables contain essentially the same information, albeit with important differences (the addition of an ID field in the Investigators table, the substitution of an ID column for the Person column in the Grants table) that we will explain below. The use of the relational data model provides for physical independence: a given table can be stored in many different ways. SQL queries are written in terms of the logical representation of tables (i.e., their schema definition). Consequently, even if the physical organization of the data changes (e.g., a different layout is used to store the data on disk, or a new index is created to speed up access for some queries), the queries need not change. Another advantage of the relational data model is that, since a table is a set, in a mathematical sense, simple and intuitive set operations (e.g., union, intersection) can be used to manipulate the data, as we discuss below. We can easily, for example, determine the intersection of two relations (e.g., grants that are awarded to a specific institution), as we describe in the following. The database further ensures that the data comply with the model (e.g., data types, key uniqueness, entity relationships), essentially providing core quality assurance. 102 4. Databases Number 1316033 1336199 1500194 1211853 ID 1 2 3 Person 1 2 3 3 Funding 660,000 323,194 200,000 261,437 Name Steven Weinberg Howard Weinberg Irving Weinberg Program Elem. Particle Physics/Theory ENVIRONMENTAL ENGINEERING Accelerating Innovation Rsrch GALACTIC ASTRONOMY PROGRAM Institution University of Texas at Austin University of North Carolina Chapel Hill University of Maryland College Park Email weinberg@utexas.edu irving@ucmc.edu Figure 4.3. Relational tables Grants and Investigators corresponding to the grants.csv and investigators.csv data in Figure 4.2, respectively. The only differences are the representation in a tabular form, the introduction of a unique numerical investigator identifier (ID) in the Investigators table, and the substitution of that identifier for the investigator name in the Grants table 4.3.1 Structured Query Language (SQL) We use query languages to manipulate data in a database (e.g., to add, update, or delete data elements) and to retrieve (raw and aggregated) data from a database (e.g., data elements that certain properties). Relational DBMSs support SQL, a simple, powerful query language with a strong formal foundation based on logic, a foundation that allows relational DBMSs to perform a wide variety of sophisticated optimizations. SQL is used for three main purposes: • Data definition: e.g., creation of new tables, • Data manipulation: queries and updates, • Control: creation of assertions to protect data integrity. We introduce each of these features in the following, although not in that order, and certainly not completely. Our goal here is to give enough information to provide the reader with insights into how relational databases work and what they do well; an in-depth SQL tutorial is beyond the scope of this book but is something we highly recommend readers seek elsewhere. 4.3.2 Manipulating and querying data SQL and other query languages used in DBMSs support the concise, declarative specification of complex queries. Because we are eager to show you something immediately useful, we cover these features first, before talking about how to define data models. 4.3. Relational DBMSs 103 Example: Identifying grants of more than $200,000 Here is an SQL query to identify all grants with total funding of at most $200,000: select * from Grants where Funding <= 200,000; (Here and elsewhere in this chapter, we show SQL key words in blue.) Notice SQL’s declarative nature: this query can be read almost as the English language statement, “select all rows from the Grants table for which the Funding column has value less than or equal 200,000.” This query is evaluated as follows: 1. The input table specified by the from clause, Grants, is selected. 2. The condition in the where clause, Funding <= 200,000, is checked against all rows in the input table to identify those rows that match. 3. The select clause specifies which columns to keep from the matching rows, that is, which columns make the schema of the output table. (The “*” indicates that all columns should be kept.) The answer, given the data in Figure 4.3, is the following single-row table. (The fact that an SQL query returns a table is important when it comes to creating more complex queries: the result of a query can be stored into the database as a new table, or passed to another query as input.) Number 1500194 Person 3 Funding 200,000 Program Accelerating Innovation Rsrch DBMSs automatically optimize declarative queries such as the example that we just presented, translating them into a set of lowlevel data manipulations (an imperative query plan) that can be evaluated efficiently. This feature allows users to write queries without having to worry too much about performance issues—the database does the worrying for you. For example, a DBMS need not consider every row in the Grants table in order to identify those with funding less than $200,000, a strategy that would be slow if the Grants table were large: it can instead use an index to retrieve the relevant records much more quickly. We discuss indices in more detail in Section 4.3.6. The querying component of SQL supports a wide variety of manipulations on tables, whether referred to explicitly by a table name (as in the example just shown) or constructed by another query. We just saw how to use the select operator to both pick certain rows (what is termed selection) and certain columns (what is called projection) from a table. 104 4. Databases Example: Finding grants awarded to an investigator ⋆ In statistical packages, the term merge or append is often used when data sets are combined. We want to find all grants awarded to the investigator with name “Irving Weinberg.” The information required to answer this question is distributed over two tables, Grants and Investigators, and so we join* the two tables to combine tuples from both: select Number, Name, Funding, Program from Grants, Investigators where Grants.Person = Investigators.ID and Name = "Irving Weinberg"; This query combines tuples from the Grants and Investigators tables for which the Person and ID fields match. It is evaluated in a similar fashion to the query presented above, except for the from clause: when multiple tables are listed, as here, the conditions in the where clause are checked for all different combinations of tuples from the tables defined in the from clause (i.e., the cartesian product of these tables)—in this case, a total of 3 × 4 = 12 combinations. We thus determine that Irving Weinberg has two grants. The query further selects the Number, Name, Funding, and Program fields from the result, giving the following: Number 1500194 1211853 Name Irving Weinberg Irving Weinberg Funding 200,000 261,437 Program Accelerating Innovation Rsrch GALACTIC ASTRONOMY PROGRAM This ability to join two tables in a query is one example of how SQL permits concise specifications of complex computations. This joining of tables via a cartesian product operation is formally called a cross join. Other types of join are also supported. We describe one such, the inner join, in Section 4.6. SQL aggregate functions allow for the computation of aggregate statistics over tables. For example, we can use the following query to determine the total number of grants and their total and average funding levels: select count(*) as 'Number', sum(Funding) as 'Total', avg(Funding) as 'Average' from Grants; This yields the following: Number 4 Total 1444631 Average 361158 The group by operator can be used in conjunction with the aggregate functions to group the result set by one or more columns. 4.3. Relational DBMSs 105 For example, we can use the following query to create a table with three columns: investigator name, the number of grants associated with the investigator, and the aggregate funding: select Name, count(*) as 'Number', avg(Funding) as 'Average funding' from Grants, Investigators where Grants.Person = Investigators.ID group by Name; We obtain the following: Name Steven Weinberg Howard Weinberg Irving Weinberg 4.3.3 Number 1 1 2 Average Funding 666000 323194 230719 Schema design and definition We have seen that a relational database comprises a set of tables. The task of specifying the structure of the data to be stored in a database is called logical design. This task may be performed by a database administrator, in the case of a database to be shared by many people, or directly by users, if they are creating databases themselves. More specifically, the logical design process involves defining a schema. A schema comprises a set of tables (including, for each table, its columns and their types), their relationships, and integrity constraints. The first step in the logical design process is to identify the entities that need to be modeled. In our example, we identified two important classes of entity: “grants” and “investigators.” We thus define a table for each; each row in these two tables will correspond to a unique grant or investigator, respectively. (In a more complete and realistic design, we would likely also identify other entities, such as institutions and research products.) During this step, we will often find ourselves breaking information up into multiple tables, so as to avoid duplicating information. For example, imagine that we were provided grant information in the form of one CSV file rather than two, with each line providing a grant number, investigator, funding, program, institution, and email. In this file, the name, institution, and email address for Irving Weinberg would then appear twice, as he has two grants, which can lead to errors when updating values and make it difficult to represent certain information. (For example, if we want to add an investigator who does not yet have a grant, we will need to create 106 ⋆ Normalization involves organizing columns and tables of a relational database to minimize data redundancy. ◮ Normalization can be done in statistical packages as well. For example, as noted above, PSID splits its data into different files linked through ID variables. The difference here is that the DBMS makes creating, navigating, and querying the resulting data particularly easy. ⋆ These storage types will be familiar to many of you from statistical software packages. 4. Databases a tuple (row) with empty slots for all columns (variables) associated with grants.) Thus we would want to break up the single big table into the two tables that we defined here. This breaking up of information across different tables to avoid repetition of information is referred to as normalization.* The second step in the design process is to define the columns that are to be associated with each entity. For each table, we define a set of columns. For example, given the data in Figure 4.2, those columns will likely include, for a grant, an award identifier, title, investigator, and award amount; for an investigator, a name, university, and email address. In general, we will want to ensure that each row in our table has a key: a set of columns that uniquely identifies that row. In our example tables, grants are uniquely identified by Number and investigators by ID. The third step in the design process is to capture relationships between entities. In our example, we are concerned with just one relationship, namely that between grants and investigators: each grant has an investigator. We represent this relationship between tables by introducing a Person column in the Grants table, as shown in Figure 4.3. Note that we do not simply duplicate the investigator names in the two tables, as was the case in the two CSV files shown in Figure 4.2: these names might not be unique, and the duplication of data across tables can lead to later inconsistencies if a name is updated in one table but not the other. The final step in the design process is to represent integrity constraints (or rules) that must hold for the data. In our example, we may want to specify that each grant must be awarded to an investigator; that each value of the grant identifier column must be unique (i.e., there cannot be two grants with the same number); and total funding can never be negative. Such restrictions can be achieved by specifying appropriate constraints at the time of schema creation, as we show in Listing 4.1, which contains the code used to create the two tables that make up our schema. Listing 4.1 contains four SQL statements. The first two statements, lines 1 and 2, simply set up our new database. The create table statement in lines 4–10 creates our first table. It specifies the table name (Investigators) and, for each of the four columns, the column name and its type.* Relational DBMSs offer a rich set of types to choose from when designing a schema: for example, int or integer (synonyms); real or float (synonyms); char(n), a fixed-length string of n characters; and varchar(n), a variablelength string of up to n characters. Types are important for several reasons. First, they allow for more efficient encoding of data. For 4.3. Relational DBMSs 2 4 6 8 10 12 14 16 18 create database grantdata; use grantdata; create table Investigators ( ID int auto_increment, Name varchar(100) not null, Institution varchar(256) not null, Email varchar(100), primary key(ID) ); create table Grants ( Number int not null, Person int not null, Funding float unsigned not null, Program varchar(100), primary key(Number) ); Listing 4.1. Code to create the grantdata database and its Investigators and Grants tables example, the Funding field in the grants.csv file of Figure 4.2 could be represented as a string in the Grants table, char(15), say, to allow for large grants. By representing it as a floating point number instead (line 15 in Listing 4.1), we reduce the space requirement per grant to just four bytes. Second, types allow for integrity checks on data as they are added to the database: for example, that same type declaration for Funding ensures that only valid numbers will be entered into the database. Third, types allow for type-specific operations on data, such as arithmetic operations on numbers (e.g., min, max, sum). Other SQL features allow for the specification of additional constraints on the values that can be placed in the corresponding column. For example, the not null constraints for Name and Institution (lines 6, 7) indicate that each investigator must have a name and an institution, respectively. (The lack of such a constraint on the Email column shows that an investigator need not have an email address.) 4.3.4 Loading data So far we have created a database and two tables. To complete our simple SQL program, we show in Listing 4.2 the two statements that load the data of Figure 4.2 into our two tables. (Here and elsewhere in this chapter, we use the MySQL DBMS. The SQL 107 108 4. Databases 2 4 load data local infile "investigators.csv" into table Investigators fields terminated by "," ignore 1 lines (Name, Institution, Email); 6 8 10 12 load data local infile "grants.csv" into table Grants fields terminated by "," ignore 1 lines (Number, @var, Funding, Program) set Person = (select ID from Investigators where Investigators.Name=@var); Listing 4.2. Code to load data into the Investigators and Grants tables syntax used by different DBMSs differs in various, mostly minor ways.) Each statement specifies the name of the file from which data is to be read and the table into which it is to be loaded. The fields terminated by "," statement tells SQL that values are separated by columns, and ignore 1 lines tells SQL to skip the header. The list of column names is used to specify how values from the file are to be assigned to columns in the table. For the Investigators table, the three values in each row of the investigators.csv file are assigned to the Name, Institution, and Email columns of the corresponding database row. Importantly, the auto_increment declaration on the ID column (line 5 in Listing 4.1) causes values for this column to be assigned automatically by the DBMS, as rows are created, starting at 1. This feature allows us to assign a unique integer identifier to each investigator as its data are loaded. For the Grants table, the load data call (lines 7–12) is somewhat more complex. Rather than loading the investigator name (the second column of each line in our data file, represented here by the variable @var) directly into the database, we use an SQL query (the select statement in lines 11–12) to retrieve from the Investigators table the ID corresponding to that name. By thus replacing the investigator name with the unique investigator identifier, we avoid replicating the name across the two tables. 4.3.5 Transactions and crash recovery A DBMS protects the data that it stores from computer crashes: if your computer stops running suddenly (e.g., your operating system crashes or you unplug the power), the contents of your database are 4.3. Relational DBMSs not corrupted. It does so by supporting transactions. A transaction is an atomic sequence of database actions. In general, every SQL statement is executed as a transaction. You can also specify sets of statements to be combined into a single transaction, but we do not cover that capability here. The DBMS ensures that each transaction is executed completely even in the case of failure or error: if the transaction succeeds, the results of all operations are recorded permanently (“persisted”) in the database, and if it fails, all operations are “rolled back” and no changes are committed. For example, suppose we ran the following SQL statement to convert the funding amounts in the Grants table from dollars to euros, by scaling each number by 0.9. The update statement specifies the table to be updated and the operation to be performed, which in this case is to update the Funding column of each row. The DBMS will ensure that either no rows are altered or all are altered. update Grants set Grants.Funding = Grants.Funding*0.9; Transactions are also key to supporting multi-user access. The concurrency control mechanisms in a DBMS allow multiple users to operate on a database concurrently, as if they were the only users of the system: transactions from multiple users can be interleaved to ensure fast response times, while the DBMS ensures that the database remains consistent. While entire books could be (and have been) written on concurrency in databases, the key point is that read operations can proceed concurrently, while update operations are typically serialized. 4.3.6 Database optimizations A relational DBMS applies query planning and optimization methods with the goal of evaluating queries as efficiently as possible. For example, if a query asks for rows that fit two conditions, one cheap to evaluate and one expensive, a relational DBMS may filter first on the basis of the first condition, and then apply the second conditions only to the rows identified by that first filter. These sorts of optimization are what distinguish SQL from other programming languages, as they allow the user to write queries declaratively and rely on the DBMS to come up with an efficient execution strategy. Nevertheless, the user can help the DBMS to improve performance. The single most powerful performance improvement tool is the index, an internal data structure that the DBMS maintains to speed up queries. While various types of indices can be created, with different characteristics, the basic idea is simple. Consider the 109 110 4. Databases column ID in our Investigators table. Assume that there are N rows in the table. In the absence of an index, a query that refers to a column value (e.g., where ID=3) would require a linear scan of the table, taking on average N/2 comparisons and in the worst case N comparisons. A binary tree index allows the desired value to be found with just log2 N comparisons. Example: Using indices to improve database performance Consider the following query: select ID, Name, sum(Funding) as TotalFunding from Grants, Investigators where Investigators.ID=Grants.Person group by ID; This query joins our two tables to link investigators with the grants that they hold, groups grants by investigator (using group by), and finally sums the funding associated with the grants held by each investigator. The result is the following: ID 1 2 3 Name Steven Weinberg Howard Weinberg Irving Weinberg TotalFunding 666000 323194 461437 In the absence of indices, the DBMS must compare each row in Investigators with each row in Grants, checking for each pair whether Investigators.ID = Grants.Person holds. As the two tables in our sample database have only three and four rows, respectively, the total number of comparisons is only 3 × 4 = 12. But if we had, say, 1 million investigators and 1 million grants, then the DBMS would have to perform 1 trillion comparisons, which would take a long time. (More importantly in many cases, it would have to perform a large number of disk I/O operations if the tables did not fit in memory.) An index on the ID column of the Investigators table reduces the number of operations dramatically, as the DBMS can then take each of the 1 million rows in the Grants table and, for each row, identify the matching row(s) in Investigators via an index lookup rather than a linear scan. In our example table, the ID column has been specified to be a primary key, and thus an index is created for it automatically. If it were not, we could easily create the desired index as follows: alter table Investigators add index(ID); It can be difficult for the user to determine when an index is required. A good rule of thumb is to create an index for any column that is queried often, that is, appears on the right-hand side of a where statement. However, the presence of indices makes updates more expensive, as every change to a column value requires that the index be rebuilt to reflect the change. Thus, if your data are 4.3. Relational DBMSs highly dynamic, you should carefully select which indices to create. (For bulk load operations, a common practice is to drop indices prior to the data import, and re-create them once the load is completed.) Also, indices take disk space, so you need to consider the tradeoff between query efficiency and resources. The explain command can be useful for determining when indices are required. For example, we show in the following some of the output produced when we apply explain to our query. (For this example, we have expanded the two tables to 1,000 rows each, as our original tables are too small for MySQL to consider the use of indices.) The output provides useful information such as the key(s) that could be used, if indices exist (Person in the Grants table, and the primary key, ID, for the Investigators table); the key(s) that are actually used (the primary key, ID, in the Investigators table); the column(s) that are compared to the index (Investigators.ID is compared with Grants.Person); and the number of rows that must be considered (each of the 1,000 rows in Grants is compared with one row in Investigators, for a total of 1,000 comparisons). mysql> explain select ID, Name, sum(Funding) as TotalFunding from Grants, Investigators where Investigators.ID=Grants.Person group by ID; +---------------+---------------+---------+---------------+------+ | table | possible_keys | key | ref | rows | +---------------+---------------+---------+---------------+------+ | Grants | Person | NULL | NULL | 1000 | | Investigators | PRIMARY | PRIMARY | Grants.Person | 1 | +---------------+---------------+---------+---------------+------+ Contrast this output with the output obtained for equivalent tables in which ID is not a primary key. In this case, no keys are used and thus 1,000 × 1,000 = 1,000,000 comparisons and the associated disk reads must be performed. +---------------+---------------+------+------+------+ | table | possible_keys | key | ref | rows | +---------------+---------------+------+------+------+ | Grants | Person | NULL | NULL | 1000 | | Investigators | ID | NULL | NULL | 1000 | +---------------+---------------+------+------+------+ A second way in which the user can contribute to performance improvement is by using appropriate table definitions and data types. Most DBMSs store data on disk. Data must be read from disk into memory before it can be manipulated. Memory accesses are fast, but loading data into memory is expensive: accesses to main memory can be a million times faster than accesses to disk. Therefore, to ensure queries are efficient, it is important to minimize the number of disk accesses. A relational DBMS automatically optimizes queries: based on how the data are stored, it transforms a SQL query into a query plan that can be executed efficiently, 111 112 4. Databases and chooses an execution strategy that minimizes disk accesses. But users can contribute to making queries efficient. As discussed above, the choice of types made when defining schemas can make a big difference. As a rule of thumb, only use as much space as needed for your data: the smaller your records, the more records can be transferred to main memory using a single disk access. The design of relational tables is also important. If you put all columns in a single table (do not normalize), more data will come into memory than is required. 4.3.7 Caveats and challenges It is important to keep the following caveats and challenges in mind when using SQL technology with social science data. Data cleaning Data created outside an SQL database, such as data in files, are not always subject to strict constraints: data types may not be correct or consistent (e.g., numeric data stored as text) and consistency or integrity may not be enforced (e.g., absence of primary keys, missing foreign keys). Indeed, as the reader probably knows well from experience, data are rarely perfect. As a result, the data may fail to comply with strict SQL schema requirements and fail to load, in which case either data must be cleaned before or during loading, or the SQL schema must be relaxed. Missing values Care must be taken when loading data in which some values may be missing or blank. SQL engines represent and refer to a missing or blank value as the built-in constant null. Counterintuitively, when loading data from text files (e.g., CSV), many SQL engines require that missing values be represented explicitly by the term null; if a data value is simply omitted, it may fail to load or be incorrectly represented, for example as zero or the empty string (" ") instead of null. Thus, for example, the second row in the investigators.csv file of Figure 4.2: Howard Weinberg,University of North Carolina Chapel Hill, may need to be rewritten as: Howard Weinberg,University of North Carolina Chapel Hill,null 4.4. Linking DBMSs and other tools Metadata for categorical variables SQL engines are metadata poor: they do not allow extra information to be stored about a variable (field) beyond its base name and type (int, char, etc., as introduced in Section 4.3.3). They cannot, for example, record directly the fact that the column class can only take one of three values, animal, vegetable, or mineral, or what these values mean. Common practice is thus to store information about possible values in another table (commonly referred to as a dimension table) that can be used as a lookup and constraint, as in the following: Table class_values Description animal Is alive vegetable Grows mineral Isn’t alive and doesn’t grow Value A related concept is that a column or list of columns may be declared primary key or unique. Either says that no two tuples of the table may agree in all the column(s) on the list. There can be only one primary key for a table, but several unique columns. No column of a primary key can ever be null in any tuple. But columns declared unique may have nulls, and there may be several tuples with null. 4.4 Linking DBMSs and other tools Query languages such as SQL are not general-purpose programming languages; they support easy, efficient access to large data sets, but are not intended to be used for complex calculations. When complex computations are required, one can embed query language statements into a programming language or statistical package. For example, we might want to calculate the interquartile range of funding for all grants. While this calculation can be accomplished in SQL, the resulting SQL code will be complicated. Languages like Python make such statistical calculations straightforward, so it is natural to write a Python (or R, SAS, Stata, etc.) program that connects to the DBMS that contains our data, fetches the required data from the DBMS, and then calculates the interquartile range of those data. The program can then, if desired, store the result of this calculation back into the database. Many relational DBMSs also have built-in analytical functions or often now embed the R engine, providing significant in-database 113 114 4. Databases 2 4 6 8 10 12 14 16 18 from mysql.connector import MySQLConnection, Error from python_mysql_dbconfig import read_db_config def retrieve_and_analyze_data(): try: # Open connection to the MySQL database dbconfig = read_db_config() conn = MySQLConnection(**dbconfig) cursor = conn.cursor() # Transmit the SQL query to the database cursor.execute('select Funding from Grants;') # Fetch all rows of the query response rows = [row for row in cur.fetchall()] calculate_inter_quartile_range(rows) except Error as e: print(e) finally: cursor.close() conn.close() 20 22 if __name__ == '__main__': retrieve_and_analyze_data() Listing 4.3. Embedding SQL in Python statistical and analytical capabilities and alleviating the need for external processing. Example: Embedding database queries in Python The Python script in Listing 4.3 shows how this embedding of database queries in Python is done. This script establishes a connection to the database (lines 7–9), transmits the desired SQL query to the database (line 11), retrieves the query results into a Python array (line 13), and calls a Python procedure (not given) to perform the desired computation (line 14). A similar program could be used to load the results of a Python (or R, SAS, Stata, etc.) computation into a database. Example: Loading other structured data We saw in Listing 4.2 how to load data from CSV files into SQL tables. Data in other formats, such as the commonly used JSON, can also be loaded into a relational DBMS. Consider, for example, the following JSON format data, a simplified version of data shown in Chapter 2. 4.4. Linking DBMSs and other tools 1 [ { 2 institute : Janelia Campus, name : Laurence Abbott, role : Senior Fellow, state : VA, town : Ashburn 3 4 5 6 7 }, { 8 9 institute : Jackson Lab, name : Susan Ackerman, role : Investigator, state : ME, town : Bar Harbor 10 11 12 13 14 } 15 16 ] While some relational DBMSs provide built-in support for JSON objects, we assume here that we want to convert these data into normal SQL tables. Using one of the many utilities for converting JSON into CSV, we can construct the following CSV file, which we can load into an SQL table using the method shown earlier. institute,name,role,state,town Janelia Campus,Laurence Abbott,Senior Fellow,VA,Ashburn Jackson Lab,Susan Ackerman,Investigator,ME,Bar Harbor But into what table? The two records each combine information about a person with information about an institute. Following the schema design rules given in Section 4.3.3, we should normalize the data by reorganizing them into two tables, one describing people and one describing institutes. Similar problems arise when JSON documents contain nested structures. For example, consider the following alternative JSON representation of the data above. Here, the need for normalization is yet more apparent. 1 [ { 2 name : Laurence Abbott, role : Senior Fellow, employer : { institute : Janelia Campus, state : VA, town : Ashburn} 3 4 5 6 7 }, { 8 9 name : Susan Ackerman, role : Investigator, employer: { institute : Jackson Lab, state : ME, town : Bar Harbor} 10 11 12 13 14 } 15 16 ] 115 116 4. Databases Thus, the loading of JSON data into a relational database usually requires both work on schema design (Section 4.3.3) and data preparation. 4.5 NoSQL databases While relational DBMSs have dominated the database world for several decades, other database technologies exist and indeed have become popular for various classes of applications in recent years. As we will see, these alternative technologies have typically been motivated by a desire to scale the quantities of data and/or number of users that can be supported, and/or to support specialized data types (e.g., unstructured data, graphs). Here we review some of these alternatives and the factors that may motivate their use. 4.5.1 Challenges of scale: The CAP theorem For many years, the big relational database vendors (Oracle, IBM, Sybase, and to a lesser extent Microsoft) have been the mainstay of how data were stored. During the Internet boom, startups looking for low-cost alternatives to commercial relational DBMSs turned to MySQL and PostgreSQL. However, these systems proved inadequate for big sites as they could not cope well with large traffic spikes, for example when many customers all suddenly wanted to order the same item. That is, they did not scale. An obvious solution to scaling databases is to partition and/or replicate data across multiple computers, for example by distributing different tables, or different rows from the same table, over multiple computers. However, partitioning and replication also introduce challenges, as we now explain. Let us first define some terms. In a system that comprises multiple computers: • Consistency indicates that all computers see the same data at the same time. • Availability indicates that every request receives a response about whether it succeeded or failed. • Partition tolerance indicates that the system continues to operate even if a network failure prevents computers from communicating. 4.5. NoSQL databases An important result in distributed systems (the so-called “CAP theorem” [51]) observes that it is not possible to create a distributed system with all three properties. This situation creates a challenge with large transactional data sets. Partitioning is needed in order to achieve high performance, but as the number of computers grows, so too does the likelihood of network disruption among pair(s) of computers. As strict consistency cannot be achieved at the same time as availability and partition tolerance, the DBMS designer must choose between high consistency and high availability for a particular system. The right combination of availability and consistency will depend on the needs of the service. For example, in an e-commerce setting, it makes sense to choose high availability for a checkout process, in order to ensure that requests to add items to a shopping cart (a revenue-producing process) can be honored. Errors can be hidden from the customer and sorted out later. However, for order submission—when a customer submits an order—it makes sense to favor consistency because several services (credit card processing, shipping and handling, reporting) need to access the data simultaneously. However, in almost all cases, availability is chosen over consistency. 4.5.2 NoSQL and key–value stores Relational DBMSs were traditionally motivated by the need for transaction processing and analysis, which led them to put a premium on consistency and availability. This led the designers of these systems to provide a set of properties summarized by the acronym ACID [137, 347]: • Atomic: All work in a transaction completes (i.e., is committed to stable storage) or none of it completes. • Consistent: A transaction transforms the database from one consistent state to another consistent state. • Isolated: The results of any changes made during a transaction are not visible until the transaction has committed. • Durable: The results of a committed transaction survive failures. The need to support extremely large quantities of data and numbers of concurrent clients has led to the development of a range of alternative database technologies that relax consistency and thus 117 118 4. Databases these ACID properties in order to increase scalability and/or availability. These systems are commonly referred to as NoSQL (for “not SQL”—or, more recently, “not only SQL,” to communicate that they may support SQL-like query languages) because they usually do not require a fixed table schema nor support joins and other SQL features. Such systems are sometimes referred to as BASE [127]: Basically Available (the system seems to work all the time), Soft state (it does not have to be consistent all the time), and Eventually consistent (it becomes consistent at some later time). The data systems used in essentially all large Internet companies (Google, Yahoo!, Facebook, Amazon, eBay) are BASE. Dozens of different NoSQL DBMSs exist, with widely varying characteristics as summarized in Table 4.3. The simplest are key– value stores such as Redis, Amazon Dynamo, Apache Cassandra, and Project Voldemort. We can think of a key–value store as a relational database with a single table that has just two columns, key and value, and that supports just two operations: store (or update) a key–value pair, and retrieve the value for a given key. Example: Representing investigator data in a NoSQL database We might represent the contents of the investigators.csv file of Figure 4.2 (in a NoSQL database) as follows. Key Investigator_StevenWeinberg_Institution Investigator_StevenWeinberg_Email Investigator_HowardWeinberg_Institution Investigator_IrvingWeinberg_Institution Investigator_IrvingWeinberg_Email Value University of Texas at Austin weinberg@utexas.edu University of North Carolina Chapel Hill University of Maryland College Park irving@ucmc.edu A client can then read and write the value associated with a given key by using operations such as the following: • Get(key) returns the value associated with key. • Put(key, value) associates the supplied value with key. • Delete(key) removes the entry for key from the data store. Key–value stores are thus particularly easy to use. Furthermore, because there is no schema, there are no constraints on what values can be associated with a key. This lack of constraints can be useful if we want to store arbitrary data. For example, it is trivial to add the following records to a key–value store; adding this information to a relational table would require schema modifications. Key Investigator_StevenWeinberg_FavoriteColor Investigator_StevenWeinberg_Awards Value Blue Nobel 4.5. NoSQL databases 119 Another advantage is that if a given key would have no value (e.g., Investigator_HowardWeinberg_Email), we need not create a record. Thus, a key–value store can achieve a more compact representation of sparse data, which would have many empty fields if expressed in relational form. A third advantage of the key–value approach is that a key–value store is easily partitioned and thus can scale to extremely large sizes. A key–value DBMS can partition the space of keys (e.g., via a hash on the key) across different computers for scalability. It can also replicate key–value pairs across multiple computers for availability. Adding, updating, or querying a key–value pair requires simply sending an appropriate message to the computer(s) that hold that pair. The key–value approach also has disadvantages. As we can see from the example, users must be careful in their choice of keys if they are to avoid name collisions. The lack of schema and constraints can also make it hard to detect erroneous keys and values. Key–value stores typically do not support join operations (e.g., “which investigators have the Nobel and live in Texas?”). Many key–value stores also relax consistency constraints and do not provide transactional semantics. 4.5.3 Other NoSQL databases The simple structure of key–value stores allows for extremely fast and scalable implementations. However, as we have seen, many interesting data cannot be easily modeled as key–value pairs. Such concerns have motivated the development of a variety of other NoSQL systems that offer, for example, richer data models: documentbased (CouchDB and MongoDB), graph-based (Neo4J), and columnbased (Cassandra, HBase) databases. In document-based databases, the value associated with a key can be a structured document: for example, a JSON document, permitting the following representation of our investigators.csv file plus the additional information that we just introduced. Key Investigator_StevenWeinberg Investigator_HowardWeinberg Investigator_IrvingWeinberg Value { institution : University of Texas at Austin, email : weinberg@utexas.edu, favcolor : Blue, award : Nobel } { institution : University of North Carolina Chapel Hill } { institution : University of Maryland College Park, email : irving@ucmc.edu } Associated query languages may permit queries within the document, such as regular expression searches, and retrieval of selected 120 4. Databases fields, providing a form of a relational DBMS’s selection and projection capabilities (Section 4.3.2). For example, MongoDB allows us to ask for documents in a collection called investigators that have “University of Texas at Austin” as their institution and the Nobel as an award. db.investigators.find( { institution: ’University of Texas at Austin’, award: ’Nobel’ } ) A column-oriented DBMS stores data tables by columns rather than by rows, as is common practice in relational DBMSs. This approach has advantages in settings where aggregates must frequently be computed over many similar data items: for example, in clinical data analysis. Google Cloud BigTable and Amazon RedShift are two cloud-hosted column-oriented NoSQL databases. HBase and Cassandra are two open source systems with similar characteristics. (Confusingly, the term column oriented is also often used to refer to SQL database engines that store data in columns instead of rows: for example, Google BigQuery, HP Vertica, Terradata, and the open source MonetDB. Such systems are not to be confused with column-based NoSQL databases.) Graph databases store information about graph structures in terms of nodes, edges that connect nodes, and attributes of nodes and edges. Proponents argue that they permit particularly straightforward navigation of such graphs, as when answering queries such as “find all the friends of the friends of my friends”—a task that would require multiple joins in a relational database. 4.6 Spatial databases Social science research commonly involves spatial data. Socioeconomic data may be associated with census tracts, data about the distribution of research funding and associated jobs with cities and states, and crime reports with specific geographic locations. Furthermore, the quantity and diversity of such spatially resolved data are growing rapidly, as are the scale and sophistication of the systems that provide access to these data. For example, just one urban data store, Plenario, contains many hundreds of data sets about the city of Chicago [64]. Researchers who work with spatial data need methods for representing those data and then for performing various queries against them. Does crime correlate with weather? Does federal spending 4.6. Spatial databases on research spur innovation within the locales where research occurs? These and many other questions require the ability to quickly determine such things as which points exist within which regions, the areas of regions, and the distance between two points. Spatial databases address these and many other related requirements. Example: Spatial extensions to relational databases Spatial extensions have been developed for many relational databases: for example, Oracle Spatial, DB2 Spatial, and SQL Server Spatial. We use the PostGIS extensions to the PostgreSQL relational database here. These extensions implement support for spatial data types such as point, line, and polygon, and operations such as st_within (returns true if one object is contained within another), st_dwithin (returns true if two objects are within a specified distance of each other), and st_distance (returns the distance between two objects). Thus, for example, given two tables with rows for schools and hospitals in Illinois (illinois_schools and illinois_hospitals, respectively; in each case, the column the_geom is a polygon for the object in question) and a third table with a single row representing the city of Chicago (chicago_citylimits), we can easily find the names of all schools within the Chicago city limits: select illinois_schools.name from illinois_schools, chicago_citylimits where st_within(illinois_schools.the_geom, chicago_citylimits.the_geom); We join the two tables illinois_schools and chicago_citylimits, with the st_within constraint constraining the selected rows to those representing schools within the city limits. Here we use the inner join introduced in Section 4.3.2. This query could also be written as: select illinois_schools.name from illinois_schools left join chicago_citylimits on st_within(illinois_schools.the_geom, chicago_citylimits.the_geom); We can also determine the names of all schools that do not have a hospital within 3,000 meters: select s.name as 'School Name' from illinois_schools as s left join illinois_hospitals as h on st_dwithin(s.the_geom, h.the_geom, 3000) where h.gid is null; Here, we use an alternative form of the join operator, the left join—or, more precisely, the left excluding join. The expression table1 left join table2 on constraint 121 122 4. Databases Table 4.4. Three types of join illustrated: the inner join, as used in Section 4.3.2, the left join, and left excluding join Inner join A Left join B select columns from Table_A A inner join Table_B B on A.Key = B.Key A Left excluding join B select columns from Table_A A left join Table_B B on A.Key = B.Key A B select columns from Table_A A left join Table_B B on A.Key = B.Key where B.Key is null returns all rows from the left table (table1) with the matching rows in the right table (table2), with the result being null in the right side when there is no match. This selection is illustrated in the middle column of Table 4.4. The addition of the where h.gid is null then selects only those rows in the left table with no right-hand match, as illustrated in the right-hand column of Table 4.4. Note also the use of the as operator to rename the columns illinois_schools and illinois_hospitals. In this case, we rename them simply to make our query more compact. 4.7 Which database to use? The question of which DBMS to use for a social sciences data management and analysis project depends on many factors. We introduced some relevant rules in Table 4.1. We expand on those considerations here. 4.7.1 Relational DBMSs If your data are structured, then a relational DBMS is almost certainly the right technology to use. Many open source, commercial, and cloud-hosted relational DBMSs exist. Among the open source DBMSs, MySQL and PostgreSQL (often simply Postgres) are particularly widely used. MySQL is the most popular. It is particularly easy to install and use, but does not support all features of the SQL standard. PostgreSQL is fully standard compliant and supports useful features such as full text search and the PostGIS extensions 4.8. Summary mentioned in the previous section, but can be more complex to work with. Popular commercial relational DBMSs include IBM DB2, Microsoft SQL Server, and Oracle RDMS. These systems are heavily used in commercial settings. There are free community editions, and some large science projects use enterprise features via academic licensing: for example, the Sloan Digital Sky Survey uses Microsoft SQL Server [367] and the CERN high-energy physics lab uses Oracle [132]. We also see increasing use being made of cloud-hosted relational DBMSs such as Amazon Relational Database Service (RDS; this supports MySQL, PostgreSQL, and various commercial DBMSs), Microsoft Azure, and Google Cloud SQL. These systems obviate the need to install local software, administer a DBMS, or acquire hardware to run and scale your database. Particularly if your database is bigger than can fit on your workstation, a cloud-hosted solution can be a good choice. 4.7.2 NoSQL DBMSs Few social science problems have the scale that might motivate the use of a NoSQL DBMS. Furthermore, while defining and enforcing a schema can involve some effort, the benefits of so doing are considerable. Thus, the use of a relational DBMS is usually to be recommended. Nevertheless, as noted in Section 4.2, there are occasions when a NoSQL DBMS can be a highly effective, such as when working with large quantities of unstructured data. For example, researchers analyzing large collections of Twitter messages frequently store the messages in a NoSQL document-oriented database such as MongoDB. NoSQL databases are also often used to organize large numbers of records from many different sources, as illustrated in Figure 4.1. 4.8 Summary A key message of this book is that you should, whenever possible, use a database. Database management systems are one of the great achievements of information technology, permitting large amounts of data to be stored and organized so as to allow rapid and reliable exploration and analysis. They have become a central component 123 124 4. Databases of a great variety of applications, from handling transactions in financial systems to serving data published in websites. They are particularly well suited for organizing social science data and for supporting analytics for data exploration. DBMSs provide an environment that greatly simplifies data management and manipulation. They make many easy things trivial, and many hard things easy. They automate many other errorprone, manual tasks associated with query optimization. While they can by daunting to those unfamiliar with their concepts and workings, they are in fact easy to use. A basic understanding of databases and of when and how to use DBMSs is an important element of the social data scientist’s knowledge base. 4.9 Resources The enormous popularity of DBMSs means that there are many good books to be found. Classic textbooks such as those by Silberschatz et al. [347] and Ramakrishnan and Gherke [316] provide a great deal of technical detail. The DB Engines website collects information on DBMSs [351]. There are also many also useful online tutorials, and of course StackExchange and other online forums often have answers to your technical questions. Turning to specific technologies, the SQL Cookbook [260] provides a wonderful introduction to SQL. We also recommend the SQL Cheatsheet [22] and a useful visual depiction of different SQL join operators [259]. Two good books on the PostGIS geospatial extensions to the PostgreSQL database are the PostGIS Cookbook [85] and PostGIS in Action [285]. The online documentation is also excellent [306]. The monograph NoSQL Databases [364] provides much useful technical detail. We did not consider in this chapter the native extensible Markup Language (XML) and Resource Description Framework (RDF) triple stores, as these are not typically used for data management. However, they do play a fundamental role in metadata and knowledge management. See, for example, Sesame [53, 336] If you are interested in the state of database and data management research, the recent Beckman Report [2] provides a useful perspective. Chapter 5 Programming with Big Data Huy Vo and Claudio Silva Big data is sometimes defined as data that are too big to fit onto the analyst’s computer. This chapter provides an overview of clever programming techniques that facilitate the use of data (often using parallel computing). While the focus is on one of the most widely used big data programming paradigms and its most popular implementation, Apache Hadoop, the goal of the chapter is to provide a conceptual framework to the key challenges that the approach is designed to address. 5.1 Introduction There are many definitions of big data, but perhaps the most popular one is from Laney’s report [229] in 2001 that defines big data as data with the three Vs: volume (large data sets), velocity (real-time streaming data), and variety (various data forms). Other authors have since proposed additional Vs for various purposes: veracity, value, variability, and visualization. It is common nowadays to see big data being associated with five or even seven Vs—the big data definition itself is also getting “big” and getting difficult to keep track. A simpler but also broader definition of big data is data sets that are “so large or complex that traditional data processing applications are inadequate,” as described by Wikipedia. We note that in this case, the definition adapts to the task at hand, and it is also tool dependent. For example, we could consider a problem to involve big data if a spreadsheet gets so large that Excel can no longer load the entire data set into memory for analysis—or if there are so many features in a data set that a machine learning classifier would take an unreasonable amount of time (say days) to finish instead of a few seconds 125 ◮ See Chapter 6, in particular Section 6.5.2. 126 5. Programming with Big Data or minutes. In such cases, the analyst has to develop customized solutions to interrogate the data, often by taking advantage of parallel computing. In particular, one may have to get a larger computer, with more memory and/or stronger processors, to cope with expensive computations; or get a cluster of machines to speed up the computing time by distributing the workload among them. While the former solution would not scale well due to the limited amount of processors a single computer can have, the latter solution needs to deal with nontraditional programming infrastructures. Big data technologies, in a nutshell, are the technologies that make these infrastructures more usable by users without a computer science background. Parallel computing and big data are hardly new ideas for dealing with computational challenges. Scientists have routinely been working on data sets much larger than a single machine can handle for several decades, especially at the DOE National Laboratories [87, 337] where high-performance computing has been a major technology trend. This is also demonstrated by the history of research in distributed computing and data management going back to the 1980s [401]. There are major technological differences between this type of big data technology and what is covered in this chapter. True parallel computing involves designing clever parallel algorithms, often from scratch, taking into account machine-dependent constraints to achieve maximum performance from the particular architecture. They are often implemented using message passing libraries, such as implementations of the Message Passing Interface (MPI) standard [142] (available since the mid-1990s). Often the expectation is that the code developed will be used for substantial computations. In that scenario it makes sense to optimize a code for maximum performance knowing the same code will be used repeatedly. Contrast that with big data for exploration, where we might need to keep changing the analysis code very often. In this case, what we are trying to optimize is the analyst’s time, not computing time. With storage and networking getting significantly cheaper and faster, big data sets could easily become available to data enthusiasts with just a few mouse clicks, e.g., the Amazon Web Service Public Data Sets [11]. These enthusiasts may be policymakers, government employees, or managers who would like to draw insights and (business) value from big data. Thus, it is crucial for big data to be made available to nonexpert users in such a way that they can process the data without the need for a supercomputing expert. One such approach is to build big data frameworks within 5.2. The MapReduce programming model which commands can be implemented just as they would in a small data framework. Also, such a framework should be as simple as possible, even if not as efficient as custom-designed parallel solutions. Users should expect that if their code works within these frameworks for small data, it will also work for big data. In order to achieve scalability for big data, many frameworks only implement a small subset of data operations and fully automate the parallelism of these operations. Users are expected to develop their codes using only the subset if they expect their code to scale to large data sets. MapReduce, one of the most widely used big data programming paradigms, is no exception to this rule. As its name suggests, the framework only supports two operations: map and reduce. The next sections will provide an overview of MapReduce and its most popular implementation, Apache Hadoop. 5.2 The MapReduce programming model The MapReduce framework was proposed by Jeffrey Dean and Sanjay Ghemawat at Google in 2004 [91]. Its origins date back to conceptually similar approaches first described in the early 1980s. MapReduce was indeed inspired by the map and reduce functions of functional programming, though its reduce function is more of a group-by-key function, producing a list of values, instead of the traditional reduce, which outputs only a single value. MapReduce is a record-oriented model, where each record is treated as a key–value pair; thus, both map and reduce functions operate on key–value pair data. A typical MapReduce job is composed of three phases—map, shuffle, and reduce—taking a list of key–value pairs [(k$_1$,v$_2$), (k$_2$,v$_2$), ..., (k$_n$,v$_n$)] as input. In the map phase, each input key–value pair is run through the map function, and zero or more new key–value pairs are output. In the shuffle phase, the framework sorts the outputs of the map phase, grouping pairs by keys before sending each of them to the reduce function. In the reduce phase, each grouping of values are processed by the reduce function, and the result is a list of new values that are collected for the job output. In brief, a MapReduce job just takes a list of key–value pairs as input and produces a list of values as output. Users only need to implement interfaces of the map and reduce functions (the shuffle phase is not very customizable) and can leave it up to the system that implements MapReduce to handle all data communications 127 128 5. Programming with Big Data and parallel computing. We can summarize the MapReduce logic as follows: ′ ′ ′ ′ map: (ki , vi ) → [f(ki1 , vi1 ),f(ki2 , vi2 ),..] for a user-defined function f . ′′ ′′ ′′ ′′′ ′′′ reduce: (ki , [vi1 ,vi2 ,..]) → [v1 ,v2 ,..] ′ ′′ ′′′ ′′ where {ki } ≡ {kj } and v = g(v ) for a user-defined function g. Example: Counting NSF awards To gain a better understanding of these MapReduce operators, imagine that we have a list of NSF principal investigators, along with their email information and award IDs as below. Our task is to count the number of awards for each institution. For example, given the four records below, we will discover that the Berkeley Geochronology Center has two awards, while New York University and the University of Utah each have one. AwardId,FirstName,LastName,EmailAddress 0958723,Roland,Mundil,rmundil@bgc.org 0958915,Randall,Irmis,irmis@umnh.utah.edu 1301647,Zaher,Hani,zh8@nyu.edu 1316375,David,Shuster,dshuster@bgc.org We observe that institutions can be distinguished by their email address domain name. Thus, we adopt of a strategy of first grouping all award IDs by domain names, and then counting the number of distinct award within each group. In order to do this, we first set the map function to scan input lines and extract institution information and award IDs. Then, in the reduce function, we simply count unique IDs on the data, since everything is already grouped by institution. Python pseudo-code is provided in Listing 5.1. In the map phase, the input will be transformed into tuples of institutions and award ids: "0958723,Roland,Mundil,rmundil@bgc.org" "0958915,Randall,Irmis,irmis@umnh.utah.edu" "1301647,Zaher,Hani,zh8@nyu.edu" "1316375,David,Shuster,dshuster@bgc.org" → → → → ("bgc.org", 0958723) ("utah.edu", 0958915) ("nyu.edu", 1301647) ("bgc.org", 1316375) Then the tuples will be grouped by institutions and be counted by the reduce function. ("bgc.org", [0958723,1316375]) ("utah.edu", [0958915]) ("nyu.edu", [1301647]) → → → ("bgc.org", 2) ("utah.edu", 1) ("nyu.edu", 1) 5.3. Apache Hadoop MapReduce 2 4 6 8 10 12 14 129 # Input : a list of text lines # Output : a list of domain name and award ids def MAP(lines): for line in lines: fields = line.strip('\n').split(',') awardId = fields[0] domainName = fields[3].split('@')[-1].split('.')[-2:] yield (domainName, awardId) # Input : a list of domain name and award ids # Output : a list of domain name and award count def REDUCE(pairs): for (domainName, awardIds) in pairs: count = len(set(awardIds)) yield (domainName, count) Listing 5.1. Python pseudo-code for the map and reduce functions to count the number of awards per institution As we have seen so far, the MapReduce programming model is quite simple and straightforward, yet it supports a simple parallelization model. In fact, it has been said to be too simple and criticized as “a major step backwards” [94] for large-scale, data-intensive applications. It is hard to argue that MapReduce is offering something truly innovative when MPI has been offering similar scatter and reduce operations since 1995, and Python has had high-order functions (map, reduce, filter, and lambda) since its 2.2 release in 1994. However, the biggest strength of MapReduce is its simplicity. Its simple programming model has brought many nonexpert users to big data analysis. Its simple architecture has also inspired many developers to develop advanced capabilities, such as support for distributed computing, data partitioning, and streaming processing. (A downside of this diversity of interest is that available features and capabilities can vary considerably, depending on the specific implementation of MapReduce that is being used.) We next describe two specific implementations of the MapReduce model: Hadoop and Spark. 5.3 Apache Hadoop MapReduce The names MapReduce and Apache Hadoop (or Hadoop)* are often used interchangeably, but they are conceptually different. MapReduce is simply a programming paradigm with a layer of abstraction that allows a set of data processing pipelines to be expressed without too much tailoring for how it will be executed exactly. The ⋆ The term Hadoop refers to the creator’s son’s toy elephant. 130 5. Programming with Big Data MapReduce model tells us which class of data structure (key–value pairs) and data transformations (map and reduce) it is supporting; however, it does not specifically state how the framework should be implemented; for example, it does not specify how data should be stored or how the computation be executed (in particular, parallelized). Hadoop [399], on the other hand, is a specific implementation of MapReduce with exact specifications of how data and computation are handled inside the system. Hadoop was originally designed for batch data processing at scale, with the target of being able to run in environments with thousands of machines. Supporting such a large computing environment puts several constraints on the system; for instance, with so many machines, the system had to assume computing nodes would fail. Hadoop is an enhanced MapReduce implementation with the support for fault tolerance, distributed storage, and data parallelism through two added key design features: (1) a distributed file system called the Hadoop Distributed File System (HDFS); and (2) a data distribution strategy that allows computation to be moved to the data during execution. 5.3.1 The Hadoop Distributed File System The Hadoop Distributed File System [345] is arguably the most important component that Hadoop added to the MapReduce framework. In a nutshell, it is a distributed file system that stripes data across all the nodes of a Hadoop cluster. HDFS splits large data files into smaller blocks which are managed by different nodes in the cluster. Each block is also replicated across several nodes as an attempt to ensure that a full copy of the data is still available even in the case of computing node failures. The block size as well as the number of replications per block are fully customized by users when they create files on HDFS. By default, the block size is set to 64 MB with a replication factor of 3, meaning that the system may encounter at least two concurrent node failures without losing any data. HDFS also actively monitors failures and re-replicates blocks on failed nodes to make sure that the number of replications for each block always stays at the user-defined settings. Thus, if a node fails, and only two copies of some data exist, the system will quickly copy those data to a working node, thus raising the number of copies to three again. This dynamic replication the primary mechanism for fault tolerance in Hadoop. Note that data blocks are replicated and distributed across several machines. This could create a problem for users, because 5.3. Apache Hadoop MapReduce if they had to manage the data manually, they might, for example, have to access more than one machine to fetch a large data file. Fortunately, Hadoop provides infrastructure for managing this complexity, including command line programs as well as an API that users can employ to interact with HDFS as if it were a local file system. This is one example that reinforces our discussion on big data technology being all about making things work seamlessly regardless of the computational environment; it should be possible for the user to use the system as though they are using their local workstation. Hadoop and HDFS are great examples of this approach. For example, one can run ls and mkdir to list and create a directory on HDFS, or even use tail to inspect file contents as one would expect in a Linux file system. The following code shows some examples of interacting with HDFS. # Creating a temporary folder hadoop dfs -mkdir /tmp/mytmp # Upload a CSV file from our local machine to HDFS hadoop dfs -put myfile.csv /tmp/mytmp # Listing all files under mytmp folder hadoop dfs -ls /tmp/mytmp # Upload another file with five replications and 128MB per block hadoop -D dfs.replication=5 -D dfs.block.size=128M \ dfs -put mylargefile.csv /tmp/mytmp # Download a file to our local machine hadoop dfs -get /tmp/mytmp/myfile.csv . 5.3.2 Hadoop: Bringing compute to the data The configuration of parallel computing infrastructure is a fairly complex task. At the risk of oversimplifying, we consider the computing environment as comprising a compute cluster with substantial computing power (e.g., thousands of computing cores), and a storage cluster with petabytes of disk space, capable of storing and serving data quickly to the compute cluster. These two clusters have quite different hardware specifications: the first is optimized for CPU performance and the second for storage occupancy. The two systems are typically configured as separate physical hardware. Running compute jobs on such hardware often goes like this. When a user requests to run an intensive task on a particular data 131 132 5. Programming with Big Data Task 5 Data Data Task 4 Task 3 Task 0 Task 1 Task 2 (a) Task 4 Task 5 Task 0 Task 3 Task 2 Task 1 (b) Figure 5.1. (a) The traditional parallel computing model where data are brought to the computing nodes. (b) Hadoop’s parallel computing model: bringing compute to the data [242] set, the system will first reserve a set of computing nodes. Then the data are partitioned and copied from the storage server into these computing nodes before the task is executed. This process is illustrated in Figure 5.1(a). This computing model will be referred to as bringing data to computation. In this model, if a data set is being analyzed in multiple iterations, it is very likely that the data will be copied multiple times from the storage cluster to the compute nodes without reusability. This is because the compute node scheduler normally does not have or keep knowledge of where data have previously been held. The need to copy data multiple times tends to make such a computation model inefficient, and I/O becomes the bottleneck when all tasks constantly pull data from the storage cluster (the green arrow). This in turn leads to poor scalability; adding more nodes to the computing cluster would not increase its performance. To solve this problem, Hadoop implements a bring compute to the data strategy that combines both computing and storage at each node of the cluster. In this setup, each node offers both computing power and storage capacity. As shown in Figure 5.1(b), when users submit a task to be run on a data set, the scheduler will first look for nodes that contain the data, and if the nodes are available, it will schedule the task to run directly on those nodes. If a node is 5.3. Apache Hadoop MapReduce busy with another task, data will still be copied to available nodes, but the scheduler will maintain records of the copy for subsequent use of the data. In addition, data copying can be minimized by increasing the data duplication in the cluster, which also increases the potential for parallelism, since the scheduler has more choices to allocate computing without copying. Since both the compute and data storage are closely coupled for this model, it is best suited for data-intensive applications. Given that Hadoop was designed for batch data processing at scale, this model fits the system nicely, especially with the support of HDFS. However, in an environment where tasks are more compute intensive, a traditional high-performance computing environment is probably best since it tends to spend more resources on CPU cores. It should be clear now that the Hadoop model has hardware implications, and computer architects have optimized systems for data-intensive computing. Now that we are equipped with the knowledge that Hadoop is a MapReduce implementation that runs on HDFS and a bringcompute-to-the-data model, we can go over the design of a Hadoop MapReduce job. A MapReduce job is still composed of three phases: map, shuffle, and reduce. However, Hadoop divides the map and reduce phases into smaller tasks. Each map phase in Hadoop is divided into five tasks: input format, record reader, mapper, combiner, and partitioner. An input format task is in charge of talking to the input data presumably sitting on HDFS, and splitting it into partitions (e.g., by breaking lines at line breaks). Then a record reader task is responsible for translating the split data into the key–value pair records so that they can be processed by the mapper. By default, Hadoop parses files into key–value pairs of line numbers and line contents. However, both input formats and record readers are fully customizable and can be programmed to read custom data including binary files. It is important to note that input formats and record readers only provide data partitioning; they do not move data around computing nodes. After the records are generated, mappers are spawned—typically on nodes containing the blocks—to run through these records and output zero or more new key–value pairs. A mapper in Hadoop is equivalent to the map function of the MapReduce model that we discussed earlier. The selection of the key to be output from the mapper will heavily depend on the data processing pipeline and could greatly affect the performance of the framework. Mappers are executed concurrently in Hadoop as long as resources permit. 133 134 5. Programming with Big Data A combiner task in Hadoop is similar to a reduce function in the MapReduce framework, but it only works locally at each node: it takes output from mappers executed on the same node and produces aggregated values. Combiners are optional but can be used to greatly reduce the amount of data exchange in the shuffle phase; thus, users are encouraged to implement this whenever possible. A common practice is when a reduce function is both commutative and associative, and has the same input and output format, one can just use the reduce function as the combiner. Nevertheless, combiners are not guaranteed to be executed by Hadoop, so this should only be treated as a hint. Its execution must not affect the correctness of the program. A partitioner task is the last process taking place in the map phase on each mapper node, where it hashes the key of each key– value pair output from the mappers or the combiners into bins. By default, the partitioner uses object hash codes and modulus operations to direct a designated reducer to pull data from a map node. Though it is possible to customize the partitioner, it is only advisable to do so when one fully understands the intermediate data distribution as well as the specifications of the cluster. In general, it is better to leave this job to Hadoop. Each reduce phase in Hadoop is divided into three tasks: reducer, output format, and record writer. The reducer task is equivalent to the reduce function of the MapReduce model. It basically groups the data produced by the mappers by keys and runs a reduce function on each list of grouping values. It outputs zero or more key–value pairs for the output format task, which then translates them into a writable format for the record writer task to serialize on HDFS. By default, Hadoop will separate the key and value with a tab and write separate records on separate lines. However, this behavior is fully customizable. Similarly, the map phase reducers are also executed concurrently in Hadoop. 5.3.3 Hardware provisioning Hadoop requires a distributed cluster of machines to operate efficiently. (It can be set up to run entirely on a single computer, but this should only be done for technology demonstration purposes.) This is mostly because the MapReduce performance heavily depends on the total I/O throughput (i.e., disk read and write) of the entire system. Having a distributed cluster, where each machine has its own set of hard drives, is one of the most efficient ways to maximize this throughput. 5.3. Apache Hadoop MapReduce 135 Data on HDFS Input Partition Map Map Map Map Shuffling Sort in Parallel Reduce Reduce Output Partition Data on HDFS Figure 5.2. Data transfer and communication of a MapReduce job in Hadoop. Data blocks are assigned to several maps, which emit key–value pairs that are shuffled and sorted in parallel. The reduce step emits one or more pairs, with results stored on the HDFS A typical Hadoop cluster consists of two types of machine: masters and workers. Master machines are those exclusively reserved for running services that are critical to the framework operations. Some examples are the NameNode and the JobTracker services, which are tasked to manage how data and tasks are distributed among the machines, respectively. The worker machines are reserved for data storage and for running actual computation tasks (i.e., map and reduce). It is normal to have worker machines that can be included or removed from an operational cluster on demand. This ability to vary the number of worker nodes makes the overall system more tolerant of failure. However, master machines are usually required to be running uninterrupted. Provisioning and configuring the hardware for Hadoop, like any other parallel computing, are some of the most important and complex tasks in setting up a cluster, and they often require a lot of experience and careful consideration. Major big data vendors provide guidelines and tools to facilitate the process [18, 23, 81]. 136 5. Programming with Big Data #!/usr/bin/env python import sys def parseInput(): for line in sys.stdin: yield line if __name__=='__main__': for line in parseInput(): fields = line.strip('\n').split(',') awardId = fields[0] domainName = fields[3].split('@')[-1].split('.')[-2:] print '%s\t%s' % (domainName,awardId) Listing 5.2. A Hadoop streaming mapper in Python Nevertheless, most decisions will be based on the types of analysis to be run on the cluster, for which only you, as the user, can provide the best input. 5.3.4 Programming language support Hadoop is written entirely in Java, thus it is best supporting applications written in Java. However, Hadoop also provides a streaming API that allows arbitrary code to be run inside the Hadoop MapReduce framework through the use of UNIX pipes. This means that we can supply a mapper program written in Python or C++ to Hadoop as long as that program reads from the standard input and writes to the standard output. The same mechanism also applies for the combiner and reducer. For example, we can develop from the Python pseudo-code in Listing 5.1 to a complete Hadoop streaming mapper (Listing 5.2) and reducer (Listing 5.3). #!/usr/bin/env python import sys def parseInput(): for line in sys.stdin: yield line if __name__=='__main__': for line in parseInput(): (domainName, awardIds) = line.split('\t') count = len(set(awardIds)) print '%s\t%s' % (domainName, count) Listing 5.3. A Hadoop streaming reducer in Python 5.3. Apache Hadoop MapReduce It should be noted that in Hadoop streaming, intermediate key– value pairs (the data flowing between mappers and reducers) must be in tab-delimited format, thus we replace the original yield command with a print formatted with tabs. Though the input format and record reader are still customizable in Hadoop streaming, they must be supplied as Java classes. This is one of the biggest limitations of Hadoop for Python developers. They not only have to split their code into separate mapper and reducer programs, but also need to learn Java if they want to work with nontextual data. 5.3.5 Fault tolerance By default, HDFS uses checksums to enforce data integrity on its file system use data replication for recovery of potential data losses. Taking advantage of this, Hadoop also maintains fault tolerance of MapReduce jobs by storing data at every step of a MapReduce job to HDFS, including intermediate data from the combiner. Then the system checks whether a task fails by either looking at its heartbeats (data activities) or whether it has been taking too long. If a task is deemed to have failed, Hadoop will kill it and run it again on a different node. The time limit for the heartbeats and task running duration may also be customized for each job. Though the mechanism is simple, it works well on thousands of machines. It is indeed highly robust because of the simplicity of the model. 5.3.6 Limitations of Hadoop Hadoop is a great system, and probably the most widely used MapReduce implementation. Nevertheless, it has important limitations, as we now describe. • Performance: Hadoop has proven to be a scalable implementation that can run on thousands of cores. However, it is also known for having a relatively high job setup overheads and suboptimal running time. An empty task in Hadoop (i.e., with no mapper or reducer) can take roughly 30 seconds to complete even on a modern cluster. This overhead makes it unsuitable for real-time data or interactive jobs. The problem comes mostly from the fact that Hadoop monitoring processes only lives within a job, thus it needs to start and stop these processes each time a job is submitted, which in turns results in this major overhead. Moreover, the brute force approach of 137 138 5. Programming with Big Data maintaining fault tolerance by storing everything on HDFS is expensive, especially when for large data sets. • Hadoop streaming support for non-Java applications: As men- tioned previously, non-Java applications may only be integrated with Hadoop through the Hadoop streaming API. However, this API is far from optimal. First, input formats and record readers can only be written in Java, making it impossible to write advanced MapReduce jobs entirely in a different language. Second, Hadoop streaming only communicates with Hadoop through Unix pipes, and there is no support for data passing within the application using native data structure (e.g., it is necessary to convert Python tuples into strings in the mappers and convert them back into tuples again in reducers). • Real-time applications: With the current setup, Hadoop only supports batch data processing jobs. This is by design, so it is not exactly a limitation of Hadoop. However, given that more and more applications are dealing with real-time massive data sets, the community using MapReduce for real-time processing is constantly growing. Not having support for streaming or real-time data is clearly a disadvantage of Hadoop over other implementations. • Limited data transformation operations: This is more of a lim- itation of MapReduce than Hadoop per se. MapReduce only supports two operations, map and reduce, and while these operations are sufficient to describe a variety of data processing pipelines, there are classes of applications that MapReduce is not suitable for. Beyond that, developers often find themselves rewriting simple data operations such as data set joins, finding a min or max, and so on. Sometime, these tasks require more than one map-and-reduce operation, resulting in multiple MapReduce jobs. This is both cumbersome and inefficient. There are tools to automate this process for Hadoop; however, they are only a layer above, and it is not easy to integrate with existing customized Hadoop applications. 5.4 Apache Spark In addition to Apache Hadoop, other notable MapReduce implementations include MongoDB, GreenplumDB, Disco, Riak, and Spark. 5.4. Apache Spark MongoDB, Riak, and Greenplum DB are all database systems and thus their MapReduce implementations focus more on the interoperability of MapReduce and the core components such as MongoDB’s aggregation framework, and leave it up to users to customize the MapReduce functionalities for broader tasks. Some of these systems, such as Riak, only parallelize the map phase, and run the reduce phase on the local machine that request the tasks. The main advantage of the three implementations is the ease with which they connect to specific data stores. However, their support for general data processing pipelines is not as extensive as that of Hadoop. Disco, similar to Hadoop, is designed to support MapReduce in a distributed computing environment, but it written in Erlang with a Python interface. Thus, for Python developers, Disco might be a better fit. However, it has significantly fewer supporting applications, such as access control and workflow integration, as well as a smaller developing community. This is why the top three big data platforms, Cloudera, Hortonworks, and MapR, still build primarily on Hadoop. Apache Spark is another implementation that aims to support beyond MapReduce. The framework is centered around the concept of resilient distributed data sets and data transformations that can operate on these objects. An innovation in Spark is that the fault tolerance of resilient distributed data sets can be maintained without flushing data onto disks, thus significantly improving the system performance (with a claim of being 100 times faster than Hadoop). Instead, the fault-recovery process is done by replaying a log of data transformations on check-point data. Though this process could take longer than reading data straight from HDFS, it does not occur often and is a fair tradeoff between processing performance and recovery performance. Beyond map and reduce, Spark also supports various other transformations [147], including filter, data join, and aggregation. Streaming computation can also be done in Spark by asking Spark to reserve resources on a cluster to constantly stream data to/from the cluster. However, this streaming method might be resource intensive (still consuming resources when there is no data coming). Additionally, Spark plays well with the Hadoop ecosystem, particularly with the distributed file system (HDFS) and resource manager (YARN), making it possible to be built on top of current Hadoop applications. Another advantage of Spark is that it supports Python natively; thus, developers can run Spark in a fraction of the time required for Hadoop. Listing 5.4 provides the full code for the previous example 139 ◮ See Chapter 4. 140 5. Programming with Big Data import sys from pyspark import SparkContext def mapper(lines): for line in lines: fields = line.strip('\n').split(',') awardId = fields[0] domainName = fields[3].split('@')[-1].split('.')[-2:] yield (domainName, awardId) def reducer(pairs): for (domainName, awardIds) in pairs: count = len(set(awardIds)) yield (domainName, count) if __name__=='__main__': hdfsInputPath = sys.argv[1] hdfsOutputFile = sys.argv[2] sc = SparkContext(appName="Counting Awards") output = sc.textFile(hdfsInputPath) \ .mapPartitions(mapper) \ .groupByKey() \ .mapPartitions(reducer) output.saveAsTextFile(hdfsInputPath) Listing 5.4. Python code for a Spark program that counts the number of awards per institution using MapReduce written entirely in Spark. It should be noted that Spark’s concept of the reduceByKey operator is not the same as Hadoop’s, as it is designed to aggregate all elements of a data set into a single element. The closest simulation of Hadoop’s MapReduce pattern is a combination of mapPartitions, groupByKey and mapPartitions, as shown in the next example. Example: Analyzing home mortgage disclosure application data We use a financial services analysis problem to illustrate the use of Apache Spark. Mortgage origination data provided by the Consumer Protection Financial Bureau provide insightful details of the financial health of the real estate market. The data [84], which are a product of the Home Mortgage Disclosure Act (HMDA), highlight key attributes that function as strong indicators of health and lending patterns. Lending institutions, as defined by section 1813 in Title 12 of the HMDA, decide on whether to originate or deny mortgage applications based on credit risk. In order to determine this credit risk, lenders must evaluate certain features relative 5.5. Summary 141 Table 5.1. Home Mortgage Disclosure Act data size Year 2007 2008 2009 2010 2011 2012 2013 Total Records 26,605,696 17,391,571 19,493,492 16,348,558 14,873,416 18,691,552 17,016,160 130,420,445 File Size (Gigabytes) 18 12 13 11 9.4 12 11 86.4 Table 5.2. Home Mortgage Disclosure Act data fields Index 0 1 2 3 4 5 6 Attribute Year State County Census Tract Loan Amount Applicant Income Loan Originated ··· ··· to the applicant, the underlying property, and the location. We want to determine whether census tract clusters could be created based on mortgage application data and whether lending institutions’ perception of risk is held constant across the entire USA. For the first step of this process, we study the debt–income ratio for loans originating in different census tracts. This could be achieved simply by computing the debt–income ratio for each loan application and aggregating them for each year by census tract number. A challenge, however, is that the data set provided by HMDA is quite extensive. In total, HMDA data contain approximately 130 million loan applications between 2007 and 2013. As each record contains 47 attributes, varying in types from continuous variables such as loan amounts and applicant income to categorical variables such as applicant gender, race, loan type, and owner occupancy, the entire data set results in about 86 GB of information. Parsing the data alone could take up to hours on a single machine if using a naïve approach that scans through the data sequentially. Tables 5.1 and 5.2 highlight the breakdown in size per year and data fields of interest. Observing the transactional nature of the data, where the aggregation process could be distributed and merged across multiple partitions of the data, we could complete this task in much less time by using Spark. Using a cluster consisting of 1,200 cores, the Spark program in Listing 5.5 took under a minute to complete. The substantial performance gain comes not so much from the large number of processors available, but mostly from the large I/O bandwidth available on the cluster thanks to the 200 distributed hard disks and fast network interconnects. 5.5 Summary Big data means that it is necessary to both store very large collections of data and perform aggregate computations on those data. This chapter spelled out an important data storage approach (the Hadoop Distributed File System) and a way of processing large-scale Type Integer String String String Float Float Boolean ··· 142 5. Programming with Big Data import ast import sys from pyspark import SparkContext def mapper(lines): for line in lines: fields = ast.literal_eval('(%s)' % line) (year, state, county, tract) = fields[:4] (amount, income, originated) = fields[4:] key = (year, state, county, tract) value = (amount, income) # Only count originated loans if originated: yeild (key, value) def sumDebtIncome(debtIncome1, debtIncome2): return (debtIncome1[0] + debtIncome2[0], debtIncome1[1] + debtIncome2[1]) if __name__=='__main__': hdfsInputPath = sys.argv[1] hdfsOutputFile = sys.argv[2] sc = SparkContext(appName="Counting Awards") sumValues = sc.textFile(hdfsInputPath) \ .mapPartitions(mapper) \ .reduceByKey(sumDebtIncome) # Actually compute the aggregated debt income output = sumValues.mapValues(lambda debtIncome: debtIncome[0]/ debtIncome[1]) output.saveAsTextFile(hdfsInputPath) Listing 5.5. Python code for a Spark program to aggregate the debt–income ratio for loans originated in different census tracts data sets (the MapReduce model, as implemented in both Hadoop and Spark). This in-database processing model enables not only high-performance analytics but also the provision of more flexibility for analysts to work with data. Instead of going through a database administrator for every data gathering, data ingestion, or data transformation task similar to a traditional data warehouse approach, the analysts rather “own” the data in the big data environment. This increases the analytic throughput as well as the time to insight, speeding up the decision-making process and thus increasing the business impact, which is one of the main drivers for big data analytics. 5.6. Resources 5.6 Resources There are a wealth of online resources describing both Hadoop and Spark. See, for example, the tutorials on the Apache Hadoop [19] and Spark [20] websites. Albanese describes how to use Hadoop for social science [9], and Lin and Dyer discuss the use of MapReduce for text analysis [238]. We have not discussed here how to deal with data that are located remotely from your computer. If such data are large, then moving them can be an arduous task. The Globus transfer service [72] is commonly used for the transfer and sharing of such data. It is available on many research research systems and is easy to install on your own computer. 143 Part II Modeling and Analysis Chapter 6 Machine Learning Rayid Ghani and Malte Schierholz This chapter introduces you to the value of machine learning in the social sciences, particularly focusing on the overall machine learning process as well as clustering and classification methods. You will get an overview of the machine learning pipeline and methods and how those methods are applied to solve social science problems. The goal is to give an intuitive explanation for the methods and to provide practical tips on how to use them in practice. 6.1 Introduction You have probably heard of “machine learning” but are not sure exactly what it is, how it differs from traditional statistics, and what you can do with it. In this chapter, we will demystify machine learning, draw connections to what you already know from statistics and data analysis, and go deeper into some of the unique concepts and methods that have been developed in this field. Although the field originates from computer science (specifically, artificial intelligence), it has been influenced quite heavily by statistics in the past 15 years. As you will see, many of the concepts you will learn are not entirely new, but are simply called something else. For example, you already are familiar with logistic regression (a classification method that falls under the supervised learning framework in machine learning) and cluster analysis (a form of unsupervised learning). You will also learn about new methods that are more exclusively used in machine learning, such as random forests and support vector machines. We will keep formalisms to a minimum and focus on getting the intuition across, as well as providing practical tips. Our hope is this chapter will make you comfortable and familiar with machine learning vocabulary, concepts, and processes, and allow you to further explore and use these methods and tools in your own research and practice. 147 148 6. Machine Learning 6.2 ◮ See Chapter 3. What is machine learning? When humans improve their skills with experience, they are said to learn. Is it also possible to program computers to do the same? Arthur Samuel, who coined the term machine learning in 1959 [323], was a pioneer in this area, programming a computer to play checkers. The computer played against itself and human opponents, improving its performance with every game. Eventually, after sufficient training (and experience), the computer became a better player than the human programmer. Today, machine learning has grown significantly beyond learning to play checkers. Machine learning systems have learned to drive (and park) autonomous cars, are embedded inside robots, can recommend books, products, and movies we are (sometimes) interested in, identify drugs, proteins, and genes that should be investigated further to cure diseases, detect cancer and other diseases in medical imaging, help us understand how the human brain learns language, help identify which voters are persuadable in elections, detect which students are likely to need extra support to graduate high school on time, and help solve many more problems. Over the past 20 years, machine learning has become an interdisciplinary field spanning computer science, artificial intelligence, databases, and statistics. At its core, machine learning seeks to design computer systems that improve over time with more experience. In one of the earlier books on machine learning, Tom Mitchell gives a more operational definition, stating that: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T , as measured by P, improves with experience E” [258]. Machine learning grew from the need to build systems that were adaptive, scalable, and cost-effective to build and maintain. A lot of tasks now being done using machine learning used to be done by rule-based systems, where experts would spend considerable time and effort developing and maintaining the rules. The problem with those systems was that they were rigid, not adaptive, hard to scale, and expensive to maintain. Machine learning systems started becoming popular because they could improve the system along all of these dimensions. Box 6.1 mentions several examples where machine learning is being used in commercial applications today. Social scientists are uniquely placed today to take advantage of the same advances in machine learning by having better methods to solve several key problems they are tackling. We will give concrete examples later in this chapter. 6.2. What is machine learning? Box 6.1: Commercial machine learning examples • Speech recognition: Speech recognition software uses ma- chine learning algorithms that are built on large amounts of initial training data. Machine learning allows these systems to be tuned and adapt to individual variations in speaking as well as across different domains. • Autonomous cars: The ongoing development of selfdriving cars applies techniques from machine learning. An onboard computer continuously analyzes the incoming video and sensor streams in order to monitor the surroundings. Incoming data are matched with annotated images to recognize objects like pedestrians, traffic lights, and potholes. In order to assess the different objects, huge training data sets are required where similar objects already have been identified. This allows the autonomous car to decide on which actions to take next. • Fraud detection: Many public and private organizations face the problem of fraud and abuse. Machine learning systems are widely used to take historical cases of fraud and flag fraudulent transactions as they take place. These systems have the benefit of being adaptive, and improving with more data over time. • Personalized ads: Many online stores have personalized recommendations promoting possible products of interest. Based on individual shopping history and what other similar users bought in the past, the website predicts products a user may like and tailors recommendations. Netflix and Amazon are two examples of companies whose recommendation software predicts how a customer would rate a certain movie or product and then suggests items with the highest predicted ratings. Of course there are some caveats here, since they then adjust the recommendations to maximize profits. • Face recognition: Surveillance systems, social network- ing platforms, and imaging software all use face detection and face recognition to first detect faces in images (or video) and then tag them with individuals for various tasks. These systems are trained by giving examples of faces to a machine learning system which then learns to detect new faces, and tag known individuals. 149 150 6. Machine Learning This chapter is not an exhaustive introduction to machine learning. There are many books that have done an excellent job of that [124, 159, 258]. Instead, we present a short and understandable introduction to machine learning for social scientists, give an overview of the overall machine learning process, provide an intuitive introduction to machine learning methods, give some practical tips that will be helpful in using these methods, and leave a lot of the statistical theory to machine learning textbooks. As you read more about machine learning in the research literature or the media, you will encounter names of other fields that are related (and practically the same for most social science audiences), such as statistical learning, data mining, and pattern recognition. 6.3 ◮ See Chapter 10. The machine learning process When solving problems using machine learning methods, it is important to think of the larger data-driven problem-solving process of which these methods are a small part. A typical machine learning problem requires researchers and practitioners to take the following steps: 1. Understand the problem and goal: This sounds obvious but is often nontrivial. Problems typically start as vague descriptions of a goal—improving health outcomes, increasing graduation rates, understanding the effect of a variable X on an outcome Y , etc. It is really important to work with people who understand the domain being studied to dig deeper and define the problem more concretely. What is the analytical formulation of the metric that you are trying to optimize? 2. Formulate it as a machine learning problem: Is it a classification problem or a regression problem? Is the goal to build a model that generates a ranked list prioritized by risk, or is it to detect anomalies as new data come in? Knowing what kinds of tasks machine learning can solve will allow you to map the problem you are working on to one or more machine learning settings and give you access to a suite of methods. 3. Data exploration and preparation: Next, you need to carefully explore the data you have. What additional data do you need or have access to? What variable will you use to match records for integrating different data sources? What variables exist in the data set? Are they continuous or categorical? What about 6.4. Problem formulation: Mapping a problem to machine learning methods missing values? Can you use the variables in their original form or do you need to alter them in some way? 4. Feature engineering: In machine learning language, what you might know as independent variables or predictors or factors or covariates are called “features.” Creating good features is probably the most important step in the machine learning process. This involves doing transformations, creating interaction terms, or aggregating over data points or over time and space. 5. Method selection: Having formulated the problem and created your features, you now have a suite of methods to choose from. It would be great if there were a single method that always worked best for a specific type of problem, but that would make things too easy. Typically, in machine learning, you take a collection of methods and try them out to empirically validate which one works the best for your problem. We will give an overview of leading methods that are being used today in this chapter. 6. Evaluation: As you build a large number of possible models, you need a way to select the model that is the best. This part of the chapter will cover the validation methodology to first validate the models on historical data as well as discuss a variety of evaluation metrics. The next step is to validate using a field trial or experiment. 7. Deployment: Once you have selected the best model and validated it using historical data as well as a field trial, you are ready to put the model into practice. You still have to keep in mind that new data will be coming in, and the model might change over time. We will not cover too much of those aspects in this chapter, but they are important to keep in mind. 6.4 Problem formulation: Mapping a problem to machine learning methods When working on a new problem, one of the first things we need to do is to map it to a class of machine learning methods. In general, the problems we will tackle, including the examples above, can be grouped into two major categories: 1. Supervised learning: These are problems where there exists a target variable (continuous or discrete) that we want to predict 151 152 6. Machine Learning or classify data into. Classification, prediction, and regression all fall into this category. More formally, supervised learning methods predict a value Y given input(s) X by learning (or estimating or fitting or training) a function F , where F (X ) = Y . Here, X is the set of variables (known as features in machine learning, or in other fields as predictors) provided as input and Y is the target/dependent variable or a label (as it is known in machine learning). The goal of supervised learning methods is to search for that function F that best predicts Y . When the output Y is categorical, this is known as classification. When Y is a continuous value, this is called regression. Sound familiar? ⋆ In statistical terms, regularization is an attempt to avoid overfitting the model. One key distinction in machine learning is that the goal is not just to find the best function F that can predict Y for observed outcomes (known Y s) but to find one that best generalizes to new, unseen data. This distinction makes methods more focused on generalization and less on just fitting the data we have as best as we can. It is important to note that you do that implicitly when performing regression by not adding more and more higher-order terms to get better fit statistics. By getting better fit statistics, we overfit to the data and the performance on new (unseen) data often goes down. Methods like the lasso [376] penalize the model for having too many terms by performing what is known as regularization.* 2. Unsupervised learning: These are problems where there does not exist a target variable that we want to predict but we want to understand “natural” groupings or patterns in the data. Clustering is the most common example of this type of analysis where you are given X and want to group similar X s together. Principal components analysis (PCA) and related methods also fall into the unsupervised learning category. In between the two extremes of supervised and unsupervised learning, there is a spectrum of methods that have different levels of supervision involved (Figure 6.1). Supervision in this case is the presence of target variables (known in machine learning as labels). In unsupervised learning, none of the data points have labels. In supervised learning, all data points have labels. In between, either the percentage of examples with labels can vary or the types of labels can vary. We do not cover the weakly supervised and semisupervised methods much in this chapter, but this is an active area of research in machine learning. Zhu [414] provides more details. 6.5. Methods 153 Machine Learning Spectrum Unsupervised Clustering PCA MDS Association Rules … “Weakly” supervised Fully supervised Classification Prediction Regression Figure 6.1. Spectrum of machine learning methods from unsupervised to supervised learning 6.5 Methods We will start by describing unsupervised learning methods and then go on to supervised learning methods. We focus here on the intuition behind the methods and the algorithm, as well as practical tips, rather than on the statistical theory that underlies the methods. We encourage readers to refer to machine learning books listed in Section 6.11 for more details. Box 6.2 gives brief definitions of several terms we will use in this section. 6.5.1 Unsupervised learning methods As mentioned earlier, unsupervised learning methods are used when we do not have a target variable to predict but want to understand “natural” clusters or patterns in the data. These methods are often used for initial data exploration, as in the following examples: 1. When faced with a large corpus of text data—for example, email records, congressional bills, speeches, or open-ended free-text survey responses—unsupervised learning methods are often used to understand and get a handle on what the data contain. 2. Given a data set about students and their behavior over time (academic performance, grades, test scores, attendance, etc.), one might want to understand typical behaviors as well as trajectories of these behaviors over time. Unsupervised learning methods (clustering) can be applied to these data to get student “segments” with similar behavior. 3. Given a data set about publications or patents in different fields, we can use unsupervised learning methods (association 154 6. Machine Learning Box 6.2: Machine learning vocabulary • Learning: In machine learning, you will notice the term learning that will be used in the context of “learning” a model. This is what you probably know as fitting or estimating a function, or training or building a model. These terms are all synonyms and are used interchangeably in the machine learning literature. • Examples: These are data points and instances. • Features: These are independent variables, attributes, predictor variables, and explanatory variables. • Labels: These include the response variable, dependent variable, and target variable. • Underfitting: This happens when a model is too simple and does not capture the structure of the data well enough. • Overfitting: This happens when a model is possibly too complex and models the noise in the data, which can result in poor generalization performance. Using in-sample measures to do model selection can result in that. • Regularization: This is a general method to avoid overfit- ting by applying additional constraints to the model that is learned. A common approach is to make sure the model weights are, on average, small in magnitude. Two common regularizations are L1 regularization (used by the lasso), which has a penalty term that encourages the sum of the absolute values of the parameters to be small; and L2 regularization, which encourages the sum of the squares of the parameters to be small. rules) to figure out which disciplines have the most collaboration and which fields have researchers who tend to publish across different fields. Clustering Clustering is the most common unsupervised learning technique and is used to group data points together that are similar to each other. The goal of clustering methods is to produce clusters 6.5. Methods with high intra-cluster (within) similarity and low inter-cluster (between) similarity. Clustering algorithms typically require a distance (or similarity) metric* to generate clusters. They take a data set and a distance metric (and sometimes additional parameters), and they generate clusters based on that distance metric. The most common distance metric used is Euclidean distance, but other commonly used metrics are Manhattan, Minkowski, Chebyshev, cosine, Hamming, Pearson, and Mahalanobis. Often, domain-specific similarity metrics can be designed for use in specific problems. For example, when performing the record linkage tasks discussed in Chapter 3, you can design a similarity metric that compares two first names and assigns them a high similarity (low distance) if they both map to the same canonical name, so that, for example, Sammy and Sam map to Samuel. Most clustering algorithms also require the user to specify the number of clusters (or some other parameter that indirectly determines the number of clusters) in advance as a parameter. This is often difficult to do a priori and typically makes clustering an iterative and interactive task. Another aspect of clustering that makes it interactive is often the difficulty in automatically evaluating the quality of the clusters. While various analytical clustering metrics have been developed, the best clustering is task-dependent and thus must be evaluated by the user. There may be different clusterings that can be generated with the same data. You can imagine clustering similar news stories based on the topic content, based on the writing style or based on sentiment. The right set of clusters depends on the user and the task they have. Clustering is therefore typically used for exploring the data, generating clusters, exploring the clusters, and then rerunning the clustering method with different parameters or modifying the clusters (by splitting or merging the previous set of clusters). Interpreting a cluster can be nontrivial: you can look at the centroid of a cluster, look at frequency distributions of different features (and compare them to the prior distribution of each feature), or you can build a decision tree (a supervised learning method we will cover later in this chapter) where the target variable is the cluster ID that can describe the cluster using the features in your data. A good example of a tool that allows interactive clustering from text data is Ontogen [125]. k -means clustering The most commonly used clustering algorithm is called k-means, where k defines the number of clusters. The algorithm works as follows: 155 ⋆ Distance metrics are mathematical formulas to calculate the distance between two objects. For example, Manhattan distance is the distance a car would drive from one place to another place in a grid-based street system, whereas Euclidian distance (in two-dimensional space) is the “straight-line” distance between two points. 156 6. Machine Learning 1. Select k (the number of clusters you want to generate). 2. Initialize by selecting k points as centroids of the k clusters. This is typically done by selecting k points uniformly at random. 3. Assign each point a cluster according to the nearest centroid. 4. Recalculate cluster centroids based on the assignment in (3) as the mean of all data points belonging to that cluster. 5. Repeat (3) and (4) until convergence. The algorithm stops when the assignments do not change from one iteration to the next (Figure 6.2). The final set of clusters, however, depend on the starting points. If they are initialized differently, it is possible that different clusters are obtained. One common practical trick is to run k-means several times, each with different (random) starting points. The k-means algorithm is fast, simple, and easy to use, and is often a good first clustering algorithm to try and see if it fits your needs. When the data are of the form where the mean of the data points cannot be computed, a related method called K-medoids can be used [296]. You may be familiar with the EM algorithm in the context of imputing missing data. EM is a general approach to maximum likelihood in the presence of incomplete data. However, it is also used as a clustering method where the missing data are the clusters a data point belongs to. Unlike k-means, where each data point gets assigned to only one cluster, EM does a soft assignment where each data point gets a probabilistic assignment to various clusters. The EM algorithm iterates until the estimates converge to some (locally) optimal solution. The EM algorithm is fairly good at dealing with outliers as well as high-dimensional data, compared to k-means. It also has a few limitations. First, it does not work well with a large number of clusters or when a cluster contains few examples. Also, when the value of k is larger than the number of actual clusters in the data, EM may not give reasonable results. Expectation-maximization (EM) clustering Mean shift clustering Mean shift clustering works by finding dense regions in the data by defining a window around each data point and computing the mean of the data points in the window. Then it shifts the center of the window to the mean and repeats the algorithm till 157 m2 Project budget Project budget 6.5. Methods m1 m3 Number of people working on the project Project budget Project budget Number of people working on the project m2 m3 m1 Number of people working on the project Number of people working on the project Figure 6.2. Example of k -means clustering with k = 3. The upper left panel shows the distribution of the data and the three starting points m1 , m2 , m3 placed at random. On the upper right we see what happens in the first iteration. The cluster means move to more central positions in their respective clusters. The lower left panel shows the second iteration. After six iterations the cluster means have converged to their final destinations and the result is shown in the lower right panel it converges. After each iteration, we can consider that the window shifts to a denser region of the data set. The algorithm proceeds as follows: 1. Fix a window around each data point (based on the bandwidth parameter that defines the size of the window). 2. Compute the mean of data within the window. 3. Shift the window to the mean and repeat till convergence. Mean shift needs a bandwidth parameter h to be tuned, which influences the convergence rate and the number of clusters. A large h might result in merging distinct clusters. A small h might result in too many clusters. Mean shift might not work well in higher 158 6. Machine Learning dimensions since the number of local maxima is pretty high and it might converge to a local optimum quickly. One of the most important differences between mean shift and kmeans is that k-means makes two broad assumptions: the number of clusters is already known and the clusters are shaped spherically (or elliptically). Mean shift does not assume anything about the number of clusters (but the value of h indirectly determines that). Also, it can handle arbitrarily shaped clusters. The k-means algorithm is also sensitive to initializations, whereas mean shift is fairly robust to initializations. Typically, mean shift is run for each point, or sometimes points are selected uniformly randomly. Similarly, k-means is sensitive to outliers, while mean shift is less sensitive. On the other hand, the benefits of mean shift come at a cost—speed. The k-means procedure is fast, whereas classic mean shift is computationally slow but can be easily parallelized. Hierarchical clustering The clustering methods that we have seen so far, often termed partitioning methods, produce a flat set of clusters with no hierarchy. Sometimes, we want to generate a hierarchy of clusters, and methods that can do that are of two types: 1. Agglomerative (bottom-up): Start with each point as its own cluster and iteratively merge the closest clusters. The iterations stop either when the clusters are too far apart to be merged (based on a predefined distance criterion) or when there is a sufficient number of clusters (based on a predefined threshold). 2. Divisive (top-down): Start with one cluster and create splits recursively. Typically, agglomerative clustering is used more often than divisive clustering. One reason is that it is significantly faster, although both of them are typically slower than direct partition methods such as k-means and EM. Another disadvantage of these methods is that they are greedy, that is, a data point that is incorrectly assigned to the “wrong” cluster in an earlier split or merge cannot be reassigned again later on. Spectral clustering Figure 6.3 shows the clusters that k-means would generate on the data set in the figure. It is obvious that the clusters produced are not the clusters you would want, and that is one drawback of methods such as k-means. Two points that are far 6.5. Methods 159 15 15 10 10 5 5 0 0 –5 –5 –10 –15 –10 –5 0 5 10 15 (a) k-means 20 25 30 –10 –15 –10 –5 0 5 10 15 20 25 30 (b) Spectral Clustering Figure 6.3. The same data set can produce drastically different clusters: (a) k -means; (b) spectral clustering away from each other will be put in different clusters even if there are other data points that create a “path” between them. Spectral clustering fixes that problem by clustering data that are connected but not necessarily (what is called) compact or clustered within convex boundaries. Spectral clustering methods work by representing data as a graph (or network), where data points are nodes in the graph and the edges (connections between nodes) represent the similarity between the two data points. The algorithm works as follows: 1. Compute a similarity matrix from the data. This involves determining a pairwise distance function (using one of the distance functions we described earlier). 2. With this matrix, we can now perform graph partitioning, where connected graph components are interpreted as clusters. The graph must be partitioned such that edges connecting different clusters have low weights and edges within the same cluster have high values. 3. We can now partition these data represented by the similarity matrix in a variety of ways. One common way is to use the normalized cuts method. Another way is to compute a graph Laplacian from the similarity matrix. 4. Compute the eigenvectors and eigenvalues of the Laplacian. 160 6. Machine Learning 5. The k eigenvectors are used as proxy data for the original data set, and they are fed into k-means clustering to produce cluster assignments for each original data point. Spectral clustering is in general much better than k-means in clustering performance but much slower to run in practice. For large-scale problems, k-means is a preferred clustering algorithm to run because of efficiency and speed. Principal components analysis Principal components analysis is another unsupervised method used for finding patterns and structure in data. In contrast to clustering methods, the output is not a set of clusters but a set of principal components that are linear combinations of the original variables. PCA is typically used when you have a large number of variables and you want a reduced number that you can analyze. This approach is often called dimensionality reduction. It generates linearly uncorrelated dimensions that can be used to understand the underlying structure of the data. In mathematical terms, given a set of data on n dimensions, PCA aims to find a linear subspace of dimension d lower than n such that the data points lie mainly on this linear subspace. PCA is related to several other methods you may already know about. Multidimensional scaling, factor analysis, and independent component analysis differ from PCA in the assumptions they make, but they are often used for similar purposes of dimensionality reduction and discovering the underlying structure in a data set. Association rules Association rules are a different type of analysis method and originate from the data mining and database community, primarily focused on finding frequent co-occurring associations among a collection of items. This methods is sometimes referred to as “market basket analysis,” since that was the original application area of association rules. The goal is to find associations of items that occur together more often than you would randomly expect. The classic example (probably a myth) is “men who go to the store to buy diapers will also tend to buy beer at the same time.” This type of analysis would be performed by applying association rules to a set of supermarket purchase data. Association rules take the form X1 , X2 , X3 ⇒ Y with support S and confidence C, implying that when a transaction contains items {X1 , X2 , X3 } C% of the time, they also contain item Y and there are at least S% of transactions where the antecedent is true. This is useful in cases where we want to find patterns that are both frequent and 6.5. Methods statistically significant, by specifying thresholds for support S and confidence C. Support and confidence are useful metrics to generate rules but are often not enough. Another important metric used to generate rules (or reduce the number of spurious patterns generated) is lift. Lift is simply estimated by the ratio of the joint probability of two items, x and y, to the product of their individual probabilities: P (x, y)/[P (x )P (y)]. If the two items are statistically independent, then P (x, y) = P (x )P (y), corresponding to a lift of 1. Note that anticorrelation yields lift values less than 1, which is also an interesting pattern, corresponding to mutually exclusive items that rarely occur together. Association rule algorithms work as follows: Given a set of transactions (rows) and items for that transaction: 1. Find all combinations of items in a set of transactions that occur with a specified minimum frequency. These combinations are called frequent itemsets. 2. Generate association rules that express co-occurrence of items within frequent itemsets. For our purposes, association rule methods are an efficient way to take a basket of features (e.g., areas of publication of a researcher, different organizations an individual has worked at in their career, all the cities or neighborhoods someone may have lived in) and find co-occurrence patterns. This may sound trivial, but as data sets and number of features get larger, it becomes computationally expensive and association rule mining algorithms provide a fast and efficient way of doing it. 6.5.2 Supervised learning We now turn to the problem of supervised learning, which typically involves methods for classification, prediction, and regression. We will mostly focus on classification methods in this chapter since many of the regression methods in machine learning are fairly similar to methods with which you are already familiar. Remember that classification means predicting a discrete (or categorical) variable. Some of the classification methods that we will cover can also be used for regression, a fact that we will mention when describing that method. In general, supervised learning methods take as input pairs of data points (X, Y ) where X are the predictor variables (features) and 161 162 ◮ The topic of causal inference is addressed in more detail in Chapter 10. 6. Machine Learning Y is the target variable (label). The supervised learning method then uses these pairs as training data and learns a model F , where F (X ) ∼ Y . This model F is then used to predict Y s for new data points X . As mentioned earlier, the goal is not to build a model that best fits known data but a model that is useful for future predictions and minimizes future generalization error. This is the key goal that differentiates many of the methods that you know from the methods that we will describe next. In order to minimize future error, we want to build models that are not just overfitting on past data. Another goal, often prioritized in the social sciences, that machine learning methods do not optimize for is getting a structural form of the model. Machine learning models for classification can take different structural forms (ranging from linear models, to sets of rules, to more complex forms), and it may not always be possible to write them down in a compact form as an equation. This does not, however, make them incomprehensible or uninterpretable. Another focus of machine learning models for supervised learning is prediction, and not causal inference. Some of these models can be used to help with causal inference, but they are typically optimized for prediction tasks. We believe that there are many social science and policy problems where better prediction methods can be extremely beneficial. In this chapter, we mostly deal with binary classification problems: that is, problems in which the data points are to be classified into one of two categories. Several of the methods that we will cover can also be used for multiclass classification (classifying a data point into one of n categories) or for multi-label classification (classifying a data point into m of n categories where m ≥1). There are also approaches to take multiclass problems and turn them into a set of binary problems that we will mention briefly at the end of the chapter. Before we describe supervised learning methods, we want to recap a few principles as well as terms that we have used and will be using in the rest of the chapter. Training a model Once we have finished data exploration, filled in missing values, created predictor variables (features), and decided what our target variable (label) is, we now have pairs of X, Y to start training (or building) the model. Using the model to score new data We are building this model so we can predict Y for a new set of X s—using the model means, 6.5. Methods 163 Figure 6.4. Example of k -nearest neighbor with k = 1, 3, 5 neighbors. We want to predict the points A and B. The 1-nearest neighbor for both points is red (“Patent not granted”), the 3-nearest neighbor predicts point A (B) to be red (green) with probability 2/3, and the 5-nearest neighbor predicts again both points to be red with probabilities 4/5 and 3/5, respectively. getting new data, generating the same features to get the vector X , and then applying the model to produce Y . One common technique for supervised learning is logistic regression, a method you will already be familiar with. We will give an overview of some of the other methods used in machine learning. It is important to remember that as you use increasingly powerful classification methods, you need more data to train the models. k -nearest neighbor The method k-nearest neighbor (k-NN) is one of the simpler classification methods in machine learning. It belongs to a family of models sometimes known as memory-based models or instance-based models. An example is classified by finding its k nearest neighbors and taking majority vote (or some other aggregation function). We need two key things: a value for k and a distance metric with which to find the k nearest neighbors. Typically, different values of k are used to empirically find the best one. Small values of k lead to predictions having high variance but can capture the local structure of the data. Larger values of k build more global models that are lower in variance but may not capture local structure in the data as well. Figure 6.4 provides an example for k = 1, 3, 5 nearest neighbors. The number of neighbors (k) is a parameter, and the prediction depends heavily on how it is determined. In this example, point B is classified differently if k = 3. Training for k-NN just means storing the data, making this method useful in applications where data are coming in extremely 164 6. Machine Learning quickly and a model needs to be updated frequently. All the work, however, gets pushed to scoring time, since all the distance calculations happen when a new data point needs to be classified. There are several optimized methods designed to make k-NN more efficient that are worth looking into if that is a situation that is applicable to your problem. In addition to selecting k and an appropriate distance metric, we also have to be careful about the scaling of the features. When distances between two data points are large for one feature and small for a different feature, the method will rely almost exclusively on the first feature to find the closest points. The smaller distances on the second feature are nearly irrelevant to calculate the overall distance. A similar problem occurs when continuous and categorical predictors are used together. To resolve the scaling issues, various options for rescaling exist. For example, a common approach is to center all features at mean 0 and scale them to variance 1. There are several variations of k-NN. One of these is weighted nearest neighbors, where different features are weighted differently or different examples are weighted based on the distance from the example being classified. The method k-NN also has issues when the data are sparse and has high dimensionality, which means that every point is far away from virtually every other point, and hence pairwise distances tend to be uninformative. This can also happen when a lot of features are irrelevant and drown out the relevant features’ signal in the distance calculations. Notice that the nearest-neighbor method can easily be applied to regression problems with a real-valued target variable. In fact, the method is completely oblivious to the type of target variable and can potentially be used to predict text documents, images, and videos, based on the aggregation function after the nearest neighbors are found. Support vector machines Support vector machines are one of the most popular and best-performing classification methods in machine learning today. The mathematics behind SVMs has a lot of prerequisites that are beyond the scope of this book, but we will give you an intuition of how SVMs work, what they are good for, and how to use them. We are all familiar with linear models that separate two classes by fitting a line in two dimensions (or a hyperplane in higher dimensions) in the middle (see Figure 6.5). An important decision that linear models have to make is which linear separator we should prefer when there are several we can build. 6.5. Methods 165 O pt im al x2 x2 hy pe rp lan e Maximum margin x1 x1 Figure 6.5. Support vector machines You can see in Figure 6.5 that multiple lines offer a solution to the problem. Is any of them better than the others? We can intuitively define a criterion to estimate the worth of the lines: A line is bad if it passes too close to the points because it will be noise sensitive and it will not generalize correctly. Therefore, our goal should be to find the line passing as far as possible from all points. The SVM algorithm is based on finding the hyperplane that maximizes the margin of the training data. The training examples that are closest to the hyperplane are called support vectors since they are supporting the margin (as the margin is only a function of the support vectors). An important concept to learn when working with SVMs is kernels. SVMs are a specific instance of a class of methods called kernel methods. So far, we have only talked about SVMs as linear models. Linear works well in high-dimensional data but sometimes you need nonlinear models, often in cases of low-dimensional data or in image or video data. Unfortunately, traditional ways of generating nonlinear models get computationally expensive since you have to explicitly generate all the features such as squares, cubes, and all the interactions. Kernels are a way to keep the efficiency of the linear machinery but still build models that can capture nonlinearity in the data without creating all the nonlinear features. You can essentially think of kernels as similarity functions and use them to create a linear separation of the data by (implicitly) mapping the data to a higher-dimensional space. Essentially, we take an n-dimensional input vector X , map it into a high-dimensional 166 6. Machine Learning (possibly infinite-dimensional) feature space, and construct an optimal separating hyperplane in this space. We refer you to relevant papers for more detail on SVMs and nonlinear kernels [334, 339]. SVMs are also related to logistic regression, but use a different loss/penalty function [159]. When using SVMs, there are several parameters you have to optimize, ranging from the regularization parameter C, which determines the tradeoff between minimizing the training error and minimizing model complexity, to more kernel-specific parameters. It is often a good idea to do a grid search to find the optimal parameters. Another tip when using SVMs is to normalize the features; one common approach to doing that is to normalize each data point to be a vector of unit length. Linear SVMs are effective in high-dimensional spaces, especially when the space is sparse such as text classification where the number of data points (perhaps tens of thousands) is often much less than the number of features (a hundred thousand to a million or more). SVMs are also fairly robust when the number of irrelevant features is large (unlike the k-NN approaches that we mentioned earlier) as well as when the class distribution is skewed, that is, when the class of interest is significantly less than 50% of the data. One disadvantage of SVMs is that they do not directly provide probability estimates. They assign a score based on the distance from the margin. The farther a point is from the margin, the higher the magnitude of the score. This score is good for ranking examples, but getting accurate probability estimates takes more work and requires more labeled data to be used to perform probability calibrations. In addition to classification, there are also variations of SVMs that can be used for regression [348] and ranking [70]. Decision trees Decision trees are yet another set of methods that are helpful for prediction. Typical decision trees learn a set of rules from training data represented as a tree. An exemplary decision tree is shown in Figure 6.6. Each level of a tree splits the tree to create a branch using a feature and a value (or range of values). In the example tree, the first split is made on the feature number of visits in the past year and the value 4. The second level of the tree now has two splits: one using average length of visit with value 2 days and the other using the value 10 days. Various algorithms exist to build decision trees. C4.5, CHAID, and CART (Classification and Regression Trees) are the most popular. 6.5. Methods 167 # of Visits in past year ≥4 Y N Average length of visit ≥ 2 days Y Average length of visit ≥ 10 days Y N Risk High Risk High Risk Low N Risk Low Average Length of Stay 16 Risk = High 12 Risk = High 8 Risk = Low 4 1 2 3 4 # of Visits in Past Year Risk = Low 6 5 Figure 6.6. An exemplary decision tree. The top figure is the standard representation for trees. The bottom figure offers an alternative view of the same tree. The feature space is partitioned into numerous rectangles, which is another way to view a tree, representing its nonlinear character more explicitly Each needs to determine the next best feature to split on. The goal is to find feature splits that can best reduce class impurity in the data, that is, a split that will ideally put all (or as many as possible) positive class examples on one side and all (or as many as possible) negative examples on the other side. One common measure of impurity that comes from information theory is entropy, and it is calculated as X H (X ) = − p(x ) log p(x ). x Entropy is maximum (1) when both classes have equal numbers of examples in a node. It is minimum (0) when all examples are 168 6. Machine Learning from the same class. At each node in the tree, we can evaluate all the possible features and select the one that most reduces the entropy given the tree so far. This expected change in entropy is known as information gain and is one of the most common criteria used to create decision trees. Other measures that are used instead of information gain are Gini and chi-squared. If we keep constructing the tree in this manner, selecting the next best feature to split on, the tree ends up fairly deep and tends to overfit the data. To prevent overfitting, we can either have a stopping criterion or prune the tree after it is fully grown. Common stopping criteria include minimum number of data points to have before doing another feature split, maximum depth, and maximum purity. Typical pruning approaches use holdout data (or crossvalidation, which will be discussed later in this chapter) to cut off parts of the tree. Once the tree is built, a new data point is classified by running it through the tree and, once it reaches a terminal node, using some aggregation function to give a prediction (classification or regression). Typical approaches include performing maximum likelihood (if the leaf node contains 10 examples, 8 positive and 2 negative, any data point that gets into that node will get an 80% probability of being positive). Trees used for regression often build the tree as described above but then fit a linear regression model at each leaf node. Decision trees have several advantages. The interpretation of a tree is straightforward as long as the tree is not too large. Trees can be turned into a set of rules that experts in a particular domain can possibly dig deeper into, validate, and modify. Trees also do not require too much feature engineering. There is no need to create interaction terms since trees can implicitly do that by splitting on two features, one after another. Unfortunately, along with these benefits come a set of disadvantages. Decision trees, in general, do not perform well, compared to SVMs, random forests, or logistic regression. They are also unstable: small changes in data can result in very different trees. The lack of stability comes from the fact that small changes in the training data may lead to different splitting points. As a consequence, the whole tree may take a different structure. The suboptimal predictive performance can be seen from the fact that trees partition the predictor space into a few rectangular regions, each one predicting only a single value (see the bottom part of Figure 6.6). 6.5. Methods 169 Ensemble methods Combinations of models are generally known as model ensembles. They are among the most powerful techniques in machine learning, often outperforming other methods, although at the cost of increased algorithmic and model complexity. The intuition behind building ensembles of models is to build several models, each somewhat different. This diversity can come from various sources such as: training models on subsets of the data; training models on subsets of the features; or a combination of these two. Ensemble methods in machine learning have two things in common. First, they construct multiple, diverse predictive models from adapted versions of the training data (most often reweighted or resampled). Second, they combine the predictions of these models in some way, often by simple averaging or voting (possibly weighted). Bagging Bagging stands for “bootstrap aggregation”:* we first create bootstrap samples from the original data and then aggregate the predictions using models trained on each bootstrap sample. Given a data set of size N, the method works as follows: 1. Create k bootstrap samples (with replacement), each of size N, resulting in k data sets. Only about 63% of the original training examples will be represented in any given bootstrapped set. 2. Train a model on each of the k data sets, resulting in k models. 3. For a new data point X , predict the output using each of the k models. 4. Aggregate the k predictions (typically using average or voting) to get the prediction for X . A nice feature of this method is that any underlying model can be used, but decision trees are often the most commonly used base model. One reason for this is that decision tress are typically high variance and unstable, that is, they can change drastically given small changes in data, and bagging is effective at reducing the variance of the overall model. Another advantage of bagging is that each model can be trained in parallel, making it efficient to scale to large data sets. Boosting Boosting is another popular ensemble technique, and it often results in improving the base classifier being used. In fact, ⋆ Bootstrap is a general statistical procedure that draws random samples of the original data with replacement. 170 6. Machine Learning if your only goal is improving accuracy, you will most likely find that boosting will achieve that. The basic idea is to keep training classifiers iteratively, each iteration focusing on examples that the previous one got wrong. At the end, you have a set of classifiers, each trained on smaller and smaller subsets of the training data. Given a new data point, all the classifiers predict the target, and a weighted average of those predictions is used to get the final prediction, where the weight is proportional to the accuracy of each classifier. The algorithm works as follows: 1. Assign equal weights to every example. 2. For each iteration: (a) Train classifier on the weighted examples. (b) Predict on the training data. (c) Calculate error of the classifier on the training data. (d) Calculate the new weighting on the examples based on the errors of the classifier. (e) Reweight examples. 3. Generate a weighted classifier based on the accuracy of each classifier. One constraint on the classifier used within boosting is that it should be able to handle weighted examples (either directly or by replicating the examples that need to be overweighted). The most common classifiers used in boosting are decision stumps (singlelevel decision trees), but deeper trees can also work well. Boosting is a common way to boost the performance of a classification method but comes with additional complexity, both in the training time and in interpreting the predictions. A disadvantage of boosting is that it is difficult to parallelize since the next iteration of boosting relies on the results of the previous iteration. A nice property of boosting is its ability to identify outliers: examples that are either mislabeled in the training data, or are inherently ambiguous and hard to categorize. Because boosting focuses its weight on the examples that are more difficult to classify, the examples with the highest weight often turn out to be outliers. On the other hand, if the number of outliers is large (lots of noise in the data), these examples can hurt the performance of boosting by focusing too much on them. 6.5. Methods Random forests Given a data set of size N and containing M features, the random forest training algorithm works as follows: 1. Create n bootstrap samples from the original data of size N. Remember, this is similar to the first step in bagging. Typically n ranges from 100 to a few thousand but is best determined empirically. 2. For each bootstrap sample, train a decision tree using m features (where m is typically much smaller than M ) at each node of the tree. The m features are selected uniformly at random from the M features in the data set, and the decision tree will select the best split among the m features. The value of m is held constant during the forest growing. 3. A new test example/data point is classified by all the trees, and the final classification is done by majority vote (or another appropriate aggregation method). Random forests are probably the most accurate classifiers being used today in machine learning. They can be easily parallelized, making them efficient to run on large data sets, and can handle a large number of features, even with a lot of missing values. Random forests can get complex, with hundreds or thousands of trees that are fairly deep, so it is difficult to interpret the learned model. At the same time, they provide a nice way to estimate feature importance, giving a sense of what features were important in building the classifier. Another nice aspect of random forests is the ability to compute a proximity matrix that gives the similarity between every pair of data points. This is calculated by computing the number of times two examples land in the same terminal node. The more that happens, the closer the two examples are. We can use this proximity matrix for clustering, locating outliers, or explaining the predictions for a specific example. Stacking Stacking is a technique that deals with the task of learning a meta-level classifier to combine the predictions of multiple base-level classifiers. This meta-algorithm is trained to combine the model predictions to form a final set of predictions. This can be used for both regression and classification. The algorithm works as follows: 1. Split the data set into n equal-sized sets: set1 , set2 , . . . , setn . 171 172 6. Machine Learning 2. Train base models on all possible combinations of n − 1 sets and, for each model, use it to predict on seti what was left out of the training set. This would give us a set of predictions on every data point in the original data set. 3. Now train a second-stage stacker model on the predicted classes or the predicted probability distribution over the classes from the first-stage (base) model(s). By using the first-stage predictions as features, a stacker model gets more information on the problem space than if it were trained in isolation. The technique is similar to cross-validation, an evaluation methodology that we will cover later in this chapter. Neural networks and deep learning Neural networks are a set of multi-layer classifiers where the outputs of one layer feed into the inputs of the next layer. The layers between the input and output layers are called hidden layers, and the more hidden layers a neural network has, the more complex functions it can learn. Neural networks were popular in the 1980s and early 1990s, but then fell out of fashion because they were slow and expensive to train, even with only one or two hidden layers. Since 2006, a set of techniques has been developed that enable learning in deeper neural networks. These techniques have enabled much deeper (and larger) networks to be trained—people now routinely train networks with five to ten hidden layers. And it turns out that these perform far better on many problems than shallow neural networks (with just a single hidden layer). The reason for the better performance is the ability of deep nets to build up a complex hierarchy of concepts, learning multiple levels of representation and abstraction that help to make sense of data such as images, sound, and text. Usually, with a supervised neural network you try to predict a target vector, Y , from a matrix of inputs, X . But when you train a deep neural network, it uses a combination of supervised and unsupervised learning. In an unsupervised neural network, you try to predict the matrix X using the same matrix X as the input. In doing this, the network can learn something intrinsic about the data without the help of a separate target or label. The learned information is stored as the weights of the network. Currently, deep neural networks are trendy and a lot of research is being done on them. It is, however, important to keep in mind that they are applicable for a narrow class of problems with which social scientists would deal and that they often require a lot more 6.6. Evaluation data than are available in most problems. Training deep neural networks also requires a lot of computational power, but that is less likely to be an issue for most people. Typical cases where deep learning has been shown to be effective involve lots of images, video, and text data. We are still in the early stages of development of this class of methods, and the next few years will give us a much better understanding of why they are effective and the problems for which they are well suited. 6.6 Evaluation The previous section introduced us to a variety of methods, all with certain pros and cons, and no single method guaranteed to outperforms others for a given problem. This section focuses on evaluation methods, with three primary goals: 1. Model selection: How do we select a method to use? What parameters should we select for that method? 2. Performance estimation: How well will our model do once it is deployed and applied to new data? 3. A deeper understanding of the model can point to inaccuracies of existing methods and provide a better understanding of the data and the problem we are tackling. This section will cover evaluation methodologies as well as metrics that are commonly used. We will start by describing common evaluation methodologies that use existing data and then move on to field trials. The methodologies we describe below apply both to regression and classification problems. 6.6.1 Methodology In-sample evaluation As social scientists, you already evaluate methods on how well they perform in-sample (on the set that the model was trained on). As we mentioned earlier in the chapter, the goal of machine learning methods is to generalize to new data, and validating models in-sample does not allow us to do that. We focus here on evaluation methodologies that allow us to optimize (as best as we can) for generalization performance. The methods are illustrated in Figure 6.7. 173 174 6. Machine Learning Data (Size = N) Original Data Set Test Set Training Set Out-of-Sample (Holdout Set) N1 Size = N/5 N2 Size = N/5 N3 Size = N/5 N4 Size = N/5 N5 Size = N/5 5-Fold Cross-Validation Figure 6.7. Validation methodologies: holdout set and cross-validation Out-of-sample and holdout set The simplest way to focus on generalization is to pretend to generalize to new (unseen) data. One way to do that is to take the original data and randomly split them into two sets: a training set and a test set (sometimes also called the holdout or validation set). We can decide how much to keep in each set (typically the splits range from 50–50 to 80–20, depending on the size of the data set). We then train our models on the training set and classify the data in the test set, allowing us to get an estimate of the relative performance of the methods. One drawback of this approach is that we may be extremely lucky or unlucky with our random split. One way to get around the problem that is to repeatedly create multiple training and test sets. We can then train on TR1 and test on TE1 , train on TR2 and test on TE2 , and so on. The performance measures on each test set can then give us an estimate of the performance of different methods and how much they vary across different random sets. Cross-validation Cross-validation is a more sophisticated holdout training and testing procedure that takes away some of the shortcomings of the holdout set approach. Cross-validation begins by splitting a labeled data set into k partitions (called folds). Typically, k is set to 5 or 10. Cross-validation then proceeds by iterating k times. In each iteration, one of the k folds is held out as the test set, while the other k − 1 folds are combined and used to train the model. A nice property of cross-validation is that every example is used in one test set for testing the model. Each iteration of cross-validation gives us a performance estimate that can then be aggregated (typically averaged) to generate the overall estimate. 6.6. Evaluation 175 Train 2010 Test 2011 2012 Train 2010 2013 2014 2013 2014 Test 2011 2012 Train 2010 2011 Test 2012 2013 2014 Figure 6.8. Temporal validation An extreme case of cross-validation is called leave-one-out crossvalidation, where given a data set of size N, we create N folds. That means iterating over each data point, holding it out as the test set, and training on the rest of the N − 1 examples. This illustrates the benefit of cross-validation by giving us good generalization estimates (by training on as much of the data set as possible) and making sure the model is tested on each data point. The cross-validation and holdout set approaches described above assume that the data have no time dependencies and that the distribution is stationary over time. This assumption is almost always violated in practice and affects performance estimates for a model. In most practical problems, we want to use a validation strategy that emulates the way in which our models will be used and provides an accurate performance estimate. We will call this temporal validation. For a given point in time ti , we train our models only on information available to us before ti to avoid training on data from the “future.” We then predict and evaluate on data from ti to ti + d and iterate, expanding the training window while keeping the test window size constant at d. Figure 6.8 shows this validation process with ti = 2010 and d = 1 year. The test set window d depends on a few factors related to how the model will be deployed to best emulate reality: Temporal validation 1. How far out in the future do predictions need to be made? For example, if the set of students who need to be targeted for 176 6. Machine Learning interventions has to be finalized at the beginning of the school year for the entire year, then d = 1 year. 2. How often will the model be updated? If the model is being updated daily, then we can move the window by a day at a time to reflect the deployment scenario. 3. How often will the system get new data? If we are getting new data frequently, we can make predictions more frequently. Temporal validation is similar to how time series models are evaluated and should be the validation approach used for most practical problems. 6.6.2 Metrics The previous subsection focused on validation methodologies assuming we have a evaluation metric in mind. This section will go over commonly used evaluation metrics. You are probably familiar with using R 2 , analysis of the residuals, and mean squared error (MSE) to evaluate the quality of regression models. For regression problems, the MSE calculates the average squared differences between predictions ŷi and true values yi . When prediction models have smaller MSE, they are better. However, the MSE itself is hard to interpret because it measures quadratic differences. Instead, the root mean squared error (RMSE) is more intuitive as it as measure of mean differences on the original scale of the response variable. Yet another alternative is the mean absolute error (MAE), which measures average absolute distances between predictions and true values. We will now describe some additional evaluation metrics commonly used in machine learning for classification. Before we dive into metrics, it is important to highlight that machine learning models for classification typically do not predict 0/1 values directly. SVMs, random forests, and logistic regression all produce a score (which is sometimes a probability) that is then turned into 0 or 1 based on a user-specific threshold. You might find that certain tools (such as sklearn) use a default value for that threshold (often 0.5), but it is important to know that it is an arbitrary threshold and you should select the threshold based on the data, the model, and the problem you are solving. We will cover that a little later in this section. Once we have turned the real-valued predictions into 0/1 classification, we can now create a confusion matrix from these pre- 6.6. Evaluation 177 Predicted Class 1 0 total 1 True Positives False Negatives P′ 0 False Positives True Negatives N′ True Class total P N Figure 6.9. A confusion matrix created from real-valued predictions dictions, shown in Figure 6.9. Each data point belongs to either the positive class or the negative class, and for each data point the prediction of the classifier is either correct or incorrect. This is what the four cells of the confusion matrix represent. We can use the confusion matrix to describe several commonly used evaluation metrics. Accuracy is the ratio of correct predictions (both positive and negative) to all predictions: Accuracy = TP + TN TP + TN + FP + FN = TP + TN P+N = TP + TN P′ + N′ , where TP denotes true positives, TN true negatives, FP false positives, FN false negatives, and other symbols denote row or column totals as in Figure 6.9. Accuracy is the most commonly described evaluation metric for classification but is surprisingly the least useful in practical situations (at least by itself). One problem with accuracy is that it does not give us an idea of lift compared to baseline. For example, if we have a classification problem with 95% of the data as positive and 5% as negative, a classifier with 85% is performing worse than a dumb classifier that predicts positive all the time (and will have 95% accuracy). Two additional metrics that are often used are precision and recall, which are defined as follows: Precision = Recall = TP TP + FP TP TP + FN = = TP P TP P′ , 178 6. Machine Learning (see also Box 7.3). Precision measures the accuracy of the classifier when it predicts an example to be positive. It is the ratio of correctly predicted positive examples (TP) to all examples predicted as positive (TP + FP). This measure is also called positive predictive value in other fields. Recall measures the ability of the classifier to find positive examples. It is the ratio of all the correctly predicted positive examples (TP) to all the positive examples in the data (TP + FN). This is also called sensitivity in other fields. You might have encountered another metric called specificity in other fields. This measure is the true negative rate: the proportion of negatives that are correctly identified. Another metric that is used is the F1 score, which is the harmonic mean of precision and recall: F1 = 2 ∗ Precision ∗ Recall Precision + Recall (see also equation (7.1)). This is often used when you want to balance both precision and recall. There is often a tradeoff between precision and recall. By selecting different classification thresholds, we can vary and tune the precision and recall of a given classifier. A highly conservative classifier that only predicts a 1 when it is absolutely sure (say, a threshold of 0.9999) will most often be correct when it predicts a 1 (high precision) but will miss most 1s (low recall). At the other extreme, a classifier that says 1 to every data point (a threshold of 0.0001) will have perfect recall but low precision. Figure 6.10 show a precision– recall curve that is often used to represent the performance of a given classifier. If we care about optimizing for the entire precision recall space, a useful metric is the area under the curve (AUC-PR), which is the area under the precision–recall curve. AUC-PR must not be confused with AUC-ROC, which is the area under the related receiver operating characteristic (ROC) curve. The ROC curve is created by plotting recall versus (1 – specificity). Both AUCs can be helpful metrics to compare the performance of different methods and the maximum value the AUC can take is 1. If, however, we care about a specific part on the precision–recall curve, we have to look at finer-grained metrics. Let us consider an example from public health. Most public health agencies conduct inspections of various sorts to detect health hazard violations (lead hazards, for example). The number of possible places (homes or businesses) to inspect far exceeds the inspection resources typically available. Let us assume further that they 6.6. Evaluation 179 1.0 1.0 0.9 1 Precision 0.6 0.6 0.6 0.5 0.4 0.4 0.4 Classifier A Classifier B 0.3 0.2 0 0.2 0.2 0 0.2 0.4 0.6 0.8 Precision Figure 6.10. Precision–recall curve 1 0.1 0.0 0.2 0.4 0.6 Percent of Population 0.8 0.0 1.0 Figure 6.11. Precision or recall at different thresholds can only inspect 5% of all possible places; they would clearly want to prioritize the inspection of places that are most likely to contain the hazard. In this case, the model will score and rank all the possible inspection places in order of hazard risk. We would then want to know what percentage of the top 5% (the ones that will get inspected) are likely to be hazards, which translates to the precision in the top 5% of the most confidence predictions—precision at 5%, as it is commonly called (see Figure 6.11). Precision at top k percent is a common class of metrics widely used in information retrieval and search engine literature, where you want to make sure that the results retrieved at the top of the search results are accurate. More generally, this metric is often used in problems in which the class distribution is skewed and only a small percentage of the examples will be examined manually (inspections, investigations for fraud, etc.). The literature provides many case studies of such applications [219, 222, 307]. One last metric we want to mention is a class of cost-sensitive metrics where different costs (or benefits) can be associated with the different cells in the confusion matrix. So far, we have implicitly assumed that every correct prediction and every error, whether for the positive class or the negative class, has equal costs and benefits. In many practical problems, that is not the case. For example, we may want to predict whether a patient in a hospital emergency room is likely to go into cardiac arrest in the next six hours. The cost of a false positive in this case is the cost of the intervention (which may be a few extra minutes of a physician’s time) while the cost of a false negative could be death. This type of analysis allows us to calculate Recall 0.7 0.8 Recall 0.8 0.8 180 6. Machine Learning the expected value of the predictions of a classifier and select the model that optimizes this cost-sensitive metric. 6.7 Practical tips Here we highlight some practical tips that will be helpful when working with machine learning methods. 6.7.1 Features So far in this chapter, we have focused a lot on methods and process, and we have not discussed features in detail. In social science, they are not called features but instead are known as variables or predictors. Good features are what makes machine learning systems effective. Feature generation (or engineering, as it is often called) is where the bulk of the time is spent in the machine learning process. As social science researchers or practitioners, you have spent a lot of time constructing features, using transformations, dummy variables, and interaction terms. All of that is still required and critical in the machine learning framework. One difference you will need to get comfortable with is that instead of carefully selecting a few predictors, machine learning systems tend to encourage the creation of lots of features and then empirically use holdout data to perform regularization and model selection. It is common to have models that are trained on thousands of features. Commonly used approaches to create features include: • Transformations, such as log, square, and square root. • Dummy (binary) variables: This is often done by taking cate- gorical variables (such as city) and creating a binary variable for each value (one variable for each city in the data). These are also called indicator variables. • Discretization: Several methods require features to be discrete instead of continuous. Several approaches exist to convert continuous variables into discrete ones, the most common of which is equal-width binning. • Aggregation: Aggregate features often constitute the majority of features for a given problem. These aggregations use different aggregation functions (count, min, max, average, standard deviation, etc.), often over varying windows of time and space. 6.7. Practical tips For example, given urban data, we would want to calculate the number (and min, max, mean, variance) of crimes within an m-mile radius of an address in the past t months for varying values of m and t, and then to use all of them as features in a classification problem. In general, it is a good idea to have the complexity in features and use a simple model, rather than using more complex models with simple features. Keeping the model simple makes it faster to train and easier to understand. 6.7.2 Machine learning pipeline When working on machine learning projects, it is a good idea to structure your code as a modular pipeline so you can easily try different approaches and methods without major restructuring. The Python workbooks supporting this book will give you an example of a machine learning pipeline. A good pipeline will contain modules for importing data, doing exploration, feature generation, classification, and evaluation. You can then instantiate a specific workflow by combining these modules. An important component of the machine learning pipeline is comparing different methods. With all the methods out there and all the hyperparameters they come with, how do we know which model to use and which hyperparameters to select? And what happens when we add new features to the model or when the data have “temporal drift” and change over time? One simple approach is to have a nested set of for loops that loop over all the methods you have access to, then enumerate all the hyperparameters for that method, create a cross-product, and loop over all of them, comparing them across different evaluation metrics and selecting the best one to use going forward. You can even add different feature subsets and time slices to this for loop, as the example in the supporting workbooks will show. 6.7.3 Multiclass problems In the supervised learning section, we framed classification problems as binary classification problems with a 0 or 1 output. There are many problems where we have multiple classes, such as classifying companies into their industry codes or predicting whether a student will drop out, transfer, or graduate. Several solutions have been designed to deal with the multiclass classification problem: 181 182 6. Machine Learning • Direct multiclass: Use methods that can directly perform mul- ticlass classification. Examples of such methods are K-nearest neighbor, decision trees, and random forests. There are extensions of support vector machines that exist for multiclass classification as well [86], but they can often be slow to train. • Convert to one versus all (OVA): This is a common approach to solve multiclass classification problems using binary classifiers. Any problem with n classes can be turned into n binary classification problems, where each classifier is trained to distinguish between one versus all the other classes. A new example can be classified by combining the predictions from all the n classifiers and selecting the class with the highest score. This is a simple and efficient approach, and one that is commonly used, but it suffers from each classification problem possibly having an imbalanced class distribution (due to the negative class being a collection of multiple classes). Another limitation of this approach is that it requires the scores of each classifier to be calibrated so that they are comparable across all of them. • Convert to pairwise: In this approach, we can create binary classifiers n  to distinguish between each pair of classes, result- ing in 2 binary classifiers. This results in a large number of classifiers, but each classifier usually has a balanced classification problem. A new example is classified by taking the predictions of all the binary classifiers and using majority voting. 6.7.4 Skewed or imbalanced classification problems A lot of problems you will deal with will not have uniform (balanced) distributions for both classes. This is often the case with problems in fraud detection, network security, and medical diagnosis where the class of interest is not very common. The same is true in many social science and public policy problems around behavior prediction, such as predicting which students will not graduate on time, which children may be at risk of getting lead poisoning, or which homes are likely to be abandoned in a given city. You will notice that applying standard machine learning methods may result in all the predictions being for the most frequent category in such situations, making it problematic to detect the infrequent classes. There has been a lot of work in machine learning research on dealing with such problems [73, 217] that we will not cover in detail here. Com- 6.8. How can social scientists benefit from machine learning? mon approaches to deal with class imbalance include oversampling from the minority class and undersampling from the majority class. It is important to keep in mind that the sampling approaches do not need to result in a 1 : 1 ratio. Many supervised learning methods described in this chapter (such as SVMs) can work well even with a 10 : 1 imbalance. Also, it is critical to make sure that you only resample the training set; keep the distribution of the test set the same as that of the original data since you will not know the class labels of new data in practice and will not be able to resample. 6.8 How can social scientists benefit from machine learning? In this chapter, we have introduced you to some new methods (both unsupervised and supervised), validation methodologies, and evaluation metrics. All of these can benefit social scientists as they tackle problems in research and practice. In this section, we will give a few concrete examples where what you have learned so far can be used to improve some social science tasks: • Use of better prediction methods and methodology: Traditional statistics and social sciences have not focused much on methods for prediction. Machine learning researchers have spent the past 30 years developing and adapting methods focusing on that task. We believe that there is a lot of value for social science researchers and practitioners in learning more about those methods, applying them, and even augmenting them [210]. Two common tasks that can be improved using better prediction methods are generating counterfactuals (essentially a prediction problem) and matching. In addition, holdout sets and cross-validation can be used as a model selection methodology with any existing regression and classification methods, resulting in improved model selection and error estimates. • Model misspecification: Linear and logistic regressions are common techniques for data analysis in the social sciences. One fundamental assumption within both is that they are additive over parameters. Machine learning provides tools when this assumption is too limiting. Hainmueller and Hazlett [148], for example, reanalyze data that were originally analyzed with 183 184 6. Machine Learning logistic regression and come to substantially different conclusions. They argue that their analysis, which is more flexible and based on supervised learning methodology, provides three additional insights when compared to the original model. First, predictive performance is similar or better, although they do not need an extensive search to find the final model specification as it was done in the original analysis. Second, their model allows them to calculate average marginal effects that are mostly similar to the original analysis. However, for one covariate they find a substantially different result, which is due to model misspecification in the original model. Finally, the reanalysis also discovers interactions that were missed in the original publication. • Better text analysis: Text is everywhere, but unfortunately hu- mans are slow and expensive in analyzing text data. Thus, computers are needed to analyze large collections of text. Machine learning methods can help make this process more efficient. Feldman and Sanger [117] provide an overview of different automatic methods for text analysis. Grimmer and Stewart [141] give examples that are more specific for social scientists, and Chapter 7 provides more details on this topic. • Adaptive surveys: Some survey questions have a large num- ber of possible answer categories. For example, international job classifications describe more than 500 occupational categories, and it is prohibitive to ask all categories during the survey. Instead, respondents answer an open-ended question about their job and machine learning algorithms can use the verbatim answers to suggest small sets of plausible answer options. The respondents can then select which option is the best description for their occupation, thus saving the costs for coding after the interview. • Estimating heterogeneous treatment effects: A standard ap- proach to causal inference is the assignment of different treatments (e.g., medicines) to the units of interest (e.g., patients). Researchers then usually calculate the average treatment effect—the average difference in outcomes for both groups. It is also of interest if treatment effects differ for various subgroups (e.g., is a medicine more effective for younger people?). Traditional subgroup analysis has been criticized and challenged by various machine learning techniques [138, 178]. 6.9. Advanced topics • Variable selection: Although there are many methods for vari- able selection, regularized methods such as the lasso are highly effective and efficient when faced with large amounts of data. Varian [386] goes into more detail and gives other methods from machine learning that can be useful for variable selection. We can also find interactions between pairs of variables (to feed into other models) using random forests, by looking at variables that co-occur in the same tree, and by calculating the strength of the interaction as a function of how many trees they co-occur in, how high they occur in the trees, and how far apart they are in a given tree. 6.9 Advanced topics This has been a short but intense introduction to machine learning, and we have left out several important topics that are useful and interesting for you to know about and that are being actively researched in the machine learning community. We mention them here so you know what they are, but will not describe them in detail. These include: • Semi-supervised learning, • Active learning, • Reinforcement learning, • Streaming data, • Anomaly detection, • Recommender systems. 6.10 Summary Machine learning is a active research field, and in this chapter we have given you an overview of how the work developed in this field can be used by social scientists. We covered the overall machine learning process, methods, evaluation approaches and metrics, and some practical tips, as well as how all of this can benefit social scientists. The material described in this chapter is a snapshot of a fast-changing field, and as we are seeing increasing collaborations between machine learning researchers and social scientists, 185 186 6. Machine Learning the hope and expectation is that the next few years will bring advances that will allow us to tackle social and policy problems much more effectively using new types of data and improved methods. 6.11 Resources Literature for further reading that also explains most topics from this chapter in greater depth: • Hastie et al.’s The Elements of Statistical Learning [159] is a classic and is available online for free. • James et al.’s An Introduction to Statistical Learning [187], from the same authors, includes less mathematics and is more approachable. It is also available online. • Mitchell’s Machine Learning [258] is a good introduction to some of the methods and gives a good motivation underlying them. • Provost and Fawcett’s Data Science for Business [311] is a good practical handbook for using machine learning to solve realworld problems. • Wu et al.’s “Top 10 Algorithms in Data Mining” [409]. Software: • Python (with libraries like scikit-learn, pandas, and more). • R has many relevant packages [168]. • Cloud-based: AzureML, Amazon ML. • Free: KNIME, Rapidminer, Weka (mostly for research use). • Commercial: IBM Modeler, SAS Enterprise Miner, Matlab. Many excellent courses are available online [412], including Hastie and Tibshirani’s Statistical Learning [158]. Major conferences in this area include the International Conference on Machine Learning [177], the Annual Conference on Neural Information Processing Systems (NIPS) [282], and the ACM International Conference on Knowledge Discovery and Data Mining (KDD) [204]. Chapter 7 Text Analysis Evgeny Klochikhin and Jordan Boyd-Graber This chapter provides an overview of how social scientists can make use of one of the most exciting advances in big data—text analysis. Vast amounts of data that are stored in documents can now be analyzed and searched so that different types of information can be retrieved. Documents (and the underlying activities of the entities that generated the documents) can be categorized into topics or fields as well as summarized. In addition, machine translation can be used to compare documents in different languages. 7.1 Understanding what people write You wake up and read the newspaper, a Facebook post, or an academic article a colleague sent you. You, like other humans, can digest and understand rich information, but an increasingly central challenge for humans is to cope with the deluge of information we are supposed to read and understand. In our use case of science, even Aristotle struggled with categorizing areas of science; the vast increase in the scope of written research has only made the challenge greater. One approach is to use rule-based methods to tag documents for categorization. Businesses used to employ human beings to read the news and tag documents on topics of interest for senior management. The rules on how to assign these topics and tags were developed and communicated to these human beings beforehand. Such a manual categorization process is still common in multiple applications, e.g., systematic literature reviews [52]. However, as anyone who has used a search engine knows, newer approaches exist to categorize text and help humans cope with overload: computer-aided text analysis. Text data can be used to enrich 187 188 ◮ See Chapter 3. 7. Text Analysis “conventional” data sources, such as surveys and administrative data, since the words spoken or written by individuals often provide more nuanced and unanticipated insights. Chapter 3 discusses how to link data to create larger, more diverse data sets. The linkage data sets need not just be numeric, but can also include data sets consisting of text data. Example: Using text to categorize scientific fields The National Center for Science and Engineering Statistics, the US statistical agency charged with collecting statistics on science and engineering, uses a rulebased system to manually create categories of science; these are then used to categorize research as “physics” or “economics” [262, 288]. In a rule-based system there is no ready response to the question “how much do we spend on climate change, food safety, or biofuels?” because existing rules have not created such categories. Text analysis techniques can be used to provide such detail without manual collation. For example, data about research awards from public sources and about people funded on research grants from UMETRICS can be linked with data about their subsequent publications and related student dissertations from ProQuest. Both award and dissertation data are text documents that can be used to characterize what research has been done, provide information about which projects are similar within or across institutions, and potentially identify new fields of study [368]. Overall, text analysis can help with specific tasks that define application-specific subfields including the following: • Searches and information retrieval: Text analysis tools can help find relevant information in large databases. For example, we used these techniques in systematic literature reviews to facilitate the discovery and retrieval of relevant publications related to early grade reading in Latin America and the Caribbean. • Clustering and text categorization: Tools like topic modeling can provide a big picture of the contents of thousands of documents in a comprehensible format by discovering only the most important words and phrases in those documents. • Text summarization: Similar to clustering, text summariza- tion can provide value in processing large documents and text corpora. For example, Wang et al. [393] use topic modeling to 7.2. How to analyze text 189 produce category-sensitive text summaries and annotations on large-scale document collections. • Machine translation: Machine translation is an example of a text analysis method that provides quick insights into documents written in other languages. 7.2 How to analyze text Human language is complex and nuanced, which makes analysis difficult. We often make simplifying assumptions: we assume our input is perfect text; we ignore humor [149] and deception [280, 292]; and we assume “standard” English [212]. Recognizing this complexity, the goal of text mining is to reduce the complexity of text and extract important messages in a comprehensible and meaningful way. This objective is usually achieved through text categorization or automatic classification. These tools can be used in multiple applications to gain salient insights into the relationships between words and documents. Examples include using machine learning to analyze the flow and topic segmentation of political debates and behaviors [276, 279] and to assign automated tags to documents [379]. Information retrieval has a similar objective of extracting the most important messages from textual data that would answer a particular query. The process analyzes the full text or metadata related to documents and allows only relevant knowledge to be discovered and returned to the query maker. Typical information retrieval tasks include knowledge discovery [263], word sense disambiguation [269], and sentiment analysis [295]. The choice of appropriate tools to address specific tasks significantly depends on the context and application. For example, document classification techniques can be used to gain insights into the general contents of a large corpus of documents [368], or to discover a particular knowledge area, or to link corpora based on implicit semantic relationships [54]. In practical terms, some of the questions can be: How much does the US government invest in climate change research and nanotechnology? Or what are the main topics in the political debate on guns in the United States? Or how can we build a salient and dynamic taxonomy of all scientific research? We begin with a review of established techniques to begin the process of analyzing text. Section 7.3 provides an overview of topic ◮ See Chapter 6 for a discussion of speech recognition, which can turn spoken language into text. ◮ Classification, a machine learning method, is discussed in Chapter 6. 190 7. Text Analysis modeling, information retrieval and clustering, and other approaches accompanied by practical examples and applications. Section 7.4 reviews key evaluation techniques used to assess the validity, robustness and utility of derived results. 7.2.1 ◮ Cleaning and processing are discussed extensively in Chapter 3. Processing text data The first important step in working with text data is cleaning and processing. Textual data are often messy and unstructured, which makes many researchers and practitioners overlook their value. Depending on the source, cleaning and processing these data can require varying amounts of effort but typically involve a set of established techniques. Text corpora A set of multiple similar documents is called a corpus. For example, the Brown University Standard Corpus of Present-Day American English, or just the Brown Corpus [128], is a collection of processed documents from works published in the United States in 1961. The Brown Corpus was a historical milestone: it was a machine-readable collection of a million words across 15 balanced genres with each word tagged with its part of speech (e.g., noun, verb, preposition). The British National Corpus [383] repeated the same process for British English at a larger scale. The Penn Treebank [251] provides additional information: in addition to part-ofspeech annotation, it provides syntactic annotation. For example, what is the object of the sentence “The man bought the hat”? These standard corpora serve as training data to train the classifiers and machine learning techniques to automatically analyze text [149]. However, not every corpus is effective for every purpose: the number and scope of documents determine the range of questions that you can ask and the quality of the answers you will get back: too few documents result in a lack of coverage, too many of the wrong kind of documents invite confusing noise. Tokenization The first step in processing text is deciding what terms and phrases are meaningful. Tokenization separates sentences and terms from each other. The Natural Language Toolkit (NLTK) [39] provides simple reference implementations of standard natural language processing algorithms such as tokenization—for example, sentences are separated from each other using punctuation such as period, question mark, or exclamation mark. However, this does not cover all cases such as quotes, abbreviations, or infor- 7.2. How to analyze text mal communication on social media. While separating sentences in a single language is hard enough, some documents “code-switch,” combining multiple languages in a single document. These complexities are best addressed through data-driven machine learning frameworks [209]. Stop words Once the tokens are clearly separated, it is possible to perform further text processing at a more granular, token level. Stop words are a category of words that have limited semantic meaning regardless of the document contents. Such words can be prepositions, articles, common nouns, etc. For example, the word “the” accounts for about 7% of all words in the Brown Corpus, and “to” and “of” are more than 3% each [247]. Hapax legomena are rarely occurring words that might have only one instance in the entire corpus. These words—names, misspellings, or rare technical terms—are also unlikely to bear significant contextual meaning. Similar to stop words, these tokens are often disregarded in further modeling either by the design of the method or by manual removal from the corpus before the actual analysis. N -grams However, individual words are sometimes not the correct unit of analysis. For example, blindly removing stop words can obscure important phrases such as “systems of innovation,” “cease and desist,” or “commander in chief.” Identifying these N-grams requires looking for statistical patterns to discover phrases that often appear together in fixed patterns [102]. These combinations of phrases are often called collocations, as their overall meaning is more than the sum of their parts. Stemming and lemmatization Text normalization is another important aspect of preprocessing textual data. Given the complexity of natural language, words can take multiple forms dependent on the syntactic structure with limited change of their original meaning. For example, the word “system” morphologically has a plural “systems” or an adjective “systematic.” All these words are semantically similar and—for many tasks—should be treated the same. For example, if a document has the word “system” occurring three times, “systems” once, and “systematic” twice, one can assume that the word “system” with similar meaning and morphological structure can cover all instances and that variance should be reduced to “system” with six instances. 191 192 7. Text Analysis The process for text normalization is often implemented using established lemmatization and stemming algorithms. A lemma is the original dictionary form of a word. For example, “go,” “went,” and “goes” will all have the lemma “go.” The stem is a central part of a given word bearing its primary semantic meaning and uniting a group of similar lexical units. For example, the words “order” and “ordering” will have the same stem “ord.” Morphy (a lemmatizer provided by the electronic dictionary WordNet), Lancaster Stemmer, and Snowball Stemmer are common tools used to derive lemmas and stems for tokens, and all have implementations in the NLTK [39]. All text-processing steps are critical to successful analysis. Some of them bear more importance than others, depending on the specific application, research questions, and properties of the corpus. Having all these tools ready is imperative to producing a clean input for subsequent modeling and analysis. Some simple rules should be followed to prevent typical errors. For example, stop words should not be removed before performing n-gram indexing, and a stemmer should not be used where data are complex and require accounting for all possible forms and meanings of words. Reviewing interim results at every stage of the process can be helpful. 7.2.2 ◮ Term weighting is an example of feature engineering discussed in Chapter 6. How much is a word worth? Not all words are worth the same; in an article about electronics, “capacitor” is more important than “aspect.” Appropriately weighting and calibrating words is important for both human and machine consumers of text data: humans do not want to see “the” as the most frequent word of every document in summaries, and classification algorithms benefit from knowing which features are actually important to making a decision. Weighting words requires balancing how often a word appears in a local context (such as a document) with how much it appears overall in the document collection. Term frequency–inverse document frequency (TFIDF) [322] is a weighting scheme to explicitly balance these factors and prioritize the most meaningful words. The TFIDF model takes into account both the term frequency of a given token and its document frequency (Box 7.1) so that if a highly frequent word also appears in almost all documents, its meaning for the specific context of the corpus is negligible. Stop words are a good example when highly frequent words also bear limited meaning since they appear in virtually all documents of a given corpus. 7.3. Approaches and applications 193 Box 7.1: TFIDF For every token t and every document d in the corpus D, TFIDF is calculated as tfidf (t, d, D ) = tf (t, d ) × idf (t, D ), where term frequency is either a simple count, tf (t, d ) = f (t, d ), or a more balanced quantity, tf (t, d ) = 0.5 + 0.5 × f (t, d ) , max{f (t, d ) : t ∈ d } and inverse document frequency is idf (t, D ) = log 7.3 N |{d ∈ D : t ∈ d }| . Approaches and applications In this section, we discuss several approaches that allow users to perform an unsupervised analysis of large text corpora. That is, approaches that do not require extensive investment of time from experts or programmers to begin to understand large text corpora. The ease of using these approaches provides additional opportunities for social scientists and policymakers to gain insights into policy and research questions through text analysis. First, we discuss topic modeling, an approach that discovers topics that constitute the high-level themes of a corpus. Topic modeling is often described as an information discovery process: describing what concepts are present in a corpus. Second, we discuss information retrieval, which finds the closest documents to a particular concept a user wants to discover. In contrast to topic modeling (exposing the primary concepts the corpus, heretofore unknown), information retrieval finds documents that express already known concepts. Other approaches can be used for document classification, sentiment analysis, and part-of-speech tagging. 7.3.1 Topic modeling As topic modeling is a broad subfield of natural language processing and machine learning, we will restrict our focus to a single exemplar: 194 7. Text Analysis Topic 1 Computer, technology, system, service, site, phone, internet, machine Topic 2 Sell, sale, store, product, business, advertising, market, consumer Topic 3 Play, film, movie, theater, production, star, director, stage Figure 7.1. Topics are distributions over words. Here are three example topics learned by latent Dirichlet allocation from a model with 50 topics discovered from the New York Times [324]. Topic 1 seems to be about technology, Topic 2 about business, and Topic 3 about the arts latent Dirichlet allocation (LDA) [43]. LDA is a fully Bayesian extension of probabilistic latent semantic indexing [167], itself a probabilistic extension of latent semantic analysis [224]. Blei and Lafferty [42] provide a more detailed discussion of the history of topic models. LDA, like all topic models, assumes that there are topics that form the building blocks of a corpus. Topics are distributions over words and are often shown as a ranked list of words, with the highest probability words at the top of the list (Figure 7.1). However, we do not know what the topics are a priori; the challenge is to discover what they are (more on this shortly). In addition to assuming that there exist some number of topics that explain a corpus, LDA also assumes that each document in a corpus can be explained by a small number of topics. For example, taking the example topics from Figure 7.1, a document titled “Red Light, Green Light: A Two-Tone LED to Simplify Screens” would be about Topic 1, which appears to be about technology. However, a document like “Forget the Bootleg, Just Download the Movie Legally” would require all three of the topics. The set of topics that are used by a document is called the document’s allocation (Figure 7.2). This terminology explains the name latent Dirichlet allocation: each document has an allocation over latent topics governed by a Dirichlet distribution. 7.3.1.1 Inferring topics from raw text Algorithmically, the problem can be viewed as a black box. Given a corpus and an integer K as input, provide the topics that best describe the document collection: a process called posterior inference. The most common algorithm for solving this problem is a technique called Gibbs sampling [131]. 7.3. Approaches and applications Red Light, Green Light: A 2-Tone L.E.D. to Simplify Screens 195 Internet Portals Begin to Distinguish among Themselves as Shopping Malls Stock Trades: A Better Deal For Investors Isn't Simple Forget the Bootleg, Just Download the Movie Legally TOPIC 1 "TECHNOLOGY" TOPIC 2 "BUSINESS" Multiplex Heralded As Linchpin To Growth The Shape of Cinema, Transformed At the Click of a Mouse A Peaceful Crew Puts Muppets Where Its Mouth Is TOPIC 3 "ENTERTAINMENT" Figure 7.2. Allocations of documents to topics Gibbs sampling works at the word level to discover the topics that best describe a document collection. Each word is associated with a single topic, explaining why that word appeared in a document. For example, consider the sentence “Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet.” Each word in this sentence is associated with a topic: “Hollywood” might be associated with an arts topic; “buy” with a business topic; and “Internet” with a technology topic (Figure 7.3). This is where we should eventually get. However, we do not know this to start. So we can initially assign words to topics randomly. This will result in poor topics, but we can make those topics better. We improve these topics by taking each word, pretending that we do not know the topic, and selecting a new topic for the word. A topic model wants to do two things: it does not want to use many topics in a document, and it does not want to use many words in a topic. So the algorithm will keep track of how many times a document d has used a topic k, Nd,k , and how many times a topic k has used a word w, Vk,w . For notational convenience, it will also be useful to keep track of marginal counts of how many words are in a document, Nd,· ≡ X k Nd,k , 196 7. Text Analysis computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Compute Computer's iTunes unes music store and other online servic services ... Figure 7.3. Each word is associated with a topic. Gibbs sampling inference iteratively resamples the topic assignments for each word to discover the most likely topic assignments that explain the document collection and how many words are associated with a topic, Vk,· ≡ X Vk,w . w The algorithm removes the counts for a word from Nd,k and Vk,w and then changes the topic of a word (hopefully to a better topic than the one it had before). Through many thousands of iterations of this process, the algorithm can find topics that are coherent, useful, and characterize the data well. The two goals of topic modeling—balancing document allocations to topics and topics’ distribution over words—come together in an equation that multiplies them together. A good topic will be both common in a document and explain a word’s appearance well. Example: Gibbs sampling for topic models The topic assignment zd,n of word n in document d is proportional to p(zd,n     = k ) ∝    Nd,k + α N + Kα | d,·{z }              how much doc likes the topic Vk,wd,n + ̙ Vk,· + V̙ | {z } how much topic likes the word      ,   7.3. Approaches and applications 197 where α and ̙ are smoothing factors that prevent a topic from having zero probability if a topic does not use a word or a document does not use a topic [390]. Recall that we do not include the token that we are sampling in the counts for N or V . For the sake of concreteness, assume that we have three documents with the following topic assignments: Document 1: A dog3 B cat2 C cat3 D pig1 Document 2: E hamburger2 F dog3 G hamburger1 Document 3: H iron1 I iron3 J pig2 K iron2 If we want to sample token B (the first instance of of “cat” in document 1), we compute the conditional probability for each of the three topics (z = 1, 2, 3): p ( zB = 1 ) = p ( zB = 2 ) = p ( zB = 3 ) = 1 + 1.000 3 + 3.000 0 + 1.000 3 + 3.000 2 + 1.000 3 + 3.000 × × × 0 + 1.000 3 + 5.000 0 + 1.000 3 + 5.000 1 + 1.000 4 + 5.000 = 0.333 × 0.125 = 0.042, = 0.167 × 0.125 = 0.021, and = 0.500 × 0.222 = 0.111. To reiterate, we do not include token B in these counts: in computing these conditional probabilities, we consider topic 2 as never appearing in the document and “cat” as never appearing in topic 2. However, “cat” does appear in topic 3 (token C), so it has a higher probability than the other topics. After renormalizing, our conditional probabilities are (0.24, 0.12, 0.64). We then sample the new assignment of token B to be topic 3 two times out of three. Griffiths and Steyvers [140] provide more details on the derivation of this equation. Listing 7.1 provides a function to compute the conditional probability of a single word and return the (unnormalized) probability to sample from. Example code 7.3.1.2 Applications of topic models Topic modeling is most often used for topic exploration, allowing users to understand the contents of large text corpora. Thus, topic models have been used, for example, to understand what the National Institutes of Health funds [368]; to compare and contrast what was discussed in the North and South in the Civil War [270]; and to understand how individuals code in large programming projects [253]. Topic models can also be used as features to more elaborate algorithms such as machine translation [170], detecting objects in images [392], or identifying political polarization [298]. 198 7. Text Analysis def class_sample(docs, vocab, d, n, alpha, beta, theta, phi, num_topics): # Get the vocabulary ID of the word we are sampling type = docs[d][n] # Dictionary to store final result result = {} # Consider each topic possibility for kk in xrange(num_topics): # theta stores the number of times the document d uses # each topic kk; alpha is a smoothing parameter doc_contrib = (theta[d][kk] + alpha) / \ (sum(theta[d].values()) + num_topics * alpha) # phi stores the number of times topic kk uses # this word type; beta is a smoothing parameter topic_contrib = (phi[kk][type] + beta) / \ (sum(phi[kk].values()) + len(vocab) * beta) result[kk] = doc_contrib * topic_contrib return result Listing 7.1. Python code to compute conditional probability of a single word and return the probability from which to sample 7.3.2 Information retrieval and clustering Information retrieval is a large subdiscipline that encompasses a variety of methods and approaches. Its main advantage is using large-scale empirical data to make analytical inferences and class assignments. Compared to topic modeling, discussed above, information retrieval techniques can use external knowledge repositories to categorize given corpora as well as discover smaller and emerging areas within a large database. A major concept of information retrieval is a search query that is usually a short phrase presented by a human or machine to retrieve a relevant answer to a question or discover relevant knowledge. A good example of a large-scale information retrieval system is a search engine, such as Google or Yahoo!, that provides the user with an opportunity to search the entire Internet almost instantaneously. Such fast searches are achieved by complex techniques that are linguistic (set-theoretic), algebraic, probabilistic, or feature-based. Set-theoretic operations and Boolean logic Set-theoretic operations proceed from the assumption that any query is a set of linked components all of which need to be present in the returned result for 7.3. Approaches and applications it to be relevant. Boolean logic serves as the basis for such queries; it uses Boolean operators such as AND, OR, and NOT to combine query components. For example, the query induction AND (physics OR logic) will retrieve all documents in which the word “induction” is used, whether in a physical or logical sense. The extended Boolean model and fuzzy retrieval are enhanced approaches to calculating the relevance of retrieved documents based on such queries [232]. Search queries can be also enriched by wildcards and other connectors. For example, the character “*” typically substitutes for any possible character or characters depending on the settings of the query engine. (In some instances, search queries can run in nongreedy mode, in which case, for example, the phrase inform* might retrieve only text up to the end of the sentence. On the other hand, a greedy query might retrieve full text following the word or part of word, denoted as inform*, up to the end of the document, which would essentially mean the same as inform*$.) The wildcard “?” expects either one or no character in its place, and the wildcard “.” expects exactly one character. Search queries enhanced with such symbols and Boolean operators are referred to as regular expressions. Various databases and search engines can interpret Boolean operators and wildcards differently depending on their settings and therefore are prone to return rather different results. This behavior should be expected and controlled for while running searches on different data sources. Example: Discover food safety awards Food safety is an interdisciplinary research area that spans multiple scientific disciplines including biological sciences, agriculture, and food science. To retrieve food safety-related awards, we have to construct a Boolean-based search string that would look for terms and phrases in those documents and return only relevant results. An example of such a string would typically be subdivided by category or search group connected to each other by the AND or OR operator: 1. General terms: (food safety OR food securit* OR food insecurit*) 2. Food pathogens: (food*) AND (acanthamoeba OR actinobacteri* OR (anaerobic organ*) OR DDT OR ...) 3. Biochemistry and toxicology: (food*) AND (toxicolog* OR (activated carbon*) OR (acid-hydrol?zed vegetable protein*) OR aflatoxin* OR ...) 199 200 7. Text Analysis 4. Food processing and preservation: (food*) AND (process* OR preserv* OR fortif* OR extrac* OR ...) 5. Food quality and quality control: (food*) AND (qualit* OR (danger zon*) OR test* OR (risk analys*) OR ...) 6. Food-related diseases: (food* OR foodbo?rn* OR food-rela*) AND (diseas* OR hygien* OR allerg* OR diarrh?ea* OR nutrit* OR ...) Different websites and databases use different search functions to return most relevant results given a query (e.g., “food safety”). Ideally, a user has access to a full database and can apply the same Python code based on regular expressions to all textual data. However, this is not always possible (e.g., when using proprietary databases, such as Web of Science). In those cases, it is important to follow the conventions of the information retrieval system. For example, one source might need phrases to be embedded in parentheses (i.e., (x-ray crystallograph.*)) while another database interface would require such phrases to be contained within quotation marks (i.e., ‘‘x-ray crystallograph.*’’). It is then critical to explore the search tips and rules on those databases to ensure that the most complete information is gathered and further analyzed. Example code Python’s built-in re package provides all the capability needed to construct and search with complex regular expressions. Listing 7.2 provides an example. def fs_regex(nsf_award_abstracts,outfilename): # Construct simple search string divided by search groups food = "food.*" general = "safety|secur.*|insecur.*" pathogens = "toxicolog.*|acid-hydrolyzed vegetable protein.*| activated carbon.*" process = "process.*|preserv.*|fortif.*" #and so on # Open csv table with all NSF award abstracts in 2000-2014 inpfile = open(nsf_award_abstracts,'rb') inpdata = csv.reader(infile) outfile = open(outfilename,'wb') output = csv.writer(outfile) for line in inpdata: award_id = line[0] title = line[1] abstract = line[2] if re.search(food,abstract) and (re.search(general,abstract) or re.search(pathogens,abstract) or re.search(process, abstract)): output.writerow(i+['food safety award']) Listing 7.2. Python code to identify food safety related NSF awards with regular expressions 7.3. Approaches and applications 201 Algebraic models Algebraic models turn text into numbers to run mathematical operations and discover inherent interdependencies between terms and phrases, also defining the most important and meaningful among them. The vector space representation is a typical way of converting words into numbers, wherein every token is assigned with a sequential ID and a respective weight, be it a simple term frequency, TFIDF value, or any other assigned number. Latent Dirichlet allocation, discussed in the preceding section, is a good example of a probabilistic model, while unsupervised machine learning techniques, such as random forest, can be used for feature-based modeling and information retrieval. Similarity measures and approaches Based on algebraic models, the user can either compare documents between each other or train a model that can be further inferred on a different corpus. Typical metrics involved in this process include cosine similarity and Kullback–Leibler divergence [218]. Cosine similarity is a popular measure in document classifica→ − tion. Given two documents da and db presented as term vectors ta → − and tb , the cosine similarity is → − → − ta · tb → − → − SIMC (ta , tb ) = → → − . |− ta | ∗ | tb | Example: Measuring cosine similarity between documents NSF awards are not labeled by scientific field—they are labeled by program. This administrative classification is not always useful to assess the effects of certain funding mechanisms on disciplines and scientific communities. One approach is to understand how awards align with each other even if they were funded by different programs. Cosine similarity allows us to do just that. The Python numpy module is a powerful library of tools for efficient linear algebra computation. Among other things, it can be used to compute the cosine similarity of two documents represented by numeric vectors, as described above. The gensim module that is often used as a Python-based topic modeling implementation can be used to produce vector space representations of textual data. Listing 7.3 provides an example of measuring cosine similarity using these modules. Example code ◮ Random forests are discussed for record linkages in Chapter 3 and described in more detail in Chapter 10. 202 7. Text Analysis # Define cosine similarity function def coss(v1,v2): return np.dot(v1,v2) / (np.sqrt(np.sum(np.square(v1))) * np.sqrt(np.sum(np.square(v2)))) def coss_nsf(nsf_climate_change,nsf_earth_science,outfile): # Open the source and compared to documents source = csv.reader(file(nsf_climate_change,'rb')) comparison = csv.reader(file(nsf_earth_science,'rb')) # Create an output file output = csv.writer(open(outfile,'wb')) # Read through the source and store value in static data container data = {} for row in source: award_id = row[0] abstract = row[1] data[award_id] = abstract # Read through the comparison file and compute similarity for row in comparison: award_id = row[0] # Assuming that abstract is cleaned, processed, tokenized, and # stored as a space-separated string of tokens abstract = row[1] abstract_for_dict = abstract.split(" ") # Construct dictionary of tokens and IDs dict_abstract = corpora.dictionary.Dictionary(abstract_for_dict) # Construct vector from dictionary # of all tokens and IDs in abstract abstr_vector = dict(dict_abstract.doc2bow(abstract)) # Iterate through all stored abstracts in source corpus # and assign same token IDs using dictionary for key,value in data.items(): source_id = key # Get all tokens from source abstract, assuming it is # tokenized and space-separated source_abstr = value.split(" ") source_vector = dict(dict_abstract.doc2bow(source_abstr)) # Cosine similarity requires having same shape vectors. # Thus impute zeros for any missing tokens in source # abstract as compared to the target one add = { n:0 for n in abstr_vector.keys() if n not in source_dict.keys() } # Update source vector source_vector.update(add) source_vector = sorted(source_vector.items()) abstr_vector = sorted(abstr_vector.items()) # Compute cosine similarity similarity = coss(np.array([item[1] for item in abstr_vector]), np.array([item[1] for item in source_dict]) output.writerow([source_id,award_id,similarity]) Listing 7.3. Python code to measure cosine similarity between Climate Change and all other Earth Science NSF awards 7.3. Approaches and applications 203 Kullback–Leibler (KL) divergence is an asymmetric measure that is often enhanced by averaged calculations to ensure unbiased results when comparing documents between each other or running a → − → − classification task. Given two term vectors ta and tb , the KL diver→ − → − gence from vector ta to tb is → − → − DKL (ta || tb ) = m X t =1 wt,a × log wt,a wt,b ! , where wt,a and wt,b are term weights in two vectors, respectively. An averaged KL divergence metric is then defined as → − → − DAvgKL (ta || tb ) = w m X t =1 (π1 × D (wt,a ||wt ) + π2 × D (wt,b ||wt )), w where π1 = wt,a +t,awt,b , π2 = wt,a +t,bwt,b , and wt = π1 × wt,a + π2 × wt,b [171]. A Python-based scikit-learn library provides an implementation of these measures as well as other machine learning models and approaches. Knowledge repositories Information retrieval can be significantly enriched by the use of established knowledge repositories that can provide enormous amounts of organized empirical data for modeling and relevance calculations. Established corpora, such as the Brown Corpus and Lancaster–Oslo–Bergen Corpus, are one type of such preprocessed repositories. Wikipedia and WordNet are examples of another type of lexical and semantic resources that are dynamic in nature and that can provide a valuable basis for consistent and salient information retrieval and clustering. These repositories have the innate hierarchy, or ontology, of words (and concepts) that are explicitly linked to each other either by inter-document links (Wikipedia) or by the inherent structure of the repository (WordNet). In Wikipedia, concepts thus can be considered as titles of individual Wikipedia pages and the contents of these pages can be considered as their extended semantic representation. Information retrieval techniques build on these advantages of WordNet and Wikipedia. For example, Meij et al. [256] mapped search queries to the DBpedia ontology (derived from Wikipedia topics and their relationships), and found that this mapping enriches the search queries with additional context and concept relationships. One way of using these ontologies is to retrieve a predefined 204 7. Text Analysis list of Wikipedia pages that would match a specific taxonomy. For example, scientific disciplines are an established way of tagging documents—some are in physics, others in chemistry, engineering, or computer science. If a user retrieves four Wikipedia pages on “Physics,” “Chemistry,” “Engineering,” and “Computer Science,” they can be further mapped to a given set of scientific documents to label and classify them, such as a corpus of award abstracts from the US National Science Foundation. Personalized PageRank is a similarity system that can help with the task. This system uses WordNet to assess semantic relationships and relevance between a search query (document d) and possible results (the most similar Wikipedia article or articles). This system has been applied to text categorization [269] by comparing documents to semantic model vectors of Wikipedia pages constructed using WordNet. These vectors account for the term frequency and their relative importance given their place in the WordNet hierarchy, so that the overall wiki vector is defined as: SMVwiki (s) = P tfwiki (w) w∈Synonyms(s) |Synsets(w)| , where w is a token within wiki, s is a WordNet synset that is associated with every token w in WordNet hierarchy, Synonyms(s) is the set of words (i.e., synonyms) in the synset s, tfwiki (w) is the term frequency of the word w in the Wikipedia article wiki, and Synsets(w) is the set of synsets for the word w. The overall probability of a candidate document d (e.g., an NSF award abstract or a PhD dissertation abstract) matching the target query, or in our case a Wikipedia article wiki, is wikiBEST = X max wt ∈doc s∈Synsets(wt ) SMVwiki (s), where Synsets(wt ) is the set of synsets for the word wt in the target document document (e.g., NSF award abstract) and SMVwiki (s) is the semantic model vector of a Wikipedia page, as defined above. Applications Information retrieval can be used in a number of applications. Knowledge discovery, or information extraction, is perhaps its primary mission; in contrast, for users, the purpose of information retrieval applications is to retrieve the most relevant response to a query. Document classification is another popular task where information retrieval methods can be helpful. Such systems, however, typically require a two-step process: The first phase defines all relevant 7.3. Approaches and applications information needed to answer the query. The second phase clusters the documents according to a set of rules or by allowing the machine to actively learn the patterns and classes. For example, one approach is to generate a taxonomy of concepts with associated Wikipedia pages and then map other documents to these pages through Personalized PageRank. In this case, disciplines, such as physics, chemistry, and engineering, can be used as the original labels, and NSF award abstracts can be mapped to these disciplinary categories through the similarity metrics (i.e., whichever of these disciplines scores the highest is the most likely to fit the disciplinary profile of an award abstract). Another approach is to use the Wikipedia structure as a clustering mechanism in itself. For example, the article about “nanotechnology” links to a number of other Wikipedia pages as referenced in its content. “Quantum realm,” “nanometer” or “National Nanotechnology Initiative” are among the meaningful concepts used in the description of nanotechnology that also have their own individual Wikipedia pages. Using these pages, we can assume that if a scientific document, such as an NSF award abstract, has enough similarity with any one of the articles associated with nanotechnology, it can be tagged as such in the classification exercise. The process can also be turned around: if the user knows exactly the clusters of documents in a given corpus, these can be mapped to an external knowledge repository, such as Wikipedia, to discover yet unknown and emerging relationships between concepts that are not explicitly mentioned in the Wikipedia ontology at the current moment. This situation is likely given the time lag between the discovery of new phenomena, their introduction to the research community, and their adoption by the wider user community responsible for writing Wikipedia pages. Examples Some examples from our recent work can demonstrate how Wikipedia-based labeling and labeled LDA [278, 315] cope with the task of document classification and labeling in the scientific domain. See Table 7.1. 7.3.3 Other approaches Our focus in this chapter is on approaches that are language independent and require little (human) effort to analyze text data. In addition to topic modeling and information retrieval discussed above, natural language processing and computational linguistics 205 206 7. Text Analysis Table 7.1. Wikipedia articles as potential labels generated by n -gram indexing of NSF awards Abstract excerpt ProQuest subject category Reconfigurable computing platform for smallscale resource-constrained robot. Specific applications often require robots of small size for reasons such as costs, access, and stealth. Smallscale robots impose constraints on resources such as power or space for modules . . . Engineering, Electronics and Electrical; Engineering, Robotics Genetic mechanisms of thalamic nuclei specification and the influence of thalamocortical axons in regulating neocortical area formation. Sensory information from the periphery is essential for all animal species to learn, adapt, and survive in their environment. The thalamus, a critical structure in the diencephalon, receives sensory information . . . Biology, Neurobiology Poetry ’n acts: The cultural politics of twentiethcentury American poets’ theater. This study focuses on the disciplinary blind spot that obscures the productive overlap between poetry and dramatic theater and prevents us from seeing the cultural work that this combination can perform . . . Literature, American; Theater ◮ Chapter 6 reviews supervised machine learning approaches. Labeled LDA Wikipediabased labeling Motor controller Robotics, Robot, Fieldprogrammable gate array HSD2 neurons Sonic hedgehog, Induced stem cell, Nervous system Audience Counterculture of the 1960s, Novel, Modernism are rich, well-developed subdisciplines of computer science that can help analyze text data. While covering these subfields is beyond this chapter, we briefly discuss some of the most widely used approaches to process and understand natural language texts. In contrast to the unsupervised approaches discussed above, most techniques in natural language processing are supervised machine learning algorithms. Supervised machine learning produce labels y given inputs x—the algorithm’s job is to learn how to automatically produce correct labels given automatic inputs x. However, the algorithm must have access to many examples of x and y, often of the order of thousands of examples. This is expensive, as the labels often require linguistic expertise [251]. While it is possible to annotate data using crowdsourcing [350], this is not a panacea, as it often forces compromises in the complexity of the task or the quality of the labels. 7.3. Approaches and applications In the sequel, we discuss how different definitions of x and y— both in the scope and structure of the examples and labels—define unique analyses of linguistic data. Document classification If the examples x are documents and y are what these documents are about, the problem is called document classification. In contrast to the techniques in Section 7.3.1, document classification is used when you know the specific document types for which you are looking and you have many examples of those document types. One simple but ubiquitous example of document classification is spam detection: an email is either an unwanted advertisement (spam) or it is not. Document classification techniques such as naïve Bayes [235] touch essentially every email sent worldwide, making email usable even though most emails are spam. Sentiment analysis Instead of being what a document is about, a label y could also reveal the speaker. A recent subfield of natural language processing is to use machine learning to reveal the internal state of speakers based on what they say about a subject [295]. For example, given an example of sentence x, can we determine whether the speaker is a Liberal or a Conservative? Is the speaker happy or sad? Simple approaches use dictionaries and word counting methods [299], but more nuanced approaches make use of domainspecific information to make better predictions. One uses different approaches to praise a toaster than to praise an air conditioner [44]; liberals and conservatives each frame health care differently from how they frame energy policy [277]. Part-of-speech tagging When the examples x are individual words and the labels y represent the grammatical function of a word (e.g., whether a word is a noun, verb, or adjective), the task is called part-of-speech tagging. This level of analysis can be useful for discovering simple patterns in text: distinguishing between when “hit” is used as a noun (a Hollywood hit) and when “hit” is used as a verb (the car hit the guard rail). Unlike document classification, the examples x are not independent: knowing whether the previous word was an adjective makes it far more likely that the next word will be a noun than a verb. Thus, the classification algorithms need to incorporate structure into the decisions. Two common algorithms for this problem are hidden Markov models [313] and conditional random fields [220]. 207 208 7. Text Analysis 7.4 ◮ Chapter 10 discusses how to measure and diagnose errors in big data. Evaluation Evaluation techniques are common in economics, policy analysis, and development. They allow researchers to justify their conclusions using statistical means of validation and assessment. Text, however, is less amenable to standard definitions of error: it is clear that predicting that revenue will be $110 when it is really $100 is far better than predicting $900; however, it is hard to say how far “potato harvest” is from “journalism” if you are attempting to automatically label documents. Documents are hard to transform into numbers without losing semantic meanings and context. Content analysis, discourse analysis, and bibliometrics are all common tools used by social scientists in their text mining exercises [134, 358]. However, they are rarely presented with robust evaluation metrics, such as type I and type II error rates, when retrieving data for further analysis. For example, bibliometricians often rely on search strings derived from expert interviews and workshops. However, it is hard to certify that those search strings are optimal. For instance, in nanotechnology research, Porter et al. [305] developed a canonical search strategy for retrieving nanorelated papers from major scientific databases. Nevertheless, others adopt their own search string modifications and claim similar validity [144, 374]. Evaluating these methods depends on reference corpora. We discuss metrics that help you understand whether a collection of documents for a query is a good one or not or whether a labeling of a document collection is consistent with an existing set of labels. Purity Suppose you are tasked with categorizing a collection of documents based on what they are about. Reasonable people may disagree: I might put “science and medicine” together, while another person may create separate categories for “energy,” “scientific research,” and “health care,” none of which is a strict subset of my “science and medicine” category. Nevertheless, we still want to know whether two categorizations are consistent. Let us first consider the case where the labels differ but all categories match (i.e., even though you call one category “taxes” and I call it “taxation,” it has exactly the same constituent documents). This should be the best case; it should have the highest score possible. Let us say that this maximum score should be 1. The opposite case is if we both simply assign labels randomly. There will still be some overlap in our labeling: we will agree some- 7.4. Evaluation times, purely by chance. On average, if we both assign one label, selected from the same set of K labels, to each document, then we should expect to agree on about K1 of the labels. This is a lower bound on performance. The formalization of this measure is called purity: how much overlap there is between each of my labels and the “best” match from your labels. Box 7.2 shows how to calculate it. Box 7.2: Purity calculation We compute purity by assigning each cluster to the class that is most frequent in the cluster, and then measuring the accuracy of this assignment by counting correctly assigned documents and dividing by the number of all documents, N [248]. In formal terms, 1 X max |wk ∩ cj |, Purity(Ω, C) = j N k where Ω = {w1 , w2 , . . . , wk } is the set of candidate clusters and C = {c1 , c2 , . . . , cj } is the gold set of classes. Precision and recall Chapter 6 already touched on the importance of precision and recall for evaluating the results of information retrieval and machine learning models (Box 7.3 provides a reminder of the formulae). Here we look at a particular example of how these metrics can be computed when working with scientific documents. We assume that a user has three sets of documents Da = {da1 , da2 , . . . , dn }, Db = {db1 , db2 , . . . , dk }, and Dc = {dc1 , dc2 , . . . , di }. All three sets are clearly tagged with a disciplinary label: Da are computer science documents, Db are physics, and Dc are chemistry. The user also has a different set of documents—Wikipedia pages on “Computer Science,” “Chemistry,” and “Physics.” Knowing that all documents in Da , Db , and Dc have clear disciplinary assignments, let us map the given Wikipedia pages to all documents within those three sets. For example, the Wikipedia-based query on “Computer Science” should return all computer science documents and none in physics or chemistry. So, if the query based on the “Computer Science” Wikipedia page returns only 50% of all computer science documents, then 50% of the relevant documents are lost: the recall is 0.5. On the other hand, if the same “Computer Science” query returns 50% of all computer science documents but also 20% of the 209 210 7. Text Analysis Box 7.3: Precision and recall These two metrics are commonly used in information retrieval and computational linguistics [318]. Precision computes the type I errors—false positives—in a similar manner to the purity measure; it is formally defined as Precision = |{relevant documents} ∩ {retrieved documents}| . |{retrieved documents}| Recall accounts for type II errors—false negatives—and is defined as Recall = |{relevant documents} ∩ {retrieved documents}| . |{relevant documents}| physics documents and 50% of the chemistry documents, then all of the physics and chemistry documents returned are false positives. Assuming that all document sets are of equal size, so that 5 = 0.42. |Da | = 10, |Db | = 10 and |Dc | = 10, then the precision is 12 F score The F score takes precision and recall measures a step further and considers the general accuracy of the model. In formal terms, the F score is a weighted average of the precision and recall: F1 = 2 · Precision · Recall . Precision + Recall (7.1) In terms of type I and type II errors: F̙ = (1 + ̙2 ) · true positive , (1 + ̙2 ) · true positive + ̙2 · false negative + false positive where ̙ is the balance between precision and recall. Thus, F2 puts more emphasis on the recall measure and F0.5 puts more emphasis on precision. 7.5 Text analysis tools We are fortunate to have access to a set of powerful open source text analysis tools. We describe three here. 7.5. Text analysis tools The Natural Language Toolkit The NLTK is a commonly used natural language toolkit that provides a large number of relevant solutions for text analysis. It is Python-based and can be easily integrated into data processing and analytical scripts by a simple import nltk (or similar for any one of its submodules). The NLTK includes a set of tokenizers, stemmers, lemmatizers and other natural language processing tools typically applied in text analysis and machine learning. For example, a user can extract tokens from a document doc by running the command tokens = nltk.word_tokenize(doc). Useful text corpora are also present in the NLTK distribution. For example, the stop words list can be retrieved by running the command stops=nltk.corpus.stopwords.words(language). These stop words are available for several languages within NTLK, including English, French, and Spanish. Similarly, the Brown Corpus or WordNet can be called by running from nltk.corpus import wordnet/brown. After the corpora are loaded, their various properties can be explored and used in text analysis; for example, dogsyn = wordnet.synsets(’dog’) will return a list of WordNet synsets related to the word “dog.” Term frequency distribution and n-gram indexing are other techniques implemented in NLTK. For example, a user can compute frequency distribution of individual terms within a document doc by running a command in Python: fdist=nltk.FreqDist(text). This command returns a dictionary of all tokens with associated frequency within doc. N-gram indexing is implemented as a chain-linked collocations algorithm that takes into account the probability of any given two, three, or more words appearing together in the entire corpus. In general, n-grams can be discovered as easily as running bigrams = nltk.bigrams(text). However, a more sophisticated approach is needed to discover statistically significant word collocations, as we show in Listing 7.4. Bird et al. [39] provide a detailed description of NLTK tools and techniques. See also the official NLTK website [284]. Stanford CoreNLP While NLTK’s emphasis is on simple reference implementations, Stanford’s CoreNLP [249, 354] is focused on fast implementations of cutting-edge algorithms, particularly for syntactic analysis (e.g., determining the subject of a sentence). 211 212 7. Text Analysis def bigram_finder(texts): # NLTK bigrams from a corpus of documents separated by new line tokens_list = nltk.word_tokenize(re.sub("\n"," ",texts)) bgm = nltk.collocations.BigramAssocMeasures() finder = nltk.collocations.BigramCollocationFinder.from_words( tokens_list) scored = finder.score_ngrams( bgm.likelihood_ratio ) # Group bigrams by first word in bigram. prefix_keys = collections.defaultdict(list) for key, scores in scored: prefix_keys[key[0]].append((key[1], scores)) # Sort keyed bigrams by strongest association. for key in prefix_keys: prefix_keys[key].sort(key = lambda x: -x[1]) Listing 7.4. Python code to find bigrams using NLTK MALLET For probabilistic models of text, MALLET, the MAchine Learning for LanguagE Toolkit [255], often strikes the right balance between usefulness and usability. It is written to be fast and efficient but with enough documentation and easy enough interfaces to be used by novices. It offers fast, popular implementations of conditional random fields (for part-of-speech tagging), text classification, and topic modeling. 7.6 Summary Much “big data” of interest to social scientists is text: tweets, Facebook posts, corporate emails, and the news of the day. However, the meaning of these documents is buried beneath the ambiguities and noisiness of the informal, inconsistent ways by which humans communicate with each other. Despite attempts to formalize the meaning of text data through asking users to tag people, apply metadata, or to create structured representations, these attempts to manually curate meaning are often incomplete, inconsistent, or both. These aspects make text data difficult to work with, but also a rewarding object of study. Unlocking the meaning of a piece of text helps bring machines closer to human-level intelligence—as language is one of the most quintessentially human activities—and helps overloaded information professionals do their jobs more effectively: understand large corpora, find the right documents, or 7.7. Resources automate repetitive tasks. And as an added bonus, the better computers become at understanding natural language, the easier it is for information professionals to communicate their needs: one day using computers to grapple with big data may be as natural as sitting down to a conversation over coffee with a knowledgeable, trusted friend. 7.7 Resources Text analysis is one of the more complex tasks in big data analysis. Because it is unstructured, text (and natural language overall) requires significant processing and cleaning before we can engage in interesting analysis and learning. In this chapter we have referenced several resources that can be helpful in mastering text mining techniques: • The Natural Language Toolkit is one of the most popular Pythonbased tools for natural language processing. It has a variety of methods and examples that are easily accessible online [284]. The book by Bird et al. [39], available online, contains multiple examples and tips on how to use NLTK. • The book Pattern Recognition and Machine Learning by Christo- pher Bishop [40] is a useful introduction to computational techniques, including probabilistic methods, text analysis, and machine learning. It has a number of tips and examples that are helpful to both learning and experienced researchers. • A paper by Anna Huang [171] provides a brief overview of the key similarity measures for text document clustering discussed in this chapter, including their strengths and weaknesses in different contexts. • Materials at the MALLET website [255] can be specialized for the unprepared reader but are helpful when looking for specific solutions with topic modeling and machine classification using this toolkit. • David Blei, one of the authors of the latent Dirichlet allocation algorithm (topic modeling), maintains a helpful web page with introductory resources for those interested in topic modeling [41]. 213 214 7. Text Analysis • We provide an example of how to run topic modeling using MALLET on textual data from the National Science Foundation and Norwegian Research Council award abstracts [49]. • Weka, developed at the University of Waikato in New Zealand, is a useful resource for running both complex text analysis and other machine learning tasks and evaluations [150, 384]. Chapter 8 Networks: The Basics Jason Owen-Smith Social scientists are typically interested in describing the activities of individuals and organizations (such as households and firms) in a variety of economic and social contexts. The frame within which data has been collected will typically have been generated from tax or other programmatic sources. The new types of data permit new units of analysis—particularly network analysis—largely enabled by advances in mathematical graph theory. This chapter provides an overview of how social scientists can use network theory to generate measurable representations of patterns of relationships connecting entities. As the author points out, the value of the new framework is not only in constructing different right-hand-side variables but also in studying an entirely new unit of analysis that lies somewhere between the largely atomistic actors that occupy the markets of neo-classical theory and the tightly managed hierarchies that are the traditional object of inquiry of sociologists and organizational theorists. 8.1 Introduction This chapter provides a basic introduction to the analysis of large networks. The following introduces the basic logic of network analysis, then turn to a summary of data structures and essential measures before presenting a primer on network visualization and a more elaborated descriptive case comparison of the collaboration networks of two research-intensive universities. Both those collaboration networks and a grant co-employment network for a large public university also examined in this chapter are derived from data produced by the multi-university Committee on Institutional Cooperation (CIC)’s UMETRICS project [228]. The snippets of code 215 216 8. Networks: The Basics that are provided are from the igraph package for network analysis as implemented in Python. At their most basic, networks are measurable representations of patterns of relationships connecting entities in an abstract or actual space. What this means is that there are two fundamental questions to ask of any network presentation or measure, First, what are the nodes? Second, what are the ties? While the network methods sketched in this chapter are equally applicable to technical or biological networks (e.g., the hub-and-spoke structure of the worldwide air travel system, or the neuronal network of a nematode worm), I focus primarily on social networks; patterns of relationships among people or organizations that are created by and come to influence individual action. This chapter draws most of its examples from the world of science, technology, and innovation. Thus, this chapter focuses particularly on networks developed and maintained by the collaborations of scientists and by the contractual relationships organizations form in pursuit of innovation. This substantive area is of great interest because a great deal of research in sociology, management, and related fields demonstrates that networks of just these sorts are essential to understanding the process of innovation and outcomes at both the individual and the organizational level. In other words, networks offer not just another convenient set of right-hand-side variables, but an entirely new unit of analysis that lies somewhere between the largely atomistic actors that occupy the markets of neo-classical theory and the tightly managed hierarchies that are the traditional object of inquiry of sociologists and organizational theorists. As Walter W. Powell [308] puts it in a description of buyer supplier networks of small Italian firms: “when the entangling of obligation and reputation reaches a point where the actions of the parties are interdependent, but there is no common ownership or legal framework . . . such a transaction is neither a market exchange nor a hierarchical governance structure, but a separate, different mode of exchange.” Existing as they do between the uncoordinated actions of independent individuals and coordinated work of organizations, networks offer a unique level of analysis for the study of scientific and creative teams [410], collaborations [202], and clusters of entrepreneurial firms [294]. The following sections introduce you to this approach to studying innovation and discovery, focusing on examples drawn from high-technology industries and particularly from the scientific collaborations among grant-employed researchers at UMETRICS universities. I make particular use of a 8.1. Introduction network that connects individual researchers to grants that paid their salaries in 2012 for a large public university. The grants network for university A includes information on 9,206 individuals who were employed on 3,389 research grants from federal science agencies in a single year. Before turning to those more specific substantive topics, the chapter first introduces the most common structures for large network data, briefly introduce three key social “mechanisms of action” by which social networks are thought to have their effects, and then present a series of basic measures that can be used to quantify characteristics of entire networks and the relative position individual people or organizations hold in the differentiated social structure created by networks. Taken together, these measures offer an excellent starting point for examining how global network structures create opportunities and challenges for the people in them, for comparing and explaining the productivity of collaborations and teams, and for making sense of the differences between organizations, industries, and markets that hinge on the pattern of relationships connecting their participants. But what is a network? At its simplest, a network is a pattern of concrete, measurable relationships connecting entities engaged in some common activity. While this chapter focuses my attention on social networks, you could easily use the techniques described here to examine the structure of networks such as the World Wide Web, the national railway route map of the USA, the food web of an ecosystem, or the neuronal network of a particular species of animal. Networks can be found everywhere, but the primary example used here is a network connecting individual scientists through their shared work on particular federal grants. The web of partnerships that emerges from university scientists’ decentralized efforts to build effective collaborations and teams generates a distinctive social infrastructure for cutting-edge science. Understanding the productivity and effects of university research thus requires an effort to measure and characterize the networks on which it depends. As suggested, those networks influence outcomes in three ways: first, they distinguish among individuals; second, they differentiate among teams; and third, they help to distinguish among research-performing universities. Most research-intensive institutions have departments and programs that cover similar arrays of topics and areas of study. What distinguishes them from one another is not the topics they cover but the ways in which their distinctive collaboration networks lead them to have quite different scientific capabilities. 217 218 8. Networks: The Basics 8.2 Network data Networks are comprised of nodes, which represent things that can be connected to one another, and of ties that represent the relationships connecting nodes. When ties are undirected they are called edges. When they are directed (as when I lend money to you and you do or do not reciprocate) they are called arcs. Nodes, edges and arcs can, in principle, be anything: patents and citations, web pages and hypertext links, scientists and collaborations, teenagers and intimate relationships, nations and international trade agreements. The very flexibility of network approaches means that the first step toward doing a network analysis is to clearly define what counts as a node and what counts as a tie. While this seems like an easy move, it often requires deep thought. For instance, an interest in innovation and discovery could take several forms. We could be interested in how universities differ in their capacity to respond to new requests for proposals (a macro question that would require the comparison of full networks across campuses). We could wonder what sorts of training arrangements lead to the best outcomes for graduate students (a more micro-level question that requires us to identify individual positions in larger networks). Or we could ask what team structure is likely to lead to more or less radical discoveries (a decidedly meso-level question that requires we identify substructures and measure their features). Each of these is a network question that relies on the ways in which people are connected to one another. The first challenge of measurement is to identify the nodes (what is being connected) and the ties (the relationships that matter) in order to construct the relevant networks. The next is to collect and structure the data in a fashion that is sufficient for analysis. Finally, measurement and visualization decisions must be made. 8.2.1 Forms of network data Network ties can be directed (flowing from one node to another) or undirected. In either case they can be binary (indicating the presence or absence of a tie) or valued (allowing for relationships of different types or strengths). Network data can be represented in matrices or as lists of edges and arcs. All these types of relationships can connect one type of node (what is commonly called one-mode network data) or multiple types of nodes (what is called two-mode or affiliation data). Varied data structures correspond to 8.2. Network data 219 Ties among a group of actors Represented as a symmetric square matrix A C A A E B B 1 B C D E 1 1 0 0 1 1 1 1 0 C 1 1 D 0 1 1 E 0 1 0 0 0 D Undirected (edge), binary data Example: Actors are firms, ties represent the presence of any strategic alliance Represented as an edge list A B A C B C B D B E C D Figure 8.1. Undirected, binary, one-mode network data different classes of network data. The simplest form of network data represents instances where the same kinds of nodes are connected by undirected ties (edges) that are binary. An example of this type of data is a network where nodes are firms and ties indicate the presence of a strategic alliance connecting them [309]. This network would be represented as a square symmetric matrix or a list of edges connecting nodes. Figure 8.1 summarizes this simple data structure, highlighting the idea that network data of this form can be represented either as a matrix or as an edge list. A much more complicated network would be one that is both directed and valued. One example might be a network of nations connected by flows of international trade. Goods and services flow from one nation to another and the value of those goods and services (or their volume) represents ties of different strengths. When networks connecting one class of nodes (in this case nations) are directed and valued, they can be represented as asymmetric valued matrices or lists of arcs with associated values. (See Figure 8.2 for an example.) While many studies of small- to medium-sized social networks rely on one-mode data. Large-scale social network data of this type are relatively rare, but one-mode data of this sort are fairly common in relationships among other types of nodes such as hyperlinks 220 8. Networks: The Basics Ties among a group of actors Represented as an asymmetric square matrix A A C A E B D Directed (arc), valued data Example: Nodes are faculty, ties represent the number of payments from one principal investigator’s grant to another’s B C D E 0 1 0 0 3 1 0 0 0 B 3 C 4 0 D 0 0 0 E 0 2 0 5 5 Represented as an arc list AC1 BA3 BC3 BD1 CA4 DE5 EB2 ED5 Figure 8.2. Directed, valued, one-mode network data ⋆ Key insight: A two-mode network can be conceptualized and analyzed as two one-mode networks, or projections. connecting web pages or citations connecting patents or publications. Nevertheless, much “big” social network analysis is conducted using two-mode data. The UMETRICS employee data set is a two-mode network that connects people (research employees) to the grants that pay their wages. These two types of nodes can be represented as a rectangular matrix that is either valued or binary. It is relatively rare to analyze untransformed two-mode network data. Instead, most analyses take advantage of the fact that such networks are dual [398]. In other words, a two-mode network connecting grants and people can be conceptualized (and analyzed) as two one-mode networks, or projections.* 8.2.2 Inducing one-mode networks from two-mode data The most important trick in large-scale social network analysis is that of inducing one-mode, or unipartite, networks (e.g., employee × employee relationships) from two-mode, or bipartite, data. But the ubiquity and potential value of two-mode data can come at a cost. Not all affiliations are equally likely to represent real, meaningful relationships. While it seems plausible to assume that two individuals paid by the same grant have interactions that reason- 8.2. Network data 221 ably pertain to the work funded by the grant, this need not be the case. For example, consider the two-mode grant × person network for university A. I used SQL to create a representation of this network that is readable by a freeware network visualization program called Pajek [30]. In this format, a network is represented as two lists: a vertex list that lists the nodes in the graph and an edge list that lists the connections between those nodes. In our grant × person network, we have two types of nodes, people and grants, and one kind of edge, used to represent wage payments from grants to individuals. I present a brief snippet of the resulting network file in what follows, showing first the initial 10 elements of the vertex list and then the initial 10 elements of the edge list, presented in two columns for compactness. (The complete file comprises information on 9,206 employees and 3,389 grants, for a total of 12,595 vertices and 15,255 edges. The employees come first in the vertex list, and so the 10 rows shown below all represent employees.) Each vertex is represented by a vertex number–label pair and each edge by a pair of vertices plus an optional value. Thus, the first entry in the edge list (1 10419) specifies that the vertex with identifier 1 (which happens to be the first element of the vertex list, which has value “00100679”) is connected to the vertex with identifier 10419 by an edge with value 1, indicating that employee “00100679” is paid by the grant described by vertex 10419. *Grant-Person-Network *Vertices 12595 9206 1 "00100679" 2 "00107462" 3 "00109569" 4 "00145355" 5 "00153190" 6 "00163131" 7 "00170348" 8 "00172339" 9 "00176582" 10 "00203529" *Edges 1 2 3 3 4 7 7 8 9 10 10419 10422 9855 9873 9891 10432 12226 10419 11574 11196 The network excerpted above is two-mode because it represents relationships between two different classes of nodes, grants, and people. In order to use data of this form to address questions about patterns of collaboration on UMETRICS campuses, we must first transform it to represent collaborative relationships. 222 8. Networks: The Basics Ties linking nodes of two different types Represented by a rectangular matrix Example: Researchers and grants C A 3 1 B 2 E 1 2 3 A 1 0 0 B 1 1 0 C 1 0 1 D 0 1 0 E 0 1 1 X = g × h matrix X’ = h × g transpose D C A B E XX’ yields a g × g symmetrical matrix that represents co-employment ties among individuals 1 2 3 X’X yields an h × h symmetrical matrix that represents grants connected by people D Figure 8.3. Two-mode affiliation data A person-by-person projection of the original two-mode network assumes that ties exist between people when they are paid by the same grant. By the same token, a grant-by-grant projection of the original two-mode network assumes that ties exist between grants when they pay the same people. Transforming two-mode data into one-mode projections is a fairly simple matter. If X is a rectangular matrix, p × g, then a one-mode projection, p × p, can be obtained by multiplying X by its transpose X′ . Figure 8.3 summarizes this transformation. In the following snippet of code, I use the igraph package in Python to read in a Pajek file and then transform the original twomode network into two separate projections. Because my focus in this discussion is on relationships among people, I then move on to work exclusively with the employee-by-employee projection. However, every technique that I describe below can also be used with the grant-by-grant projection, which provides a different view of how federally funded research is put together by collaborative relationships on campus. from igraph import * # Read the graph g = Graph.Read_Pajek("public_a_2m.net") 8.2. Network data 223 # Look at result summary(g) # IGRAPH U-WT 12595 15252 -# + attr: color (v), id (v), shape (v), type (v), x (v), y (v), z (v), weight (e) # ... # ... # Transform to get 1M projection pr_g_proj1, pr_g_proj2= g.bipartite_projection() # Look at results summary(pr_g_proj1) # IGRAPH U-WT 9206 65040 -# + attr: color (v), id (v), shape (v), type (v), x (v), y (v), z (v), weight (e) summary(pr_g_proj2) # IGRAPH U-WT 3389 12510 -# + attr: color (v), id (v), shape (v), type (v), x (v), y (v), z (v), weight (e) # pr_g_proj1 is the employeeXemployee projection, n=9,206 nodes # Rename to emp for use in future calculations emp=pr_g_proj1 We now can work with the graph emp, which represents the collaborative network of federally funded research on this campus. Care must be taken when inducing one-mode network projections from two-mode network data because not all affiliations provide equally compelling evidence of actual social relationships. While assuming that people who are paid by the same research grants are collaborating on the same project seems plausible, it might be less realistic to assume that all students who take the same university classes have meaningful relationships. For the remainder of this chapter, the examples I discuss are based on UMETRICS employee data rendered as a one-mode person-by-person projection of the original two-mode person-by-grants data. In constructing these networks I assume that a tie exists between two university research employees when they are paid any wages from the same grant during the same year. Other time frames or thresholds might be used to define ties if appropriate for particular analyses.* ⋆ Key insight: Care must be taken when inducing one-mode network projections from two-mode network data because not all affiliations provide equally compelling evidence of actual social relationships. 224 8. Networks: The Basics 8.3 ⋆ Key insight: Structural analysis of outcomes for any individual or group are a function of the complete pattern of connections among them. ⋆ Key insight: Much of the power of networks (and their systemic features) is due to indirect ties that create reachability. Two nodes can reach each other if they are connected by an unbroken chain of relationships. These are often called indirect ties. Network measures The power of networks lies in their unique flexibility and ability to address many phenomena at multiple levels of analysis. But harnessing that power requires the application of measures that take into account the overall structure of relationships represented in a given network. The key insight of structural analysis is that outcomes for any individual or group are a function of the complete pattern of connections among them. In other words, the explanatory power of networks is driven as much by the pathways that indirectly connect nodes as by the particular relationships that directly link members of a given dyad. Indirect ties create reachability in a network.* 8.3.1 Reachability Two nodes are said to be reachable when they are connected by an unbroken chain of relationships through other nodes. For instance, two people who have never met may nonetheless be able to reach each other through a common acquaintance who is positioned to broker an introduction [286] or the transfer of information and resources [60]. It is the reachability that networks create that makes them so important for understanding the work of science and innovation. Consider Figure 8.4, which presents three schematic networks. In each, one focal node, ego, is colored orange. Each ego has four alters, but the fact that each has connections to four other nodes masks important differences in their structural positions. Those differences have to do with the number of other nodes they can reach through the network and the extent to which the other nodes in the network are connected to each other. The orange node (ego) in each network has four partners, but their positions are far from equivalent. Centrality measures on full network data can tease out the differences. The networks also vary in their gross characteristics. Those differences, too, are measurable.* Networks in which more of the possible connections among nodes are realized are denser and more cohesive than networks in which fewer potential connections are realized. Consider the two smaller networks in Figure 8.4, each of which is comprised of five nodes. Just five ties connect those nodes in the network on the far right of the figure. One smaller subset of that network, the triangle connecting ego and two alters at the center of the image, represents 8.3. Network measures Figure 8.4. Reachability and indirect ties a more cohesively connected subset of the networks. In contrast, eight of the nine ties that are possible connect the five nodes in the middle figure; no subset of those nodes is clearly more interconnected than any other. While these kinds of differences may seem trivial, they have implications for the orange nodes, and for the functioning of the networks as a whole. Structural differences between the positions of nodes, the presence and characteristics of cohesive “communities” within larger networks [133], and many important properties of entire structures can be quantified using different classes of network measures. Newman [272] provides the most recent and most comprehensive look at measures and algorithms for network research. The most essential thing to be able to understand about larger scale networks is the pattern of indirect connections among nodes. What is most important about the structure of networks is not necessarily the ties that link particular pairs of nodes to one another. Instead, it is the chains of indirect connections that make networks function as a system and thus make them worthwhile as new levels of analysis for understanding social and other dynamics. 8.3.2 Whole-network measures The basic terms needed to characterize whole networks are fairly simple. It is useful to know the size (in terms of nodes and ties) of each network you study. This is true both for the purposes of being able to generally gauge the size and connectivity of an entire network and because many of the measures that one might calculate using such networks should be standardized for analytic use. While the list of possible network measures is long, a few commonly used 225 226 8. Networks: The Basics indices offer useful insights into the structure and implications of entire network structures. Components and reachability As we have seen, a key feature of networks is reachability. The reachability of participants in a network is determined by their membership in what network theorists call components, subsets of larger networks where every member of a group is indirectly connected to every other. If you imagine a standard node and line drawing of a network, a component is a portion of the network where you can trace paths between every pair of nodes without ever having to lift your pen. Most large networks have a single dominant component that typically includes anywhere from 50% to 90% of its participants as well as many smaller components and isolated nodes that are disconnected from the larger portion of the network. Because the path length centrality measures described below can only be computed on connected subsets of networks, it is typical to analyze the largest component of any given network. Thus any description of a network or any effort to compare networks should report the number of components and the percentage of nodes reachable through the largest component. In the code snippet below, I identify the weakly connected components of the employee network, emp. # Add component membership emp.vs["membership"] = emp.clusters(mode="weak").membership # Add component size emp.vs["csize"] = [emp.clusters(mode="weak").sizes()[i] for i in emp.clusters(mode="weak").membership] # Identify the main component # Get indices of max clusters maxSize = max(emp.clusters(mode="weak").sizes()) emp.vs["largestcomp"] = [1 if maxSize == x else 0 for x in emp.vs[ "csize"]] # Add component membership emp.vs["membership"] = emp.clusters(mode="weak").membership The main component of a network is commonly analyzed and visualized because the graph-theoretic distance among unconnected nodes is infinite, which renders calculation of many common network measures impossible without strong assumptions about just how far apart unconnected nodes actually are. While some researchers replace infinite path lengths with a value that is one plus the longest path, called the network’s diameter, observed in a given 8.3. Network measures 227 structure, it is also common to simply analyze the largest connected component of the network. Path length One of the most robust and reliable descriptive statistics about an entire network is the average path length, lG , among nodes. Networks with shorter average path lengths have structures that may make it easier for information or resources to flow among members in the network. Longer path lengths, by comparison, are associated with greater difficulty in the diffusion and transmission of information or resources. Let g be the number of nodes or vertices in a network. Then lG = 1 g(g − 1) X d (ni , nj ). i ,j As with other measures based on reachability, it is most common to report the average path length for the largest connected component of the network because the graph-theoretic distance between two unconnected nodes is infinite. In an electronic network such as the World Wide Web, a shorter path length means that any two pages can be reached through fewer hyperlink clicks. The snippet of code below identifies the distribution of shortest path lengths among all pairs of nodes in a network and the average path length. I also include a line of code that calculates the network distance among all nodes and returns a matrix of those distances. That matrix (saved as empdist) can be used to calculate additional measures or to visualize the graph-theoretic proximities among nodes. # Calculate distances and construct distance table dfreq=emp.path_length_hist(directed=False) print(dfreq) # # # # # # # # # # # N = 12506433, mean +- sd: 5.0302 +- 1.7830 Each * represents 51657 items [ 1, 2): * (65040) [ 2, 3): ********* (487402) [ 3, 4): *********************************** (1831349) [ 4, 5): ****************************************************** **** (2996157) [ 5, 6): **************************************************** (2733204) [ 6, 7): ************************************** (1984295) [ 7, 8): ************************ (1267465) [ 8, 9): ************ (649638) [ 9, 10): ***** (286475) 228 8. Networks: The Basics # [10, 11): ** (125695) # [11, 12): * (52702) # [12, 13): (18821) # [13, 14): (5944) # [14, 15): (1682) # [15, 16): (403) # [16, 17): (128) # [17, 18): (28) # [18, 19): (5) print(dfreq.unconnected) # 29864182 print(emp.average_path_length(directed=False)) #[1] 5.030207 empdist= emp.shortest_paths() These measures provide a few key insights into the employee network we have been considering. First, the average pair of nodes that are connected by indirect paths are slightly more than five steps from one another. Second, however, many node pairs in this network ($unconnected = 29,864,182) are unconnected and thus unreachable to each other. Figure 8.5 presents a histogram of the distribution of path lengths in the network. It represents the numeric values returned by the distance.table command in the code snippet above. In this case the diameter of the network is 18 and five pairs of nodes are reachable at this distance, but the largest group of dyads is reachable (N = 2,996,157 dyads) at distance 4. In short, nearly 3 million pairs of nodes are collaborators of collaborators of collaborators of collaborators. Degree distribution Another powerful way to describe and compare networks is to look at the distribution of centralities across nodes. While any of the centrality measures described above could be summarized in terms of their distribution, it is most common to plot the degree distribution of large networks. Degree distributions commonly have extremely long tails. The implication of this pattern is that most nodes have a small number of ties (typically one or two) and that a small percentage of nodes account for the lion’s share of a network’s connectivity and reachability. Degree distributions are typically so skewed that it is common practice to plot degree against the percentage of nodes with that degree score on a log–log scale. High-degree nodes are often particularly important actors. In the UMETRICS networks that are employee × employee projections of employee × grant networks, for instance, the nodes with the highest degree seem likely to include high-profile faculty—the investi- 8.3. Network measures 229 3500000 Number of Dyads 3000000 2500000 2000000 1500000 1000000 500000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Path Length Figure 8.5. Histogram of path lengths for university A employee network gators on larger institutional grants such as National Institutes of Health-funded Clinical and Translational Science Awards and National Science Foundation-funded Science and Technology Centers, and perhaps staff whose particular skills are in demand (and paid for) by multiple research teams. For instance, the head technician in a core microscopy facility or a laboratory manager who serves multiple groups might appear highly central in the degree distribution of a UMETRICS network. Most importantly, the degree distribution is commonly taken to provide insight into the dynamics by which a network was created. Highly skewed degree distributions often represent scale-free networks [25, 271, 309], which grow in part through a process called preferential attachment, where new nodes entering the network are more likely to attach to already prominent participants. In the kinds of scientific collaboration networks that UMETRICS represents, a scale-free degree distribution might come about as faculty new to an institution attempt to enroll more established colleagues on grants as coinvestigators. In the comparison exercise outlined below, I plot degree distributions for the main components of two different university networks. Clustering coefficient The third commonly used whole-network measure captures the extent to which a network is cohesive, with many 230 8. Networks: The Basics nodes interconnected. In networks that are more cohesively clustered, there are fewer opportunities for individuals to play the kinds of brokering roles that we will discuss below in the context of betweenness centrality. Less cohesive networks, with lower levels of clustering, are potentially more conducive to brokerage and the kinds of innovation that accompany it. However, the challenge of innovation and discovery is both the moment of invention, the “aha!” of a good new idea, and the often complicated, uncertain, and collaborative work that is required to turn an initial insight into a scientific finding. While less clustered, open networks are more likely to create opportunities for brokers to develop fresh ideas, more cohesive and clustered networks support the kinds of repeated interactions, trust, and integration that are necessary to do uncertain and difficult collaborative work. While it is possible to generate a global measure of cohesiveness in networks, which is generically the number of closed triangles (groups of three nodes all connected to one another) as a proportion of the number of possible triads, it is more common to take a local measure of connectivity and average it across all nodes in a network. This local connectivity measure more closely approximates the notion of cohesion around nodes that is at the heart of studies of networks as means to coordinate difficult, risky work. The code snippet below calculates both the global clustering coefficient and a vector of node-specific clustering coefficients whose average represents the local measure for the employee × employee network projection of the university A UMETRICS data. # Calculate clustering coefficients emp.transitivity_undirected() # 0.7241 local_clust=emp.transitivity_local_undirected(mode="zero") # (isolates="zero" sets clustering to zero rather than undefined) import pandas as pd print(pd.Series(local_clust).describe()) # count 9206.000000 # mean 0.625161 # std 0.429687 # min 0.000000 # 25% 0.000000 # 50% 0.857143 # 75% 1.000000 # max 1.000000 #--------------------------------------------------# 8.3. Network measures 231 Together, these summary statistics—number of nodes, average path length, distribution of path lengths, degree distribution, and the clustering coefficient—offer a robust set of measures to examine and compare whole networks. It is also possible to distinguish among the positions nodes hold in a particular network. Some of the most powerful centrality measures also rely on the idea of indirect ties.* Centrality measures This class of measures is the most common way to distinguish between the positions individual nodes hold in networks. There are many different measures of centrality that capture different aspects of network positions, but they fall into three general types. The most basic and intuitive measure of centrality, degree centrality, simply counts the number of ties that a node has. In a binary undirected network, this measure resolves into the number of unique alters each node is connected to. In mathematical terms it is the row or column sum of the adjacency matrix that characterizes a network. Degree centrality, CD (ni ), represents a clear measure of the prominence or visibility of a node. Let CD (ni ) = X xij . j The degree of a node is limited by the size of the network in which it is embedded. In a network of g nodes the maximum degree of any node is g − 1. The two orange nodes in the small networks presented in Figure 8.4 have the maximum degree possible (4). In contrast, the orange node in the larger, 13-node network in that figure has the same number of alters but the possible number of partners is three times as large (12). For this reason it is problematic to compare raw degree centrality measures across networks of different sizes. Thus, it is common to normalize degree by the maximum value defined by g − 1: P j xij ′ CD (ni ) = . g−1 While the normalized degree centrality of the two orange nodes of the smaller networks in Figure 8.4 is 1.0, the normalized value for the node in the large network of 13 nodes is 0.33. Despite the fact that the highlighted nodes in the two smaller networks have the same degree centrality, the pattern of indirect ties connecting their alters means they occupy meaningfully different positions. There are a number of degree-based centrality measures that take more ⋆ Key insight: Some of the most powerful centrality measures also rely on the idea of indirect ties. 232 ⋆ A shortest path is a path that does not repeat any nodes or ties. Most pairs have several of those. The geodesic is the longest shortest path. So, if two people are directly connected (path length 1) and connected through shared ties to another person (path length 2), then their geodesic distance is two. 8. Networks: The Basics of the structural information from a complete network into account by using a variety of methods to account not just for the number of partners a particular ego might have but also for the prominence of those partners. Two well-known examples are eigenvector centrality and page rank (see [272, Ch. 7.2 and 8.4]). Consider two additional measures that capture aspects of centrality that have more to do with the indirect ties that increase reachability. Both make explicit use of the idea that reachability is the source of many of the important social and economic benefits of salutary network positions, but they do so with different substantive emphases. Both of these approaches rely on the idea of a network geodesic, the longest shortest path* connecting any pair of actors. Because these measures rely on reachability, they are only useful when applied to components. When nodes have no ties (degree 0) they are called isolates. The geodesic distances are infinite and thus path-based centrality measures cannot be calculated. This is a shortcoming of these measures, which can only be used on connected subsets of graphs where each node has at least one tie to another and all are indirectly connected. Closeness centrality, CC, is based on the idea that networks position some individuals closer to or farther away from other participants. The primary idea is that shorter network paths between actors increase the likelihood of communication and with it the ability to coordinate complicated activities. Let d (ni , nj ) represent the number of network steps in the geodesic path connecting two nodes i and j. As d increases, the network distance between a pair of nodes grows. Thus a standard measure of closeness is the inverse of the sum of distances between any given node and all the others that are reachable in a network: 1 CC (ni ) = Pg j =1 d (ni , nj ) . The maximum of closeness centrality occurs when a node is directly connected to every possible partner in the network. As with degree centrality, closeness depends on the number of nodes in a network. Thus, it is necessary to standardize the measure to allow comparisons across multiple networks: CC′ (ni ) = Pg g−1 j =1 d (ni , nj ) . Like closeness centrality, betweenness centrality, CB , relies on the concept of geodesic paths to capture nuanced differences between 8.3. Network measures 233 the positions of nodes in a connected network. Where closeness assumes that communication and the flow of information increase with proximity, betweenness captures the idea of brokerage that was made famous by Burt [59]. Here too the idea is that flows of information and resources pass between nodes that are not directly connected through indirect paths. The key to the idea of brokerage is that such paths pass through nodes that can interdict, or otherwise profit from their position “in between” unconnected alters. This idea has been particularly important in network studies of innovation [60,293], where flows of information through strategic alliances among firms or social networks connecting individuals loom large in explanations of why some organizations or individuals are better able to develop creative ideas than others. To calculate betweenness as originally specified, two strong assumptions are required [129]. First, one must assume that when people (or organizations) search for new information through their networks, they are capable of identifying the shortest path to what they seek. When multiple paths of equal length exist, we assume that each path is equally likely to be used. Newman [271] describes an alternative betweenness measure based on random paths through a network, rather than shortest paths, that relaxes these assumptions. For now, let gjk equal the number of geodesic paths linking any two actors. Then 1/gjk is the probability that any given path will be followed on a particular node’s search for information or resources in a network. In order to calculate the betweenness score of a particular actor, i, it is then necessary to determine how many of the geodesic paths connecting j to k include i. That quantity is gjk (ni ). With these (unrealistic) assumptions in place, we calculate CB (ni ) as CB (ni ) = X (n ) gjk i /gjk . j

Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Author                          : Foster, Ian; Ghani, Rayid; Jarmin, Ron S.
Create Date                     : 2016:05:19 10:09:53+05:30
Modify Date                     : 2016:07:12 17:56:54+05:30
Subject                         : 
EBX PUBLISHER                   : CRC Press
Page Layout                     : SinglePage
Page Count                      : 377
Page Mode                       : UseOutlines
Has XFA                         : No
XMP Toolkit                     : Adobe XMP Core 5.2-c001 63.139439, 2010/09/27-13:37:26
Format                          : application/pdf
Creator                         : 
Description                     : 
Title                           : 
Creator Tool                    : LaTeX with hyperref package
Metadata Date                   : 2016:07:12 17:56:54+05:30
Keywords                        : 
Producer                        : dvips + GPL Ghostscript 9.16
Document ID                     : uuid:dc0b4bda-7bb6-2843-bb0f-f6a22c81cb69
Instance ID                     : uuid:9ec023c9-4b65-4948-8154-af9f7eb2499c
EXIF Metadata provided by EXIF.tools

Navigation menu