Big Data And Social Science (Statistics In The Behavioral Sciences Series) Ian Foster, Rayid Ghani, Ron S. Jarmin


User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 377 [warning: Documents this large are best viewed by clicking the View PDF Link!]

A Practical Guide to Methods and Tools
Statistics in the Social and Behavioral Sciences Series
Aims and scope
Large and complex datasets are becoming prevalent in the social and behavioral
sciences and statistical methods are crucial for the analysis and interpretation of such
data. This series aims to capture new developments in statistical methodology with
particular relevance to applications in the social and behavioral sciences. It seeks to
promote appropriate use of statistical, econometric and psychometric methods in
these applied sciences by publishing a broad range of reference works, textbooks and
The scope of the series is wide, including applications of statistical methodology in
sociology, psychology, economics, education, marketing research, political science,
criminology, public policy, demography, survey methodology and ofcial statistics. The
titles included in the series are designed to appeal to applied statisticians, as well as
students, researchers and practitioners from the above disciplines. The inclusion of real
examples and case studies is therefore essential.
Jeff Gill
Washington University, USA
Wim J. van der Linden
Pacic Metrics, USA
Steven Heeringa
University of Michigan, USA
J. Scott Long
Indiana University, USA
Series Editors
Chapman & Hall/CRC
Tom Snijders
Oxford University, UK
University of Groningen, NL
Published Titles
Analyzing Spatial Models of Choice and Judgment with R
David A. Armstrong II, Ryan Bakker, Royce Carroll, Christopher Hare, Keith T. Poole, and Howard Rosenthal
Analysis of Multivariate Social Science Data, Second Edition
David J. Bartholomew, Fiona Steele, Irini Moustaki, and Jane I. Galbraith
Latent Markov Models for Longitudinal Data
Francesco Bartolucci, Alessio Farcomeni, and Fulvia Pennoni
Statistical Test Theory for the Behavioral Sciences
Dato N. M. de Gruijter and Leo J. Th. van der Kamp
Multivariable Modeling and Multivariate Analysis for the Behavioral Sciences
Brian S. Everitt
Multilevel Modeling Using R
W. Holmes Finch, Jocelyn E. Bolin, and Ken Kelley
Big Data and Social Science: A Practical Guide to Methods and Tools
Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, and Julia Lane
Ordered Regression Models: Parallel, Partial, and Non-Parallel Alternatives
Andrew S. Fullerton and Jun Xu
Bayesian Methods: A Social and Behavioral Sciences Approach, Third Edition
Jeff Gill
Multiple Correspondence Analysis and Related Methods
Michael Greenacre and Jorg Blasius
Applied Survey Data Analysis
Steven G. Heeringa, Brady T. West, and Patricia A. Berglund
Informative Hypotheses: Theory and Practice for Behavioral and Social Scientists
Herbert Hoijtink
Generalized Structured Component Analysis: A Component-Based Approach to Structural Equation Modeling
Heungsun Hwang and Yoshio Takane
Bayesian Psychometric Modeling
Roy Levy and Robert J. Mislevy
Statistical Studies of Income, Poverty and Inequality in Europe: Computing and Graphics in R Using EU-SILC
Nicholas T. Longford
Foundations of Factor Analysis, Second Edition
Stanley A. Mulaik
Linear Causal Modeling with Structural Equations
Stanley A. Mulaik
Age–Period–Cohort Models: Approaches and Analyses with Aggregate Data
Robert M. O’Brien
Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data
Leslie Rutkowski, Matthias von Davier, and David Rutkowski
Generalized Linear Models for Categorical and Continuous Limited Dependent Variables
Michael Smithson and Edgar C. Merkle
Incomplete Categorical Data Design: Non-Randomized Response Techniques for Sensitive Questions in
Guo-Liang Tian and Man-Lai Tang
Handbook of Item Response Theory, Volume 1: Models
Wim J. van der Linden
Handbook of Item Response Theory, Volume 2: Statistical Tools
Wim J. van der Linden
Handbook of Item Response Theory, Volume 3: Applications
Wim J. van der Linden
Computerized Multistage Testing: Theory and Applications
Duanli Yan, Alina A. von Davier, and Charles Lewis
Statistics in the Social and Behavioral Sciences Series
Chapman & Hall/CRC
A Practical Guide to Methods and Tools
Edited by
Ian Foster
University of Chicago
Argonne National Laboratory
Rayid Ghani
University of Chicago
Ron S. Jarmin
U.S. Census Bureau
Frauke Kreuter
University of Maryland
University of Manheim
Institute for Employment Research
Julia Lane
New York University
American Institutes for Research
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2017 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed on acid-free paper
Version Date: 20160414
International Standard Book Number-13: 978-1-4987-5140-7 (Hardback)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but
the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to
trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained.
If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical,
or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without
written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access ( or contact the Copyright
Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a
variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to
Library of Congress Cataloging‑in‑Publication Data
Names: Foster, Ian, 1959- editor.
Title: Big data and social science : a practical guide to methods and tools /
edited by Ian Foster, University of Chicago, Illinois, USA, Rayid Ghani,
University of Chicago, Illinois, USA, Ron S. Jarmin, U.S. Census Bureau,
USA, Frauke Kreuter, University of Maryland, USA, Julia Lane, New York
University, USA.
Description: Boca Raton, FL : CRC Press, [2017] | Series: Chapman & Hall/CRC
statistics in the social and behavioral sciences series | Includes
bibliographical references and index.
Identifiers: LCCN 2016010317 | ISBN 9781498751407 (alk. paper)
Subjects: LCSH: Social sciences--Data processing. | Social
sciences--Statistical methods. | Data mining. | Big data.
Classification: LCC H61.3 .B55 2017 | DDC 300.285/6312--dc23
LC record available at
Visit the Taylor & Francis Web site at
and the CRC Press Web site at
Preface xiii
Editors xv
Contributors xix
1 Introduction 1
1.1 Whythisbook?................................... 1
1.2 Defining big data and its value . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Social science, inference, and big data . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Social science, data quality, and big data . . . . . . . . . . . . . . . . . . . . 7
1.5 Newtoolsfornewdata............................... 9
1.6 Thebooksusecase” ............................... 10
1.7 The structure of the book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.7.1 Part I: Capture and curation . . . . . . . . . . . . . . . . . . . . . . . 13
1.7.2 Part II: Modeling and analysis . . . . . . . . . . . . . . . . . . . . . . . 15
1.7.3 Part III: Inference and ethics . . . . . . . . . . . . . . . . . . . . . . . 16
1.8 Resources...................................... 17
I Capture and Curation 21
2 Working with Web Data and APIs 23
Cameron Neylon
2.1 Introduction .................................... 23
2.2 Scraping information from the web . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Obtaining data from the HHMI website . . . . . . . . . . . . . . . . . . 24
2.2.2 Limitsofscraping ............................. 30
2.3 New data in the research enterprise . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Afunctionalview.................................. 37
2.4.1 Relevant APIs and resources . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.2 RESTful APIs, returned data, and Python wrappers . . . . . . . . . . . 38
2.5 Programming against an API . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
viii Contents
2.6 Using the ORCID API via a wrapper . . . . . . . . . . . . . . . . . . . . . . . . 42
2.7 Quality, scope, and management . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.8 Integrating data from multiple sources . . . . . . . . . . . . . . . . . . . . . . 46
2.8.1 TheLagottoAPI .............................. 46
2.8.2 Working with a corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.9 Working with the graph of relationships . . . . . . . . . . . . . . . . . . . . . 58
2.9.1 Citation links between articles . . . . . . . . . . . . . . . . . . . . . . 58
2.9.2 Categories, sources, and connections . . . . . . . . . . . . . . . . . . . 60
2.9.3 Data availability and completeness . . . . . . . . . . . . . . . . . . . . 61
2.9.4 The value of sparse dynamic data . . . . . . . . . . . . . . . . . . . . . 62
2.10 Bringing it together: Tracking pathways to impact . . . . . . . . . . . . . . . 65
2.10.1 Network analysis approaches . . . . . . . . . . . . . . . . . . . . . . . 66
2.10.2 Future prospects and new data sources . . . . . . . . . . . . . . . . . 66
2.11 Summary...................................... 67
2.12 Resources...................................... 69
2.13 Acknowledgements and copyright . . . . . . . . . . . . . . . . . . . . . . . . . 70
3 Record Linkage 71
Joshua Tokle and Stefan Bender
3.1 Motivation ..................................... 71
3.2 Introduction to record linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3 Preprocessing data for record linkage . . . . . . . . . . . . . . . . . . . . . . . 76
3.4 Indexingandblocking............................... 78
3.5 Matching ...................................... 80
3.5.1 Rule-based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.5.2 Probabilistic record linkage . . . . . . . . . . . . . . . . . . . . . . . . 83
3.5.3 Machine learning approaches to linking . . . . . . . . . . . . . . . . . 85
3.5.4 Disambiguating networks . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.6 Classication.................................... 88
3.6.1 Thresholds ................................. 89
3.6.2 One-to-onelinks.............................. 90
3.7 Record linkage and data protection . . . . . . . . . . . . . . . . . . . . . . . . 91
3.8 Summary...................................... 92
3.9 Resources...................................... 92
4 Databases 93
Ian Foster and Pascal Heus
4.1 Introduction .................................... 93
4.2 DBMS:Whenandwhy............................... 94
4.3 RelationalDBMSs ................................. 100
4.3.1 Structured Query Language (SQL) . . . . . . . . . . . . . . . . . . . . 102
4.3.2 Manipulating and querying data . . . . . . . . . . . . . . . . . . . . . 102
4.3.3 Schema design and definition . . . . . . . . . . . . . . . . . . . . . . . 105
Contents ix
4.3.4 Loadingdata ................................ 107
4.3.5 Transactions and crash recovery . . . . . . . . . . . . . . . . . . . . . 108
4.3.6 Database optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.3.7 Caveats and challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.4 Linking DBMSs and other tools . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.5 NoSQLdatabases ................................. 116
4.5.1 Challenges of scale: The CAP theorem . . . . . . . . . . . . . . . . . . 116
4.5.2 NoSQL and key–value stores . . . . . . . . . . . . . . . . . . . . . . . 117
4.5.3 Other NoSQL databases . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.6 Spatialdatabases ................................. 120
4.7 Whichdatabasetouse? .............................. 122
4.7.1 RelationalDBMSs ............................. 122
4.7.2 NoSQLDBMSs............................... 123
4.8 Summary...................................... 123
4.9 Resources...................................... 124
5 Programming with Big Data 125
Huy Vo and Claudio Silva
5.1 Introduction .................................... 125
5.2 The MapReduce programming model . . . . . . . . . . . . . . . . . . . . . . . 127
5.3 Apache Hadoop MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.3.1 The Hadoop Distributed File System . . . . . . . . . . . . . . . . . . . 130
5.3.2 Hadoop: Bringing compute to the data . . . . . . . . . . . . . . . . . . 131
5.3.3 Hardware provisioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.3.4 Programming language support . . . . . . . . . . . . . . . . . . . . . . 136
5.3.5 Faulttolerance............................... 137
5.3.6 Limitations of Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.4 ApacheSpark.................................... 138
5.5 Summary...................................... 141
5.6 Resources...................................... 143
II Modeling and Analysis 145
6 Machine Learning 147
Rayid Ghani and Malte Schierholz
6.1 Introduction .................................... 147
6.2 What is machine learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.3 The machine learning process . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.4 Problem formulation: Mapping a problem to machine learning methods . . . . 151
6.5 Methods....................................... 153
6.5.1 Unsupervised learning methods . . . . . . . . . . . . . . . . . . . . . . 153
6.5.2 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
x Contents
6.6 Evaluation ..................................... 173
6.6.1 Methodology ................................ 173
6.6.2 Metrics ................................... 176
6.7 Practicaltips .................................... 180
6.7.1 Features .................................. 180
6.7.2 Machine learning pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.7.3 Multiclass problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.7.4 Skewed or imbalanced classification problems . . . . . . . . . . . . . . 182
6.8 How can social scientists benefit from machine learning? . . . . . . . . . . . . 183
6.9 Advancedtopics .................................. 185
6.10 Summary...................................... 185
6.11 Resources...................................... 186
7 Text Analysis 187
Evgeny Klochikhin and Jordan Boyd-Graber
7.1 Understanding what people write . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.2 Howtoanalyzetext ................................ 189
7.2.1 Processing text data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.2.2 How much is a word worth? . . . . . . . . . . . . . . . . . . . . . . . . 192
7.3 Approaches and applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
7.3.1 Topicmodeling............................... 193 Inferring topics from raw text . . . . . . . . . . . . . . . . . 194 Applications of topic models . . . . . . . . . . . . . . . . . . 197
7.3.2 Information retrieval and clustering . . . . . . . . . . . . . . . . . . . 198
7.3.3 Otherapproaches ............................. 205
7.4 Evaluation ..................................... 208
7.5 Textanalysistools ................................. 210
7.6 Summary...................................... 212
7.7 Resources...................................... 213
8 Networks: The Basics 215
Jason Owen-Smith
8.1 Introduction .................................... 215
8.2 Networkdata.................................... 218
8.2.1 Forms of network data . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
8.2.2 Inducing one-mode networks from two-mode data . . . . . . . . . . . 220
8.3 Networkmeasures................................. 224
8.3.1 Reachability ................................ 224
8.3.2 Whole-network measures . . . . . . . . . . . . . . . . . . . . . . . . . 225
8.4 Comparing collaboration networks . . . . . . . . . . . . . . . . . . . . . . . . 234
8.5 Summary...................................... 238
8.6 Resources...................................... 239
Contents xi
III Inference and Ethics 241
9 Information Visualization 243
M. Adil Yalçın and Catherine Plaisant
9.1 Introduction .................................... 243
9.2 Developing effective visualizations . . . . . . . . . . . . . . . . . . . . . . . . 244
9.3 A data-by-tasks taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
9.3.1 Multivariatedata.............................. 249
9.3.2 Spatialdata................................. 251
9.3.3 Temporaldata ............................... 252
9.3.4 Hierarchicaldata.............................. 255
9.3.5 Networkdata................................ 257
9.3.6 Textdata .................................. 259
9.4 Challenges ..................................... 259
9.4.1 Scalability ................................. 260
9.4.2 Evaluation ................................. 261
9.4.3 Visualimpairment............................. 261
9.4.4 Visualliteracy ............................... 262
9.5 Summary...................................... 262
9.6 Resources...................................... 263
10 Errors and Inference 265
Paul P. Biemer
10.1 Introduction .................................... 265
10.2 The total error paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
10.2.1 The traditional model . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
10.2.2 Extending the framework to big data . . . . . . . . . . . . . . . . . . . 273
10.3 Illustrations of errors in big data . . . . . . . . . . . . . . . . . . . . . . . . . 275
10.4 Errors in big data analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
10.4.1 Errors resulting from volume, velocity, and variety, assuming perfect
veracity................................... 277
10.4.2 Errors resulting from lack of veracity . . . . . . . . . . . . . . . . . . . 279 Variable and correlated error . . . . . . . . . . . . . . . . . . 280 Models for categorical data . . . . . . . . . . . . . . . . . . . 282 Misclassification and rare classes . . . . . . . . . . . . . . . 283 Correlation analysis . . . . . . . . . . . . . . . . . . . . . . . 284 Regression analysis . . . . . . . . . . . . . . . . . . . . . . . 288
10.5 Some methods for mitigating, detecting, and compensating for errors . . . . . 290
10.6 Summary...................................... 295
10.7 Resources...................................... 296
xii Contents
11 Privacy and Confidentiality 299
Stefan Bender, Ron Jarmin, Frauke Kreuter, and Julia Lane
11.1 Introduction .................................... 299
11.2 Why is access important? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
11.3 Providingaccess .................................. 305
11.4 Thenewchallenges ................................ 306
11.5 Legal and ethical framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
11.6 Summary...................................... 310
11.7 Resources...................................... 311
12 Workbooks 313
Jonathan Scott Morgan, Christina Jones, and Ahmad Emad
12.1 Introduction .................................... 313
12.2 Environment .................................... 314
12.2.1 Running workbooks locally . . . . . . . . . . . . . . . . . . . . . . . . 314
12.2.2 Central workbook server . . . . . . . . . . . . . . . . . . . . . . . . . . 315
12.3 Workbookdetails.................................. 315
12.3.1 Social Media and APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
12.3.2Databasebasics .............................. 316
12.3.3DataLinkage................................ 316
12.3.4MachineLearning ............................. 317
12.3.5TextAnalysis................................ 317
12.3.6Networks .................................. 318
12.3.7Visualization ................................ 318
12.4 Resources...................................... 319
Bibliography 321
Index 349
The class on which this book is based was created in response to a
very real challenge: how to introduce new ideas and methodologies
about economic and social measurement into a workplace focused
on producing high-quality statistics. We are deeply grateful for the
inspiration and support of Census Bureau Director John Thompson
and Deputy Director Nancy Potok in designing and implementing
the class content and structure.
As with any book, there are many people to be thanked. We
are grateful to Christina Jones, Ahmad Emad, Josh Tokle from
the American Institutes for Research, and Jonathan Morgan from
Michigan State University, who, together with Alan Marco and Julie
Caruso from the US Patent and Trademark Office, Theresa Leslie
from the Census Bureau, Brigitte Raumann from the University of
Chicago, and Lisa Jaso from Summit Consulting, actually made the
class happen.
We are also grateful to the students of three “Big Data for Fed-
eral Statistics” classes in which we piloted this material, and to the
instructors and speakers beyond those who contributed as authors
to this edited volume—Dan Black, Nick Collier, Ophir Frieder, Lee
Giles, Bob Goerge, Laure Haak, Madian Khabsa, Jonathan Ozik,
Ben Shneiderman, and Abe Usher. The book would not exist with-
out them.
We thank Trent Buskirk, Davon Clarke, Chase Coleman, Ste-
phanie Eckman, Matt Gee, Laurel Haak, Jen Helsby, Madian Khabsa,
Ulrich Kohler, Charlotte Oslund, Rod Little, Arnaud Sahuguet, Tim
Savage, Severin Thaler, and Joe Walsh for their helpful comments
on drafts of this material.
We also owe a great debt to the copyeditor, Richard Leigh; the
project editor, Charlotte Byrnes; and the publisher, Rob Calver, for
their hard work and dedication.
Ian Foster is a Professor of Computer Science at the University of
Chicago and a Senior Scientist and Distinguished Fellow at Argonne
National Laboratory.
Ian has a long record of research contributions in high-perfor-
mance computing, distributed systems, and data-driven discovery.
He has also led US and international projects that have produced
widely used software systems and scientific computing infrastruc-
tures. He has published hundreds of scientific papers and six books
on these and other topics. Ian is an elected fellow of the Amer-
ican Association for the Advancement of Science, the Association
for Computing Machinery, and the British Computer Society. His
awards include the British Computer Society’s Lovelace Medal and
the IEEE Tsutomu Kanai award.
Rayid Ghani is the Director of the Center for Data Science and
Public Policy and a Senior Fellow at the Harris School of Public
Policy and the Computation Institute at the University of Chicago.
Rayid is a reformed computer scientist and wannabe social scientist,
but mostly just wants to increase the use of data-driven approaches
in solving large public policy and social challenges. He is also pas-
sionate about teaching practical data science and started the Eric
and Wendy Schmidt Data Science for Social Good Fellowship at the
University of Chicago that trains computer scientists, statisticians,
and social scientists from around the world to work on data science
problems with social impact.
Before joining the University of Chicago, Rayid was the Chief
Scientist of the Obama 2012 Election Campaign, where he focused
on data, analytics, and technology to target and influence voters,
donors, and volunteers. Previously, he was a Research Scientist
and led the Machine Learning group at Accenture Labs. Rayid did
his graduate work in machine learning at Carnegie Mellon Univer-
sity and is actively involved in organizing data science related con-
xvi Editors
ferences and workshops. In his ample free time, Rayid works with
non-profits and government agencies to help them with their data,
analytics, and digital efforts and strategy.
Ron S. Jarmin is the Assistant Director for Research and Method-
ology at the US Census Bureau. He formerly was the Bureau’s Chief
Economist and Chief of the Center for Economic Studies and a Re-
search Economist. He holds a PhD in economics from the University
of Oregon and has published papers in the areas of industrial or-
ganization, business dynamics, entrepreneurship, technology and
firm performance, urban economics, data access, and statistical
disclosure avoidance. He oversees a broad research program in
statistics, survey methodology, and economics to improve economic
and social measurement within the federal statistical system.
Frauke Kreuter is a Professor in the Joint Program in Survey Meth-
odology at the University of Maryland, Professor of Methods and
Statistics at the University of Mannheim, and head of the statistical
methods group at the German Institute for Employment Research
in Nuremberg. Previously she held positions in the Department of
Statistics at the University of California Los Angeles (UCLA), and
the Department of Statistics at the Ludwig-Maximillian’s University
of Munich. Frauke serves on several advisory boards for National
Statistical Institutes around the world and within the Federal Sta-
tistical System in the United States. She recently served as the
co-chair of the Big Data Task Force of the American Association
for Public Opinion Research. She is a Gertrude Cox Award win-
ner, recognizing statisticians in early- to mid-career who have made
significant breakthroughs in statistical practice, and an elected fel-
low of the American Statistical Association. Her textbooks on Data
Analysis Using Stata and Practical Tools for Designing and Weighting
Survey Samples are used at universities worldwide, including Har-
vard University, Johns Hopkins University, Massachusetts Insti-
tute of Technology, Princeton University, and the University College
London. Her Massive Open Online Course in Questionnaire De-
sign attracted over 70,000 learners within the first year. Recently
Frauke launched the international long-distance professional edu-
cation program sponsored by the German Federal Ministry of Edu-
cation and Research in Survey and Data Science.
Editors xvii
Julia Lane is a Professor at the New York University Wagner Grad-
uate School of Public Service and at the NYU Center for Urban Sci-
ence and Progress, and she is a NYU Provostial Fellow for Innovation
Julia has led many initiatives, including co-founding the UMET-
RICS and STAR METRICS programs at the National Science Foun-
dation. She conceptualized and established a data enclave at
NORC/University of Chicago. She also co-founded the creation and
permanent establishment of the Longitudinal Employer-Household
Dynamics Program at the US Census Bureau and the Linked Em-
ployer Employee Database at Statistics New Zealand. Julia has
published over 70 articles in leading journals, including Nature and
Science, and authored or edited ten books. She is an elected fellow
of the American Association for the Advancement of Science and a
fellow of the American Statistical Association.
Stefan Bender
Deutsche Bundesbank
Frankfurt, Germany
Paul P. Biemer
RTI International
Raleigh, NC, USA
University of North Carolina
Chapel Hill, NC, USA
Jordan Boyd-Graber
University of Colorado
Boulder, CO, USA
Ahmad Emad
American Institutes for Research
Washington, DC, USA
Pascal Heus
Metadata Technology North America
Knoxville, TN, USA
Christina Jones
American Institutes for Research
Washington, DC, USA
Evgeny Klochikhin
American Institutes for Research
Washington, DC, USA
Jonathan Scott Morgan
Michigan State University
East Lansing, MI, USA
Cameron Neylon
Curtin University
Perth, Australia
Jason Owen-Smith
University of Michigan
Ann Arbor, MI, USA
Catherine Plaisant
University of Maryland
College Park, MD, USA
Malte Schierholz
University of Mannheim
Mannheim, Germany
Claudio Silva
New York University
New York, NY, USA
Joshua Tokle
Seattle, WA, USA
Huy Vo
City University of New York
New York, NY, USA
M. Adil Yalçın
University of Maryland
College Park, MD, USA
Chapter 1
This section provides a brief overview of the goals and structure of
the book.
1.1 Why this book?
The world has changed for empirical social scientists. The new types
of “big data” have generated an entire new research field—that of
data science. That world is dominated by computer scientists who
have generated new ways of creating and collecting data, developed
new analytical and statistical techniques, and provided new ways of
visualizing and presenting information. These new sources of data
and techniques have the potential to transform the way applied
social science is done.
Research has certainly changed. Researchers draw on data that
are “found” rather than “made” by federal agencies; those publish-
ing in leading academic journals are much less likely today to draw
on preprocessed survey data (Figure 1.1).
The way in which data are used has also changed for both gov-
ernment agencies and businesses. Chief data officers are becoming
as common in federal and state governments as chief economists
were decades ago, and in cities like New York and Chicago, mayoral
offices of data analytics have the ability to provide rapid answers
to important policy questions [233]. But since federal, state, and
local agencies lack the capacity to do such analysis themselves [8],
they must make these data available either to consultants or to the
research community. Businesses are also learning that making ef-
fective use of their data assets can have an impact on their bottom
line [56].
And the jobs have changed. The new job title of “data scien-
tist” is highlighted in job advertisements on and—in the same category as statisticians, economists,
and other quantitative social scientists if starting salaries are useful
2 1. Introduction
Micro-data Base Articles using Survey Data (%)
1980 1990
Note: “Pre-existing survey” data sets refer to micro surveys such as the CPS or
SIPP and do not include surveys designed by researchers for their study.
Sample excludes studies whose primary data source is from developing countries.
Figure 1.1. Use of pre-existing survey data in publications in leading journals,
1980–2010 [74]
The goal of this book is to provide social scientists with an un-
derstanding of the key elements of this new science, its value, and
the opportunities for doing better work. The goal is also to identify
the many ways in which the analytical toolkits possessed by social
scientists can be brought to bear to enhance the generalizability of
the work done by computer scientists.
We take a pragmatic approach, drawing on our experience of
working with data. Most social scientists set out to solve a real-
world social or economic problem: they frame the problem, identify
the data, do the analysis, and then draw inferences. At all points,
of course, the social scientist needs to consider the ethical ramifi-
cations of their work, particularly respecting privacy and confiden-
tiality. The book follows the same structure. We chose a particular
problem—the link between research investments and innovation—
because that is a major social science policy issue, and one in which
social scientists have been addressing using big data techniques.
While the example is specific and intended to show how abstract
concepts apply in practice, the approach is completely generaliz-
able. The web scraping, linkage, classification, and text analysis
methods on display here are canonical in nature. The inference
1.2. Defining big data and its value 3
and privacy and confidentiality issues are no different than in any
other study involving human subjects, and the communication of
results through visualization is similarly generalizable.
1.2 Defining big data and its value
There are almost as many definitions of big data as there are new
types of data. One approach is to define big data as anything too big This topic is discussed in
more detail in Chapter 5.
to fit onto your computer. Another approach is to define it as data
with high volume, high velocity, and great variety. We choose the
description adopted by the American Association of Public Opinion
Research: “The term ‘Big Data’ is an imprecise description of a
rich and complicated set of characteristics, practices, techniques,
ethical issues, and outcomes all associated with data” [188].
The value of the new types of data for social science is quite
substantial. Personal data has been hailed as the “new oil” of the
twenty-first century, and the benefits to policy, society, and public
opinion research are undeniable [139]. Policymakers have found
that detailed data on human beings can be used to reduce crime,
improve health delivery, and manage cities better [205]. The scope
is broad indeed: one of this book’s editors has used such data to
not only help win political campaigns but also show its potential
for public policy. Society can gain as well—recent work shows data-
driven businesses were 5% more productive and 6% more profitable
than their competitors [56]. In short, the vision is that social sci-
ence researchers can potentially, by using data with high velocity,
variety, and volume, increase the scope of their data collection ef-
forts while at the same time reducing costs and respondent burden,
increasing timeliness, and increasing precision [265].
Example: New data enable new analyses
Spotshotter data, which have fairly detailed information for each gunfire incident,
such as the precise timestamp and the nearest address, as well as the type of
shot, can be used to improve crime data [63]; Twitter data can be used to improve
predictions around job loss, job gain, and job postings [17]; and eBay postings can
be used to estimate demand elasticities [104].
But most interestingly, the new data can change the way we
think about measuring and making inferences about behavior. For
4 1. Introduction
example, it enables the capture of information on the subject’s en-
tire environment—thus, for example, the effect of fast food caloric
labeling in health interventions [105]; the productivity of a cashier
if he is within eyesight of a highly productive cashier but not oth-
erwise [252]. So it offers the potential to understand the effects of
complex environmental inputs on human behavior. In addition, big
data, by its very nature, enables us to study the tails of a distribu-
tion in a way that is not possible with small data. Much of interest
in human behavior is driven by the tails of the distribution—health
care costs by small numbers of ill people [356], economic activity
and employment by a small number of firms [93,109]—and is impos-
sible to study with the small sample sizes available to researchers.
Instead we are still faced with the same challenges and respon-
sibilities as we were before in the survey and small data collection
environment. Indeed, social scientists have a great deal to offer to a
(data) world that is currently looking to computer scientists to pro-
vide answers. Two major areas to which social scientists can con-
tribute, based on decades of experience and work with end users,
are inference and attention to data quality.
1.3 Social science, inference, and big data
The goal of empirical social science is to make inferences about a
population from available data. That requirement exists regardless
of the data source—and is a guiding principle for this book. For
probability-based survey data, methodology has been developed to
overcome problems in the data generating process. A guiding prin-
ciple for survey methodologists is the total survey error framework,
and statistical methods for weighting, calibration, and other forms
of adjustment are commonly used to mitigate errors in the survey
process. Likewise for “broken” experimental data, techniques like
propensity score adjustment and principal stratification are widely
used to fix flaws in the data generating process. Two books provide
frameworks for survey quality [35, 143].
This topic is discussed in
more detail in Chapter 10.
Across the social sciences, including economics, public policy,
sociology, management, (parts of) psychology and the like, we can
identify three categories of analysis with three different inferential
goals: description, causation, and prediction.
Description The job of many social scientists is to provide descrip-
tive statements about the population of interest. These could be
univariate, bivariate, or even multivariate statements. Chapter 6
1.3. Social science, inference, and big data 5
on machine learning will cover methods that go beyond simple de-
scriptive statistics, known as unsupervised learning methods.
Descriptive statistics are usually created based on census data
or sample surveys to generate some summary statistics like a mean,
median, or a graphical distribution to describe the population of in-
terest. In the case of a census, the work ends right there. With
sample surveys the point estimates come with measures of uncer-
tainties (standard errors). The estimation of standard errors has
been worked out for most descriptive statistics and most common
survey designs, even complex ones that include multiple layers of
sampling and disproportional selection probabilities [154, 385].
Example: Descriptive statistics
The US Bureau of Labor Statistics surveys about 60,000 households a month and
from that survey is able to describe national employment and unemployment levels.
For example, in November 2015, total nonfarm payroll employment increased by
211,000 in November, and the unemployment rate was unchanged at 5.0%. Job
gains occurred in construction, professional and technical services, and health
care. Mining and information lost jobs [57].
Proper inference, even for purely descriptive purposes, from a
sample to the population rests usually on knowing that everyone
from the target population had the chance to be included in the
survey, and knowing the selection probability for each element in
the population. The latter does not necessarily need to be known
prior to sampling, but eventually a probability is assigned for each
case. Getting the selection probabilities right is particularly impor-
tant when reporting totals [243]. Unfortunately in practice, samples
that start out as probability samples can suffer from a high rate of
nonresponse. Because the survey designer cannot completely con-
trol which units respond, the set of units that ultimately respond
cannot be considered to be a probability sample [257]. Nevertheless,
starting with a probability sample provides some degree of comfort
that a sample will have limited coverage errors (nonzero probability
of being in the sample), and there are methods for dealing with a
variety of missing data problems [240].
Causation In many cases, social scientists wish to test hypotheses,
often originating in theory, about relationships between phenomena
of interest. Ideally such tests stem from data that allow causal infer-
6 1. Introduction
ence: typically randomized experiments or strong nonexperimental
study designs. When examining the effect of Xon Y, knowing how
cases were selected into the sample or data set is much less impor-
tant in the estimation of causal effects than for descriptive studies,
for example, population means. What is important is that all ele-
ments of the inferential population have a chance of being selected
for the treatment [179]. In the debate about probability and non-
probability surveys, this distinction is often overlooked. Medical
researchers have operated with unknown study selection mecha-
nisms for years: for example, randomized trials that enroll only
selected samples.
Example: New data and causal inference
One of the major risks with using big data without thinking about the data source
is the misallocation of resources. Overreliance on, say, Twitter data in targeting re-
sources after hurricanes can lead to the misallocation of resources towards young,
Internet-savvy people with cell phones, and away from elderly or impoverished
neighborhoods [340]. Of course, all data collection approaches have had similar
risks. Bad survey methodology led the Literary Digest to incorrectly call the 1936
election [353]. Inadequate understanding of coverage, incentive and quality issues,
together with the lack of a comparison group, has hampered the use of adminis-
trative records—famously in the case of using administrative records on crime to
make inference about the role of death penalty policy in crime reduction [95].
Of course, in practice it is difficult to ensure that results are
generalizable, and there is always a concern that the treatment
effect on the treated is different than the treatment effect in the
full population of interest [365]. Having unknown study selection
probabilities makes it even more difficult to estimate population
causal effects, but substantial progress is being made [99,261]. As
long as we are able to model the selection process, there is no reason
not to do causal inference from so-called nonprobability data.
Prediction Forecasting or prediction tasks are a little less common
among applied social science researchers as a whole, but are cer-
tainly an important element for users of official statistics—in partic-
ular, in the context of social and economic indicators—as generally
for decision-makers in government and business. Here, similar to
the causal inference setting, it is of utmost importance that we do
know the process that generated the data, and we can rule out any
unknown or unobserved systematic selection mechanism.
1.4. Social science, data quality, and big data 7
Example: Learning from the flu
“Five years ago [in 2009], a team of researchers from Google announced a remark-
able achievement in one of the world’s top scientific journals, Nature. Without
needing the results of a single medical check-up, they were nevertheless able to
track the spread of influenza across the US. What’s more, they could do it more
quickly than the Centers for Disease Control and Prevention (CDC). Google’s track-
ing had only a day’s delay, compared with the week or more it took for the CDC
to assemble a picture based on reports from doctors’ surgeries. Google was faster
because it was tracking the outbreak by finding a correlation between what people
searched for online and whether they had flu symptoms. . . .
“Four years after the original Nature paper was published, Nature News had
sad tidings to convey: the latest flu outbreak had claimed an unexpected victim:
Google Flu Trends. After reliably providing a swift and accurate account of flu
outbreaks for several winters, the theory-free, data-rich model had lost its nose for
where flu was going. Google’s model pointed to a severe outbreak but when the
slow-and-steady data from the CDC arrived, they showed that Google’s estimates
of the spread of flu-like illnesses were overstated by almost a factor of two.
“The problem was that Google did not know—could not begin to know—what
linked the search terms with the spread of flu. Google’s engineers weren’t trying to
figure out what caused what. They were merely finding statistical patterns in the
data. They cared about correlation rather than causation” [155].
1.4 Social science, data quality, and big data
Most data in the real world are noisy, inconsistent, and suffers from
missing values, regardless of its source. Even if data collection
is cheap, the costs of creating high-quality data from the source—
cleaning, curating, standardizing, and integrating—are substantial. This topic is discussed in
more detail in Chapter 3.
Data quality can be characterized in multiple ways [76]:
Accuracy: How accurate are the attribute values in the data?
Completeness: Is the data complete?
Consistency: How consistent are the values in and between
the database(s)?
Timeliness: How timely is the data?
Accessibility: Are all variables available for analysis?
8 1. Introduction
Social scientists have decades of experience in transforming
messy, noisy, and unstructured data into a well-defined, clearly
structured, and quality-tested data set. Preprocessing is a complex
and time-consuming process because it is “hands-on”—it requires
judgment and cannot be effectively automated. A typical workflow
comprises multiple steps from data definition to parsing and ends
with filtering. It is difficult to overstate the value of preprocessing
for any data analysis, but this is particularly true in big data. Data
need to be parsed, standardized, deduplicated, and normalized.
Parsing is a fundamental step taken regardless of the data source,
and refers to the decomposition of a complex variable into compo-
nents. For example, a freeform address field like “1234 E 56th St”
might be broken down into a street number “1234” and a street
name “E 56th St.” The street name could be broken down further
to extract the cardinal direction “E” and the designation “St.” An-
other example would be a combined full name field that takes the
form of a comma-separated last name, first name, and middle initial
as in “Miller, David A.” Splitting these identifiers into components
permits the creation of more refined variables that can be used in
the matching step.
In the simplest case, the distinct parts of a character field are
delimited. In the name field example, it would be easy to create the
separate fields “Miller” and “David A” by splitting the original field
at the comma. In more complex cases, special code will have to
be written to parse the field. Typical steps in a parsing procedure
1. Splitting fields into tokens (words) on the basis of delimiters,
2. Standardizing tokens by lookup tables and substitution by a
standard form,
3. Categorizing tokens,
4. Identifying a pattern of anchors, tokens, and delimiters,
5. Calling subroutines according to the identified pattern, therein
mapping of tokens to the predefined components.
Standardization refers to the process of simplifying data by re-
placing variant representations of the same underlying observation
by a default value in order to improve the accuracy of field com-
parisons. For example, “First Street” and “1st St” are two ways of
writing the same street name, but a simple string comparison of
these values will return a poor result. By standardizing fields—and
1.5. New tools for new data 9
using the same standardization rules across files!—the number of
true matches that are wrongly classified as nonmatches (i.e., the
number of false nonmatches) can be reduced.
Some common examples of standardization are:
Standardization of different spellings of frequently occurring
words: for example, replacing common abbreviations in street
names (Ave, St, etc.) or titles (Ms, Dr, etc.) with a common
form. These kinds of rules are highly country- and language-
General standardization, including converting character fields
to all uppercase and removing punctuation and digits.
Deduplication consists of removing redundant records from a
single list, that is, multiple records from the same list that refer to
the same underlying entity. After deduplication, each record in the
first list will have at most one true match in the second list and vice
versa. This simplifies the record linkage process and is necessary if
the goal of record linkage is to find the best set of one-to-one links
(as opposed to a list of all possible links). One can deduplicate a list
by applying record linkage techniques described in this chapter to
link a file to itself.
Normalization is the process of ensuring that the fields that are
being compared across files are as similar as possible in the sense
that they could have been generated by the same process. At min-
imum, the same standardization rules should be applied to both
files. For additional examples, consider a salary field in a survey.
There are number different ways that salary could be recorded: it
might be truncated as a privacy-preserving measure or rounded to
the nearest thousand, and missing values could be imputed with
the mean or with zero. During normalization we take note of exactly
how fields are recorded.
1.5 New tools for new data
The new data sources that we have discussed frequently require
working at scales for which the social scientist’s familiar tools are
not designed. Fortunately, the wider research and data analytics
community has developed a wide variety of often more scalable and
flexible tools—tools that we will introduce within this book.
Relational database management systems (DBMSs) are used This topic is discussed in
more detail in Chapter 4.
throughout business as well as the sciences to organize, process,
10 1. Introduction
and search large collections of structured data. NoSQL DBMSs are
used for data that is extremely large and/or unstructured, such as
collections of web pages, social media data (e.g., Twitter messages),
and clinical notes. Extensions to these systems and also special-
ized single-purpose DBMSs provide support for data types that are
not easily handled in statistical packages such as geospatial data,
networks, and graphs.
Open source programming systems such as Python (used ex-
tensively throughout this book) and R provide high-quality imple-
mentations of numerous data analysis and visualization methods,
from regression to statistics, text analysis, network analysis, and
much more. Finally, parallel computing systems such as Hadoop
and Spark can be used to harness parallel computer clusters for
extremely large data sets and computationally intensive analyses.
These various components may not always work together as
smoothly as do integrated packages such as SAS, SPSS, and Stata,
but they allow researchers to take on problems of great scale and
complexity. Furthermore, they are developing at a tremendous rate
as the result of work by thousands of people worldwide. For these
reasons, the modern social scientist needs to be familiar with their
characteristics and capabilities.
1.6 The book’s “use case”
This book is about the uses of big data in social science. Our focus
is on working through the use of data as a social scientist normally
approaches research. That involves thinking through how to use
such data to address a question from beginning to end, and thereby
learning about the associated tools—rather than simply engaging in
coding exercises and then thinking about how to apply them to a
potpourri of social science examples.
There are many examples of the use of big data in social science
research, but relatively few that feature all the different aspects
that are covered in this book. As a result, the chapters in the book
draw heavily on a use case based on one of the first large-scale big
data social science data infrastructures. This infrastructure, based
on UMETRICS*data housed at the University of Michigan’s Insti-
UMETRICS: Universi-
ties Measuring the Impact
of Research on Innovation
and Science [228]
tute for Research on Innovation and Science (IRIS) and enhanced with data from the US Census Bureau, provides a new quantitative
analysis and understanding of science policy based on large-scale
computational analysis of new types of data.
1.6. The book’s “use case” 11
The infrastructure was developed in response to a call from the
President’s Science Advisor (Jack Marburger) for a science of science
policy [250]. He wanted a scientific response to the questions that
he was asked about the impact of investments in science.
Example: The Science of Science Policy
Marburger wrote [250]: “How much should a nation spend on science? What
kind of science? How much from private versus public sectors? Does demand for
funding by potential science performers imply a shortage of funding or a surfeit
of performers? These and related science policy questions tend to be asked and
answered today in a highly visible advocacy context that makes assumptions that
are deserving of closer scrutiny. A new ‘science of science policy’ is emerging, and
it may offer more compelling guidance for policy decisions and for more credible
advocacy. . . .
“Relating R&D to innovation in any but a general way is a tall order, but not
a hopeless one. We need econometric models that encompass enough variables
in a sufficient number of countries to produce reasonable simulations of the effect
of specific policy choices. This need won’t be satisfied by a few grants or work-
shops, but demands the attention of a specialist scholarly community. As more
economists and social scientists turn to these issues, the effectiveness of science
policy will grow, and of science advocacy too.”
Responding to this policy imperative is a tall order, because it in-
volves using all the social science and computer science tools avail-
able to researchers. The new digital technologies can be used to
capture the links between the inputs into research, the way in which
those inputs are organized, and the subsequent outputs [396,415].
The social science questions that are addressable with this data in-
frastructure include the effect of research training on the placement
and earnings of doctoral recipients, how university trained scien-
tists and engineers affect the productivity of the firms they work for,
and the return on investments in research. Figure 1.2 provides an
abstract representation of the empirical approach that is needed:
data about grants, the people who are funded on grants, and the
subsequent scientific and economic activities.
First, data must be captured on what is funded, and since the
data are in text format, computational linguistics tools must be
applied (Chapter 7). Second, data must be captured on who is
funded, and how they interact in teams, so network tools and ana-
lysis must be used (Chapter 8). Third, information about the type of
results must be gleaned from the web and other sources (Chapter 2).
12 1. Introduction
Figure 1.2. A visualization of the complex links between what and who is funded, and the results; tracing the direct
link between funding and results is misleading and wrong
Finally, the disparate complex data sets need to be stored in data-
bases (Chapter 4), integrated (Chapter 3), analyzed (Chapter 6), and
used to make inferences (Chapter 10).
The use case serves as the thread that ties many of the ideas
together. Rather than asking the reader to learn how to code “hello
world,” we build on data that have been put together to answer a
real-world question, and provide explicit examples based on that
data. We then provide examples that show how the approach gen-
For example, the text analysis chapter (Chapter 7) shows how
to use natural language processing to describe what research is
being done, using proposal and award text to identify the research
topics in a portfolio [110, 368]. But then it also shows how the
approach can be used to address a problem that is not just limited
to science policy—the conversion of massive amounts of knowledge
that is stored in text to usable information.
1.7. The structure of the book 13
Similarly, the network analysis chapter (Chapter 8) gives specific
examples using the UMETRICS data and shows how such data can
be used to create new units of analysis—the networks of researchers
who do science, and the networks of vendors who supply research
inputs. It also shows how networks can be used to study a wide
variety of other social science questions.
In another example, we use APIs*provided by publishers to de- Application Programming
scribe the results generated by research funding in terms of pub-
lications and other measures of scientific impact, but also provide
code that can be repurposed for many similar APIs.
And, of course, since all these new types of data are provided
in a variety of different formats, some of which are quite large (or
voluminous), and with a variety of different timestamps (or velocity),
we discuss how to store the data in different types of data formats.
1.7 The structure of the book
We organize the book in three parts, based around the way social
scientists approach doing research. The first set of chapters ad-
dresses the new ways to capture, curate, and store data. The sec-
ond set of chapters describes what tools are available to process and
classify data. The last set deals with analysis and the appropriate
handling of data on individuals and organizations.
1.7.1 Part I: Capture and curation
The four chapters in Part I (see Figure 1.3) tell you how to capture
and manage data.
Chapter 2 describes how to extract information from social me-
dia about the transmission of knowledge. The particular applica-
tion will be to develop links to authors’ articles on Twitter using
PLOS articles and to pull information about authors and articles
from web sources by using an API. You will learn how to retrieve
link data from bookmarking services, citations from Crossref, links
from Facebook, and information from news coverage. In keep-
ing with the social science grounding that is a core feature of the
book, the chapter discusses what data can be captured from online
sources, what is potentially reliable, and how to manage data quality
Big data differs from survey data in that we must typically com-
bine data from multiple sources to get a complete picture of the
activities of interest. Although computer scientists may sometimes
14 1. Introduction
API and Web
Storing Data
Data Sets
Chapter 2: Dierent ways of collecting data
Chapter 3: Combining dierent data sets
Chapter 4: Ingest, query and export data
Chapter 5: Output (creating innovation measures)
Figure 1.3. The four chapters of Part I focus on data capture and curation
simply “mash” data sets together, social scientists are rightfully
concerned about issues of missing links, duplicative links, and
erroneous links. Chapter 3 provides an overview of traditional
rule-based and probabilistic approaches to data linkage, as well
as the important contributions of machine learning to the linkage
Once data have been collected and linked into different files, it
is necessary to store and organize it. Social scientists are used to
working with one analytical file, often in statistical software tools
such as SAS or Stata. Chapter 4, which may be the most impor-
tant chapter in the book, describes different approaches to stor-
ing data in ways that permit rapid and reliable exploration and
Big data is sometimes defined as data that are too big to fit onto
the analyst’s computer. Chapter 5 provides an overview of clever
programming techniques that facilitate the use of data (often using
parallel computing). While the focus is on one of the most widely
used big data programming paradigms and its most popular imple-
mentation, Apache Hadoop, the goal of the chapter is to provide a
conceptual framework to the key challenges that the approach is
designed to address.
1.7. The structure of the book 15
Chapter 6: Classifying data in new ways
Chapter 7: Creating new data from text
Chapter 8: Creating new measures of
social and economic activity
Figure 1.4. The four chapters in Part II focus on data modeling and analysis
1.7.2 Part II: Modeling and analysis
The three chapters in Part II (see Figure 1.4) introduce three of the
most important tools that can be used by social scientists to do new
and exciting research: machine learning, text analysis, and social
network analysis.
Chapter 6 introduces machine learning methods. It shows the
power of machine learning in a variety of different contexts, par-
ticularly focusing on clustering and classification. You will get an
overview of basic approaches and how those approaches are applied.
The chapter builds from a conceptual framework and then shows
you how the different concepts are translated into code. There is
a particular focus on random forests and support vector machine
(SVM) approaches.
Chapter 7 describes how social scientists can make use of one of
the most exciting advances in big data—text analysis. Vast amounts
of data that are stored in documents can now be analyzed and
searched so that different types of information can be retrieved.
Documents (and the underlying activities of the entities that gener-
ated the documents) can be categorized into topics or fields as well
as summarized. In addition, machine translation can be used to
compare documents in different languages.
16 1. Introduction
Social scientists are typically interested in describing the activi-
ties of individuals and organizations (such as households and firms)
in a variety of economic and social contexts. The frames within
which data are collected have typically been generated from tax or
other programmatic sources. The new types of data permit new
units of analysis—particularly network analysis—largely enabled by
advances in mathematical graph theory. Thus, Chapter 8 describes
how social scientists can use network theory to generate measurable
representations of patterns of relationships connecting entities. As
the author points out, the value of the new framework is not only in
constructing different right-hand-side variables but also in study-
ing an entirely new unit of analysis that lies somewhere between
the largely atomistic actors that occupy the markets of neo-classical
theory and the tightly managed hierarchies that are the traditional
object of inquiry of sociologists and organizational theorists.
1.7.3 Part III: Inference and ethics
The four chapters in Part III (see Figure 1.5) cover three advanced
topics relating to data inference and ethics—information visualiza-
tion, errors and inference, and privacy and confidentiality—and in-
troduce the workbooks that provide access to the practical exercises
associated with the text.
Privacy and
Chapter 9: Making sense of the data
Chapter 10: Drawing statistically valid conclusions
Chapter 11: Handling data appropriately
Chapter 12: Applying new models and tools
Figure 1.5. The four chapters in Part III focus on inference and ethics
1.8. Resources 17
Chapter 9 introduces information visualization methods and de-
scribes how you can use those methods to explore data and com-
municate results so that data can be turned into interpretable, ac-
tionable information. There are many ways of presenting statis-
tical information that convey content in a rigorous manner. The
goal of this chapter is to explore different approaches and exam-
ine the information content and analytical validity of the different
approaches. It provides an overview of effective visualizations.
Chapter 10 deals with inference and the errors associated with
big data. Social scientists know only too well the cost associated
with bad data—we highlighted the classic Literary Digest example
in the introduction to this chapter, as well as the more recent Google
Flu Trends. Although the consequences are well understood, the
new types of data are so large and complex that their properties
often cannot be studied in traditional ways. In addition, the data
generating function is such that the data are often selective, in-
complete, and erroneous. Without proper data hygiene, errors can
quickly compound. This chapter provides a systematic way to think
about the error framework in a big data setting.
Chapter 11 addresses the issue that sits at the core of any study
of human beings—privacy and confidentiality. In a new field, like the
one covered in this book, it is critical that many researchers have
access to the data so that work can be replicated and built on—that
there be a scientific basis to data science. Yet the rules that social
scientists have traditionally used for survey data, namely anonymity
and informed consent, no longer apply when the data are collected
in the wild. This concluding chapter identifies the issues that must
be addressed for responsible and ethical research to take place.
Finally, Chapter 12 provides an overview of the practical work
that accompanies each chapter—the workbooks that are designed,
using Jupyter notebooks, to enable students and interested prac- See
titioners to apply the new techniques and approaches in selected
chapters. We hope you have a lot of fun with them.
1.8 Resources
For more information on the science of science policy, see Husbands
et al.’s book for a full discussion of many issues [175] and the online
resources at the eponymous website [352].
This book is above all a practical introduction to the methods and
tools that the social scientist can use to make sense of big data,
and thus programming resources are also important. We make
18 1. Introduction
extensive use of the Python programming language and the MySQL
database management system in both the book and its supporting
workbooks. We recommend that any social scientist who aspires
to work with large data sets become proficient in the use of these
two systems, and also one more, GitHub. All three, fortunately, are
quite accessible and are supported by excellent online resources.
Time spent mastering them will be repaid many times over in more
productive research.
For Python, Alex Bell’s Python for Economists (available online
Read this!
1VgytVV [31]) provides a wonderful 30-page introduction to the use of Python
in the social sciences, complete with XKCD cartoons. Economists
Tom Sargent and John Stachurski provide a very useful set of lec-
tures and examples at For more detail, we
recommend Charles Severance’s Python for Informatics: Exploring
Information [338], which not only covers basic Python but also pro-
vides material relevant to web data (the subject of Chapter 2) and
MySQL (the subject of Chapter 4). This book is also freely available
online and is supported by excellent online lectures and exercises.
For MySQL, Chapter 4 provides introductory material and point-
ers to additional resources, so we will not say more here.
We also recommend that you master GitHub. A version control
system is a tool for keeping track of changes that have been made
to a document over time. GitHub is a hosting service for projects
that use the Git version control system. As Strasser explains [363],
Git/GitHub makes it straightforward for researchers to create digi-
tal lab notebooks that record the data files, programs, papers, and
other resources associated with a project, with automatic tracking
of the changes that are made to those resources over time. GitHub
also makes it easy for collaborators to work together on a project,
whether a program or a paper: changes made by each contribu-
tor are recorded and can easily be reconciled. For example, we
used GitHub to create this book, with authors and editors check-
ing in changes and comments at different times and from many time
zones. We also use GitHub to provide access to the supporting work-
books. Ram [314] provides a nice description of how Git/GitHub can
be used to promote reproducibility and transparency in research.
One more resource that is outside the scope of this book but that
you may well want to master is the cloud [21,236]. It used to be that
when your data and computations became too large to analyze on
your laptop, you were out of luck unless your employer (or a friend)
had a larger computer. With the emergence of cloud storage and
computing services from the likes of Amazon Web Services, Google,
and Microsoft, powerful computers are available to anyone with a
1.8. Resources 19
credit card. We and many others have had positive experiences
using such systems for the analysis of urban [64], environmental
[107], and genomic [32] data analysis and modeling, for example.
Such systems may well represent the future of research computing.
Part I
Capture and Curation
Working with Web Data and APIs
Chapter 2
Cameron Neylon
This chapter will show you how to extract information from social
media about the transmission of knowledge. The particular appli-
cation will be to develop links to authors’ articles on Twitter using
PLOS articles and to pull information using an API. You will get link
data from bookmarking services, citations from Crossref, links from
Facebook, and information from news coverage. The examples that
will be used are from Twitter. In keeping with the social science
grounding that is a core feature of the book, it will discuss what can
be captured, what is potentially reliable, and how to manage data
quality issues.
2.1 Introduction
A tremendous lure of the Internet is the availability of vast amounts
of data on businesses, people, and their activity on social media.
But how can we capture the information and make use of it as we
might make use of more traditional data sources? In this chapter,
we begin by describing how web data can be collected, using the
use case of UMETRICS and research output as a readily available
example, and then discuss how to think about the scope, coverage,
and integration issues associated with its collection.
Often a big data exploration starts with information on people or
on a group of people. The web can be a rich source of additional in-
formation. It can also act as pointers to new sources of information,
allowing a pivot from one perspective to another, from one kind of
query to another. Often this is exploratory. You have an existing
core set of data and are looking to augment it. But equally this
exploration can open up whole new avenues. Sometimes the data
are completely unstructured, existing as web pages spread across a
24 2. Working with Web Data and APIs
site, and sometimes they are provided in a machine-readable form.
The challenge is in having a sufficiently diverse toolkit to bring all
of this information together.
Using the example of data on researchers and research outputs,
we will explore obtaining information directly from web pages (web
scraping) as well as explore the uses of APIs—web services that allow
an interaction with, and retrieval of, structured data. You will see
how the crucial pieces of integration often lie in making connections
between disparate data sets and how in turn making those connec-
tions requires careful quality control. The emphasis throughout
this chapter is on the importance of focusing on the purpose for
which the data will be used as a guide for data collection. While
much of this is specific to data about research and researchers, the
ideas are generalizable to wider issues of data and public policy.
2.2 Scraping information from the web
With the range of information available on the web, our first ques-
tion is how to access it. The simplest approach is often to manually
go directly to the web and look for data files or other information.
For instance, on the NSF website [268] it is possible to obtain data
dumps of all grant information. Sometimes data are available only
on web pages or we only want a subset of this information. In this
case web scraping is often a viable approach.
Web scraping involves using a program to download and process
web pages directly. This can be highly effective, particularly where
tables of information are made available online. It is also useful in
cases where it is desirable to make a series of very similar queries.
In each case we need to look at the website, identify how to get the
information we want, and then process it. Many websites deliber-
ately make this difficult to prevent easy access to their underlying
2.2.1 Obtaining data from the HHMI website
Let us suppose we are interested in obtaining information on those
investigators that are funded by the Howard Hughes Medical Insti-
tute (HHMI). HHMI has a website that includes a search function
for funded researchers, including the ability to filter by field, state,
and role. But there does not appear to be a downloadable data set
of this information. However, we can automate the process with
code to create a data set that you might compare with other data.
2.2. Scraping information from the web 25
This process involves first understanding how to construct a URL
that will do the search we want. This is most easily done by playing
with search functionality and investigating the URL structures that
are returned. Note that in many cases websites are not helpful
here. However, with HHMI if we do a general search and play with
the structure of the URL, we can see some of the elements of the URL
that we can think of as a query. As we want to see all investigators,
we do not need to limit the search, and so with some fiddling we
come up with a URL like the following. (We have broken the one-line
URL into three lines for ease of presentation.)
The requests module, available natively in Jupyter Python note-
books, is a useful set of tools for handling interactions with web-
sites. It lets us construct the request that we just presented in
terms of a base URL and query terms, as follows:
>> BASE_URL = ""
>> query = {
"kw" :"",
"sort_by" :"field_scientist_last_name",
"sort_order" :"ASC",
"items_per_page" : 20,
"page" : None
With our request constructed we can then make the call to the
web page to get a response.
>> import requests
>> response = requests.get(BASE_URL, params=query)
The first thing to do when building a script that hits a web page
is to make sure that your call was successful. This can be checked
by looking at the response code that the web server sent—and, obvi-
ously, by checking the actual HTML that was returned. A 200 code
means success and that everything should be OK. Other codes may
mean that the URL was constructed wrongly or that there was a
server error.
>> response.status_code
With the page successfully returned, we now need to process
the text it contains into the data we want. This is not a trivial
exercise. It is possible to search through and find things, but there
26 2. Working with Web Data and APIs
Figure 2.1. Source HTML from the portion of an HHMI results page containing information on HHMI investigators;
note that the webscraping results in badly formatted html which is difficult to read.
are a range of tools that can help with processing HTML and XML
data. Among these one of the most popular is a module called
BeautifulSoup*[319], which provides a number of useful functions
Python features many
useful libraries; Beautiful-
Soup is particularly helpful
for webscraping.
for this kind of processing. The module documentation provides
more details.
We need to check the details of the page source to find where the
information we are looking for is kept (see, for example, Figure 2.1).
Here, all the details on HHMI investigators can be found in a <div>
element with the class attribute view-content. This structure is not
something that can be determined in advance. It requires knowl-
edge of the structure of the page itself. Nested inside this <div>
element are another series of divs, each of which corresponds to
one investigator. These have the class attribute view-rows. Again,
there is nothing obvious about finding these, it requires a close ex-
amination of the page HTML itself for any specific case you happen
to be looking at.
We first process the page using the BeautifulSoup module (into
the variable soup) and then find the div element that holds the
information on investigators (investigator_list). As this element
is unique on the page (I checked using my web browser), we can
use the find method. We then process that div (using find_all) to
create an iterator object that contains each of the page segments
detailing a single investigator (investigators).
>> from bs4 import BeautifulSoup
>> soup = BeautifulSoup(response.text, "html5lib")
>> investigator_list = soup.find(’div’, class_ = "view-content")
>> investigators = investigator_list.find_all("div", class_ = "
As we specified in our query parameters that we wanted 20 res-
ults per page, we should check whether our list of page sections has
the right length.
>> len(investigators)
2.2. Scraping information from the web 27
# Given a request response object, parse for HHMI investigators
def scrape(page_response):
# Obtain response HTML and the correct <div> from the page
soup = BeautifulSoup(response.text, "html5lib")
inv_list = soup.find('div', class_ = "view-content")
# Create a list of all the investigators on the page
investigators = inv_list.find_all("div", class_ = "views-row")
data = [] # Make the data object to store scraping results
# Scrape needed elements from investigator list
for investigator in investigators:
inv = {} # Create a dictionary to store results
# Name and role are in same HTML element; this code
# separates them into two data elements
name_role_tag = investigator.find("div",
class_ = "views-field-field-scientist-classification")
strings = name_role_tag.stripped_strings
for string,a in zip(strings, ["name","role"]):
inv[a] = string
# Extract other elements from text of specific divs or from
# class attributes of tags in the page (e.g., URLs)
research_tag = investigator.find("div",
class_ = "views-field-field-scientist-research-abs-nod")
inv["research"] = research_tag.text.lstrip()
inv["research_url"] = ""
+ research_tag.find("a").get("href")
institution_tag = investigator.find("div",
class_ = "views-field-field-scientist-academic-institu")
inv["institute"] = institution_tag.text.lstrip()
town_state_tag = investigator.find("div",
class_ = "views-field-field-scientist-institutionstate"
inv["town"], inv["state"] = town_state_tag.text.split(",")
inv["town"] = inv.get("town").lstrip()
inv["state"] = inv.get("state").lstrip()
thumbnail_tag = investigator.find("div",
class_ = "views-field-field-scientist-image-thumbnail")
inv["thumbnail_url"] = thumbnail_tag.find("img")["src"]
inv["url"] = ""
+ thumbnail_tag.find("a").get("href")
# Add the new data to the list
return data
Listing 2.1. Python code to parse for HHMI investigators
28 2. Working with Web Data and APIs
Finally, we need to process each of these segments to obtain the
data we are looking for. This is the actual “scraping” of the page
to get the information we want. Again, this involves looking closely
at the HTML itself, identifying where the information is held, what
tags can be used to find it, and often doing some postprocessing to
clean it up (removing spaces, splitting different elements up).
Listing 2.1 provides a function to handle all of this. The function
accepts the response object from the requests module as its input,
processes the page text to soup, and then finds the investigator_list
as above and processes it into an actual list of the investigators. For
each investigator it then processes the HTML to find and clean up
the information required, converting it to a dictionary and adding it
to our growing list of data.
Let us check what the first two elements of our data set now look
like. You can see two dictionaries, one relating to Laurence Abbott,
who is a senior fellow at the HHMI Janelia Farm Campus, and one
for Susan Ackerman, an HHMI investigator based at the Jackson
Laboratory in Bar Harbor, Maine. Note that we have also obtained
URLs that give more details on the researcher and their research
program (research_url and url keys in the dictionary) that could
provide a useful input to textual analysis or topic modeling (see
Chapter 7).
>> data = scrape(response)
>> data[0:2]
[{’institute’: u’Janelia Research Campus ’,
’name’: u’Laurence Abbott, PhD’,
’research’: u’Computational and Mathematical Modeling of Neurons
and Neural... ’,
’research_url’: u’
’role’: u’Janelia Senior Fellow’,
’state’: u’VA ’,
’thumbnail_url’: u’
’town’: u’Ashburn,
’url’: u’’},
{’institute’: u’The Jackson Laboratory ,
’name’: u’Susan Ackerman, PhD’,
’research’: u’Identification of the Molecular Mechanisms
Underlying... ’,
’research_url’: u’
’role’: u’Investigator,
’state’: u’ME ’,
2.2. Scraping information from the web 29
’town’: u’Bar Harbor,
’url’: u’’}]
So now we know we can process a page from a website to generate
usefully structured data. However, this was only the first page of
results. We need to do this for each page of results if we want
to capture all the HHMI investigators. We could just look at the
number of pages that our search returned manually, but to make
this more general we can actually scrape the page to find that piece
of information and use that to calculate how many pages we need
to work through.
The number of results is found in a div with the class “view-
headers” as a piece of free text (“Showing 1–20 of 493 results”). We
need to grab the text, split it up (I do so based on spaces), find the
right number (the one that is before the word “results”) and convert
that to an integer. Then we can divide by the number of items we
requested per page (20 in our case) to find how many pages we need
to work through. A quick mental calculation confirms that if page
0 had results 1–20, page 24 would give results 481–493.
>> # Check total number of investigators returned
>> view_header = soup.find("div", class_ = "view-header")
>> words = view_header.text.split(" ")
>> count_index = words.index("results.") - 1
>> count = int(words[count_index])
>> # Calculate number of pages, given count & items_per_page
>> num_pages = count/query.get("items_per_page")
>> num_pages
Then it is a simple matter of putting the function we constructed
earlier into a loop to work through the correct number of pages. As
we start to hit the website repeatedly, we need to consider whether
we are being polite. Most websites have a file in the root directory
called robots.txt that contains guidance on using programs to inter-
act with the website. In the case of the file states
first that we are allowed (or, more properly, not forbidden) to query programmatically. Thus, you can
pull down all of the more detailed biographical or research informa-
tion, if you so desire. The file also states that there is a requested
“Crawl-delay” of 10. This means that if you are making repeated
queries (as we will be in getting the 24 pages), you should wait for
10 seconds between each query. This request is easily accommo-
dated by adding a timed delay between each page request.
30 2. Working with Web Data and APIs
>> for page_num in range(num_pages):
>> # We already have page zero and we need to go to 24:
>> # range(24) is [0,1,...,23]
>> query["items_per_page"] = page_num + 1
>> page = requests.get(BASE_URL, params=query)
>> # We use extend to add list for each page to existing list
>> data.extend(scrape(page))
>> print "Retrieved and scraped page number:", query.get("
>> time.sleep(10) # robots.txt at specifies a crawl delay
of 10 seconds
Retrieved and scraped page number: 1
Retrieved and scraped page number: 2
Retrieved and scraped page number: 24
Finally we can check that we have the right number of results
after our scraping. This should correspond to the 493 records that
the website reports.
>> len(data)
2.2.2 Limits of scraping
While scraping websites is often necessary, is can be a fragile and
messy way of working. It is problematic for a number of reasons: for
example, many websites are designed in ways that make scraping
difficult or impossible, and other sites explicitly prohibit this kind
of scripted analysis. (Both reasons apply in the case of the NSF and websites, which is why we use the HHMI website in our
In many cases a better choice is to process a data dump from
an organization. For example, the NSF and Wellcome Trust both
provide data sets for each year that include structured data on all
their awarded grants. In practice, integrating data is a continual
challenge of figuring out what is the easiest way to proceed, what is
allowed, and what is practical and useful. The selection of data will
often be driven by pragmatic rather than theoretical concerns.
Increasingly, however, good practice is emerging in which orga-
nizations provide APIs to enable scripted and programmatic access
to the data they hold. These tools are much easier and generally
more effective to work with. They are the focus of much of the rest
of this chapter.
2.3. New data in the research enterprise 31
2.3 New data in the research enterprise
The new forms of data we are discussing in this chapter are largely
available because so many human activities—in this case, discus-
sion, reading, and bookmarking—are happening online. All sorts of
data are generated as a side effect of these activities. Some of that
data is public (social media conversations), some private (IP ad-
dresses requesting specific pages), and some intrinsic to the service
(the identity of a user who bookmarks an article). What exactly are
these new forms of data? There are broadly two new directions that
data availability is moving in. The first is information on new forms
of research output, data sets, software, and in some cases physical
resources. There is an interest across the research community in
expanding the set of research outputs that are made available and,
to drive this, significant efforts are being made to ensure that these
nontraditional outputs are seen as legitimate outputs. In particular
there has been a substantial policy emphasis on data sharing and,
coupled with this, efforts to standardize practice around data cita-
tion. This is applying a well-established measure (citation) to a new
form of research output.
The second new direction, which is more developed, takes the
alternate route, providing new forms of information on existing types
of output, specifically research articles. The move online of research
activities, including discovery, reading, writing, and bookmarking,
means that many of these activities leave a digital trace. Often
these traces are public or semi-public and can be collected and
tracked. This certainly raises privacy issues that have not been
comprehensively addressed but also provides a rich source of data
on who is doing what with research articles.
There are a wide range of potential data sources, so it is useful
to categorize them. Figure 2.2 shows one possible categorization,
in which data sources are grouped based on the level of engage-
ment and the stage of use. It starts from the left with “views,”
measures of online views and article downloads, followed by “saves”
where readers actively collect articles into a library of their own,
through online discussion forums such as blogs, social media and
new commentary, formal scholarly recommendations, and, finally,
formal citations.
These categories are a useful way to understand the classes of
information available and to start digging into the sources they can
be obtained from. For each category we will look at the kind of
usage that the indicator is a proxy for, which users are captured by
32 2. Working with Web Data and APIs
Research Article
Viewed Saved
NatureBlogs F1000 Prime Crossref
Web of Science
Increasing Engagement
PLOS Comments
Discussed CitedRecommended
Figure 2.2. Classes of online activity related to research journal articles. Reproduced from Lin and Fenner [237],
under a Creative Commons Attribution v 3.0 license
the indicator, the limitations that the indicator has as a measure,
and the sources of data. We start with the familiar case of formal
literature citations to provide context.
Example: Citations
Most quantitative analyses of research have focused on citations from research ar-
ticles to other research articles. Many familiar measures—such as Impact Factors,
Scimago Journal Rank, or Eigenfactor—are actually measures of journal rather
than article performance. However, information on citations at the article level is
increasingly the basis for much bibliometric analysis.
Kind of usage
Citing a scholarly work is a signal from a researcher that a specific
work has relevance to, or has influenced, the work they are describing.
It implies significant engagement and is a measure that carries some
Researchers, which means usage by a specific group for a fairly small
range of purposes.
With high-quality data, there are some geographical, career, and dis-
ciplinary demographic details.
The citations are slow to accumulate, as they must pass through a
peer-review process.
2.3. New data in the research enterprise 33
It is seldom clear from raw data why a paper is being cited.
It provides a limited view of usage, as it only reflects reuse in research,
not application in the community.
Public sources of citation data include PubMed Central and Europe
PubMed Central, which mine publicly available full text to find cita-
Proprietary sources of citation data include Thomson Reuters’ Web of
Knowledge and Elsevier’s Scopus.
Some publishers make citation data collected by Crossref available.
Example: Page views and downloads
A major new source of data online is the number of times articles are viewed. Page
views and downloads can be defined in different ways and can be reached via a
range of paths. Page views are an immediate measure of usage. Viewing a paper
may involve less engagement than citation or bookmarking, but it can capture
interactions with a much wider range of users.
The possibility of drawing demographic information from downloads has sig-
nificant potential for the future in providing detailed information on who is reading
an article, which may be valuable for determining, for example, whether research
is reaching a target audience.
Kind of usage
It counts the number of people who have clicked on an article page or
downloaded an article.
Page views and downloads report on use by those who have access
to articles. For publicly accessible articles this could be anyone; for
subscription articles it is likely to be researchers.
Page views are calculated in different ways and are not directly com-
parable across publishers. Standards are being developed but are not
yet widely applied.
Counts of page views cannot easily distinguish between short-term
visitors and those who engage more deeply with an article.
There are complications if an article appears in multiple places, for
example at the journal website and a repository.
34 2. Working with Web Data and APIs
Some publishers and many data repositories make page view data
available in some form. Publishers with public data include PLOS,
Nature Publishing Group, Ubiquity Press, Co-Action Press, and Fron-
Data repositories, including Figshare and Dryad, provide page view
and download information.
PubMed Central makes page views of articles hosted on that site avail-
able to depositing publishers. PLOS and a few other publishers make
this available.
Example: Analyzing bookmarks
Tools for collecting and curating personal collections of literature, or web content,
are now available online. They make it easy to make copies and build up indexes of
articles. Bookmarking services can choose to provide information on the number
of people who have bookmarked a paper.
Two important services targeted at researchers are Mendeley and CiteULike.
Mendeley has the larger user base and provides richer statistics. Data include
the number of users that who bookmarked a paper, groups that have collected a
paper, and in some cases demographics of users, which can include discipline,
career stage, and geography.
Bookmarks accumulate rapidly after publication and provide evidence of schol-
arly interest. They correlate quite well with the eventual number of citations. There
are also public bookmarking services that provide a view onto wider interest in re-
search articles.
Kind of usage
Bookmarking is a purposeful act. It may reveal more interest than a
page view, but less than a citation.
Its uses are different from those captured by citations.
The bookmarks may include a variety of documents, such as papers for
background reading, introductory material, position or policy papers,
or statements of community positions.
Academic-focused services provide information on use by researchers.
Each service has a different user profile in, for instance, sciences or
social sciences.
2.3. New data in the research enterprise 35
All services have a geographical bias towards North America and Eu-
There is some demographic information, for instance, on countries
where users are bookmarking the most.
There is bias in coverage of services; for instance, Mendeley has good
coverage of biomedical literature.
It can only report on activities of signed-up users.
It is not usually possible to determine why a bookmark has been cre-
Mendeley and CiteULike both have public APIs that provide data that
are freely available for reuse.
Most consumer bookmarking services provide some form of API, but
this often has restrictions or limitations.
Example: Discussions on social media
Social media are one of the most valuable new services producing information
about research usage. A growing number of researchers, policymakers, and tech-
nologists are on these services discussing research.
There are three major features of social media as a tool. First, among a large
set of conversations, it is possible to discover a discussion about a specific paper.
Second, Twitter makes it possible to identify groups discussing research and to
learn whether they were potential targets of the research. Third, it is possible to
reconstruct discussions to understand what paths research takes to users.
In the future it will be possible to identify target audiences and to ask whether
they are being reached and how modified distribution might maximize that reach.
This could be a powerful tool, particularly for research with social relevance.
Twitter provides the most useful data because discussions and the identity of
those involved are public. Connections between users and the things they say
are often available, making it possible to identify communities discussing work.
However, the 140-character limit on Twitter messages (“tweets”) does not support
extended critiques. Facebook has much less publicly available information—but
being more private, it can be a site for frank discussion of research.
Kind of usage
Those discussing research are showing interest potentially greater than
page views.
36 2. Working with Web Data and APIs
Often users are simply passing on a link or recommending an article.
It is possible to navigate to tweets and determine the level and nature
of interest.
Conversations range from highly technical to trivial, so numbers should
be treated with caution.
Highly tweeted or Facebooked papers also tend to have significant
bookmarking and citation.
Professional discussions can be swamped when a piece of research
captures public interest.
The user bases and data sources for Twitter and Facebook are global
and public.
There are strong geographical biases.
A rising proportion of researchers use Twitter and Facebook for profes-
sional activities.
Many journalists, policymakers, public servants, civil society groups,
and others use social media.
Frequent lack of explicit links to papers is a serious limitation.
Use of links is biased towards researchers and against groups not
directly engaged in research.
There are demographic issues and reinforcement effects—retweeting
leads to more retweeting in preference to other research—so analysis
of numbers of tweets or likes is not always useful.
Example: Recommendations
A somewhat separate form of usage is direct expert recommendations. The best-
known case of this is the F1000 service on which experts offer recommendations
with reviews of specific research articles. Other services such as collections or
personal recommendation services may be relevant here as well.
Kind of usage
Recommendations from specific experts show that particular outputs
are worth looking at in detail, are important, or have some other value.
Presumably, recommendations are a result of in-depth reading and a
high level of engagement.
2.4. A functional view 37
Recommendations are from a selected population of experts depending
on the service in question.
In some cases this might be an algorithmic recommendation service.
Recommendations are limited to the interests of the selected popula-
tion of experts.
The recommendation system may be biased in terms of the interests of
recommenders (e.g., towards—or away from—new theories vs. develop-
ing methodology) as well as their disciplines.
Recommendations are slow to build up.
2.4 A functional view
The descriptive view of data types and sources is a good place to
start, but it is subject to change. Sources of data come and go,
and even the classes of data types may expand and contract in the
medium to long term. We also need a more functional perspective
to help us understand how these sources of data relate to activities
in the broader research enterprise.
Consider Figure 1.2 in Chapter 1. The research enterprise has
been framed as being made up of people who are generating out-
puts. The data that we consider in this chapter relate to connections
between outputs, such as citations between research articles and
tweets referring to articles. These connections are themselves cre-
ated by people, as shown in Figure 2.3. The people in turn may be
classed as belonging to certain categories or communities. What is
interesting, and expands on the simplified picture of Figure 1.2, is
that many of these people are not professional researchers. Indeed,
in some cases they may not be people at all but automated systems
of some kind. This means we need to expand the set of actors we are
considering. As described above, we are also expanding the range
of outputs (or objects) that we are considering as well.
In the simple model of Figure 2.3, there are three categories of
things (nodes on the graph): objects, people, and the communities
they belong to. Then there are the relationships between these el-
ements (connections between nodes). Any given data source may
provide information on different parts of this graph, and the in-
formation available is rarely complete or comprehensive. Data from
38 2. Working with Web Data and APIs
Refers to
Created byCreated by
Figure 2.3. A simplified model of online interactions between research outputs
and the objects that refer to them
different sources can also be difficult to integrate. As with any
See Section 3.2.
data integration, combining sources relies on being able to confi-
dently identify those nodes that are common between data sources.
Therefore identifying unique objects and people is critical to making
These data are not necessarily public but many services choose
to make some data available. An important characteristic of these
data sources is that they are completely in the gift of the service
provider. Data availability, its presentation, and upstream analysis
can change without notice. Data are sometimes provided as a dump
but is also frequently provided through an API.
An API is simply a tool that allows a program to interface with a
service. APIs can take many different forms and be of varying quality
and usefulness. In this section we will focus on one common type
of API and examples of important publicly available APIs relevant
to research communications. We will also cover combining APIs
and the benefits and challenges of bringing multiple data sources
2.4.1 Relevant APIs and resources
There is a wide range of other sources of information that can be
used in combination with the APIs featured above to develop an
overview of research outputs and of where and how they are being
used. There are also other tools that can allow deeper analysis of the
outputs themselves. Table 2.1 gives a partial list of key data sources
and APIs that are relevant to the analysis of research outputs.
2.4.2 RESTful APIs, returned data, and Python wrappers
The APIs we will focus on here are all examples of RESTful services.
REST stands for Representational State Transfer [121, 402], but for
2.4. A functional view 39
Table 2.1. Popular sources of data relevant to the analysis of research outputs
Source Description API Free
Bibliographic Data
PubMed An online index that combines bibliographic data from Medline and
PubMed Central. PubMed Central and Europe PubMed Central also
provide information.
Web of Science The bibliographic database provided by Thomson Reuters. The ISI
Citation Index is also available.
Scopus The bibliographic database provided by Elsevier. It also provides
citation information.
Crossref Provides a range of bibliographic metadata and information obtained
from members registering DOIs.
Google Scholar Provides a search index for scholarly objects and aggregates citation
Microsoft Academic Search Provides a search index for scholarly objects and aggregates citation
information. Not as complete as Google Scholar, but has an API.
Social Media A provider of aggregated data on social media and mainstream media
attention of research outputs. Most comprehensive source of infor-
mation across different social media and mainstream media conver-
Twitter Provides an API that allows a user to search for recent tweets and
obtain some information on specific accounts.
Facebook The Facebook API gives information on the number of pages, likes,
and posts associated with specific web pages.
Author Profiles
ORCID Unique identifiers for research authors. Profiles include information
on publication lists, grants, and affiliations.
LinkedIn CV-based profiles, projects, and publications. Y *
Funder Information
Gateway to Research A database of funding decisions and related outputs from Research
Councils UK.
NIH Reporter Online search for information on National Institutes of Health grants.
Does not provide an API but a downloadable data set is available.
NSF Award Search Online search for information on NSF grants. Does not provide an
API but downloadable data sets by year are available.
* The data are restricted: sometimes fee based, other times not.
our purposes it is most easily understood as a means of transfer-
ring data using web protocols. Other forms of API require addi-
tional tools or systems to work with, but RESTful APIs work directly
over the web. This has the advantage that a human user can also
with relative ease play with the API to understand how it works.
Indeed, some websites work simply by formatting the results of
API calls.
40 2. Working with Web Data and APIs
As an example let us look at the Crossref API. This provides a
range of information associated with Digital Object Identifiers (DOIs)
registered with Crossref. DOIs uniquely identify an object, and
Crossref DOIs refer to research objects, primarily (but not entirely)
research articles. If you use a web browser to navigate to http://api., you should receive back
a webpage that looks something like the following. (We have laid it
out nicely to make it more readable.)
{"status" :"ok",
"message-type" :"work",
"message-version" :"1.0.0",
"message" :
"subject" :["Genetics"],
"issued" :{"date-parts" :[[2005,10,24]] },
"score" :1.0,
"prefix" :"",
"author" :["affiliation" :[],
"family" :"Whiteford",
"given" :"N."}],
"container-title" :["Nucleic Acids Research"],
"reference-count" :0,
"page" :"e171-e171",
"deposited" :{"date-parts" :[[2013,8,8]],
"timestamp" :1375920000000},
"issue" :"19",
"title" :
["An analysis of the feasibility of short read sequencing"]
"type" :"journal-article",
"DOI" :"10.1093/nar/gni170",
"ISSN" :["0305-1048","1362-4962"],
"URL" :"",
"source" :"Crossref",
"publisher" :"Oxford University Press (OUP)",
"indexed" :{"date-parts" :[[2015,6,8]],
"timestamp" :1433777291246},
"volume" :"33",
"member" :""
This is a package of JavaScript Object Notation (JSON)*data
JSON is an open stan-
dard way of storing and ex-
changing data.
returned in response to a query. The query is contained entirely
in the URL, which can be broken up into pieces: the root URL
( and a data “query,” in this case made up of
a “field” (works) and an identifier (the DOI 10.1093/nar/gni170). The
Crossref API provides information about the article identified with
this specific DOI.
2.5. Programming against an API 41
2.5 Programming against an API
Programming against an API involves constructing HTTP requests
and parsing the data that are returned. Here we use the Cross-
ref API to illustrate how this is done. Crossref is the provider of
DOIs used by many publishers to uniquely identify scholarly works.
Crossref is not the only organization to provide DOIs. The scholarly
communication space DataCite is another important provider. The
documentation is available at the Crossref website [394].
Once again the requests Python library provides a series of con-
venience functions that make it easier to make HTTP calls and to
process returned JSON. Our first step is to import the module and
set a base URL variable.
>> import requests
>> BASE_URL = ""
A simple example is to obtain metadata for an article associated
with a specific DOI. This is a straightforward call to the Crossref
API, similar to what we saw earlier.
>> doi = "10.1093/nar/gni170"
>> query = "works/"
>> url = BASE_URL + query + doi
>> response = requests.get(url)
>> url
>> response.status_code
The response object that the requests library has created has a
range of useful information, including the URL called and the re-
sponse code from the web server (in this case 200, which means
everything is OK). We need the JSON body from the response ob-
ject (which is currently text from the perspective of our script) con-
verted to a Python dictionary. The requests module provides a
convenient function for performing this conversion, as the following
code shows. (All strings in the output are in Unicode, hence the u
>> response_dict = response.json()
>> response_dict
{u’message’ :
{u’DOI’ :u’10.1093/nar/gni170,
u’ISSN’ :[u’0305-1048,u’1362-4962],
u’URL’ :u’,
u’author’ :[ {u’affiliation:[],
u’family’ :u’Whiteford,
42 2. Working with Web Data and APIs
u’given’ :u’N.’} ],
u’container-title:[u’Nucleic Acids Research’ ],
u’deposited’ :{u’date-parts’ :[[2013,8,8]],
u’timestamp’ :1375920000000 },
u’indexed’ :{u’date-parts’ :[[2015,6,8]],
u’timestamp’ :1433777291246 },
u’issue’ :u’19,
u’issued’ :{u’date-parts:[[2005,10,24]] },
u’member’ :u’,
u’page’ :u’e171-e171,
u’prefix’ :u’,
u’publisher’ :u’Oxford University Press (OUP)’,
u’score’ :1.0,
u’source’ :u’Crossref,
u’subject’ :[u’Genetics’],
u’subtitle’ :[],
u’title’ :[u’An analysis of the feasibility of short read
u’type’ :u’journal-article,
u’volume’ :u’33
u’message-type’ :u’work,
u’status’ :u’ok’
This data object can now be processed in whatever way the user
wishes, using standard manipulation techniques.
The Crossref API can, of course, do much more than simply look
up article metadata. It is also valuable as a search resource and
for cross-referencing information by journal, funder, publisher, and
other criteria. More details can be found at the Crossref website.
2.6 Using the ORCID API via a wrapper
ORCID, which stands for “Open Research and Contributor Identi-
fier” (see; see also [145]), is a service that provides unique
identifiers for researchers. Researchers can claim an ORCID profile
and populate it with references to their research works, funding and
affiliations. ORCID provides an API for interacting with this infor-
mation. For many APIs there is a convenient Python wrapper that
can be used. The ORCID–Python wrapper works with the ORCID
v1.2 API to make various API calls straightforward. This wrapper
only works with the public ORCID API and can therefore only access
publicly available data.
2.6. Using the ORCID API via a wrapper 43
Using the API and wrapper together provides a convenient means
of getting this information. For instance, given an ORCID, it is
straightforward to get profile information. Here we get a list of pub-
lications associated with my ORCID and look at the the first item
on the list.
>> import orcid
>> cn = orcid.get("0000-0002-0068-716X")
>> cn
<Author Cameron Neylon, ORCID 0000-0002-0068-716X>
>> cn.publications[0]
<Publication "Principles for Open Scholarly Infrastructures-v1">
The wrapper has created Python objects that make it easier to
work with and manipulate the data. It is common to take the return
from an API and create objects that behave as would be expected
in Python. For instance, the publications object is a list popu-
lated with publications (which are also Python-like objects). Each
publication in the list has its own attributes, which can then be
examined individually. In this case the external IDs attribute is a
list of further objects that include a DOI for the article and the ISSN
of the journal the article was published in.
>> len(cn.publications)
>> cn.publications[12].external_ids
[<ExternalID DOI:10.1371/journal.pbio.1001677>, <ExternalID ISSN
As a simple example of data processing, we can iterate over the
list of publications to identify those for which a DOI has been pro-
vided. In this case we can see that of the 70 publications listed in
this ORCID profile (at the time of testing), 66 have DOIs.
>> exids = []
>> for pub in cn.publications:
if pub.external_ids:
exids = exids + pub.external_ids
>> DOIs = [ for exid in exids if exid.type == "DOI"]
>> len(DOIs)
Wrappers generally make operating with an API simpler and
cleaner by abstracting away the details of making HTTP requests.
Achieving the same by directly interacting with the ORCID API
would require constructing the appropriate URLs and parsing the
returned data into a usable form. Where a wrapper is available it
is generally much easier to use. However, wrappers may not be
actively developed and may lag the development of the API. Where
44 2. Working with Web Data and APIs
possible, use a wrapper that is directly supported or recommended
by the API provider.
2.7 Quality, scope, and management
The examples in the previous section are just a small dip into the
surface of the data available, but we already can see a number of
issues that are starting to surface. A great deal of care needs to
be taken when using these data, and a researcher will need to ap-
ply subject matter knowledge as well as broader data management
expertise. Some of the core issues are as follows:
See Chapter 10.
Integration In the examples given above with Crossref and ORCID,
we used a known identifier (a DOI or an ORCID). Integrating data
from Crossref to supplement the information from an ORCID profile
is possible, but it depends on the linking of identifiers. Note that
for the profile data we obtained, only 66 or the 70 items had DOIs.
Data integration across multiple data sources that reference DOIs
is straightforward for those objects that have DOIs, and messy or
impossible for those that do not. In general, integration is possible,
but it depends on a means of cross-referencing between data sets.
Unique identifiers that are common to both are extremely powerful
but only exist in certain cases (see also Chapter 3).
Coverage Without a population frame, it is difficult to know
whether the information that can be captured is comprehensive.
For example, “the research literature” is at best a vague concept. A
variety of indexes, some openly available (PubMed, Crossref), some
proprietary (Scopus, Web of Knowledge, many others), cover differ-
ent partially overlapping segments of this corpus of work. Each in-
dex has differing criteria for inclusion and differing commitments to
completeness. Sampling of “the literature” is therefore impossible,
and the choice of index used for any study can make a substantial
difference to the conclusions.
Completeness Alongside the question of coverage (how broad is
a data source?), with web data and opt-in services we also need to
probe the completeness of a data set. In the example above, 66 of 70
objects have a DOI registered. This does not mean that those four
other objects do not have a DOI, just that there are none included in
the ORCID record. Similarly, ORCID profiles only exist for a subset
of researchers at this stage. Completeness feeds into integration
2.7. Quality, scope, and management 45
challenges. While many researchers have a Twitter profile and many
have an ORCID profile, only a small subset of ORCID profiles provide
a link to a Twitter profile. See below for a worked example.
Scope In survey data sets, the scope is defined by the question
being asked. This is not the case with much of these new data.
For example, the challenges listed above for research articles, tra-
ditionally considered the bedrock of research outputs, at least in
the natural sciences, are much greater for other forms of research
outputs. Increasingly, the data generated from research projects,
software, materials, and tools, as well as reports and presentations,
are being shared by researchers in a variety of settings. Some of
these are formal mechanisms for publication, such as large disci-
plinary databases, books, and software repositories, and some are
highly informal. Any study of (a subset of) these outputs has as its
first challenge the question of how to limit the corpus to be studied.
Source and validity The challenges described above relate to the
identification and counting of outputs. As we start to address ques-
tions of how these outputs are being used, the issues are com-
pounded. To illustrate some of the difficulties that can arise, we
examine the number of citations that have been reported for a sin-
gle sample article on a biochemical methodology [68]. This article
has been available for eight years and has accumulated a reason-
able number of citations for such an article over that time.
However, the exact number of citations identified varies radi-
cally, depending on the data source. Scopus finds 40, while Web
of Science finds only 38. A Google Scholar search performed on
the same date identified 59. These differences relate to the size of
the corpus from which inward citations are being counted. Web of
Science has the smallest database, with Scopus being larger and
Google Scholar substantially larger again. Thus the size of the in-
dex not only affects output counting, it can also have a substantial
effect on any analysis that uses that corpus. Alongside the size
of the corpus, the means of analysis can also have an effect. For
the same article, PubMed Central reports 10 citations but Europe
PubMed Central reports 18, despite using a similar corpus. The
distinction lies in differences in the methodology used to mine the
corpus for citations.
Identifying the underlying latent variable These issues multiply as
we move into newer forms of data. These sparse and incomplete
sources of data require different treatment than more traditional
46 2. Working with Web Data and APIs
structured and comprehensive forms of data. They are more useful
as a way of identifying activities than of quantifying or comparing
them. Nevertheless, they can provide new insight into the pro-
cesses of knowledge dissemination and community building that
are occurring online.
2.8 Integrating data from multiple sources
We often must work across multiple data sources to gather the in-
formation needed to answer a research question. A common pattern
is to search in one location to create a list of identifiers and then use
those identifiers to query another API. In the ORCID example above,
we created a list of DOIs from a single ORCID profile. We could use
those DOIs to obtain further information from the Crossref API and
other sources. This models a common path for analysis of research
outputs: identifying a corpus and then seeking information on its
In this example, we will build on the ORCID and Crossref ex-
amples to collect a set of work identifiers from an ORCID profile
and use a range of APIs to identify additional metadata as well as
information on the performance of those articles. In addition to the
ORCID API, we will use the PLOS Lagotto API. Lagotto is the soft-
ware that was built to support the Article Level Metrics program at
PLOS, the open access publisher, and its API provides information
on various metrics of PLOS articles. A range of other publishers
and service providers, including Crossref, also provide an instance
of this API, meaning the same tools can be used to collect informa-
tion on articles from a range of sources.
2.8.1 The Lagotto API
The module pyalm is a wrapper for the Lagotto API, which is served
from a range of hosts. We will work with two instances in particular:
one run by PLOS, and the Crossref DOI Event Tracker (DET, recently
renamed Crossref Event Data) pilot service. We first need to provide
the details of the URLs for these instances to our wrapper. Then
we can obtain some information for a single DOI to see what the
returned data look like.
>> import pyalm
>> pyalm.config.APIS = {’plos’ : {’url’ :
>> ’’},
>> ’det’ : {’url’ :
2.8. Integrating data from multiple sources 47
>> ’’}
>> }
>> det_alm_test = pyalm.get_alm(’10.1371/journal.pbio.1001677’,
>> info=’detail’, instance=’det’)
{’articles’ : [<ArticleALM Expert Failure: Re-evaluating Research
Assessment, DOI 10.1371/journal.pbio.1001677>],
’meta’ : {u’error: None, u’page’ : 1,
u’total’ : 1, u’total_pages’ : 1}
The library returns a Python dictionary containing two elements.
The articles key contains the actual data and the meta key includes
general information on the results of the interaction with the API.
In this case the library has returned one page of results containing
one object (because we only asked about one DOI). If we want to
collect a lot of data, this information helps in the process of paging
through results. It is common for APIs to impose some limit on
the number of results returned, so as to ensure performance. By
default the Lagotto API has a limit of 50 results.
The articles key holds a list of ArticleALM objects as its value.
Each ArticleALM object has a set of internal attributes that contain
information on each of the metrics that the Lagotto instance col-
lects. These are derived from various data providers and are called
sources. Each can be accessed by name from a dictionary called
“sources.” The iterkeys() function provides an iterator that lets us
loop over the set of keys in a dictionary. Within the source object
there is a range of information that we will dig into.
>> article = det_alm_test.get(’articles’)[0]
>> article.title
u’Expert Failure: Re-evaluating Research Assessment
>> for source in article.sources.iterkeys():
>> print source, article.sources[source]
reddit 0
datacite 0
pmceuropedata 0
wikipedia 1
pmceurope 0
citeulike 0
pubmed 0
facebook 0
wordpress 0
pmc 0
mendeley 0
crossref 0
The DET service only has a record of citations to this article
from Wikipedia. As we will see below, the PLOS service returns
more results. This is because some of the sources are not yet being
queried by DET.
48 2. Working with Web Data and APIs
Because this is a PLOS paper we can also query the PLOS Lagotto
instance for the same article.
>> plos_alm_test = pyalm.get_alm(’10.1371/journal.pbio.1001677’,
info=’detail’, instance=’plos’)
>> article_plos = plos_alm_test.get(’articles’)[0]
>> article_plos.title
u’Expert Failure: Re-evaluating Research Assessment
>> for source in article_plos.sources.iterkeys():
>> print source, article_plos.sources[source]
datacite 0
twitter 130
pmc 610
articlecoveragecurated 0
pmceurope 1
pmceuropedata 0
researchblogging 0
scienceseeker 0
copernicus 0
f1000 0
wikipedia 1
citeulike 0
wordpress 2
openedition 0
reddit 0
nature 0
relativemetric 125479
figshare 0
facebook 1
mendeley 14
crossref 3
plos_comments 2
articlecoverage 0
counter 12551
scopus 2
pubmed 1
orcid 3
The PLOS instance provides a greater range of information but
also seems to be giving larger numbers than the DET instance in
many cases. For those sources that are provided by both API in-
stances, we can compare the results returned.
>> for source in article.sources.iterkeys():
>> print source, article.sources[source],
>> article_plos.sources[source]
reddit 0 0
datacite 0 0
pmceuropedata 0 0
wikipedia 1 1
pmceurope 0 1
citeulike 0 0
pubmed 0 1
2.8. Integrating data from multiple sources 49
facebook 0 1
wordpress 0 2
pmc 0 610
mendeley 0 14
crossref 0 3
The PLOS Lagotto instance is collecting more information and
has a wider range of information sources. Comparing the results
from the PLOS and DET instances illustrates the issues of coverage
and completeness discussed previously. The data may be sparse for
a variety of reasons, and it is important to have a clear idea of the
strengths and weaknesses of a particular data source or aggregator.
In this case the DET instance is returning information for some
sources for which it is does not yet have data.
We can dig deeper into the events themselves that the met- count aggregates. The API wrapper collects these into
an event object within the source object. These contain the JSON
returned from the API in most cases. For instance, the Crossref
source is a list of JSON objects containing information on an article
that cites our article of interest. The first citation event in the list is
a citation from the Journal of the Association for Information Science
and Technology by Du et al.
>> article_plos.sources[’crossref’].events[0]
{u’event’ :
{u’article_title: u’The effects of research level and article
type on the differences between citation metrics and F1000
u’contributors’ :
{u’contributor’ :
[ { u’contributor_role’ : u’author’,
u’first_author’ : u’true’,
u’given_name’ : u’Jian’,
u’sequence’ : u’first’,
u’surname’ : u’Du’ },
{ u’contributor_role’ : u’author,
u’first_author’ : u’false’,
u’given_name’ : u’Xiaoli’,
u’sequence’ : u’additional,
u’surname’ : u’Tang’},
{ u’contributor_role’ : u’author,
u’first_author’ : u’false’,
u’given_name’ : u’Yishan’,
u’sequence’ : u’additional,
u’surname’ : u’Wu’} ]
u’doi’ : u’10.1002/asi.23548,
u’first_page’ : u’n/a’,
u’fl_count’ : u’0’,
u’issn’ : u’23301635’,
50 2. Working with Web Data and APIs
u’journal_abbreviation: u’J Assn Inf Sci Tec’,
u’journal_title: u’Journal of the Association for
Information Science and Technology’,
u’publication_type: u’full_text’,
u’year’ : u’2015’
u’event_csl’ : {
u’author’ :
[ { u’family’ : u’Du’, u’given’ : u’Jian’},
{u’family’ : u’Tang’, u’given: u’Xiaoli’},
{u’family’ : u’Wu’, u’given’ : uYishan’} ],
u’container-title: u’Journal of the Association for
Information Science and Technology’,
u’issued’ : {u’date-parts’ : [[2015]]},
u’title’ : u’The Effects Of Research Level And Article Type
On The Differences Between Citation Metrics And F1000
u’type’ : u’article-journal,
u’url’ : u’’
u’event_url’ : u
Another source in the PLOS data is Twitter. In the case of the
Twitter events (individual tweets), this provides the text of the tweet,
user IDs, user names, URL of the tweet, and the date. We can see
from the length of the events list that there are at least 130 tweets
that link to this article.
>> len(article_plos.sources[’twitter’].events)
Again, noting the issues of coverage, scope, and completeness,
it is important to consider the limitations of these data. This is a
lower bound as it represents search results returned by searching
the Twitter API for the DOI or URL of the article. Other tweets that
discuss the article may not include a link, and the Twitter search
API also has limitations that can lead to incomplete results. The
number must therefore be seen as both incomplete and a lower
We can look more closely at data on the first tweet on the list.
Bear in mind that the order of the list is not necessarily special.
This is not the first tweet about this article chronologically.
>> article_plos.sources[’twitter’].events[0]
{ u’event’ : {u’created_at: u’2013-10-08T21:12:28Z’,
u’id’ : u’387686960585641984,
u’text’ : u’We have identified the Higgs boson; it is surely not
beyond our reach to make research assessment useful http://t
2.8. Integrating data from multiple sources 51
u’user’ : u’catmacOA’,
u’user_name’ : uCatriona MacCallum’,
u’event_time’ : u’2013-10-08T21:12:28Z’,
u’event_url’ : u
We could use the Twitter API to understand more about this
person. For instance, we could look at their Twitter followers and
whom they follow, or analyze the text of their tweets for topic mod-
eling. Much work on social media interactions is done with this
kind of data, using forms of network and text analysis described
elsewhere in this book.
See Chapters 7 and 8.
A different approach is to integrate these data with informa-
tion from another source. We might be interested, for instance,
in whether the author of this tweet is a researcher, or whether they
have authored research papers. One thing we could do is search
the ORCID API to see if there are any ORCID profiles that link to
this Twitter handle.
>> twitter_search ="catmacOA")
>> for result in twitter_search:
>> print unicode(result)
>> print result.researcher_urls}
<Author Catriona MacCallum, ORCID 0000-0001-9623-2225>
[<Website twitter []>]
So the person with this Twitter handle seems to have an ORCID
profile. That means we can also use ORCID to gather more infor-
mation on their outputs. Perhaps they have authored work which
is relevant to our article?
>> cm = orcid.get("0000-0001-9623-2225")
>> for pub in cm.publications[0:5]:
>> print pub.title
The future is open: opportunities for publishers and institutions
Open Science and Reporting Animal Studies: Who’s Accountable?
Expert Failure: Re-evaluating Research Assessment
Why ONE Is More Than 5
Reporting Animal Studies: Good Science and a Duty of Care
From this analysis we can show that this tweet is actually from
one of my co-authors of the article.
To make this process easier we write the convenience function
shown in Listing 2.2 to go from a Twitter user handle to try and find
an ORCID for that person.
52 2. Working with Web Data and APIs
# Take a twitter handle or user name and return an ORCID
2def twitter2orcid(twitter_handle,
resp = 'orcid', search_depth = 10):
4search =
s = [r for rin search]
6orc = None
i = 0
8while i < search_depth and orc == None and i < len(s):
arr = [('' in website.url)
10 for website in s[i].researcher_urls]
if True in arr:
12 index = arr.index(True)
url = s[i].researcher_urls[index].url
14 if url.lower().endswith(twitter_handle.lower()):
orc = s[i].orcid
16 return orc
18 return None
Listing 2.2. Python code to find ORCID for Twitter handle
Let us do a quick test of the function.
>> twitter2orcid(’catmacOA’)
2.8.2 Working with a corpus
In this case we will continue as previously to collect a set of works
from a single ORCID profile. This collection could just as easily be
a date range, or subject search at a range of other APIs. The target
is to obtain a set of identifiers (in this case DOIs) that can be used
to precisely query other data sources. This is a general pattern
that reflects the issues of scope and source discussed above. The
choice of how to construct a corpus to analyze will strongly affect
the results and the conclusions that can be drawn.
>> # As previously, collect DOIs available from an ORCID profile
>> cn = orcid.get("0000-0002-0068-716X")
>> exids = []
>> for pub in cn.publications:
>> if pub.external_ids:
>> exids = exids + pub.external_ids
>> DOIs = [ for exid in exids if exid.type == "DOI"]
>> len(DOIs)
2.8. Integrating data from multiple sources 53
We have recovered 66 DOIs from the ORCID profile. Note that we
have not obtained an identifier for every work, as not all have DOIs.
This result illustrates an important point about data integration. In
practice it is generally not worth the effort of attempting to integrate
data on objects unless they have a unique identifier or key that
can be used in multiple data sources, hence the focus on DOIs and
ORCIDs in these examples. Even in our search of the ORCID API
for profiles that are associated with a Twitter account, we used the
Twitter handle as a unique ID to search on.
While it is possible to work with author names or the titles of
works directly, disambiguating such names and titles is substan-
tially more difficult than working with unique identifiers. Other
chapters (in particular, Chapter 3) deal with issues of data cleaning
and disambiguation. Much work has been done on this basis, but
increasingly you will see that the first step in any analysis is simply
to discard objects without a unique ID that can be used across data
We can obtain data for these from the DET API. As is common
with many APIs, there is a limit to how many queries can be simul-
taneously run, in this case 50, so we divide our query into batches.
>> batches = [DOIs[0:50], DOIs[51:-1]]
>> det_alms = []
>> for batch in batches:
>> alms_response = pyalm.get_alm(batch, info="detail",
>> det_alms.extend(alms_response.get(’articles’))
>> len(det_alms)
The DET API only provides information on a subset of Cross-
ref DOIs. The process that Crossref has followed to populate its
database has focused on more recently published articles, so only
24 responses are received in this case for the 66 DOIs we queried on.
A good exercise would be to look at which of the DOIs are found and
which are not. Let us see how much interesting data is available in
the subset of DOIs for which we have data.
>> for r in [d for d in det_alms if d.sources[’wikipedia’].metrics
.total != 0]:
>> print r.title
>> print ’ ’, r.sources[’pmceurope’], ’
pmceurope citations’
>> print ’ ’, r.sources[’wikipedia’], ’
wikipedia citations’
Architecting the Future of Research Communication: Building the
Models and Analytics for an Open Access Future
54 2. Working with Web Data and APIs
1 pmceurope citations
1 wikipedia citations
Expert Failure: Re-evaluating Research Assessment
0 pmceurope citations
1 wikipedia citations
LabTrove: A Lightweight, Web Based, Laboratory "Blog" as a Route
towards a Marked Up Record of Work in a Bioscience Research
0 pmceurope citations
1 wikipedia citations
The lipidome and proteome of oil bodies from Helianthus annuus (
common sunflower)
2 pmceurope citations
1 wikipedia citations
As discussed above, this shows that the DET instance, while it
provides information on a greater number of DOIs, has less com-
plete data on each DOI at this stage. Only four of the 24 responses
have Wikipedia references. You can change the code to look at the
full set of 24, which shows only sparse data. The PLOS Lagotto
instance provides more data but only on PLOS articles. However, it
does provide data on all PLOS articles, going back earlier than the
set returned by the DET instance. We can collect the set of articles
from the profile published by PLOS.
>> plos_dois = []
>> for doi in DOIs:
>> # Quick and dirty, should check Crossref API for publisher
>> if doi.startswith(’10.1371’):
>> plos_dois.append(doi)
>> len(plos_dois)
>> plos_alms = pyalm.get_alm(plos_dois, info=’detail’, instance=’
>> for article in plos_alms:
>> print article.title
>> print ’ ’, article.sources[’crossref’], ’
Crossref citations’
>> print ’ ’, article.sources[’twitter’],
Architecting the Future of Research Communication: Building the
Models and Analytics for an Open Access Future
2 Crossref citations
48 tweets
Expert Failure: Re-evaluating Research Assessment
3 Crossref citations
130 tweets
LabTrove: A Lightweight, Web Based, Laboratory "Blog" as a Route
towards a Marked Up Record of Work in a Bioscience Research
6 Crossref citations
2.8. Integrating data from multiple sources 55
1 tweets
More Than Just Access: Delivering on a Network-Enabled Literature
4 Crossref citations
95 tweets
Article-Level Metrics and the Evolution of Scientific Impact
24 Crossref citations
5 tweets
Optimal Probe Length Varies for Targets with High Sequence
Variation: Implications for Probe
Library Design for Resequencing Highly Variable Genes
2 Crossref citations
1 tweets
Covalent Attachment of Proteins to Solid Supports and Surfaces via
Sortase-Mediated Ligation
40 Crossref citations
0 tweets
From the previous examples we know that we can obtain in-
formation on citing articles and tweets associated with these 66
articles. From that initial corpus we now have a collection of up to
86 related articles (cited and citing), a few hundred tweets that refer
to (some of) those articles, and perhaps 500 people if we include
authors of both articles and tweets. Note how for each of these
links our query is limited, so we have a subset of all the related
objects and agents. At this stage we probably have duplicate articles
(one article might cite multiple in our set of seven) and duplicate
people (authors in common between articles and authors who are
also tweeting).
These data could be used for network analysis, to build up a new
corpus of articles (by following the citation links), or to analyze the
links between authors and those tweeting about the articles. We do
not pursue an in-depth analysis here, but will gather the relevant
objects, deduplicate them as far as possible, and count how many
we have in preparation for future analysis.
>> # Collect all citing DOIs & author names from citing articles
>> citing_dois = []
>> citing_authors = []
>> for article in plos_alms:
>> for cite in article.sources[’crossref’].events:
>> citing_dois.append(cite[’event’][’doi’])
>> # Use ’extend’ because the element is a list
>> citing_authors.extend(cite[’event_csl’][’author’])
>> print ’\nBefore de-deduplication:’
>> print ’ ’, len(citing_dois), ’DOIs’
>> print ’ ’, len(citing_authors), ’citing authors’
>> # Easiest way to deduplicate is to convert to a Python set
>> citing_dois = set(citing_dois)
>> citing_authors = set([author[’given’] + author[’family’] for
56 2. Working with Web Data and APIs
author in citing_authors])
>> print ’\nAfter de-deduplication:’
>> print ’ ’, len(citing_dois), ’DOIs’
>> print ’ ’, len(citing_authors), ’citing authors’
Before de-deduplication:
81 DOIs
346 citing authors
After de-deduplication:
78 DOIs
278 citing authors
>> # Collect all tweets, usernames; check for ORCIDs
>> tweet_urls = set()
>> twitter_handles = set()
>> for article in plos_alms:
>> for tweet in article.sources[’twitter’].events:
>> tweet_urls.add(tweet[’event_url’])
>> twitter_handles.add(tweet[’event’][’user’])
>> # No need to explicitly deduplicate as we created sets directly
>> print len(tweet_urls), ’tweets’
>> print len(twitter_handles), ’Twitter users’
280 tweets
210 Twitter users
It could be interesting to look at which Twitter users interact
most with the articles associated with this ORCID profile. To do
that we would need to create not a set but a list, and then count the
number of duplicates in the list. The code could be easily modified
to do this. Another useful exercise would be to search ORCID for
profiles corresponding to citing authors. The best way to do this
would be to obtain ORCIDs associated with each of the citing ar-
ticles. However, because ORCID data are sparse and incomplete,
there are two limitations here. First, the author may not have an
ORCID. Second, the article may not be explicitly linked to another
article. Try searching ORCID for the DOIs associated with each of
the citing articles.
In this case we will look to see how many of the Twitter handles
discussing these articles are associated with an ORCID profile we
can discover. This in turn could lead to more profiles and more
cycles of analysis to build up a network of researchers interacting
through citation and on Twitter. Note that we have inserted a delay
between calls. This is because we are making a larger number of API
calls (one for each Twitter handle). It is considered polite to keep
the pace at which calls are made to an API to a reasonable level.
The ORCID API does not post suggested limits at the moment, but
delaying for a second between calls is reasonable.
2.8. Integrating data from multiple sources 57
>> tweet_orcids = []
>> for handle in twitter_handles:
>> orc = twitter2orcid(handle)
>> if orc:
>> tweet_orcids.append(orc)
>> time.sleep(1) # wait one second between each call to the
>> print len(tweet_orcids)
In this case we have identified 12 ORCID profiles that we can
link positively to tweets about this set of articles. This is a substan-
tial underestimate of the likely number of ORCIDs associated with
these tweets. However, relatively few ORCIDs have Twitter accounts
registered as part of the profile. To gain a broader picture a search
and matching strategy would need to be applied. Nevertheless, for
these 12 we can look more closely into the profiles.
The first step is to obtain the actual profile information for each
of the 12 ORCIDs that we have found. Note that at the moment
what we have is the ORCIDs themselves, not the retrieved profiles.
>> orcs = []
>> for id in tweet_orcids:
>> orcs.append(orcid.get(id))
With the profiles retrieved we can then take a look at who they
are, and check that we do in fact have sensible Twitter handles
associated with them. We could use this to build up the network of
related authors and Twitter users for further analysis.
>> for orc in orcs:
>> i = [(’’ in website.url) for website in orc.
>> twitter_url = orc.researcher_urls[i].url
>> print orc.given_name, orc.family_name, orc.orcid,
Catriona MacCallum 0000-0001-9623-2225
John Dupuis 0000-0002-6066-690X
Johannes Velterop 0000-0002-4836-6568
Stuart Lawson 0000-0002-1972-8953
Nelson Piedra 0000-0003-1067-8707
Iryna Kuchma 0000-0002-2064-3439
Frank Huysmans 0000-0002-3468-9032
Salvatore Salvi VICIDOMINI 0000-0001-5086-7401
William Gunn 0000-0002-3555-2054
Stephen Curry 0000-0002-0552-8870
Cameron Neylon 0000-0002-0068-716X
58 2. Working with Web Data and APIs
Graham Steel 0000-0003-4681-8011
2.9 Working with the graph of relationships
In the above examples we started with the profile of an individual,
used this to create a corpus of works, which in turn led us to other
citing works (and their authors) and commentary about those works
on Twitter (and the people who wrote those comments). Along the
way we built up a graph of relationships between objects and people.
See Chapter 8.
In this section we will look at this model of the data and how it
reveals limitations and strengths of these forms of data and what
can be done with them.
2.9.1 Citation links between articles
A citation in a research article (or a policy document or working
paper) defines a relationship between that citing article and the
cited article. The exact form of the relationship is generally poorly
defined, at least at the level of large-scale data sets. A citation
might be referring to previous work, indicating the source of data,
or supporting (or refuting) an idea. While efforts have been made to
codify citation types, they have thus far gained little traction.
In our example we used a particular data source (Crossref) for
information about citations. As previously discussed, this will give
different results than other sources (such as Thomson Reuters, Sco-
pus, or Google Scholar) because other sources look at citations from
a different set of articles and collect them in a different way. The
completeness of the data will always be limited. We could use the
data to clearly connect the citing articles and their authors because
author information is generally available in bibliographic metadata.
However, we would have run into problems if we had only had
names. ORCIDs can provide a way to uniquely identify authors
and ensure that our graph of relationships is clean.
Acitation is a reference from an object of one type to an object
of the same type. We also sought to link social media activity with
specific articles. Rather than a link between objects that are the
same (articles) we started to connect different kinds of objects to-
gether. We are also expanding the scope of the communities (i.e.,
people) that might be involved. While we focused on the question of
which Twitter handles were connected with researchers, we could
2.9. Working with the graph of relationships 59
Research group
Authored by
Tweeted by
Patient group
Other Agents/Stakeholders
Links to
Figure 2.4. A functional view of proxies and relationships
just as easily have focused on trying to discover which comments
came from people who are not researchers.
We used the Lagotto API at PLOS to obtain this information.
The PLOS API in turn depends on the Twitter Search API. A tweet
that refers explicitly to a research article, perhaps via a Crossref
DOI, can be discovered, and a range of services do these kinds of
checks. These services generally rely either on Twitter Search or,
more generally, on a search of “the firehose,” a dump of all Twitter
data that are available for purchase. The distinction is important
because Twitter Search does not provide either a complete or a con-
sistent set of results. In addition, there will be many references to
research articles that do not contain a unique identifier, or even a
link. These are more challenging to discover. As with citations, the
completeness of any data set will always be limited.
However, the set of all tweets is a more defined set of objects than
the set of all “articles.” Twitter is a specific social media service with
a defined scope. “Articles” is a broad class of objects served by a
very wide range of services. Twitter is clearly a subset of all discus-
sions and is highly unlikely to be representative of “all discussions.”
Equally the set of all objects with a Crossref DOI, while defined, is
unlikely to be representative of all articles.
Expanding on Figure 2.3, we show in Figure 2.4 agents and ac-
tors (people) and outputs. We place both agents and outputs into
categories that may be more or less well defined. In practice our
analysis is limited to those objects that we discover by using some
“selector” (circles in this diagram), which may or may not have a
60 2. Working with Web Data and APIs
close correspondence with the “real” categories (shown with graded
shapes). Our aim is to identify, aggregate, and in some cases count
the relationships between and within categories of objects; for in-
stance, citations are relationships between formal research outputs.
A tweet may have a relationship (“links to”) with a specific formally
published research output. Both tweets and formal research out-
puts relate to specific agents (“authors”) of the content.
2.9.2 Categories, sources, and connections
We can see in this example a distinction between categories of ob-
jects of interest (articles, discussions, people) and sources of infor-
mation on subsets of those categories (Crossref, Twitter, ORCID).
Any analysis will depend on one or more data sources, and in turn
be limited by the coverage of those data sources. The selectors
used to generate data sets from these sources will have their own
Similar to a query on a structured data set, the selector itself may
introduce bias. The crucial difference between filtering on a com-
prehensive (or at least representative) data set and the data sources
we are discussing here is that these data sources are by their very
nature incomplete. Survey data may include biases introduced in
the way that the survey itself is structured or the sampling is de-
signed, but the intent is to be comprehensive. Many of these new
forms of data make no attempt to be comprehensive or avowedly
avoid such an attempt.
Understanding this incompleteness is crucial to understanding
the forms of inference that can be made from these data. Sampling
is only possible within a given source or corpus, and this limits the
conclusions that can be drawn to the scope of that corpus. It is
frequently possible to advance a plausible argument to claim that
such findings are more broadly applicable, but it is crucial to avoid
assuming that this is the case. In particular, it is important to
be clear about what data sources a finding applies to and where
the boundary between the strongly evidenced finding and a claim
about its generalization lies. Much of the literature on scholarly
communications and research impact is poor on this point.
If this is an issue in identifying the objects of interest, it is even
more serious when seeking to identify the relationships between
them, which are, after all generally the thing of interest. In some
cases there are reasonably good sources of data between objects of
the same class (at least those available from the same data sources)
such as citations between journal articles or links between tweets.
2.9. Working with the graph of relationships 61
However, as illustrated in this chapter, detecting relationships be-
tween tweets and articles is much more challenging.
These issues can arise both due to the completeness of the
data source itself (e.g., ORCID currently covers only a subset of
researchers; therefore, the set of author–article relationships is lim-
ited) or due to the challenges of identification (e.g., in the Twitter
case above) or due to technical limitations at source (the difference
between the Twitter search API and the firehose). In addition, be-
cause the source data and the data services are both highly dynamic
and new, there is often a mismatch. Many services tracking Twitter
data only started collecting data relatively recently. There is a range
of primary and secondary data sources working to create more com-
plete data sets. However, once again it is important to treat all of
these data as sparse and limited as well as highly dynamic and
2.9.3 Data availability and completeness
With these caveats in hand and the categorization discussed above,
we can develop a mapping of what data sources exist, what objects
those data sources inform us about, the completeness of those data
sources, and how well the relationships between the different data
sources are tracked. Broadly speaking, data sources concern them-
selves with either agents (mostly people) or objects (articles, books,
tweets, posts), while additionally providing additional data about
the relationships of the agents or objects that they describe with
other objects or agents.
The five broad types of data described above are often treated
as ways of categorizing the data source. They are more properly
thought of as relationships between objects, or between objects and
agents. Thus, for example, citations are relationships between ar-
ticles; the tweets that we are considering are actually relationships
between specific Twitter posts and articles; and “views” are an event
associating a reader (agent) with an article. The last case illustrates
that often we do not have detailed information on the relationship
but merely a count of them. Relationships between agents (such as
co-authorship or group membership) can also be important.
With this framing in hand, we can examine which types of re-
lationships we can obtain data on. We need to consider both the
quality of data available and the completeness of the data availabil-
ity. These metrics are necessarily subjective and any analysis will
be a personal view of a particular snapshot in time. Nevertheless,
some major trends are available.
62 2. Working with Web Data and APIs
We have growing and improving data on the relationships be-
tween a wide range of objects and agents and traditional scholarly
outputs. Although it is sparse and incomplete in many places,
nontraditional information on traditional outputs is becoming more
available and increasingly rich. By contrast, references from tradi-
tional outputs to nontraditional outputs are weaker and data that
allow us to understand the relationships between nontraditional
outputs is very sparse.
In the context of the current volume, a major weakness is our
inability to triangulate around people and communities. While it
may be possible to collect a set of co-authors from a bibliographic
data source and to identify a community of potential research users
on Twitter or Facebook, it is extremely challenging to connect these
different sets. If a community is discussing an article or book on
social media, it is almost impossible to ascertain whether the au-
thors (or, more generically, interested parties such as authors of
cited works or funders) are engaged in that conversation.
2.9.4 The value of sparse dynamic data
Two clear messages arise from our analysis. These new forms of
data are incomplete or sparse, both in quality and in coverage, and
they change. A data source that is poor today may be much im-
proved tomorrow. A query performed one minute may give different
results the next. This can be both a strength and a weakness: data
are up to the minute, giving a view of relationships as they form
(and break), but it makes ensuring consistency within analyses and
across analyses challenging. Compared to traditional surveys, these
data sources cannot be relied on as either representative samples
or to be stable.
A useful question to ask, therefore, is what kind of statements
these data can support. Questions like this will be necessarily differ-
ent from the questions that can be posed with high-quality survey
data. More often they provide an existence proof that something
has happened—but they cannot, conversely, show that it has not.
They enable some forms of comparison and determination of the
characteristics of activity in some cases.
Provide evidence that . . . Because much of the data that we have
is sparse, the absence of an indicator cannot reliably be taken to
mean an absence of activity. For example, a lack of Mendeley book-
marks may not mean that a paper is not being saved by researchers,
2.9. Working with the graph of relationships 63
just that those who do save the article are not using Mendeley to
do it. Similarly, a lack of tweets about an article does not mean the
article is not being discussed. But we can use the data that do exist
to show that some activity is occurring. Here are some examples:
Provide evidence that relevant communities are aware of a spe-
cific paper. I identified the fact that a paper by Jewkes et
al. [191] was mentioned by crisis centers, sexual health orga-
nizations, and discrimination support groups in South Africa
when I was looking for University of Cape Town papers that
had South African Twitter activity using
Provide evidence that a relatively under-cited paper is having
a research impact. There is a certain kind of research article,
often a method description or a position paper, that is influ-
ential without being (apparently) heavily cited. For instance,
the PLoS One article by Shen et al. [341] has a respectable
14,000 views and 116 Mendeley bookmarks, but a relatively
(for the number of views) small number of WoS citations (19)
compared to, say, another article, by Leahy et al. [231] and
also in PLoS One, that is similar in age and number of views
but has many more citations.
Provide evidence of public interest in some topic. Many art-
icles at the top of lists ordered by views or social media men-
tions are of ephemeral (or prurient) interest—the usual trilogy
of sex, drugs, and rock and roll. However, if we dig a little
deeper, a wide range of articles surface, often not highly cited
but clearly of wider interest. For example, an article on Y-
chromosome distribution in Afghanistan [146] has high page
views and Facebook activity among papers with a Harvard af-
filiation but is not about sex, drugs, nor rock and roll. Unfor-
tunately, because this is Facebook data we cannot see who is
talking about it, which limits our ability to say which groups
are talking about it, which could be quite interesting.
Compare . . . Comparisons using social media or download stat-
istics need real care. As noted above, the data are sparse so it is
important that comparisons are fair. Also, comparisons need to be
on the basis of something that the data can actually tell you: for ex-
ample, “which article is discussed more by this online community,”
not “which article is discussed more.”
Compare the extent to which these articles are discussed by
this online patient group, or possibly specific online communi-
64 2. Working with Web Data and APIs
ties in general. Here the online communities might be a proxy
for a broader community, or there might be a specific interest
in knowing whether the dissemination strategy reaches this
community. It is clear that in the longer term social media will
be a substantial pathway for research to reach a wide range
of audiences, and understanding which communities are dis-
cussing what research will help us to optimize the communi-
Compare the readership of these articles in these countries.
One thing that most data sources are weak on at the moment
is demographics, but in principle the data are there. Are these
articles that deal with diseases of specific areas actually being
viewed by readers in those areas? If not, why not? Do they
have Internet access, could lay summaries improve dissemi-
nation, are they going to secondary online sources instead?
Compare the communities discussing these articles online. Is
most conversation driven by science communicators or by re-
searchers? Are policymakers, or those who influence them,
involved? What about practitioner communities? These com-
parisons require care, and simple counting rarely provides
useful information. But understanding which people within
which networks are driving conversations can give insight into
who is aware of the work and whether it is reaching target
What flavor is it? Priem et al. [310] provide a thoughtful analysis
of the PLOS Article Level Metrics data set. They used principal
component analysis to define different “flavors of impact” based on
the way different combinations of signals seemed to point to different
kinds of interest. Many of the above use cases are variants on this
theme—what kind of article is this? Is it a policy piece, of public
interest? Is it of interest to a niche research community or does it
have wider public implications? Is it being used in education or in
health practice? And to what extent are these different kinds of use
independent from each other?
It is important to realize that these kinds of data are proxies of
things that we do not truly understand. They are signals of the flow
of information down paths that we have not mapped. To me this
is the most exciting possibility and one we are only just starting to
explore. What can these signals tell us about the underlying path-
2.10. Bringing it together: Tracking pathways to impact 65
ways down which information flows? How do different combinations
of signals tell us about who is using that information now, and how
they might be applying it in the future? Correlation analysis cannot
answer these questions, but more sophisticated approaches might.
And with that information in hand we could truly design schol-
arly communication systems to maximize their reach, value, and
2.10 Bringing it together: Tracking pathways
to impact
Collecting data on research outputs and their performance clearly
has significant promise. However, there are a series of substantial
challenges in how best to use these data. First, as we have seen,
it is sparse and patchy. Absence of evidence cannot be taken as
evidence of absence. But, perhaps more importantly, it is unclear
in many cases what these various proxies actually mean. Of course
this is also true of more familiar indicators like citations.
Finally, there is a challenge in how to effectively analyze these
data. The sparse nature of the data is a substantial problem in it-
self, but in addition there are a number of significantly confounding
effects. The biggest of these is time. The process of moving research
outputs and their use online is still proceeding, and the uptake and
penetration of online services and social media by researchers and
other relevant communities has increased rapidly over the past few
years and will continue to do so for some time.
These changes are occurring on a timescale of months, or even
weeks, so any analysis must take into account how those changes
may contribute to any observed signal. Much attention has focused
on how different quantitative proxies correlate with each other. In
essence this has continued the mistake that has already been made
with citations. Focusing on proxies themselves implicitly makes the
assumption that it is the proxy that matters, rather than the un-
derlying process that is actually of interest. Citations are irrelevant;
what matters is the influence that a piece of research has had. Cita-
tions are merely a proxy for a particular slice of influence, a (limited)
indicator of the underlying process in which a research output is
used by other researchers.
Of course, these are common challenges for many “big data”
situations. The challenge lies in using large, but disparate and
messy, data sets to provide insight while avoiding the false positives
66 2. Working with Web Data and APIs
that will arise from any attempt to mine data blindly for correlations.
Using the appropriate models and tools and careful validation of
findings against other sources of data are the way forward.
2.10.1 Network analysis approaches
One approach is to use these data to dissect and analyze the (visible)
network of relationships between agents and objects. This approach
can be useful in defining how networks of collaborators change over
time, who is in contact with whom, and how outputs are related to
each other. This kind of analysis has been productive with citation
graphs (see Eigenfactor for an example) as well as with small-scale
analysis of grant programs (see, for instance, the Lattes analysis of
the network grant program).
Network analysis techniques and visualization are covered in
Chapter 8 (on networks) and clustering and categorization in Chap-
ter 6 (on machine learning). Networks may be built up from any
combination of outputs, actors/agents, and their relationships to
each other. Analyses that may be particularly useful are those
searching for highly connected (proxy for influential) actors or out-
puts, clustering to define categories that emerge from the data itself
(as opposed to external categorization) and comparisons between
networks, both between those built from specific nodes (people,
outputs) and between networks that are built from data relating
to different time frames.
Care is needed with such analyses to make sure that compar-
isons are valid. In particular, when doing analyses of different time
frames, it is important to compare any change in the network char-
acteristics that are due to general changes over time as opposed
to specific changes. As noted above, this is particularly important
with networks based on social media data, as any networks are
likely to have increased in size and diversity over the past few years
as more users interested in research have joined. It is important
to distinguish in these cases between changes relating to a spe-
cific intervention or treatment and those that are environmental.
As with any retrospective analysis, a good counterfactual sample is
2.10.2 Future prospects and new data sources
As the broader process of research moves online we are likely to
have more and more information on what is being created, by whom,
2.11. Summary 67
and when. As access to these objects increases, both through pro-
vision of open access to published work and through increased data
sharing, it will become more and more feasible to mine the objects
themselves to enrich the metadata. And finally, as the use of unique
identifiers increases for both outputs and people, we will be able to
cross-reference across data sources much more strongly.
Much of the data currently being collected is of poor quality or
is inconsistently processed. Major efforts are underway to develop
standards and protocols for initial processing, particularly for page
view and usage data. Alongside efforts such as the Crossref DOI
Event Tracker Service to provide central clearing houses for data,
both consistency and completeness will continue to rise, making
new and more comprehensive forms of analysis feasible.
Perhaps the most interesting prospect is new data that arise as
more of the outputs and processes of research move online. As the
availability of data outputs, software products, and even potentially
the raw record of lab notebooks increases, we will have opportuni-
ties to query how (and how much) different reagents, techniques,
tools, and instruments are being used. As the process of policy de-
velopment and government becomes more transparent and better
connected, it will be possible to watch in real time as research has
its impact on the public sphere. And as health data moves online
there will be opportunities to see how both chemical and behavioral
interventions affect health outcomes in real time.
In the end all of this data will also be grist to the mill for further
research. For the first time we will have the opportunity to treat the
research enterprise as a system that is subject to optimization and
engineering. Once again the challenges of what it is we are seeking
to optimize for are questions that the data itself cannot answer, but
in turn the data can better help us to have the debate about what
2.11 Summary
The term research impact is difficult and politicized, and it is used
differently in different areas. At its root it can be described as the
change that a particular part of the research enterprise (e.g., re-
search project, researcher, funding decision, or institute) makes in
the world. In this sense, it maps well to standard approaches in
the social sciences that seek to identify how an intervention has led
to change.
68 2. Working with Web Data and APIs
The link between “impact” and the distribution of limited re-
search resources makes its definition highly political. In fact, there
are many forms of impact, and the different pathways to different
kinds of change, further research, economic growth, improvement
in health outcomes, greater engagement of citizenry, or environ-
mental change may have little in common beyond our interest in
how they can be traced to research outputs. Most public policy
on research investment has avoided the difficult question of which
impacts are most important.
In part this is due to the historical challenges of providing ev-
idence for these impacts. We have only had good data on for-
mal research outputs, primarily journal articles, and measures
have focused on naïve metrics such as productivity or citations, or
on qualitative peer review. Broader impacts have largely been evi-
denced through case studies, an expensive and nonscalable
The move of research processes online is providing much richer
and more diverse information on how research outputs are used
and disseminated. We have the prospect of collecting much more
information around the performance and usage of traditional re-
search outputs as well as greater data on the growing diversity of
nontraditional research outputs that are now being shared.
It is possible to gain quantitative information on the numbers of
people looking at research, different groups talking about research
(in different places), those citing research in different places, and
recommendations and opinions on the value of work. These data
are sparse and incomplete and its use needs to acknowledge these
limitations, but it is nonetheless possible to gain new and valuable
insights from analysis.
Much of this data is available from web services in the form of ap-
plication programming interfaces. Well-designed APIs make it easy
to search for, gather, and integrate data from multiple sources. A
key aspect of successfully integrating data is the effective use and
application of unique identifiers across data sets that allow straight-
forward cross-referencing. Key among the identifiers currently be-
ing used are ORCIDs to uniquely identify researchers and DOIs,
from both Crossref and increasingly DataCite, to identify research
outputs. With good cross-referencing it is possible to obtain rich
data sets that can be used as inputs to many of the techniques
described elsewhere in the book.
The analysis of this new data is a nascent field and the quality of
work done so far has been limited. In my view there is a substantial
opportunity to use these rich and diverse data sets to treat the
2.12. Resources 69
underlying question of how research outputs flow from the academy
to their sites of use. What are the underlying processes that lead
to various impacts? This means treating these data sets as time
domain signals that can be used to map and identify the underlying
processes. This approach is appealing because it offers the promise
of probing the actual process of knowledge diffusion while making
fewer assumptions about what we think is happening.
2.12 Resources
We talked a great deal here about how to access publications and
other resources via their DOIs. Paskin [297] provides a nice sum-
mary of the problems that DOIs solve and how they work.
ORCIDs are another key piece of this puzzle, as we have seen
throughout this chapter. You might find some of the early articles
describing the need for unique author IDs useful, such as Bourne
et al. [46], as well as more recent descriptions [145]. More recent
initiatives on expanding the scope of identifiers to materials and
software have also been developed [24].
More general discussions of the challenges and opportunities
of using metrics in research assessment may be found in recent
reports such as the HEFCE Expert Group Report [405], and I have
covered some of the broader issues elsewhere [274].
There are many good introductions to web scraping using Beau-
tifulSoup and other libraries as well as API usage in general. Given
the pace at which APIs and Python libraries change, the best and
most up to date source of information is likely to be a web search.
In other settings, you may be concerned with assigning DOIs to
data that you generate yourself, so that you and others can easily
and reliably refer to and access that data in their own work. Here we
face an embarrassment of riches, with many systems available that
each meet different needs. Big data research communities such
as climate science [404], high-energy physics [304], and astron-
omy [367] operate their own specialized infrastructures that you
are unlikely to require. For small data sets, Figshare [122] and
DataCite [89] are often used. The Globus publication service [71]
permits an institution or community to build their own publication
70 2. Working with Web Data and APIs
2.13 Acknowledgements and copyright
Section 2.3 is adapted in part from Neylon et al. [275], copyright In-
ternational Development Research Center, Canada, used here un-
der a Creative Commons Attribution v 4.0 License.
Section 2.9.4 is adapted in part from Neylon [273], copyright
PLOS, used here under a Creative Commons Attribution v 4.0
Record Linkage
Chapter 3
Joshua Tokle and Stefan Bender
Big data differs from survey data in that it is typically necessary
to combine data from multiple sources to get a complete picture
of the activities of interest. Although computer scientists tend to
simply “mash” data sets together, social scientists are rightfully
concerned about issues of missing links, duplicative links, and
erroneous links. This chapter provides an overview of traditional
rule-based and probabilistic approaches, as well as the important
contribution of machine learning to record linkage.
3.1 Motivation
Big data offers social scientists great opportunities to bring together
many different types of data, from many different sources. Merg-
ing different data sets provides new ways of creating population
frames that are generated from the digital traces of human activity
rather than, say, tax records. These opportunities, however, cre-
ate different kinds of challenges from those posed by survey data.
Combining information from different sources about an individual,
business, or geographic entity means that the social scientist must
determine whether or not two entities on two different files are the
same. This determination is not easy. In the UMETRICS data, if
data are to be used to measure the impact of research grants, is
David A. Miller from Stanford, CA, the same as David Andrew Miller
from Fairhaven, NJ, in a list of inventors? Is Google the same as
Alphabet if the productivity and growth of R&D-intensive firms is
to be studied? Or, more generally, is individual A the same person
as the one who appears on a list of terrorists that has been com-
piled? Does the product that a customer is searching for match the
products that business B has for sale?
72 3. Record Linkage
The consequences of poor record linkage decisions can be sub-
stantial. In the business arena, Christen reports that as much as
12% of business revenues are lost due to bad linkages [76]. In the
security arena, failure to match travelers to a “known terrorist” list
may result in those individuals entering the country, while over-
zealous matching could lead to numbers of innocent citizens being
detained. In finance, incorrectly detecting a legitimate purchase as
a fraudulent one annoys the customer, but failing to identify a thief
will lead to credit card losses. Less dramatically, in the scientific
arena when studying patenting behavior, if it is decided that two in-
ventors are the same person, when in fact they are not, then records
will be incorrectly grouped together and one researcher’s productiv-
ity will be overstated. Conversely, if the records for one inventor are
believed to correspond to multiple individuals, then that inventor’s
productivity will be understated.
This chapter discusses current approaches to joining multiple
data sets together—commonly called record linkage. Other names
associated with record linkage are entity disambiguation, entity res-
olution, co-reference resolution, statistical matching, and data fu-
sion, meaning that records which are linked or co-referent can be
thought of as corresponding to the same underlying entity. The
number of names is reflective of a vast literature in social science,
statistics, computer science, and information sciences. We draw
heavily here on work by Winkler, Scheuren, and Christen, in par-
ticular [76, 77, 165]. To ground ideas, we use examples from a re-
cent paper examining the effects of different algorithms on studies
of patent productivity [387].
3.2 Introduction to record linkage
There are many reasons to link data sets. Linking to existing data
sources to solve a measurement need instead of implementing a new
survey results in cost savings (and almost certainly time savings as
well) and reduced burden on potential survey respondents. For
some research questions (e.g., a survey of the reasons for death of a
longitudinal cohort of individuals) a new survey may not be possible.
In the case of administrative data or other automatically generated
data, the sample size is much greater than would be possible from
a survey.
Record linkage can be used to compensate for data quality is-
sues. If a large number of observations for a particular field are
missing, it may be possible to link to another data source to fill
3.2. Introduction to record linkage 73
in the missing values. For example, survey respondents might not
want to share a sensitive datum like income. If the researcher has
access to an official administrative list with income data, then those
values can be used to supplement the survey [5].
Record linkage is often used to create new longitudinal data sets
by linking the same entities over time [190]. More generally, linking
separate data sources makes it possible to create a combined data
set that is richer in coverage and measurement than any of the
individual data sources [4].
Example: The Administrative Data Research Network
The UK’s Administrative Data Research Network*(ADRN) is a major investment by “Administrative data” typ-
ically refers to data gener-
ated by the administration
of a government program,
as distinct from deliberate
survey collection.
the United Kingdom to “improve our knowledge and understanding of the society
we live in . . . [and] provide a sound base for policymakers to decide how to tackle a
range of complex social, economic and environmental issues” by linking adminis-
trative data from a variety of sources, such as health agencies, court records, and
tax records in a confidential environment for approved researchers. The linkages
are done by trusted third-party providers. [103]
Linking is straightforward if each entity has a corresponding
unique identifier that appears in the data sets to be linked. For
example, two lists of US employees may both contain Social Secu-
rity numbers. When a unique identifier exists in the data or can be
created, no special techniques are necessary to join the data sets.
If there is no unique identifier available, then the task of identi-
fying unique entities is challenging. One instead relies on fields that
only partially identify the entity, like names, addresses, or dates of
birth. The problem is further complicated by poor data quality and
duplicate records, issues well attested in the record linkage litera-
ture [77] and sure to become more important in the context of big
data. Data quality issues include input errors (typos, misspellings,
truncation, extraneous letters, abbreviations, and missing values)
as well as differences in the way variables are coded between the
two data sets (age versus date of birth, for example). In addition
to record linkage algorithms, we will discuss different data prepro-
cessing steps that are necessary first steps for the best results in
record linkage.
To find all possible links between two data sets it would be neces-
sary to compare each record of the first data set with each record of
the second data set. The computational complexity of this approach
74 3. Record Linkage
grows quadratically with the size of the data—an important consid-
eration, especially in the big data context. To compensate for this
complexity, the standard second step in record linkage, after pre-
processing, is indexing or blocking, which creates subsets of similar
records and reduces the total number of comparisons.
The outcome of the matching step is a set of predicted links—
record pairs that are likely to correspond to the same entity. After
these are produced, the final stage of the record linkage process is
to evaluate the result and estimate the resulting error rates. Unlike
other areas of application for predictive algorithms, ground truth
or gold standard data sets are rarely available. The only way to
create a reliable truth data set sometimes is through an expensive
clerical review process that may not be viable for a given application.
Instead, error rates must be estimated.
An input data set may contribute to the linked data in a variety of
ways, such as increasing coverage, expanding understanding of the
measurement or mismeasurement of underlying latent variables, or
adding new variables to the combined data set. It is therefore im-
portant to develop a well-specified reason for linking the data sets,
and to specify a loss function to proxy the cost of false negative
matches versus false positive matches that can be used to guide
match decisions. It is also important to understand the coverage of
the different data sets being linked because differences in coverage
may result in bias in the linked data. For example, consider the
problem of linking Twitter data to a sample-based survey—elderly
adults and very young children are unlikely to use Twitter and so
the set of records in the linked data set will have a youth bias, even
if the original sample was representative of the population. It is also
essential to engage in critical thinking about what latent variables
are being captured by the measures in the different data sets—an
“occupational classification” in a survey data set may be very dif-
ferent from a “job title” in an administrative record or a “current
position” in LinkedIn data.
This topic is discussed in
more detail in Chapter 10.
Example: Employment and earnings outcomes of
doctoral recipients
A recent paper in Science matched UMETRICS data on doctoral recipients to Cen-
sus data on earnings and employment outcomes. The authors note that some
20% of the doctoral recipients are not matched for several reasons: (i) the recip-
ient does not have a job in the US, either for family reasons or because he/she
goes back to his/her home country; (ii) he/she starts up a business rather than
3.2. Introduction to record linkage 75
choosing employment; or (iii) it is not possible to uniquely match him/her to a
Census Bureau record. They correctly note that there may be biases introduced in
case (iii), because Asian names are more likely duplicated and harder to uniquely
match [415]. Improving the linkage algorithm would increase the estimate of the
effects of investments in research and the result would be more accurate.
Comparing the kinds of heterogeneous records associated with
big data is a new challenge for social scientists, who have tradition-
ally used a technique first developed in the 1960s to apply comput-
ers to the problem of medical record linkage. There is a reason why
this approach has survived: it has been highly successful in linking
survey data to administrative data, and efficient implementations
of this algorithm can be applied at the big data scale. However,
the approach is most effective when the two files being linked have
a number of fields in common. In the new landscape of big data,
there is a greater need to link files that have few fields in common
but whose noncommon fields provide additional predictive power to
determine which records should be linked. In some cases, when
sufficient training data can be produced, more modern machine
learning techniques may be applied.
The canonical record linkage workflow process is shown in Fig-
ure 3.1 for two data files, A and B. The goal is to identify all pairs of
records in the two data sets that correspond to the same underlying
individual. One approach is to compare all data units from file A
Data le A
Links Non links
Data le B
Figure 3.1. The preprocessing pipeline
76 3. Record Linkage
with all units in file B and classify all of the comparison outcomes
to decide whether or not the records match. In a perfect statistical
world the comparison would end with a clear determination of links
and nonlinks.
Alas, a perfect world does not exist, and there is likely to be
noise in the variables that are common to both data sets and that
will be the main identifiers for the record linkage. Although the
original files A and B are the starting point, the identifiers must be
preprocessed before they can be compared. Determining identifiers
for the linkage and deciding on the associated cleaning steps are
extremely important, as they result in a necessary reduction of the
possible search space.
In the next section we begin our overview of the record linkage
process with a discussion of the main steps in data preprocess-
ing. This is followed by a section on approaches to record linkage
that includes rule-based, probabilistic, and machine learning algo-
rithms. Next we cover classification and evaluation of links, and we
conclude with a discussion of data privacy in record linkage.
3.3 Preprocessing data for record linkage
As noted in the introductory chapter, all data work involves prepro-
cessing, and data that need to be linked is no exception. Preprocess-
ing refers to a workflow that transforms messy, noisy, and unstruc-
tured data into a well-defined, clearly structured, and quality-tested
data set. Elsewhere in this book, we discuss general strategies for
This topic (quality of
data, preprocessing issues)
is discussed in more detail
in Section 1.4.
data preprocessing. In this section, we focus specifically on pre-
processing steps relating to the choice of input fields for the record
linkage algorithm. Preprocessing for any kind of a new data set is
a complex and time-consuming process because it is “hands-on”: it
requires judgment and cannot be effectively automated. It may be
tempting to minimize this demanding work under the assumption
that the record linkage algorithm will account for issues in the data,
but it is difficult to overstate the value of preprocessing for record
linkage quality. As Winkler notes: “In situations of reasonably
high-quality data, preprocessing can yield a greater improvement
in matching efficiency than string comparators and ‘optimized’ pa-
rameters. In some situations, 90% of the improvement in matching
efficiency may be due to preprocessing” [406].
The first step in record linkage is to develop link keys, which are
the record fields that will be used to predict link status. These can
include common identifiers like first and last name. Survey and ad-
3.3. Preprocessing data for record linkage 77
ministrative data sets may include a number of clearly identifying
variables like address, birth date, and sex. Other data sets, like
transaction records or social media data, often will not include ad-
dress or birth date but may still include other identifying fields like
occupation, a list of interests, or connections on a social network.
Consider this chapter’s illustrative example of the US Patent and
Trademark Office (USPTO) data [387]:
USPTO maintains an online database of all patents is-
sued in the United States. In addition to identifying in-
formation about the patent, the database contains each
patent’s list of inventors and assignees, the companies,
organizations, individuals, or government agencies to
which the patent is assigned. . . . However, inventors and
assignees in the USPTO database are not given unique
identification numbers, making it difficult to track inven-
tors and assignees across their patents or link their in-
formation to other data sources.
There are some basic precepts that are useful when considering
identifying fields. The more different values a field can take, the less
likely it is that two randomly chosen individuals in the population
will agree on those values. Therefore, fields that exhibit a wider
range of values are more powerful as link keys: names are much
better link keys than sex or year of birth.
Example: Link keys in practice
“A Harvard professor has re-identified the names of more than 40 percent of a
sample of anonymous participants in a high-profile DNA study, highlighting the
dangers that ever greater amounts of personal data available in the Internet era
could unravel personal secrets. . . . Of the 1,130 volunteers Sweeney and her
team reviewed, about 579 provided zip code, date of birth and gender, the three
key pieces of information she needs to identify anonymous people combined with
information from voter rolls or other public records. Of these, Sweeney succeeded
in naming 241, or 42 percent of the total. The Personal Genome Project confirmed
that 97 percent of the names matched those in its database if nicknames and first
name variations were included” [369].
Complex link keys like addresses can be broken down into com-
ponents so that the components can be compared independently of
one another. This way, errors due to data quality can be further
78 3. Record Linkage
isolated. For example, assigning a single comparison value to the
complex fields “1600 Pennsylvania” and “160 Pennsylvania Ave” is
less informative than assigning separate comparison values to the
street number and street name portions of those fields. A record
linkage algorithm that uses the decomposed field can make more
nuanced distinctions by assigning different weights to errors in each
Sometimes a data set can include different variants of a field,
like legal first name and nickname. In these cases match rates
can be improved by including all variants of the field in the record
comparison. For example, if only the first list includes both vari-
ants, and the second list has a single “first name” field that could
be either a legal first name or a nickname, then match rates can be
improved by comparing both variants and then keeping the better
of the two comparison outcomes. It is important to remember, how-
ever, that some record linkage algorithms expect field comparisons
to be somewhat independent. In our example, using the outcome
from both comparisons as separate inputs into the probabilistic
model we describe below may result in a higher rate of false nega-
tives. If a record has the same value in the legal name and nickname
fields, and if that value happens to agree with the first name field
in the second file, then the agreement is being double-counted. By
the same token, if a person in the first list has a nickname that
differs significantly from their legal first name, then a comparison
of that record to the corresponding record will unfairly penalize the
outcome because at least one of those name comparisons will show
a low level of agreement.
Preprocessing serves two purposes in record linkage. First, to
correct for issues in data quality that we described above. Second,
to account for the different ways that the input files were generated,
which may result in the same underlying data being recorded on
different scales or according to different conventions.
Once preprocessing is finished, it is possible to start linking the
records in the different data sets. In the next section we describe a
technique to improve the efficiency of the matching step.
3.4 Indexing and blocking
There is a practical challenge to consider when comparing the records
in two files. If both files are roughly the same size, say 100 records
in the first and 100 records in the second file, then there are 10,000
possible comparisons, because the number of pairs is the product
3.4. Indexing and blocking 79
of the number of records in each file. More generally, if the number
of records in each file is approximately n, then the total number of
possible record comparisons is approximately n2. Assuming that
there are no duplicate records in the input files, the proportion of
record comparisons that correspond to a link is only 1/n. If we
naively proceed with all n2possible comparisons, the linkage algo-
rithm will spend the bulk of its time comparing records that are not
matches. Thus it is possible to speed up record linkage significantly
by skipping comparisons between record pairs that are not likely to
be linked.
Indexing refers to techniques that determine which of the poss-
ible comparisons will be made in a record linkage application. The
most used technique for indexing is blocking. In this approach you
construct a “blocking key” for each record by concatenating fields
or parts of fields. Two records with identical blocking keys are
said to be in the same block, and only records in the same block
are compared. This technique is effective because performing an
exact comparison of two blocking keys is a relatively quick operation
compared to a full record comparison, which may involve multiple
applications of a fuzzy string comparator.
Example: Blocking in practice
Given two lists of individuals, one might construct the blocking key by concate-
nating the first letter of the last name and the postal code and then “blocking”
on first character of last name and postal code. This reduces the total number of
comparisons by only comparing those individuals in the two files who live in the
same locality and whose last names begin with the same letter.
There are important considerations when choosing the blocking
key. First, the choice of blocking key creates a potential bias in
the linked data because true matches that do not share the same
blocking key will not be found. In the example, the blocking strategy
could fail to match records for individuals whose last name changed
or who moved. Second, because blocking keys are compared ex-
actly, there is an implicit assumption that the included fields will
not have typos or other data entry errors. In practice, however, the
blocking fields will exhibit typos. If those typos are not uniformly
distributed over the population, then there is again the possibility
of bias in the linked data set. One simple strategy for dealing with
This topic is discussed in
more detail in Chapter 10.
imperfect blocking keys is to implement multiple rounds of block-
80 3. Record Linkage
ing and matching. After the first set of matches is produced, a new
blocking strategy is deployed to search for additional matches in the
remaining record pairs.
Blocking based on exact field agreements is common in practice,
but there are other approaches to indexing that attempt to be more
error tolerant. For example, one may use clustering algorithms to
identify sets of similar records. In this approach an index key, which
is analogous to the blocking key above, is generated for both data
sets and then the keys are combined into a single list. A distance
function must be chosen and pairwise distances computed for all
keys. The clustering algorithm is then applied to the combined list,
and only record pairs that are assigned to the same cluster are
compared. This is a theoretically appealing approach but it has the
drawback that the similarity metric has to be computed for all pairs
of records. Even so, computing the similarity measure for a pair of
blocking keys is likely to be cheaper than computing the full record
comparison, so there is still a gain in efficiency. Whang et al. [397]
provide a nice review of indexing approaches.
In addition to reducing the computational burden of record link-
age, indexing plays an important secondary role. Once implemented,
the fraction of comparisons made that correspond to true links will
be significantly higher. For some record linkage approaches that
use an algorithm to find optimal parameters—like the probabilis-
tic approach—having a larger ratio of matches to nonmatches will
produce a better result.
3.5 Matching
The purpose of a record linkage algorithm is to examine pairs of
records and make a prediction as to whether they correspond to the
same underlying entity. (There are some sophisticated algorithms
that examine sets of more than two records at a time [359], but
pairwise comparison remains the standard approach.) At the core
of every record linkage algorithm is a function that compares two
records and outputs a “score” that quantifies the similarity between
those records. Mathematically, the match score is a function of the
output from individual field comparisons: agreement in the first
name field, agreement in the last name field, etc. Field comparisons
may be binary—indicating agreement or disagreement—or they may
output a range of values indicating different levels of agreement.
There are a variety of methods in the statistical and computer sci-
ence literature that can be used to generate a match score, includ-
3.5. Matching 81
ing nearest-neighbor matching, regression-based matching, and
propensity score matching. The probabilistic approach to record
linkage defines the match score in terms of a likelihood ratio [118].
Example: Matching in practice
Long strings, such as assignee and inventor names, are susceptible to typograph-
ical errors and name variations. For example, none of Sony Corporation, Sony
Corporatoin and Sony Corp. will match using simple exact matching. Similarly,
David vs. Dave would not match [387].
Comparing fields whose values are continuous is straightforward:
often one can simply take the absolute difference as the comparison
value. Comparing character fields in a rigorous way is more com-
plicated. For this purpose, different mathematical definitions of
the distance between two character fields have been defined. Edit
distance, for example, is defined as the minimum number of edit
operations—chosen from a set of allowed operations—needed to con-
vert one string to another. When the set of allowed edit operations is
single-character insertions, deletions, and substitutions, the corre-
sponding edit distance is also known as the Levenshtein distance.
When transposition of adjacent characters is allowed in addition
to those operations, the corresponding edit distance is called the
Levenshtein–Damerau distance.
Edit distance is appealing because of its intuitive definition, but
it is not the most efficient string distance to compute. Another
standard string distance known as Jaro–Winkler distance was de-
veloped with record linkage applications in mind and is faster to
compute. This is an important consideration because in a typical
record linkage application most of the algorithm run time will be
spent performing field comparisons. The definition of Jaro–Winkler
distance is less intuitive than edit distance, but it works as ex-
pected: words with more characters in common will have a higher
Jaro–Winkler value than those with fewer characters in common.
The output value is normalized to fall between 0 and 1. Because
of its history in record linkage applications, there are some stan-
dard variants of Jaro–Winkler distance that may be implemented in
record linkage software. Some variants boost the weight given to
agreement in the first few characters of the strings being compared.
Others decrease the score penalty for letter substitutions that arise
from common typos.
82 3. Record Linkage
Once the field comparisons are computed, they must be com-
bined to produce a final prediction of match status. In the following
sections we describe three types of record linkage algorithms: rule-
based, probabilistic, and machine learning.
3.5.1 Rule-based approaches
A natural starting place is for a data expert to create a set of ad hoc
rules that determine which pairs of records should be linked. In the
classical record linkage setting where the two files have a number
of identifying fields in common, this is not the optimal approach.
However, if there are few fields in common but each file contains
auxiliary fields that may inform a linkage decision, then an ad hoc
approach may be appropriate.
Example: Linking in practice
Consider the problem of linking two lists of individuals where both lists contain
first name, last name, and year of birth. Here is one possible linkage rule: link all
pairs of records such that
the Jaro–Winkler comparison of first names is greater than 0.9
the Jaro–Winkler comparison of last names is greater than 0.9
the first three digits of the year of birth are the same.
The result will depend on the rate of data errors in the year of birth field and typos
in the name fields.
By auxiliary field we mean data fields that do not appear on both
data sets, but which may nonetheless provide information about
whether records should be linked. Consider a situation in which the
first list includes an occupation field and the second list includes
educational history. In that case one might create additional rules
to eliminate matches where the education was deemed to be an
unlikely fit for the occupation.
This method may be attractive if it produces a reasonable-looking
set of links from intuitive rules, but there are several pitfalls. As
the number of rules grows it becomes harder to understand the
ways that the different rules interact to produce the final set of
links. There is no notion of a threshold that can be increased or
decreased depending on the tolerance for false positive and false
negative errors. The rules themselves are not chosen to satisfy any
kind of optimality, unlike the probabilistic and machine learning
3.5. Matching 83
methods. Instead, they reflect the practitioner’s domain knowledge
about the data sets.
3.5.2 Probabilistic record linkage
In this section we describe the probabilistic approach to record
linkage, also known as the Fellegi–Sunter algorithm [118]. This
approach dominates in traditional record linkage applications and
remains an effective and efficient way to solve the record linkage
problem today.
In this section we give a somewhat formal definition of the statis-
tical model underlying the algorithm. By understanding this model,
one is better equipped to define link keys and record comparisons
in an optimal way.
Example: Usefulness of probabilistic record linkage
In practice, it is typically the case that a researcher will want to combine two or
more data sets containing records for the same individuals or units that possi-
bly come from different sources. Unless the sources all contain the same unique
identifiers, linkage will likely require matching on standardized text strings. Even
standardized data are likely to contain small differences that preclude exact match-
ing as in the matching example above. The Census Bureau’s Longitudinal Busi-
ness Database (LBD) links establishment records from administrative and survey
sources. Exact numeric identifiers do most of the heavy lifting, but mergers, ac-
quisitions, and other actions can break these linkages. Probabilistic record linkage
on company names and/or addresses is used to fix these broken linkages that bias
statistics on business dynamics [190].
Let Aand Bbe two lists of individuals whom we wish to link.
The product set A×Bcontains all possible pairs of records where
the first element of the pair comes from Aand the second element
of the pair comes from B. A fraction of these pairs will be matches,
meaning that both records in the pair represent the same underlying
individual, but the vast majority of them will be nonmatches. In
other words, A×Bis the disjoint union of the set of matches M
and the set of nonmatches U, a fact that we denote formally by
Let γbe a vector-valued function on A×Bsuch that, for aAand
bB,γ(a, b)represents the outcome of a set of field comparisons
between aand b. For example, if both Aand Bcontain data on
84 3. Record Linkage
individuals’ first names, last names, and cities of residence, then γ
could be a vector of three binary values representing agreement in
first name, last name, and city. In that case γ(a, b)=(1,1,0)would
mean that the records aand bagree on first name and last name,
but disagree on city of residence.
For this model, the comparison outcomes in γ(a, b)are not re-
quired to be binary, but they do have to be categorical: each com-
ponent of γ(a, b)should take only finitely many values. This means
that a continuous comparison outcome—such as output from the
Jaro–Winkler string comparator—has to be converted to an ordinal
value representing levels of agreement. For example, one might cre-
ate a three-level comparison, using one level for exact agreement,
one level for approximate agreement defined as a Jaro–Winkler score
greater than 0.85, and one level for nonagreement corresponding to
a Jaro–Winkler score less than 0.85.
If a variable being used in the comparison has a significant num-
ber of missing values, it can help to create a comparison outcome
level to indicate missingness. Consider two data sets that both have
middle initial fields, and suppose that in one of the data sets the
middle initial is filled in only about half of the time. When compar-
ing records, the case where both middle initials are filled in but are
not the same should be treated differently from the case where one
of the middle initials is blank, because the first case provides more
evidence that the records do not correspond to the same person.
We handle this in the model by defining a three-level comparison
for the middle initial, with levels to indicate “equal,” “not equal,”
and “missing.”
Probabilistic record linkage works by weighing the probability of
seeing the result γ(a, b)if (a, b)belongs to the set of matches M
against the probability of seeing the result if (a, b)belongs to the
set of nonmatches U. Conditional on Mor U, the distribution of the
individual comparisons defined by γare assumed to be mutually
independent. The parameters that define the marginal distributions
of γ|Mare called m-weights, and similarly the marginal distributions
of γ|Uare called u-weights.
In order to apply the Fellegi–Sunter method, it is necessary to
choose values for these parameters, m-weights and u-weights. With
labeled data—a pair of lists for which the match status is known—
it is straightforward to solve for optimal values. Training data are
not usually available, however, and the typical approach is to use
expectation maximization to find optimal values.
We have noted that primary motivation for record linkage is to
create a linked data set for analysis that will have a richer set of fields
3.5. Matching 85
than either of the input data sets alone. A natural application is to
perform a linear regression using a combination of variables from
both files as predictors. With all record linkage approaches it is a
challenge to understand how errors from the linkage process will
manifest in the regression. Probabilistic record linkage has an ad-
vantage over rule-based and machine learning approaches in that
there are theoretical results concerning coefficient bias and errors
[221, 329]. More recently, Chipperfield and Chambers have devel-
oped an approach based on the bootstrap to account for record link-
age errors when making inferences for cross-tabulated variables [75].
3.5.3 Machine learning approaches to linking
Computer scientists have contributed extensively in parallel litera-
ture focused on linking large data sets [76]. Their focus is on iden-
tifying potential links using approaches that are fast and scalable,
and approaches are developed based on work in network algorithms
and machine learning.
While simple blocking as described in Section 3.4 is standard in
Fellegi–Sunter applications, computer scientists are likely to use the
more sophisticated clustering approach to indexing. Indexing may
also use network information to include, for example, records for in-
dividuals that have a similar place in a social graph. When linking
lists of researchers, one might specify that comparisons should be
made between records that share the same address, have patents
in the same patent class, or have overlapping sets of coinventors.
These approaches are known as semantic blocking, and the com-
putational requirements are similar to standard blocking [76].
In recent years machine learning approaches have been applied
This topic is discussed in
more detail in Chapter 6.
to record linkage following their success in other areas of prediction
and classification. Computer scientists couch the analytical prob-
lem as one of entity resolution, even though the conceptual problem
is identical. As Wick et al. [400] note:
Entity resolution, the task of automatically determining
which mentions refer to the same real-world entity, is a
crucial aspect of knowledge base construction and man-
agement. However, performing entity resolution at large
scales is challenging because (1) the inference algorithms
must cope with unavoidable system scalability issues and
(2) the search space grows exponentially in the number
of mentions. Current conventional wisdom declares that
performing coreference at these scales requires decom-
86 3. Record Linkage
posing the problem by first solving the simpler task of
entity-linking (matching a set of mentions to a known set
of KB entities), and then performing entity discovery as a
post-processing step (to identify new entities not present
in the KB). However, we argue that this traditional ap-
proach is harmful to both entity-linking and overall coref-
erence accuracy. Therefore, we embrace the challenge of
jointly modeling entity-linking and entity discovery as a
single entity resolution problem.
Figure 3.2 provides a useful comparison between classical record
linkage and learning-based approaches. In machine learning there
is a predictive model and an algorithm for “learning” the optimal
set of parameters to use in the predictive algorithm. The learning
algorithm relies on a training data set. In record linkage, this would
be a curated data set with true and false matches labeled as such.
See [387] for an example and a discussion of how a training data
set was created for the problem of disambiguating inventors in the
USPTO database. Once optimal parameters are computed from the
training data, the predictive model can be applied to unlabeled data
to find new links. The quality of the training data set is critical; the
model is only as good as the data it is trained on.
An example of a machine learning model that is popular for
record linkage is the random forest model [50]. This is a classi-
See Chapter 6.
fication model that fits a large number of classification trees to a
labeled training data set. Each individual tree is trained on a boot-
strap sample of all labeled cases using a random subset of predictor
variables. After creating the classification trees, new cases are la-
beled by giving each tree a vote and keeping the label that receives
the most votes. This highly randomized approach corrects for a
problem with simple classification trees, which is that they may
overfit to training data.
As shown in Figure 3.2, a major difference between probabilistic
and machine learning approaches is the need for labeled training
data to implement the latter approach. Usually training data are
created through a painstaking process of clerical review. After an
initial round of record linkage, a sample of record pairs that are
not clearly matches or nonmatches is given to a research assistant
who makes the final determination. In some cases it is possible to
create training data by automated means. For example, when there
is a subset of the complete data that contains strongly identifying
fields. Suppose that both of the candidate lists contain name and
date of birth fields and that in the first list the date of birth data are
3.5. Matching 87
• Similarity Function
• Attribute Selection
Source Target Source Target
• No. of Examples
• Selection Scheme
• reshold
• Learning Algorithm
• Matcher Selection
Figure 3.2. Probabilistic (left) vs. machine learning (right) approaches to linking. Source: Köpcke et al. [213]
complete, but in the second list only about 10% of records contain
date of birth. For reasonably sized lists, name and date of birth
together will be a nearly unique identifier. It is then possible to
perform probabilistic record linkage on the subset of records with
date of birth and be confident that the error rates would be small.
If the subset of records with date of birth is representative of the
complete data set, then the output from the probabilistic record
linkage can be used as “truth” data.
Given a quality training data set, machine learning approaches
may have advantages over probabilistic record linkage. Consider
the random forest model. Random forests are more robust to corre-
lated predictor variables, because only a random subset of predic-
tors is included in any individual classification tree. The conditional
independence assumption, to which we alluded in our discussion
of the probabilistic model, can be dropped. An estimate of the gen-
eralization error can be computed in the form of “out-of-bag error.”
A measure of variable importance is computed that gives an idea of
how powerful a particular field comparison is in terms of correctly
predicting link status. Finally, unlike the Fellegi–Sunter model,
predictor variables can be continuous.
The combination of being robust to correlated variables and pro-
viding a variable importance measure makes random forests a use-
88 3. Record Linkage
ful diagnostic tool for record linkage models. It is possible to refine
the record linkage model iteratively, by first including many pre-
dictor variables, including variants of the same comparison, and
then using the variable importance measure to narrow down the
predictors to a parsimonious set.
There are many published studies on the effectiveness of random
forests and other machine learning algorithms for record linkage.
Christen and Ahmed et al. provide some pointers [77, 108].
3.5.4 Disambiguating networks
The problem of disambiguating entities in a network is closely re-
lated to record linkage: in both cases the goal is to consolidate mul-
tiple records corresponding to the same entity. Rather than finding
the same entity in two data sets, however, the goal in network dis-
ambiguation is to consolidate duplicate records in a network data
set. By network we mean that the data set contains not only typical
record fields like names and addresses but also information about
how entities relate to one another: entities may be coauthors, coin-
ventors, or simply friends in a social network.
The record linkage techniques that we have described in this
chapter can be applied to disambiguate a network. To do so, one
must convert the network to a form that can be used as input into
a record linkage algorithm. For example, when disambiguating a
social network one might define a field comparison whose output
gives the fraction of friends in common between two records. Ven-
tura et al. demonstrated the relative effectiveness of the probabilis-
tic method and machine learning approaches to disambiguating a
database of inventors in the USPTO database [387]. Another ap-
proach is to apply clustering algorithms from the computer science
literature to identify groups of records that are likely to refer to the
same entity. Huang et al. [172] have developed a successful method
based on an efficient computation of distance between individuals
in the network. These distances are then fed into the DBSCAN
clustering algorithm to identify unique entities.
3.6 Classification
Once the match score for a pair of records has been computed us-
ing the probabilistic or random forest method, a decision has to be
made whether the pair should be linked. This requires classifying
the pair as either a “true” or a “false” match. In most cases, a third
classification is required—sending for manual review and classifica-
3.6. Classification 89
3.6.1 Thresholds
In the probabilistic and random forest approaches, both of which
output a “match score” value, a classification is made by establish-
ing a threshold Tsuch that all records with a match score greater
than Tare declared to be links. Because of the way these algorithms
are defined, the match scores are not meaningful by themselves and
the threshold used for one linkage application may not be appro-
priate for another application. Instead, the classification threshold
must be established by reviewing the model output.
Typically one creates an output file that includes pairs of records
that were compared along with the match score. The file is sorted
by match score and the reviewer begins to scan the file from the
highest match scores to the lowest. For the highest match scores the
record pairs will agree on all fields and there is usually no question
about the records being linked. However, as the scores decrease the
reviewer will see more record pairs whose match status is unclear
(or that are clearly nonmatches) mixed in with the clear matches.
There are a number of ways to proceed, depending on the resources
available and the goal of the project.
Rather than set a single threshold, the reviewer may set two