(Studies In Big Data 26) Srinivasan S. (ed.) Guide To Applications Springer (2018)


User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 567 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Studies in Big Data 26
S.Srinivasan Editor
Guide to
Big Data
Studies in Big Data
Volume 26
Series Editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
e-mail: kacprzyk@ibspan.waw.pl
About this Series
The series “Studies in Big Data” (SBD) publishes new developments and advances
in the various areas of Big Data – quickly and with a high quality. The intent
is to cover the theory, research, development, and applications of Big Data, as
embedded in the fields of engineering, computer science, physics, economics and
life sciences. The books of the series refer to the analysis and understanding of
large, complex, and/or distributed data sets generated from recent digital sources
coming from sensors or other physical instruments as well as simulations, crowd
sourcing, social networks or other internet transactions, such as emails or video
click streams and other. The series contains monographs, lecture notes and edited
volumes in Big Data spanning the areas of computational intelligence including
neural networks, evolutionary computation, soft computing, fuzzy systems, as well
as artificial intelligence, data mining, modern statistics and Operations research, as
well as self-organizing systems. Of particular value to both the contributors and
the readership are the short publication timeframe and the world-wide distribution,
which enable both wide and rapid dissemination of research output.
More information about this series at http://www.springer.com/series/11970
S. Srinivasan
Guide to Big Data
S. Srinivasan
Jesse H. Jones School of Business
Texas Southern University
Houston, TX, USA
ISSN 2197-6503 ISSN 2197-6511 (electronic)
Studies in Big Data
ISBN 978-3-319-53816-7 ISBN 978-3-319-53817-4 (eBook)
DOI 10.1007/978-3-319-53817-4
Library of Congress Control Number: 2017936371
© Springer International Publishing AG 2018
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
Printed on acid-free paper
This Springer imprint is published by Springer Nature
The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my wife Lakshmi and grandson Sahaas
It gives me great pleasure to write this Foreword for this timely publication on the
topic of the ever-growing list of Big Data applications. The potential for leveraging
existing data from multiple sources has been articulated over and over, in an
almost infinite landscape, yet it is important to remember that in doing so, domain
knowledge is key to success. Naïve attempts to process data are bound to lead to
errors such as accidentally regressing on noncausal variables. As Michael Jordan
at Berkeley has pointed out, in Big Data applications the number of combinations
of the features grows exponentially with the number of features, and so, for any
particular database, you are likely to find some combination of columns that will
predict perfectly any outcome, just by chance alone. It is therefore important that
we do not process data in a hypothesis-free manner and skip sanity checks on our
In this collection titled “Guide to Big Data Applications,” the editor has
assembled a set of applications in science, medicine, and business where the authors
have attempted to do just this—apply Big Data techniques together with a deep
understanding of the source data. The applications covered give a flavor of the
benefits of Big Data in many disciplines. This book has 19 chapters broadly divided
into four parts. In Part I, there are four chapters that cover the basics of Big
Data, aspects of privacy, and how one could use Big Data in natural language
processing (a particular concern for privacy). Part II covers eight chapters that
viii Foreword
look at various applications of Big Data in environmental science, oil and gas, and
civil infrastructure, covering topics such as deduplication, encrypted search, and the
friendship paradox.
Part III covers Big Data applications in medicine, covering topics ranging
from “The Impact of Big Data on the Physician,” written from a purely clinical
perspective, to the often discussed deep dives on electronic medical records.
Perhaps most exciting in terms of future landscaping is the application of Big Data
application in healthcare from a developing country perspective. This is one of the
most promising growth areas in healthcare, due to the current paucity of current
services and the explosion of mobile phone usage. The tabula rasa that exists in
many countries holds the potential to leapfrog many of the mistakes we have made
in the west with stagnant silos of information, arbitrary barriers to entry, and the
lack of any standardized schema or nondegenerate ontologies.
In Part IV, the book covers Big Data applications in business, which is perhaps
the unifying subject here, given that none of the above application areas are likely
to succeed without a good business model. The potential to leverage Big Data
approaches in business is enormous, from banking practices to targeted advertising.
The need for innovation in this space is as important as the underlying technologies
themselves. As Clayton Christensen points out in The Innovator’s Prescription,
three revolutions are needed for a successful disruptive innovation:
1. A technology enabler which “routinizes” previously complicated task
2. A business model innovation which is affordable and convenient
3. A value network whereby companies with disruptive mutually reinforcing
economic models sustain each other in a strong ecosystem
We see this happening with Big Data almost every week, and the future is
In this book, the reader will encounter inspiration in each of the above topic
areas and be able to acquire insights into applications that provide the flavor of this
fast-growing and dynamic field.
Atlanta, GA, USA Gari Clifford
December 10, 2016
Big Data applications are growing very rapidly around the globe. This new approach
to decision making takes into account data gathered from multiple sources. Here my
goal is to show how these diverse sources of data are useful in arriving at actionable
information. In this collection of articles the publisher and I are trying to bring
in one place several diverse applications of Big Data. The goal is for users to see
how a Big Data application in another field could be replicated in their discipline.
With this in mind I have assembled in the “Guide to Big Data Applications” a
collection of 19 chapters written by academics and industry practitioners globally.
These chapters reflect what Big Data is, how privacy can be protected with Big
Data and some of the important applications of Big Data in science, medicine and
business. These applications are intended to be representative and not exhaustive.
For nearly two years I spoke with major researchers around the world and the
publisher. These discussions led to this project. The initial Call for Chapters was
sent to several hundred researchers globally via email. Approximately 40 proposals
were submitted. Out of these came commitments for completion in a timely manner
from 20 people. Most of these chapters are written by researchers while some are
written by industry practitioners. One of the submissions was not included as it
could not provide evidence of use of Big Data. This collection brings together in
one place several important applications of Big Data. All chapters were reviewed
using a double-blind process and comments provided to the authors. The chapters
included reflect the final versions of these chapters.
I have arranged the chapters in four parts. Part I includes four chapters that deal
with basic aspects of Big Data and how privacy is an integral component. In this
part I include an introductory chapter that lays the foundation for using Big Data
in a variety of applications. This is then followed with a chapter on the importance
of including privacy aspects at the design stage itself. This chapter by two leading
researchers in the field shows the importance of Big Data in dealing with privacy
issues and how they could be better addressed by incorporating privacy aspects at
the design stage itself. The team of researchers from a major research university in
the USA addresses the importance of federated Big Data. They are looking at the
use of distributed data in applications. This part is concluded with a chapter that
shows the importance of word embedding and natural language processing using
Big Data analysis.
In Part II, there are eight chapters on the applications of Big Data in science.
Science is an important area where decision making could be enhanced on the
way to approach a problem using data analysis. The applications selected here
deal with Environmental Science, High Performance Computing (HPC), friendship
paradox in noting which friend’s influence will be significant, significance of
using encrypted search with Big Data, importance of deduplication in Big Data
especially when data is collected from multiple sources, applications in Oil &
Gas and how decision making can be enhanced in identifying bridges that need
to be replaced as part of meeting safety requirements. All these application areas
selected for inclusion in this collection show the diversity of fields in which Big
Data is used today. The Environmental Science application shows how the data
published by the National Oceanic and Atmospheric Administration (NOAA) is
used to study the environment. Since such datasets are very large, specialized tools
are needed to benefit from them. In this chapter the authors show how Big Data
tools help in this effort. The team of industry practitioners discuss how there is
great similarity in the way HPC deals with low-latency, massively parallel systems
and distributed systems. These are all typical of how Big Data is used using tools
such as MapReduce, Hadoop and Spark. Quora is a leading provider of answers to
user queries and in this context one of their data scientists is addressing how the
Friendship paradox is playing a significant part in Quora answers. This is a classic
illustration of a Big Data application using social media.
Big Data applications in science exist in many branches and it is very heavily
used in the Oil and Gas industry. Two chapters that address the Oil and Gas
application are written by two sets of people with extensive industry experience.
Two specific chapters are devoted to how Big Data is used in deduplication practices
involving multimedia data in the cloud and how privacy-aware searches are done
over encrypted data. Today, people are very concerned about the security of data
stored with an application provider. Encryption is the preferred tool to protect such
data and so having an efficient way to search such encrypted data is important.
This chapter’s contribution in this regard will be of great benefit for many users.
We conclude Part II with a chapter that shows how Big Data is used in noting the
structural safety of nation’s bridges. This practical application shows how Big Data
is used in many different ways.
Part III considers applications in medicine. A group of expert doctors from
leading medical institutions in the Bay Area discuss how Big Data is used in the
practice of medicine. This is one area where many more applications abound and the
interested reader is encouraged to look at such applications. Another chapter looks
at how data scientists are important in analyzing medical data. This chapter reflects
a view from Asia and discusses the roadmap for data science use in medicine.
Smoking has been noted as one of the leading causes of human suffering. This part
includes a chapter on comorbidity aspects related to smokers based on a Big Data
analysis. The details presented in this chapter would help the reader to focus on other
possible applications of Big Data in medicine, especially cancer. Finally, a chapter is
Preface xi
included that shows how scientific analysis of Big Data helps with epileptic seizure
prediction and control.
Part IV of the book deals with applications in Business. This is an area where
Big Data use is expected to provide tangible results quickly to businesses. The
three applications listed under this part include an application in banking, an
application in marketing and an application in Quick Serve Restaurants. The
banking application is written by a group of researchers in Europe. Their analysis
shows that the importance of identifying financial fraud early is a global problem
and how Big Data is used in this effort. The marketing application highlights the
various ways in which Big Data could be used in business. Many large business
sectors such as the airlines industry are using Big Data to set prices. The application
with respect to a Quick Serve Restaurant chain deals with the impact of Yelp ratings
and how it influences people’s use of Quick Serve Restaurants.
As mentioned at the outset, this collection of chapters on Big Data applications
is expected to serve as a sample for other applications in various fields. The readers
will find novel ways in which data from multiple sources is combined to derive
benefit for the general user. Also, in specific areas such as medicine, the use of Big
Data is having profound impact in opening up new areas for exploration based on
the availability of large volumes of data. These are all having practical applications
that help extend people’s lives. I earnestly hope that this collection of applications
will spur the interest of the reader to look at novel ways of using Big Data.
This book is a collective effort of many people. The contributors to this book
come from North America, Europe and Asia. This diversity shows that Big Data is
a truly global way in which people use the data to enhance their decision-making
capabilities and to derive practical benefits. The book greatly benefited from the
careful review by many reviewers who provided detailed feedback in a timely
manner. I have carefully checked all chapters for consistency of information in
content and appearance. In spite of careful checking and taking advantage of the
tools provided by technology, it is highly likely that some errors might have crept in
to the chapter content. In such cases I take responsibility for such errors and request
your help in bringing them to my attention so that they can be corrected in future
Houston, TX, USA S. Srinivasan
December 15, 2016
A project of this nature would be possible only with the collective efforts of many
people. Initially I proposed the project to Springer, New York, over two years ago.
Springer expressed interest in the proposal and one of their editors, Ms. Mary
James, contacted me to discuss the details. After extensive discussions with major
researchers around the world we finally settled on this approach. A global Call for
Chapters was made in January 2016 both by me and Springer, New York, through
their channels of communication. Ms. Mary James helped throughout the project
by providing answers to questions that arose. In this context, I want to mention
the support of Ms. Brinda Megasyamalan from the printing house of Springer.
Ms. Megasyamalan has been a constant source of information as the project
progressed. Ms. Subhashree Rajan from the publishing arm of Springer has been
extremely cooperative and patient in getting all the page proofs and incorporating
all the corrections. Ms. Mary James provided all the encouragement and support
throughout the project by responding to inquiries in a timely manner.
The reviewers played a very important role in maintaining the quality of this
publication by their thorough reviews. We followed a double-blind review process
whereby the reviewers were unaware of the identity of the authors and vice versa.
This helped in providing quality feedback. All the authors cooperated very well by
incorporating the reviewers’ suggestions and submitting their final chapters within
the time allotted for that purpose. I want to thank individually all the reviewers and
all the authors for their dedication and contribution to this collective effort.
I want to express my sincere thanks to Dr. Gari Clifford of Emory University and
Georgia Institute of Technology in providing the Foreword to this publication. In
spite of his many commitments, Dr. Clifford was able to find the time to go over all
the Abstracts and write the Foreword without delaying the project.
Finally, I want to express my sincere appreciation to my wife for accommodating
the many special needs when working on a project of this nature.
Part I General
1 Strategic Applications of Big Data ........................................ 3
Joe Weinman
2 Start with Privacy by Design in All Big Data Applications ............ 29
Ann Cavoukian and Michelle Chibba
3 Privacy Preserving Federated Big Data Analysis ....................... 49
Wenrui Dai, Shuang Wang, Hongkai Xiong, and Xiaoqian Jiang
4 Word Embedding for Understanding Natural Language:
A Survey ..................................................................... 83
Yang Li and Tao Yang
Part II Applications in Science
5 Big Data Solutions to Interpreting Complex Systems in the
Environment ................................................................ 107
Hongmei Chi, Sharmini Pitter, Nan Li, and Haiyan Tian
6 High Performance Computing and Big Data ............................ 125
Rishi Divate, Sankalp Sah, and Manish Singh
7 Managing Uncertainty in Large-Scale Inversions for the Oil
and Gas Industry with Big Data .......................................... 149
Jiefu Chen, Yueqin Huang, Tommy L. Binford Jr., and Xuqing Wu
8 Big Data in Oil & Gas and Petrophysics ................................. 175
Mark Kerzner and Pierre Jean Daniel
9 Friendship Paradoxes on Quora .......................................... 205
Shankar Iyer
10 Deduplication Practices for Multimedia Data in the Cloud ........... 245
Fatema Rashid and Ali Miri
xvi Contents
11 Privacy-Aware Search and Computation Over Encrypted Data
Stores......................................................................... 273
Hoi Ting Poon and Ali Miri
12 Civil Infrastructure Serviceability Evaluation Based on Big Data.... 295
Yu Liang, Dalei Wu, Dryver Huston, Guirong Liu, Yaohang Li,
Cuilan Gao, and Zhongguo John Ma
Part III Applications in Medicine
13 Nonlinear Dynamical Systems with Chaos and Big Data:
A Case Study of Epileptic Seizure Prediction and Control ............ 329
Ashfaque Shafique, Mohamed Sayeed, and Konstantinos Tsakalis
14 Big Data to Big Knowledge for Next Generation Medicine:
A Data Science Roadmap .................................................. 371
Tavpritesh Sethi
15 Time-Based Comorbidity in Patients Diagnosed with Tobacco
Use Disorder................................................................. 401
Pankush Kalgotra, Ramesh Sharda, Bhargav Molaka,
and Samsheel Kathuri
16 The Impact of Big Data on the Physician ................................ 415
Elizabeth Le, Sowmya Iyer, Teja Patil, Ron Li, Jonathan H. Chen,
Michael Wang, and Erica Sobel
Part IV Applications in Business
17 The Potential of Big Data in Banking .................................... 451
Rimvydas Skyrius, Gintar˙
e, Igor Katin,
Michail Kazimianec, and Raimundas Žilinskas
18 Marketing Applications Using Big Data ................................. 487
S. Srinivasan
19 Does Yelp Matter? Analyzing (And Guide to Using) Ratings
for a Quick Serve Restaurant Chain ..................................... 503
Bogdan Gadidov and Jennifer Lewis Priestley
Author Biographies .............................................................. 523
Index............................................................................... 553
List of Reviewers
Maruthi Bhaskar
Jay Brandi
Jorge Brusa
Arnaub Chatterjee
Robert Evans
Aly Farag
Lila Ghemri
Ben Hu
Balaji Janamanchi
Mehmed Kantardzic
Mark Kerzner
Ashok Krishnamurthy
Angabin Matin
Hector Miranda
P. S. Raju
S. Srinivasan
Rakesh Verma
Daniel Vrinceanu
Haibo Wang
Xuqing Wu
Alec Yasinsac
Part I
Chapter 1
Strategic Applications of Big Data
Joe Weinman
1.1 Introduction
For many people, big data is somehow virtually synonymous with one application—
marketing analytics—in one vertical—retail. For example, by collecting purchase
transaction data from shoppers based on loyalty cards or other unique identifiers
such as telephone numbers, account numbers, or email addresses, a company can
segment those customers better and identify promotions that will boost profitable
revenues, either through insights derived from the data, A/B testing, bundling, or
the like. Such insights can be extended almost without bound. For example, through
sophisticated analytics, Harrah’s determined that its most profitable customers
weren’t “gold cuff-linked, limousine-riding high rollers,” but rather teachers, doc-
tors, and even machinists (Loveman 2003). Not only did they come to understand
who their best customers were, but how they behaved and responded to promotions.
For example, their target customers were more interested in an offer of $60 worth of
chips than a total bundle worth much more than that, including a room and multiple
steak dinners in addition to chips.
While marketing such as this is a great application of big data and analytics, the
reality is that big data has numerous strategic business applications across every
industry vertical. Moreover, there are many sources of big data available from a
company’s day-to-day business activities as well as through open data initiatives,
such as data.gov in the U.S., a source with almost 200,000 datasets at the time of
this writing.
To apply big data to critical areas of the firm, there are four major generic
approaches that companies can use to deliver unparalleled customer value and
J. Weinman ()
Independent Consultant, Flanders, NJ 07836, USA
e-mail: joeweinman@gmail.com
© Springer International Publishing AG 2018
S. Srinivasan (ed.), Guide to Big Data Applications, Studies in Big Data 26,
DOI 10.1007/978-3-319-53817-4_1
4 J. Weinman
achieve strategic competitive advantage: better processes, better products and
services, better customer relationships, and better innovation.
1.1.1 Better Processes
Big data can be used to optimize processes and asset utilization in real time, to
improve them in the long term, and to generate net new revenues by entering
new businesses or at least monetizing data generated by those processes. UPS
optimizes pickups and deliveries across its 55,000 routes by leveraging data ranging
from geospatial and navigation data to customer pickup constraints (Rosenbush
and Stevens 2015). Or consider 23andMe, which has sold genetic data it collects
from individuals. One such deal with Genentech focused on Parkinson’s disease
gained net new revenues of fifty million dollars, rivaling the revenues from its “core”
business (Lee 2015).
1.1.2 Better Products and Services
Big data can be used to enrich the quality of customer solutions, moving them up
the experience economy curve from mere products or services to experiences or
transformations. For example, Nike used to sell sneakers, a product. However, by
collecting and aggregating activity data from customers, it can help transform them
into better athletes. By linking data from Nike products and apps with data from
ecosystem solution elements, such as weight scales and body-fat analyzers, Nike
can increase customer loyalty and tie activities to outcomes (Withings 2014).
1.1.3 Better Customer Relationships
Rather than merely viewing data as a crowbar with which to open customers’ wallets
a bit wider through targeted promotions, it can be used to develop deeper insights
into each customer, thus providing better service and customer experience in the
short term and products and services better tailored to customers as individuals
in the long term. Netflix collects data on customer activities, behaviors, contexts,
demographics, and intents to better tailor movie recommendations (Amatriain
2013). Better recommendations enhance customer satisfaction and value which
in turn makes these customers more likely to stay with Netflix in the long term,
reducing churn and customer acquisition costs, as well as enhancing referral (word-
of-mouth) marketing. Harrah’s determined that customers that were “very happy”
with their customer experience increased their spend by 24% annually; those that
were unhappy decreased their spend by 10% annually (Loveman 2003).
1 Strategic Applications of Big Data 5
1.1.4 Better Innovation
Data can be used to accelerate the innovation process, and make it of higher quality,
all while lowering cost. Data sets can be published or otherwise incorporated as
part of an open contest or challenge, enabling ad hoc solvers to identify a best
solution meeting requirements. For example, GE Flight Quest incorporated data
on scheduled and actual flight departure and arrival times, for a contest intended
to devise algorithms to better predict arrival times, and another one intended to
improve them (Kaggle n.d.). As the nexus of innovation moves from man to
machine, data becomes the fuel on which machine innovation engines run.
These four business strategies are what I call digital disciplines (Weinman
2015), and represent an evolution of three customer-focused strategies called value
disciplines, originally devised by Michael Treacy and Fred Wiersema in their inter-
national bestseller The Discipline of Market Leaders (Treacy and Wiersema 1995).
1.2 From Value Disciplines to Digital Disciplines
The value disciplines originally identified by Treacy and Wiersema are operational
excellence,product leadership, and customer intimacy.
Operational excellence entails processes which generate customer value by being
lower cost or more convenient than those of competitors. For example, Michael
Dell, operating as a college student out of a dorm room, introduced an assemble-
to-order process for PCs by utilizing a direct channel which was originally the
phone or physical mail and then became the Internet and eCommerce. He was
able to drive the price down, make it easier to order, and provide a PC built to
customers’ specifications by creating a new assemble-to-order process that bypassed
indirect channel middlemen that stocked pre-built machines en masse, who offered
no customization but charged a markup nevertheless.
Product leadership involves creating leading-edge products (or services) that
deliver superior value to customers. We all know the companies that do this: Rolex
in watches, Four Seasons in lodging, Singapore Airlines or Emirates in air travel.
Treacy and Wiersema considered innovation as being virtually synonymous with
product leadership, under the theory that leading products must be differentiated in
some way, typically through some innovation in design, engineering, or technology.
Customer intimacy, according to Treacy and Wiersema, is focused on segmenting
markets, better understanding the unique needs of those niches, and tailoring solu-
tions to meet those needs. This applies to both consumer and business markets. For
example, a company that delivers packages might understand a major customer’s
needs intimately, and then tailor a solution involving stocking critical parts at
their distribution centers, reducing the time needed to get those products to their
customers. In the consumer world, customer intimacy is at work any time a tailor
adjusts a garment for a perfect fit, a bartender customizes a drink, or a doctor
diagnoses and treats a medical issue.
6 J. Weinman
Traditionally, the thinking was that a company would do well to excel in a given
discipline, and that the disciplines were to a large extent mutually exclusive. For
example, a fast food restaurant might serve a limited menu to enhance operational
excellence. A product leadership strategy of having many different menu items, or a
customer intimacy strategy of customizing each and every meal might conflict with
the operational excellence strategy. However, now, the economics of information—
storage prices are exponentially decreasing and data, once acquired, can be lever-
aged elsewhere—and the increasing flexibility of automation—such as robotics—
mean that companies can potentially pursue multiple strategies simultaneously.
Digital technologies such as big data enable new ways to think about the insights
originally derived by Treacy and Wiersema. Another way to think about it is that
digital technologies plus value disciplines equal digital disciplines: operational
excellence evolves to information excellence, product leadership of standalone
products and services becomes solution leadership of smart, digital products and
services connected to the cloud and ecosystems, customer intimacy expands to
collective intimacy, and traditional innovation becomes accelerated innovation.In
the digital disciplines framework, innovation becomes a separate discipline, because
innovation applies not only to products, but also processes, customer relationships,
and even the innovation process itself. Each of these new strategies can be enabled
by big data in profound ways.
1.2.1 Information Excellence
Operational excellence can be viewed as evolving to information excellence, where
digital information helps optimize physical operations including their processes and
resource utilization; where the world of digital information can seamlessly fuse
with that of physical operations; and where virtual worlds can replace physical.
Moreover, data can be extracted from processes to enable long term process
improvement, data collected by processes can be monetized, and new forms of
corporate structure based on loosely coupled partners can replace traditional,
monolithic, vertically integrated companies. As one example, location data from
cell phones can be aggregated and analyzed to determine commuter traffic patterns,
thereby helping to plan transportation network improvements.
1.2.2 Solution Leadership
Products and services can become sources of big data, or utilize big data to
function more effectively. Because individual products are typically limited in
storage capacity, and because there are benefits to data aggregation and cloud
processing, normally the data that is collected can be stored and processed in the
cloud. A good example might be the GE GEnx jet engine, which collects 5000
1 Strategic Applications of Big Data 7
data points each second from each of 20 sensors. GE then uses the data to develop
better predictive maintenance algorithms, thus reducing unplanned downtime for
airlines. (GE Aviation n.d.) Mere product leadership becomes solution leadership,
where standalone products become cloud-connected and data-intensive. Services
can also become solutions, because services are almost always delivered through
physical elements: food services through restaurants and ovens; airline services
through planes and baggage conveyors; healthcare services through x-ray machines
and pacemakers. The components of such services connect to each other and exter-
nally. For example, healthcare services can be better delivered through connected
pacemakers, and medical diagnostic data from multiple individual devices can be
aggregated to create a patient-centric view to improve health outcomes.
1.2.3 Collective Intimacy
Customer intimacy is no longer about dividing markets into segments, but rather
dividing markets into individuals, or even further into multiple personas that an
individual might have. Personalization and contextualization offers the ability to
not just deliver products and services tailored to a segment, but to an individual.
To do this effectively requires current, up-to-date information as well as historical
data, collected at the level of the individual and his or her individual activities
and characteristics down to the granularity of DNA sequences and mouse moves.
Collective intimacy is the notion that algorithms running on collective data from
millions of individuals can generate better tailored services for each individual.
This represents the evolution of intimacy from face-to-face, human-mediated
relationships to virtual, human-mediated relationships over social media, and from
there, onward to virtual, algorithmically mediated products and services.
1.2.4 Accelerated Innovation
Finally, innovation is not just associated with product leadership, but can create new
processes, as Walmart did with cross-docking or Uber with transportation, or new
customer relationships and collective intimacy, as Amazon.com uses data to better
upsell/cross-sell, and as Netflix innovated its Cinematch recommendation engine.
The latter was famously done through the Netflix Prize, a contest with a million
dollar award for whoever could best improve Cinematch by at least 10% (Bennett
and Lanning 2007). Such accelerated innovation can be faster, cheaper, and better
than traditional means of innovation. Often, such approaches exploit technologies
such as the cloud and big data. The cloud is the mechanism for reaching multiple
potential solvers on an ad hoc basis, with published big data being the fuel for
problem solving. For example, Netflix published anonymized customer ratings of
movies, and General Electric published planned and actual flight arrival times.
8 J. Weinman
Today, machine learning and deep learning based on big data sets are a means
by which algorithms are innovating themselves. Google DeepMind’s AlphaGo Go-
playing system beat the human world champion at Go, Lee Sedol, partly based on
learning how to play by not only “studying” tens of thousands of human games, but
also by playing an increasingly tougher competitor: itself (Moyer 2016).
1.2.5 Value Disciplines to Digital Disciplines
The three classic value disciplines of operational excellence, product leadership and
customer intimacy become transformed in a world of big data and complementary
digital technologies to become information excellence, solution leadership, collec-
tive intimacy, and accelerated innovation. These represent four generic strategies
that leverage big data in the service of strategic competitive differentiation; four
generic strategies that represent the horizontal applications of big data.
1.3 Information Excellence
Most of human history has been centered on the physical world. Hunting and
gathering, fishing, agriculture, mining, and eventually manufacturing and physical
operations such as shipping, rail, and eventually air transport. It’s not news that the
focus of human affairs is increasingly digital, but the many ways in which digital
information can complement, supplant, enable, optimize, or monetize physical
operations may be surprising. As more of the world becomes digital, the use of
information, which after all comes from data, becomes more important in the
spheres of business, government, and society (Fig. 1.1).
1.3.1 Real-Time Process and Resource Optimization
There are numerous business functions, such as legal, human resources, finance,
engineering, and sales, and a variety of ways in which different companies in a
variety of verticals such as automotive, healthcare, logistics, or pharmaceuticals
configure these functions into end-to-end processes. Examples of processes might
be “claims processing” or “order to delivery” or “hire to fire”. These in turn use a
variety of resources such as people, trucks, factories, equipment, and information
Data can be used to optimize resource use as well as to optimize processes for
goals such as cycle time, cost, or quality.
Some good examples of the use of big data to optimize processes are inventory
management/sales forecasting, port operations, and package delivery logistics.
1 Strategic Applications of Big Data 9
Fig. 1.1 High-level architecture for information excellence
Too much inventory is a bad thing, because there are costs to holding inventory:
the capital invested in the inventory, risk of disaster, such as a warehouse fire,
insurance, floor space, obsolescence, shrinkage (i.e., theft), and so forth. Too little
inventory is also bad, because not only may a sale be lost, but the prospect may
go elsewhere to acquire the good, realize that the competitor is a fine place to
shop, and never return. Big data can help with sales forecasting and thus setting
correct inventory levels. It can also help to develop insights, which may be subtle
or counterintuitive. For example, when Hurricane Frances was projected to strike
Florida, analytics helped stock stores, not only with “obvious” items such as
bottled water and flashlights, but non-obvious products such as strawberry Pop-Tarts
(Hayes 2004). This insight was based on mining store transaction data from prior
Consider a modern container port. There are multiple goals, such as minimizing
the time ships are in port to maximize their productivity, minimizing the time ships
or rail cars are idle, ensuring the right containers get to the correct destinations,
maximizing safety, and so on. In addition, there may be many types of structured
and unstructured data, such as shipping manifests, video surveillance feeds of roads
leading to and within the port, data on bridges, loading cranes, weather forecast data,
truck license plates, and so on. All of these data sources can be used to optimize port
operations in line with the multiple goals (Xvela 2016).
Or consider a logistics firm such as UPS. UPS has invested hundreds of millions
of dollars in ORION (On-Road Integrated Optimization and Navigation). It takes
data such as physical mapping data regarding roads, delivery objectives for each
package, customer data such as when customers are willing to accept deliveries,
and the like. For each of 55,000 routes, ORION determines the optimal sequence of
10 J. Weinman
an average of 120 stops per route. The combinatorics here are staggering, since there
are roughly 10**200 different possible sequences, making it impossible to calculate
a perfectly optimal route, but heuristics can take all this data and try to determine
the best way to sequence stops and route delivery trucks to minimize idling time,
time waiting to make left turns, fuel consumption and thus carbon footprint, and
to maximize driver labor productivity and truck asset utilization, all the while
balancing out customer satisfaction and on-time deliveries. Moreover real-time data
such as geographic location, traffic congestion, weather, and fuel consumption, can
be exploited for further optimization (Rosenbush and Stevens 2015).
Such capabilities could also be used to not just minimize time or maximize
throughput, but also to maximize revenue. For example, a theme park could
determine the optimal location for a mobile ice cream or face painting stand,
based on prior customer purchases and exact location of customers within the
park. Customers’ locations and identities could be identified through dedicated long
range radios, as Disney does with MagicBands; through smartphones, as Singtel’s
DataSpark unit does (see below); or through their use of related geographically
oriented services or apps, such as Uber or Foursquare.
1.3.2 Long-Term Process Improvement
In addition to such real-time or short-term process optimization, big data can also
be used to optimize processes and resources over the long term.
For example, DataSpark, a unit of Singtel (a Singaporean telephone company)
has been extracting data from cell phone locations to be able to improve the
MTR (Singapore’s subway system) and customer experience (Dataspark 2016).
For example, suppose that GPS data showed that many subway passengers were
traveling between two stops but that they had to travel through a third stop—a hub—
to get there. By building a direct line to bypass the intermediate stop, travelers could
get to their destination sooner, and congestion could be relieved at the intermediate
stop as well as on some of the trains leading to it. Moreover, this data could also be
used for real-time process optimization, by directing customers to avoid a congested
area or line suffering an outage through the use of an alternate route. Obviously a
variety of structured and unstructured data could be used to accomplish both short-
term and long-term improvements, such as GPS data, passenger mobile accounts
and ticket purchases, video feeds of train stations, train location data, and the like.
1.3.3 Digital-Physical Substitution and Fusion
The digital world and the physical world can be brought together in a number of
ways. One way is substitution, as when a virtual audio, video, and/or web conference
substitutes for physical airline travel, or when an online publication substitutes for
1 Strategic Applications of Big Data 11
a physically printed copy. Another way to bring together the digital and physical
worlds is fusion, where both online and offline experiences become seamlessly
merged. An example is in omni-channel marketing, where a customer might browse
online, order online for pickup in store, and then return an item via the mail. Or, a
customer might browse in the store, only to find the correct size out of stock, and
order in store for home delivery. Managing data across the customer journey can
provide a single view of the customer to maximize sales and share of wallet for that
customer. This might include analytics around customer online browsing behavior,
such as what they searched for, which styles and colors caught their eye, or what
they put into their shopping cart. Within the store, patterns of behavior can also be
identified, such as whether people of a certain demographic or gender tend to turn
left or right upon entering the store.
1.3.4 Exhaust-Data Monetization
Processes which are instrumented and monitored can generate massive amounts
of data. This data can often be monetized or otherwise create benefits in creative
ways. For example, Uber’s main business is often referred to as “ride sharing,”
which is really just offering short term ground transportation to passengers desirous
of rides by matching them up with drivers who can give them rides. However, in
an arrangement with the city of Boston, it will provide ride pickup and drop-off
locations, dates, and times. The city will use the data for traffic engineering, zoning,
and even determining the right number of parking spots needed (O’Brien 2015).
Such inferences can be surprisingly subtle. Consider the case of a revolving door
firm that could predict retail trends and perhaps even recessions. Fewer shoppers
visiting retail stores means fewer shoppers entering via the revolving door. This
means lower usage of the door, and thus fewer maintenance calls.
Another good example is 23andMe. 23andMe is a firm that was set up to leverage
new low cost gene sequence technologies. A 23andMe customer would take a saliva
sample and mail it to 23andMe, which would then sequence the DNA and inform
the customer about certain genetically based risks they might face, such as markers
signaling increased likelihood of breast cancer due to a variant in the BRCA1 gene.
They also would provide additional types of information based on this sequence,
such as clarifying genetic relationships among siblings or questions of paternity.
After compiling massive amounts of data, they were able to monetize the
collected data outside of their core business. In one $50 million deal, they sold data
from Parkinson’s patients to Genentech, with the objective of developing a cure
for Parkinson’s through deep analytics (Lee 2015). Note that not only is the deal
lucrative, especially since essentially no additional costs were incurred to sell this
data, but also highly ethical. Parkinson’s patients would like nothing better than for
Genentech—or anybody else, for that matter—to develop a cure.
12 J. Weinman
1.3.5 Dynamic, Networked, Virtual Corporations
Processes don’t need to be restricted to the four walls of the corporation. For
example, supply chain optimization requires data from suppliers, channels, and
logistics companies. Many companies have focused on their core business and
outsourced or partnered with others to create and continuously improve supply
chains. For example, Apple sells products, but focuses on design and marketing,
not manufacturing. As many people know, Apple products are built by a partner,
Foxconn, with expertise in precision manufacturing electronic products.
One step beyond such partnerships or virtual corporations are dynamic, net-
worked virtual corporations. An example is Li & Fung. Apple sells products such
as iPhones and iPads, without owning any manufacturing facilities. Similarly, Li &
Fung sells products, namely clothing, without owning any manufacturing facilities.
However, unlike Apple, who relies largely on one main manufacturing partner; Li &
Fung relies on a network of over 10,000 suppliers. Moreover, the exact configuration
of those suppliers can change week by week or even day by day, even for the same
garment. A shirt, for example, might be sewed in Indonesia with buttons from
Thailand and fabric from S. Korea. That same SKU, a few days later, might be
made in China with buttons from Japan and fabric from Vietnam. The constellation
of suppliers is continuously optimized, by utilizing data on supplier resource
availability and pricing, transportation costs, and so forth (Wind et al. 2009).
1.3.6 Beyond Business
Information excellence also applies to governmental and societal objectives. Earlier
we mentioned using big data to improve the Singapore subway operations and
customer experience; later we’ll mention how it’s being used to improve traffic
congestion in Rio de Janeiro. As an example of societal objectives, consider the
successful delivery of vaccines to remote areas. Vaccines can lose their efficacy
or even become unsafe unless they are refrigerated, but delivery to outlying areas
can mean a variety of transport mechanisms and intermediaries. For this reason,
it is important to ensure that they remain refrigerated across their “cold chain.”
A low-tech method could potentially warn of unsafe vaccines: for example, put a
container of milk in with the vaccines, and if the milk spoils it will smell bad and the
vaccines are probably bad as well. However, by collecting data wirelessly from the
refrigerators throughout the delivery process, not only can it be determined whether
the vaccines are good or bad, but improvements can be made to the delivery process
by identifying the root cause of the loss of refrigeration, for example, loss of power
at a particular port, and thus steps can be taken to mitigate the problem, such as the
deployment of backup power generators (Weinman 2016).
1 Strategic Applications of Big Data 13
1.4 Solution Leadership
Products (and services) were traditionally standalone and manual, but now have
become connected and automated. Products and services now connect to the cloud
and from there on to ecosystems. The ecosystems can help collect data, analyze it,
provide data to the products or services, or all of the above (Fig. 1.2).
1.4.1 Digital-Physical Mirroring
In product engineering, an emerging approach is to build a data-driven engineering
model of a complex product. For example, GE mirrors its jet engines with “digital
twins” or “virtual machines” (unrelated to the computing concept of the same
name). The idea is that features, engineering design changes, and the like can be
made to the model much more easily and cheaply than building an actual working jet
engine. A new turbofan blade material with different weight, brittleness, and cross
section might be simulated to determine impacts on overall engine performance.
To do this requires product and materials data. Moreover, predictive analytics can
be run against massive amounts of data collected from operating engines (Warwick
Fig. 1.2 High-level architecture for solution leadership
14 J. Weinman
1.4.2 Real-Time Product/Service Optimization
Recall that solutions are smart, digital, connected products that tie over networks to
the cloud and from there onward to unbounded ecosystems. As a result, the actual
tangible, physical product component functionality can potentially evolve over time
as the virtual, digital components adapt. As two examples, consider a browser that
provides “autocomplete” functions in its search bar, i.e., typing shortcuts based
on previous searches, thus saving time and effort. Or, consider a Tesla, whose
performance is improved by evaluating massive quantities of data from all the Teslas
on the road and their performance. As Tesla CEO Elon Musk says, “When one car
learns something, the whole fleet learns” (Coren 2016).
1.4.3 Product/Service Usage Optimization
Products or services can be used better by customers by collecting data and pro-
viding feedback to customers. The Ford Fusion’s EcoGuide SmartGauge provides
feedback to drivers on their fuel efficiency. Jackrabbit starts are bad; smooth driving
is good. The EcoGuide SmartGauge grows “green” leaves to provide drivers with
feedback, and is one of the innovations credited with dramatically boosting sales of
the car (Roy 2009).
GE Aviation’s Flight Efficiency Services uses data collected from numerous
flights to determine best practices to maximize fuel efficiency, ultimately improving
airlines carbon footprint and profitability. This is an enormous opportunity, because
it’s been estimated that one-fifth of fuel is wasted due to factors such as suboptimal
fuel usage and inefficient routing. For example, voluminous data and quantitative
analytics were used to develop a business case to gain approval from the Malaysian
Directorate of Civil Aviation for AirAsia to use single-engine taxiing. This con-
serves fuel because only one engine is used to taxi, rather than all the engines
running while the plane is essentially stuck on the runway.
Perhaps one of the most interesting examples of using big data to optimize
products and services comes from a company called Opower, which was acquired
by Oracle. It acquires data on buildings, such as year built, square footage, and
usage, e.g., residence, hair salon, real estate office. It also collects data from smart
meters on actual electricity consumption. By combining all of this together, it
can message customers such as businesses and homeowners with specific, targeted
insights, such as that a particular hair salon’s electricity consumption is higher than
80% of hair salons of similar size in the area built to the same building code (and
thus equivalently insulated). Such “social proof” gamification has been shown to
be extremely effective in changing behavior compared to other techniques such as
rational quantitative financial comparisons (Weinman 2015).
1 Strategic Applications of Big Data 15
1.4.4 Predictive Analytics and Predictive Maintenance
Collecting data from things and analyzing it can enable predictive analytics and
predictive maintenance. For example, the GE GEnx jet engine has 20 or so sensors,
each of which collects 5000 data points per second in areas such as oil pressure,
fuel flow and rotation speed. This data can then be used to build models and identify
anomalies and predict when the engine will fail.
This in turn means that airline maintenance crews can “fix” an engine before it
fails. This maximizes what the airlines call “time on wing,” in other words, engine
availability. Moreover, engines can be proactively repaired at optimal times and
optimal locations, where maintenance equipment, crews, and spare parts are kept
(Weinman 2015).
1.4.5 Product-Service System Solutions
When formerly standalone products become connected to back-end services and
solve customer problems they become product-service system solutions. Data can
be the glue that holds the solution together. A good example is Nike and the
Nike has a number of mechanisms for collecting activity tracking data, such as
the NikeCFuelBand, mobile apps, and partner products, such as NikeCKinect
which is a video “game” that coaches you through various workouts. These can
collect data on activities, such as running or bicycling or doing jumping jacks. Data
can be collected, such as the route taken on a run, and normalized into “NikeFuel”
points (Weinman 2015).
Other elements of the ecosystem can measure outcomes. For example, a variety
of scales can measure weight, but the Withings Smart Body Analyzer can also
measure body fat percentage, and link that data to NikeFuel points (Choquel 2014).
By linking devices measuring outcomes to devices monitoring activities—with
the linkages being data traversing networks—individuals can better achieve their
personal goals to become better athletes, lose a little weight, or get more toned.
1.4.6 Long-Term Product Improvement
Actual data on how products are used can ultimately be used for long-term product
improvement. For example, a cable company can collect data on the pattern of
button presses on its remote controls. A repeated pattern of clicking around the
“Guide” button fruitlessly and then finally ordering an “On Demand” movie might
lead to a clearer placement of a dedicated “On Demand” button on the control. Car
companies such as Tesla can collect data on actual usage, say, to determine how
16 J. Weinman
many batteries to put in each vehicle based on the statistics of distances driven;
airlines can determine what types of meals to offer; and so on.
1.4.7 The Experience Economy
In the Experience Economy framework, developed by Joe Pine and Jim Gilmore,
there is a five-level hierarchy of increasing customer value and firm profitability.
At the lowest level are commodities, which may be farmed, fished, or mined, e.g.,
coffee beans. At the next level of value are products, e.g., packaged, roasted coffee
beans. Still one level higher are services, such as a corner coffee bar. One level
above this are experiences, such as a fine French restaurant, which offers coffee
on the menu as part of a “total product” that encompasses ambience, romance, and
professional chefs and services. But, while experiences may be ephemeral, at the
ultimate level of the hierarchy lie transformations, which are permanent, such as
a university education, learning a foreign language, or having life-saving surgery
(Pine and Gilmore 1999).
1.4.8 Experiences
Experiences can be had without data or technology. For example, consider a hike
up a mountain to its summit followed by taking in the scenery and the fresh air.
However, data can also contribute to experiences. For example, Disney MagicBands
are long-range radios that tie to the cloud. Data on theme park guests can be used to
create magical, personalized experiences. For example, guests can sit at a restaurant
without expressly checking in, and their custom order will be brought to their
table, based on tracking through the MagicBands and data maintained in the cloud
regarding the individuals and their orders (Kuang 2015).
1.4.9 Transformations
Data can also be used to enable transformations. For example, the NikeCfamily
and ecosystem of solutions mentioned earlier can help individuals lose weight or
become better athletes. This can be done by capturing data from the individual on
steps taken, routes run, and other exercise activities undertaken, as well as results
data through connected scales and body fat monitors. As technology gets more
sophisticated, no doubt such automated solutions will do what any athletic coach
does, e.g., coaching on backswings, grip positions, stride lengths, pronation and the
like. This is how data can help enable transformations (Weinman 2015).
1 Strategic Applications of Big Data 17
1.4.10 Customer-Centered Product and Service Data
When multiple products and services each collect data, they can provide a 360ıview
of the patient. For example, patients are often scanned by radiological equipment
such as CT (computed tomography) scanners and X-ray machines. While individual
machines should be calibrated to deliver a safe dose, too many scans from too
many devices over too short a period can deliver doses over accepted limits, leading
potentially to dangers such as cancer. GE Dosewatch provides a single view of the
patient, integrating dose information from multiple medical devices from a variety
of manufacturers, not just GE (Combs 2014).
Similarly, financial companies are trying to develop a 360ıview of their
customers’ financial health. Rather than the brokerage division being run separately
from the mortgage division, which is separate from the retail bank, integrating data
from all these divisions can help ensure that the customer is neither over-leveraged
or underinvested.
1.4.11 Beyond Business
The use of connected refrigerators to help improve the cold chain was described
earlier in the context of information excellence for process improvement. Another
example of connected products and services is cities, such as Singapore, that help
reduce carbon footprint through connected parking garages. The parking lots report
how many spaces they have available, so that a driver looking for parking need not
drive all around the city: clearly visible digital signs and a mobile app describe how
many—if any—spaces are available (Abdullah 2015).
This same general strategy can be used with even greater impact in the devel-
oping world. For example, in some areas, children walk an hour or more to a well
to fill a bucket with water for their families. However, the well may have gone dry.
Connected, “smart” pump handles can report their usage, and inferences can be
made as to the state of the well. For example, a few pumps of the handle and then
no usage, another few pumps and then no usage, etc., is likely to signify someone
visiting the well, attempting to get water, then abandoning the effort due to lack of
success (ITU and Cisco 2016).
1.5 Collective Intimacy
At one extreme, a customer “relationship” is a one-time, anonymous transaction.
Consider a couple celebrating their 30th wedding anniversary with a once-in-a-
lifetime trip to Paris. While exploring the left bank, they buy a baguette and some
Brie from a hole-in-the-wall bistro. They will never see the bistro again, nor vice
18 J. Weinman
At the other extreme, there are companies and organizations that see customers
repeatedly. Amazon.com sees its customers’ patterns of purchases; Netflix sees its
customers’ patterns of viewing; Uber sees its customers’ patterns of pickups and
drop-offs. As other verticals become increasingly digital, they too will gain more
insight into customers as individuals, rather than anonymous masses. For example,
automobile insurers are increasingly pursuing “pay-as-you-drive,” or “usage-based”
insurance. Rather than customers’ premiums being merely based on aggregate,
coarse-grained information such as age, gender, and prior tickets, insurers can
charge premiums based on individual, real-time data such as driving over the speed
limit, weaving in between lanes, how congested the road is, and so forth.
Somewhere in between, there are firms that may not have any transaction history
with a given customer, but can use predictive analytics based on statistical insights
derived from large numbers of existing customers. Capital One, for example,
famously disrupted the existing market for credit cards by building models to create
“intimate” offers tailored to each prospect rather than a one-size fits all model
(Pham 2015).
Big data can also be used to analyze and model churn. Actions can be taken to
intercede before a customer has defected, thus retaining that customer and his or her
In short, big data can be used to determine target prospects, determine what to
offer them, maximize revenue and profitability, keep them, decide to let them defect
to a competitor, or win them back (Fig. 1.3).
Fig. 1.3 High-level architecture for collective intimacy
1 Strategic Applications of Big Data 19
1.5.1 Target Segments, Features and Bundles
A traditional arena for big data and analytics has been better marketing to customers
and market basket analysis. For example, one type of analysis entails identifying
prospects, clustering customers, non-buyers, and prospects as three groups: loyal
customers, who will buy your product no matter what; those who won’t buy no
matter what; and those that can be swayed. Marketing funds for advertising and
promotions are best spent with the last category, which will generate sales uplift.
A related type of analysis is market basket analysis, identifying those products
that might be bought together. Offering bundles of such products can increase profits
(Skiera and Olderog 2000). Even without bundles, better merchandising can goose
sales. For example, if new parents who buy diapers also buy beer, it makes sense to
put them in the aisle together. This may be extended to product features, where the
“bundle” isn’t a market basket but a basket of features built in to a product, say a
sport suspension package and a V8 engine in a sporty sedan.
1.5.2 Upsell/Cross-Sell
Amazon.com uses a variety of approaches to maximize revenue per customer. Some,
such as Amazon Prime, which offers free two-day shipping, are low tech and based
on behavioral economics principles such as the “flat-rate bias” and how humans
frame expenses such as sunk costs (Lambrecht and Skiera 2006). But they are
perhaps best known for their sophisticated algorithms, which do everything from
automating pricing decisions, making millions of price changes every day, (Falk
2013) and also in recommending additional products to buy, through a variety of
algorithmically generated capabilities such as “People Also Bought These Items”.
Some are reasonably obvious, such as, say, paper and ink suggestions if a copier
is bought or a mounting plate if a large flat screen TV is purchased. But many are
subtle, and based on deep analytics at scale of the billions of purchases that have
been made.
1.5.3 Recommendations
If Amazon.com is the poster child for upsell/cross-sell, Netflix is the one for a
pure recommendation engine. Because Netflix charges a flat rate for a household,
there is limited opportunity for upsell without changing the pricing model. Instead,
the primary opportunity is for customer retention, and perhaps secondarily, referral
marketing, i.e., recommendations from existing customers to their friends. The key
to that is maximizing the quality of the total customer experience. This has multiple
dimensions, such as whether DVDs arrive in a reasonable time or a streaming video
20 J. Weinman
plays cleanly at high resolution, as opposed to pausing to rebuffer frequently. But
one very important dimension is the quality of the entertainment recommendations,
because 70% of what Netflix viewers watch comes about through recommendations.
If viewers like the recommendations, they will like Netflix, and if they don’t, they
will cancel service. So, reduced churn and maximal lifetime customer value are
highly dependent on this (Amatriain 2013).
Netflix uses extremely sophisticated algorithms against trillions of data points,
which attempt to solve as best as possible the recommendation problem. For
example, they must balance out popularity with personalization. Most people like
popular movies; this is why they are popular. But every viewer is an individual,
hence will like different things. Netflix continuously evolves their recommendation
engine(s), which determine which options are presented when a user searches, what
is recommended based on what’s trending now, what is recommended based on
prior movies the user has watched, and so forth. This evolution spans a broad
set of mathematical and statistical methods and machine learning algorithms, such
as matrix factorization, restricted Boltzmann machines, latent Dirichlet allocation,
gradient boosted decision trees, and affinity propagation (Amatriain 2013). In addi-
tion, a variety of metrics—such as member retention and engagement time—and
experimentation techniques—such as offline experimentation and A/B testing—are
tuned for statistical validity and used to measure the success of the ensemble of
algorithms (Gomez-Uribe and Hunt 2015).
1.5.4 Sentiment Analysis
A particularly active current area in big data is the use of sophisticated algorithms to
determine an individual’s sentiment (Yegulalp 2015). For example, textual analysis
of tweets or posts can determine how a customer feels about a particular product.
Emerging techniques include emotional analysis of spoken utterances and even
sentiment analysis based on facial imaging.
Some enterprising companies are using sophisticated algorithms to conduct such
sentiment analysis at scale, in near real time, to buy or sell stocks based on how
sentiment is turning as well as additional analytics (Lin 2016).
1.5.5 Beyond Business
Such an approach is relevant beyond the realm of corporate affairs. For example,
a government could utilize a collective intimacy strategy in interacting with its
citizens, in recommending the best combination of public transportation based on
personal destination objectives, or the best combination of social security benefits,
based on a personal financial destination. Dubai, for example, has released a mobile
app called Dubai Now that will act as a single portal to thousands of government
services, including, for example, personalized, contextualized GPS-based real-time
traffic routing (Al Serkal 2015).
1 Strategic Applications of Big Data 21
1.6 Accelerated Innovation
Innovation has evolved through multiple stages, from the solitary inventor, such
as the early human who invented the flint hand knife, through shop invention,
a combination of research lab and experimental manufacturing facility, to the
corporate research labs (Weinman 2015). However, even the best research labs can
only hire so many people, but ideas can come from anywhere.
The theory of open innovation proposes loosening the firm boundaries to partners
who may have ideas or technologies that can be brought into the firm, and to
distribution partners who may be able to make and sell ideas developed from within
the firm. Open innovation suggests creating relationships that help both of these
approaches succeed (Chesbrough 2003).
However, even preselecting relationships can be overly constricting. A still more
recent approach to innovation lets these relationships be ad hoc and dynamic. I call
it accelerated innovation, but it can be not only faster, but also better and cheaper.
One way to do this is by holding contests or posting challenges, which theoretically
anyone in the world could solve. Related approaches include innovation networks
and idea markets. Increasingly, machines will be responsible for innovation, and we
are already seeing this in systems such as IBM’s Chef Watson, which ingested a
huge database of recipes and now can create its own innovative dishes, and Google
DeepMind’s AlphaGo, which is innovating game play in one of the oldest games in
the world, Go (Fig. 1.4).
Fig. 1.4 High-level architecture for accelerated innovation
22 J. Weinman
1.6.1 Contests and Challenges
Netflix depends heavily on the quality of its recommendations to maximize
customer satisfaction, thus customer retention, and thereby total customer lifetime
value and profitability. The original Netflix Cinematch recommendation system
let Netflix customers rate movies on a scale of one star to five stars. If Netflix
recommended a movie that the customer then rated a one, there is an enormous
discrepancy between Netflix’s recommendations and the user’s delight. With a per-
fect algorithm, customers would always rate a Netflix-recommended movie a five.
Netflix launched the Netflix Prize in 2006. It was open to anyone who wished
to compete, and multiple teams did so, from around the world. Netflix made 100
million anonymized movie ratings available. These came from almost half a million
subscribers, across almost 20,000 movie titles. It also withheld 3 million ratings to
evaluate submitted contestant algorithms (Bennett and Lanning 2007). Eventually,
the prize was awarded to a team that did in fact meet the prize objective: a 10%
improvement in the Cinematch algorithm. Since that time, Netflix has continued
to evolve its recommendation algorithms, adapting them to the now prevalent
streaming environment which provides billions of additional data points. While
user-submitted DVD ratings may be reasonably accurate; actual viewing behaviors
and contexts are substantially more accurate and thus better predictors. As one
example, while many viewers say they appreciate foreign documentaries; actual
viewing behavior shows that crude comedies are much more likely to be watched.
GE Flight Quest is another example of a big data challenge. For Flight Quest 1,
GE published data on planned and actual flight departures and arrivals, as well
as weather conditions, with the objective of better predicting flight times. Flight
Quest II then attempted to improve flight arrival times through better scheduling
and routing. The key point of both the Netflix Prize and GE’s Quests is that large
data sets were the cornerstone of the innovation process. Methods used by Netflix
were highlighted in Sect. 1.5.3. The methods used by the GE Flight Quest winners
span gradient boosting, random forest models, ridge regressions, and dynamic
programming (GE Quest n.d.).
1.6.2 Contest Economics
Running such a contest exhibits what I call “contest economics.” For example,
rather than paying for effort, as a firm would when paying salaries to its R&D
team, it can now pay only for results. The results may be qualitative, for example,
“best new product idea,” or quantitative, for example, a percentage improvement
in the Cinematch algorithm. Moreover, the “best” idea may be selected, or the best
one surpassing a particular given threshold or judges’ decision. This means that
hundreds or tens of thousands of “solvers” or contestants may be working on your
problem, but you only need to pay out in the event of a sufficiently good solution.
1 Strategic Applications of Big Data 23
Moreover, because the world’s best experts in a particular discipline may be
working on your problem, the quality of the solution may be higher than if
conducted internally, and the time to reach a solution may be faster than internal
R&D could do it, especially in the fortuitous situation where just the right expert is
matched with just the right problem.
1.6.3 Machine Innovation
Technology is evolving to be not just an enabler of innovation, but the source of
innovation itself. For example, a program called AlphaGo developed by DeepMind,
which has been acquired by Google, bested the European champion, Fan Hui, and
then the world champion, Lee Sedol. Rather than mere brute force examination of
many moves in the game tree together with a board position evaluation metric, it
used a deep learning approach coupled with some game knowledge encoded by its
developers (Moyer 2016).
Perhaps the most interesting development, however, was in Game 2 of the
tournament between AlphaGo and Sedol. Move 37 was so unusual that the human
commentators thought it was a mistake—a bug in the program. Sedol stood up and
left the game table for 15 min to regain his composure. It was several moves later
that the rationale and impact of Move 37 became clear, and AlphaGo ended up
winning that game, and the tournament. Move 37 was “beautiful,” (Metz 2016)in
retrospect, the way that the heliocentric theory of the solar system or the Theory of
Relativity or the concept of quasicrystals now are. To put it another way, a machine
innovated beyond what thousands of years and millions of players had been unable
to do.
1.6.4 Beyond Business
Of course, such innovation is not restricted to board games. Melvin is a program
that designs experiments in quantum physics, which are notoriously counterintuitive
or non-intuitive to design. It takes standard components such as lasers and beam
splitters, and determines new ways to combine them to test various quantum
mechanics hypotheses. It has already been successful in creating such experiments.
In another example of the use of big data for innovation, automated hypothesis
generation software was used to scan almost two hundred thousand scientific paper
abstracts in biochemistry to determine the most promising “kinases,”—a type of
protein—that activate another specific protein, “p53”, which slows cancer growth.
All but two of the top prospects identified by the software proved to have the desired
effect (The Economist 2014).
24 J. Weinman
1.7 Integrated Disciplines
A traditional precept of business strategy is the idea of focus. As firms select
a focused product area, market segment, or geography, say, they also make a
conscious decision on what to avoid or say “no” to. A famous story concerns
Southwest, an airline known for its no frills, low-cost service. Its CEO, Herb
Kelleher, in explaining its strategy, explained that every strategic decision could be
viewed in the light of whether it helped achieve that focus. For example, the idea of
serving a tasty chicken Caesar salad on its flights could be instantly nixed, because
it wouldn’t be aligned with low cost (Heath and Heath 2007).
McDonald’s famously ran into trouble by attempting to pursue operational
excellence, product leadership, and customer intimacy at the same time, and these
were in conflict. After all, having the tastiest burgers—product leadership—
would mean foregoing mass pre-processing in factories that created frozen
patties—operational excellence. Having numerous products, combinations and
customizations such as double patty, extra mayo, no onions—customer intimacy—
would take extra time and conflict with a speedy drive through line—operational
excellence (Weinman 2015).
However, the economics of information and information technology mean that
a company can well orient itself to more than one discipline. The robots that
run Amazon.com’s logistics centers, for example, can use routing and warehouse
optimization programs—operational excellence—that are designed once, and don’t
necessarily conflict with the algorithms that make product recommendations based
on prior purchases and big data analytics across millions or billions of transactions.
The efficient delivery of unicast viewing streams to Netflix streaming
subscribers—operational excellence—doesn’t conflict with the entertainment
suggestions derived by the Netflix recommender—collective intimacy—nor does
it conflict with the creation of original Netflix content—product leadership—nor
does it impact Netflix’s ability to run open contests and challenges such as the
Netflix Prize or the Netflix Cloud OSS (Open Source Software) Prize—accelerated
In fact, not only do the disciplines not conflict, but, in such cases, data captured
or derived in one discipline can be used to support the needs of another in the
same company. For example, Netflix famously used data on customer behaviors,
such as rewind or re-watch, contexts, such as mobile device or family TV, and
demographics, such as age and gender, that were part of its collective intimacy
strategy, to inform decisions made about investing in and producing House of Cards,
a highly popular, Emmy-Award-winning show, that supports product leadership.
The data need not even be restricted to a single company. Uber, the “ride-
sharing” company, entered into an agreement with Starwood, the hotel company
(Hirson 2015). A given Uber customer might be dropped off at a competitor’s hotel,
offering Starwood the tantalizing possibility of emailing that customer a coupon for
20% off their next stay at a Sheraton or Westin, say, possibly converting a lifelong
competitor customer into a lifelong Starwood customer. The promotion could be
extremely targeted, along the lines of, say, “Mr. Smith, you’ve now stayed at our
1 Strategic Applications of Big Data 25
competitor’s hotel at least three times. But did you know that the Westin Times
Square is rated 1 star higher than the competitor? Moreover, it’s only half as far
away from your favorite restaurant as the competitor, and has a health club included
in the nightly fee, which has been rated higher than the health club you go to at the
competitor hotel.”
Such uses are not restricted to analytics for marketing promotions. For example,
Waze and the City of Rio de Janeiro have announced a collaboration. Waze is a
mobile application that provides drivers information such as driving instructions,
based not only on maps but also real-time congestion data derived from all other
Waze users, a great example of crowdsourcing with customer intimacy. In a bidirec-
tional arrangement with Rio de Janeiro, Waze will improve its real time routing by
utilizing not only the data produced by Waze users, but additional data feeds offered
by the City. Data will flow in the other direction, as well, as Rio uses data collected
by Waze to plan new roads or to better time traffic signals (Ungerleider 2015).
1.8 Conclusion
A combination of technologies such as the cloud, big data and analytics, machine
learning, social, mobile, and the Internet of Things is transforming the world
around us. Information technologies, of course, have information at their nexus,
and consequently data, and the capabilities to extract information and insight and
make decisions and take action on that insight, are key to the strategic application of
information technology to increase the competitiveness of our firms, enhance value
created for our customers, and to excel beyond these domains also into the areas of
government and society.
Four generic strategies—information excellence, solution leadership, collective
intimacy, and accelerated innovation—can be used independently or in combination
to utilize big data and related technologies to differentiate and create customer
value—for better processes and resources, better products and services, better
customer relationships, and better innovation, respectively.
Abdullah, Z. (2015). New app to help motorists find available parking.http://
www.straitstimes.com/singapore/new-app-to-help-motorists-find-available-parking. Accessed
15 September 2016.
Al Serkal, M. (2015). Dubai to launch smart app for 2000 government services.http://
ices-1.1625556. Accessed 15 September 2016.
Amatriain, X. (2013). Big and personal: data and models behind Netflix recommendations. In
Proceedings of the 2nd International Workshop on Big Data, Streams and Heterogeneous
Source Mining: Algorithms, Systems, Programming Models and Applications (pp. 1–6). ACM.
Bennett, J., & Lanning, S. (2007). The Netflix prize. In Proceedings of KDD Cup and Workshop
(Vol. 2007, p. 35).
26 J. Weinman
Chesbrough, H. W. (2003). Open innovation: The new imperative for creating and profiting from
technology. Boston: Harvard Business School Press.
Choquel, J. (2014). NikeFuel total on your Withings scale.http://blog.withings.com/2014/07/22/
new-way-to-fuel-your-motivation-see-your-nikefuel-total-on-your-withings-scale. Accessed
15 September 2016.
Combs, V. (2014). An infographic that works: I want dose watch from GE healthcare. Med City
News.http://medcitynews.com/2014/05/infographic-works-want-ges-dosewatch. Accessed 15
September 2016.
Coren, M. (2016). Tesla has 780 million miles of driving data, and adds another million
every 10 hours.http://qz.com/694520/tesla-has-780-million-miles-of-driving-data-and-adds-
another-million-every-10-hours/. Accessed 15 September 2016.
Dataspark (2016) Can data science help build better public transport? https://
datasparkanalytics.com/insight/can-data-science-help-build-better-public-transport. Accessed
15 September 2016.
Falk, T. (2013). Amazon changes prices millions of times every day. ZDnet.com.http://
www.zdnet.com/article/amazon-changes-prices-millions-of-times-every-day. Accessed 15
September 2016.
GE Aviation (n.d.). http://www.geaviation.com/commercial/engines/genx/. Accessed 15 Septem-
ber 2016.
GE Quest (n.d.). http://www.gequest.com/c/flight. Accessed 15 November 2016.
Gomez-Uribe, C., & Hunt, N. (2015). The netflix recommender system: algorithms, business value,
and innovation. ACM Transactions on Management Information Systems, 6(4), 1–19.
Hayes, C. (2004). What Wal-Mart knows about customers’ habits. The New York
customers-habits.html. Accessed 15 September 2016.
Heath, C., & Heath, D. (2007). Made to Stick: Why some ideas survive and others die.NewYork,
USA: Random House.
Hirson, R. (2015). Uber: The big data company.http://www.forbes.com/sites/ronhirson/2015/03/
23/uber-the-big-data-company/#2987ae0225f4. Accessed 15 September 2016.
ITU and Cisco (2016). Harnessing the Internet of Things for Global Development.http:/
Accessed 15 September 2016.
Kaggle (n.d.) GE Tackles the industrial internet.https://www.kaggle.com/content/kaggle/img/
casestudies/Kaggle%20Case%20Study-GE.pdf. Accessed 15 September 2016.
Kuang, C. (2015). Disney’s $1 Billion bet on a magical wristband. Wired.http://www.wired.com/
2015/03/disney-magicband. Accessed 15 September 2016.
Lambrecht, A., & Skiera, B. (2006). Paying too much and being happy about it: Existence, causes
and consequences of tariff-choice biases. Journal of Marketing Research, XLIII, 212–223.
Lee, S. (2015). 23 and Me and Genentech in deal to research Parkinson’s treatments.
SFgate, January 6, 2015. http://www.sfgate.com/health/article/23andMe-and-Genentech-in-
deal-to-research-5997703.php. Accessed 15 September 2016.
Lin, D. (2016). Seeking and finding alpha—Will cloud disrupt the investment man-
agement industry? https://thinkacloud.wordpress.com/2016/03/07/seeking-and-finding-alpha-
will-cloud-disrupt-the-investment-management-industry/. Accessed 15 September 2016.
Loveman, G. W. (2003). Diamonds in the data mine. Harvard Business Review, 81(5), 109–113.
Metz, C. (2016). In two moves, AlphaGo and lee sedol redefined the future. Wired.
http://www.wired.com/2016/03/two-moves-alphago-lee-sedol-redefined-future/. Accessed 15
September 2016.
Moyer, C. (2016). How Google’s AlphaGo beat a Go world champion.http://www.theatlantic.com/
technology/archive/2016/03/the-invisible-opponent/475611/. Accessed 15 September 2016.
O’Brien, S. A. (2015). Uber partners with Boston on traffic data.http://money.cnn.com/2015/01/
13/technology/uber-boston-traffic-data/. Accessed 15 September 2016.
1 Strategic Applications of Big Data 27
Pham, P. (2015). The Impacts of big data that you may not have heard of. forbes.com.http://
-heard-of/#3b1ccc1c957d. Accessed 15 September 2016.
Pine, J., & Gilmore, J. (1999). The experience economy: Work is theatre and every business a stage.
Boston: Harvard Business School Press.
Rosenbush, S., & Stevens, L. (2015). At UPS, the algorithm is the driver. The Wall Street Jour-
nal.http://www.wsj.com/articles/at-ups-the-algorithm-is-the-driver-1424136536. Accessed 15
September 2016.
Roy, R., (2009). Ford’s green goddess grows leaves.http://www.autoblog.com/2009/10/29/ford-
smart-gauge-engineer/. Accessed 15 September 2016.
Skiera, B., & Olderog, T. (2000). The benefits of bundling strategies. Schmalenbach Business
Review, 52, 137–159.
The Economist (2014). Computer says try this.http://www.economist.com/news/science-and-
king. Accessed 15 September 2016.
Treacy, M., & Wiersema, F. (1995). The discipline of market leaders. Reading, USA: Addison-
Ungerleider, N. (2015). Waze is driving into city hall.http://www.fastcompany.com/3045080/
waze-is-driving-into-city-hall. Accessed 15 September 2016.
Warwick, G. (2015). GE advances analytical maintenance with digital twins.http://
digital-twins. Accessed 15 September 2016.
Weinman, J. (2015). Digital disciplines. New Jersey: John Wiley & Sons.
Weinman, J. (2016). The internet of things for developing economies. CIO.http://www.cio.com/
Accessed 15 September 2016.
Wind, J., Fung, V., & Fung, W. (2009). Network orchestration: Creating and managing global
supply chains without owning them. In P. R. Kleindorfer, Y. (Jerry) R. Wind, & R. E.
Gunther (Eds.), The Network Challenge: Strategy, Profit, and Risk in an Interlinked World
(pp. 299–314). Upper Saddle River, USA: Wharton School Publishing.
Withings (2014). NikeFuel total on your Withings scale.http://blog.withings.com/2014/07/22/
new-way-to-fuel-your-motivation-see-your-nikefuel-total-on-your-withings-scale/. Accessed
15 September 2016.
Xvela (2016). https://xvela.com/solutions.html. Accessed 15 September 2016.
Yegulalp, S. 2015. IBM’s Watson mines Twitter for sentiments. 17 Mar 2015. Infoworld.com.
timents-good-bad-and-ugly.html. Accessed 15 September 2016.
Chapter 2
Start with Privacy by Design in All Big Data
Ann Cavoukian and Michelle Chibba
2.1 Introduction
The evolution of networked information and communication technologies has, in
one generation, radically changed the value of and ways to manage data. These
trends carry profound implications for privacy. The creation and dissemination of
data has accelerated around the world, and is being copied and stored indefinitely,
resulting in the emergence of Big Data. The old information destruction paradigm
created in an era of paper records is no longer relevant, because digital bits
and bytes have now attained near immortality in cyberspace, thwarting efforts
to successfully remove them from “public” domains. The practical obscurity of
personal information—the data protection of yesteryear—is disappearing as data
becomes digitized, connected to the grid, and exploited in countless new ways.
We’ve all but given up trying to inventory and classify information, and now rely
more on advanced search techniques and automated tools to manage and “mine”
data. The combined effect is that while information has become cheap to distribute,
copy, and recombine; personal information has also become far more available
and consequential. The challenges to control and protect personal information are
significant. Implementing and following good privacy practices should not be a
hindrance to innovation, to reaping societal benefits or to finding the means to
reinforce the public good from Big Data analytics—in fact, by doing so, innovation
is fostered with doubly-enabling, win–win outcomes. The privacy solution requires
a combination of data minimization techniques, credible safeguards, meaningful
individual participation in data processing life cycles, and robust accountability
measures in place by organizations informed by an enhanced and enforceable set of
A. Cavoukian () • M. Chibba
Faculty of Science, Privacy and Big Data Institute, Ryerson University, 350 Victoria Street,
Toronto, ON M5B 2K3, Canada
e-mail: ann.cavoukian@ryerson.ca;michelle.chibba@ryerson.ca
© Springer International Publishing AG 2018
S. Srinivasan (ed.), Guide to Big Data Applications, Studies in Big Data 26,
DOI 10.1007/978-3-319-53817-4_2
30 A. Cavoukian and M. Chibba
universal privacy principles better suited to modern realities. This is where Privacy
by Design becomes an essential approach for Big Data applications. This chapter
begins by defining information privacy, then it will provide an overview of the
privacy risks associated with Big Data applications. Finally, the authors will discuss
Privacy by Design as an international framework for privacy, then provide guidance
on using the Privacy by Design Framework and the 7 Foundational Principles, to
achieve both innovation and privacy—not one at the expense of the other.
2.2 Information Privacy Defined
Information privacy refers to the right or ability of individuals to exercise control
over the collection, use and disclosure by others of their personal information
(Clarke 2000). The ability to determine the fate of one’s personal information is
so important that the authors wish to bring to the attention of the readers, the
term “informational self-determination” which underpins the approach taken to
privacy in this chapter. This term was established in 1983 in Germany when the
Constitutional Court ruled that individuals, not governments, determine the fate of
their personal information. Since this time, in December 2013, the United Nations
General Assembly adopted resolution 68/167 (UN 2016), which expressed deep
concern at the negative impact that surveillance and interception of communications
may have on human rights. The General Assembly affirmed that the rights held by
people offline must also be protected online, and it called upon all States to respect
and protect the right to privacy in digital communication.
Information privacy makes each of us ‘masters’ of the data that identifies each
of us – as individual, citizen, worker, consumer, patient, student, tourist, investor,
parent, son, or daughter. For this, the notions of empowerment, control, choice and
self-determination are the very essence of what we refer to as information privacy.
As ‘custodians’ of our information, we expect governments and business can be
trusted with its safekeeping and proper use.
There have also been references to statements such as “If you have nothing to
hide, you have nothing to fear.” (Solove 2007) Privacy is not about secrecy. It is
about the freedom to exercise one’s right to decide who to choose to share the
personal details of one’s life with. Democracy does not begin with intrusions into
one’s personal sphere—it begins with human rights, civil liberties and privacy—all
fundamental to individual freedom.
Sometimes, safekeeping or information security is taken to mean that privacy
has been addressed. To be clear, information security does not equal privacy.
While data security certainly plays a vital role in enhancing privacy, there is an
important distinction to be made—security is about protecting data assets. It is
about achieving the goals of confidentiality, integrity and availability. Privacy related
goals developed in Europe that complement this security triad are: unlinkability,
transparency and intervenability. In other words, information privacy incorporates
a much broader set of protections than security alone. We look to the work on
2 Start with Privacy by Design in All Big Data Applications 31
‘contextual integrity’ (Dwork 2014) that extends the meaning of privacy to a much
broader class of transmission principles that cannot be presumed unless warranted
by other context-specific parameters influenced by other actors and information
types. Privacy relates not only to the way that information is protected and accessed,
but also to the way in which it is collected and used. For example, user access
controls protect personal information from internal threats by preventing even the
possibility of accidental or intentional disclosure or misuse. This protection is
especially needed in the world of Big Data.
2.2.1 Is It Personally Identifiable Information?
Not all data gives rise to privacy concerns. An important first step for any Big
Data application is to determine whether the information involved falls under the
definition of personally identifiable information (PII). Privacy laws around the
world include a definition of personal information and it is this definition which
is integral to whether or not the rules apply. Although there are privacy laws
around the world, each with a definition of personal information, we will use the
NIST definition, where personal information (also known as personally identifiable
information) may be defined as any information, recorded or otherwise, relating
to an identifiable individual (NIST 2010). It is important to note that almost any
information (e.g. biographical, biological, genealogical, historical, transactional,
locational, relational, computational, vocational, or reputational), may become
personal in nature. Privacy laws and associated rules will apply to information
if there is a reasonable possibility of identifying a specific individual—whether
directly, indirectly, or through manipulation or data linkage.
Understanding the different forms of non-personal data helps to better understand
what constitutes personal information. One example is de-identified or anonymous
information, which will be dealt with in more detail later in this chapter. NIST
defines de-identified information as records that have had enough personal infor-
mation removed or obscured in some manner such that the remaining information
does not identify an individual, and there is no reasonable basis to believe that the
information can be used to identify an individual (NIST 2015). As an illustration,
under a U.S. law known as the Health Insurance Portability and Accountability Act
(HIPAA), a set of standards exist to determine when health-care information is
no longer ‘individually identifiable’ or de-identified (HHS 2012). If this standard
is achieved, then the health-care information would not be subject to this law
governing the privacy of health care information. Another example is the EU
General Data Protection Regulation (GDPR) that similarly, excludes anonymous
information (EU Commission 2015). Of interest, however, is that this European
law introduces the concept of “pseudonymization” defined as the processing of
personal data in such a way as to prevent attribution to an identified or identifiable
32 A. Cavoukian and M. Chibba
person without additional information that may be held separately.1For research and
statistical purposes, certain requirements under the GDPR are relaxed if the personal
data is pseudonymized, which is considered an appropriate safeguard alongside
encryption (Official Journal of the European Union 2016).
Another form is when personal information is aggregated. Aggregation refers
to summary data that have been generated by performing a calculation across all
individual units as a whole. For example, medical researchers may use aggregated
patient data to assess new treatment strategies; governments may use aggregated
population data for statistical analysis on certain publicly funded programs for
reporting purposes; companies may use aggregated sales data to assist in deter-
mining future product lines. Work has also been done on privacy-preserving data
aggregation in wireless sensor networks, especially relevant in the context of the
Internet of Things (Zhang et al. 2016). By using aggregated data, there is a reduced
risk of connecting this information to a specific person or identify an individual.
Lastly, while personal information may be classified as confidential, not all
confidential information should be governed under privacy rules. Confidential
information includes information that should not be publicly available and often
holds tremendous value and importance for organizations, such as strategic business
plans, interim revenue forecasts, proprietary research, or other intellectual property.
The distinction is that while the theft or loss of such confidential information is of
grave concern for an organization it would not constitute a privacy breach because
it does not involve personal information—rather, it is business information.
The growth in Big Data applications and other information communication
technologies have added to the challenges of definition of personal information.
There are times when information architectures, developed by engineers to ensure
the smooth functioning of computer networks and connectivity, lead to unforeseen
uses that have an impact on identity and privacy. These changes present challenges
to what constitutes personal information, extending it from obvious tombstone
data (name, address, telephone number, date of birth, gender) to the innocuous
computational or metadata once the purview of engineering requirements for
communicating between devices (Cameron 2013; Mayer et al. 2016).
Metadata, for example, is information generated by our communications devices
and our communications service providers as we use landline or mobile phones,
computers, tablets, or other computing devices. Metadata is essentially information
about other information—in this case, relating to our communications (Mayer
et al. 2016). Using metadata in Big Data analysis requires understanding of context.
1NIST (2015) defines ‘pseudonymization’ as a specific kind of transformation in which
the names and other information that directly identifies an individual are replaced with
pseudonyms. Pseudonymization allows linking information belonging to an individual across
multiple data records or information systems, provided that all direct identifiers are systematically
pseudonymized. Pseudonymization can be readily reversed if the entity that performed the
pseudonymization retains a table linking the original identities to the pseudonyms, or if the
substitution is performed using an algorithm for which the parameters are known or can be
2 Start with Privacy by Design in All Big Data Applications 33
Metadata reveals detailed pattern of associations that can be far more invasive of
privacy than merely accessing the content of one’s communications (Cavoukian
2013a,b). Addresses, such as the Media Access Control (MAC) number that are
designed to be persistent and unique for the purpose of running software applica-
tions and utilizing Wi-Fi positioning systems to communicate to a local area network
can now reveal much more about an individual through advances in geo-location
services and uses of smart mobile devices (Cavoukian and Cameron 2011). Another
good example in the mobile environment would be a unique device identifier such as
an International Mobile Equipment Identity (IMEI) number: even though this does
not name the individual, if it is used to treat individuals differently it will fit the
definition of personal data (Information Commissioner’s Office ICO 2013).
No doubt, the mobile ecosystem is extremely complex and architectures that
were first developed to ensure the functioning of wireless network components
now act as geo-location points, thereby transforming the original intent or what
might be an unintended consequence for privacy. As noted by the International
Working Group on Data Protection in Telecommunications (IWGDPT 2004)“The
enhanced precision of location information and its availability to parties other
than the operators of mobile telecommunications networks create unprecedented
threats to the privacy of the users of mobile devices linked to telecommunications
networks.” When a unique identifier may be linked to an individual, it often falls
under the definition of “personal information” and carries with it a set of regulatory
2.3 Big Data: Understanding the Challenges to Privacy
Before moving into understanding the challenges and risks to privacy that arise
from Big Data applications and the associated data ecosystem, it is important to
emphasize that these should not be deterrents to extracting value from Big Data.
The authors believe that by understanding these privacy risks early on, Big Data
application developers, researchers, policymakers, and other stakeholders will be
sensitized to the privacy issues and therefore, be able to raise early flags on potential
unintended consequences as part of a privacy/security threat risk analysis.
We know that with advances in Big Data applications, organizations are devel-
oping a more complete understanding of the individuals with whom they interact
because of the growth and development of data analytical tools, and systems avail-
able to them. Public health authorities, for example, have a need for more detailed
information in order to better inform policy decisions related to managing their
increasingly limited resources. Local governments are able to gain insights never
before available into traffic patterns that lead to greater road and pedestrian safety.
These examples and many more demonstrate the ability to extract insights from Big
Data that will, without a doubt, be of enormous socio-economic significance. These
challenges and insights are further examined in the narrative on the impact of Big
Data on privacy (Lane et al. 2014).
34 A. Cavoukian and M. Chibba
With this shift to knowledge creation and service delivery, the value of infor-
mation and the need to manage it responsibly have grown dramatically. At the
same time, rapid innovation, global competition and increasing system complexity
present profound challenges for informational privacy. The notion of informational
self-determination seems to be collapsing under the weight, diversity, speed and
volume of Big Data processing in the modern digital era. When a Big Data set is
comprised of identifiable information, then a host of customary privacy risks apply.
As technological advances improve our ability to exploit Big Data, potential privacy
concerns could stir a regulatory backlash that would dampen the data economy and
stifle innovation (Tene and Polonetsky 2013). These concerns are reflected in, for
example, the debate around the new European legislation that includes a ‘right to
be forgotten’ that is aimed at helping individuals better manage data protection
risks online by requiring organizations to delete their data if there are no legitimate
grounds for retaining it (EU Commission 2012). The genesis of the incorporation
of this right comes from a citizen complaint to a data protection regulator against
a newspaper and a major search engine concerning outdated information about the
citizen that continued to appear in online search results of the citizen’s name. Under
certain conditions now, individuals have the right to ask search engines to remove
links with personal information about them that is “inaccurate, inadequate, irrelevant
or excessive.” (EU Commission 2012)
Big Data challenges the tenets of information security, which may also be
of consequence for the protection of privacy. Security challenges arise because
Big Data involves several infrastructure layers for data processing, new types of
infrastructure to handle the enormous flow of data, as well as requiring non-
scalable encryption of large data sets. Further, a data breach may have more severe
consequences when enormous datasets are stored. Consider, for example, the value
of a large dataset of identifiable information or confidential information for that
matter, that could make it a target of theft or for ransom—the larger the dataset, the
more likely it may be targeted for misuse. Once unauthorized disclosure takes place,
the impact on privacy will be far greater, because the information is centralized and
contains more data elements. In extreme cases, unauthorized disclosure of personal
information could put public safety at risk.
Outsourcing Big Data analytics and managing data accountability are other
issues that arise when handling identifiable datasets. This is especially true in
a Big Data context, since organizations with large amounts of data may lack
the ability to perform analytics themselves and will outsource this analysis and
reporting (Fogarty and Bell 2014). There is also a growing presence of data
brokers involved in collecting information, including personal information, from
a wide variety of sources other than the individual, for the purpose of reselling
such information to their customers for various purposes, including verifying an
individual’s identity, differentiating records, marketing products, and preventing
financial fraud (FTC 2012). Data governance becomes a sine qua non for the
enterprise and the stakeholders within the Big Data ecosystem.
2 Start with Privacy by Design in All Big Data Applications 35
2.3.1 Big Data: The Antithesis of Data Minimization
To begin, the basis of Big Data is the antithesis of a fundamental privacy principle
which is data minimization. The principle of data minimization or the limitation
principle (Gürses et al. 2011) is intended to ensure that no more personal informa-
tion is collected and stored than what is necessary to fulfil clearly defined purposes.
This approach follows through the fully data lifecycle where personal data must be
deleted when it is no longer necessary for the original purpose. The challenge to this
is that Big Data entails a new way of looking at data, where data is assigned value in
itself. In other words, the value of the data is linked to its future and potential uses.
In moving from data minimization to what may be termed data maximization
or Big Data, the challenge to privacy is the risk of creating automatic data linkages
between seemingly non-identifiable data which, on its own, may not be sensitive, but
when compiled, may generate a sensitive result. These linkages can result in a broad
portrait of an individual including revelations of a sensitive nature—a portrait once
inconceivable since the identifiers were separated in various databases. Through the
use of Big Data tools, we also know that it is possible to identify patterns which may
predict people’s dispositions, for example related to health, political viewpoints or
sexual orientation (Cavoukian and Jonas 2012).
By connecting key pieces of data that link people to things, the capability of data
analytics can render ordinary data into information about an identifiable individual
and reveal details about a person’s lifestyle and habits. A telephone number or postal
code, for example, can be combined with other data to identify the location of a
person’s home and work; an IP or email address can be used to identify consumer
habits and social networks.
An important trend and contribution to Big Data is the movement by government
institutions to open up their data holdings in an effort to enhance citizen participation
in government and at the same time spark innovation and new insights through
access to invaluable government data (Cavoukian 2009).2
With this potential for Big Data to create data linkages being so powerful, the
term “super” data or “super” content has been introduced (Cameron 2013). “Super”
data is more powerful than other data in a Big Data context, because the use of
one piece of “super” data, which on its own would not normally reveal much, can
spark new data linkages that grow exponentially until the individual is identified.
Each new transaction in a Big Data system would compound this effect and spread
identifiability like a contagion.
Indeed, to illustrate the significant implications of data maximization on privacy
we need only look at the shock of the Snowden revelations and the eventual reper-
cussions. A top EU court decision in 2015 declared the longstanding Safe Harbor
2There are many government Open Data initiatives such as U.S. Government’s Open Data at
www.data.gov; Canadian Government’s Open Data at http://open.canada.ca/en/open-data;UN
Data at http://data.un.org/; EU Open Data Portal at https://data.europa.eu/euodp/en/data/.Thisis
just a sample of the many Open Data sources around the world.
36 A. Cavoukian and M. Chibba
data transfer agreement between Europe and the U.S. invalid (Lomas 2015). The
issues had everything to do with concerns about not just government surveillance
but the relationship with U.S. business and their privacy practices. Eventually, a new
agreement was introduced known as the EU-U.S. Privacy Shield (US DOC 2016)
(EU Commission 2016). This new mechanism introduces greater transparency
requirements for the commercial sector on their privacy practices among a number
of other elements including U.S. authorities affirming that collection of information
for intelligence is focussed and targeted.
The authors strongly believe that an important lesson learned for Big Data suc-
cess is that when the individual participant is more directly involved in information
collection, the accuracy of the information’s context grows and invariably increases
the quality of the data under analysis. Another observation, that may seem to be
contradictory, is that even in Big Data scenarios where algorithms are tasked with
finding connections within vast datasets, data minimization is not only essential
for safeguarding personally identifiable information—it could help with finding the
needle without the haystack by reducing extraneous irrelevant data.
2.3.2 Predictive Analysis: Correlation Versus Causation
Use of correlation analysis may yield completely incorrect results for individuals.
Correlation is often mistaken for causality (Ritter 2014). If the analyses show that
individuals who like X have an eighty per cent probability rating of being exposed
to Y, it is impossible to conclude that this will occur in 100 per cent of the cases.
Thus, discrimination on the basis of statistical analysis may become a privacy issue
(Sweeney 2013). A development where more and more decisions in society are
based on use of algorithms may result in a “Dictatorship of Data”, (Cukier and
Mayer-Schonberger 2013) where we are no longer judged on the basis of our actual
actions, but on the basis of what the data indicate will be our probable actions.
In a survey undertaken by the Annenberg Public Policy Center, the researchers
found that most Americans overwhelmingly consider forms of price discrimination
and behavioral targeting ethically wrong (Turow et al. 2015). Not only are these
approaches based on profiling individuals but using personal information about an
individual for purposes the individual is unaware of. The openness of data sources
and the power of not just data mining but now predictive analysis and other complex
algorithms also present a challenge to the process of de-identification. The risks of
re-identification are more apparent, requiring more sophisticated de-identification
techniques (El Emam et al. 2011). In addition, while the concept of “nudging”
is gaining popularity, using identifiable data for profiling individuals to analyse,
predict, and influence human behaviour may be perceived as invasive and unjustified
Data determinism and discrimination are also concerns that arise from a Dic-
tatorship of Data. Extensive use of automated decisions and prediction analyses
may actually result in adverse consequences for individuals. Algorithms are not
neutral, but reflect choices, among others, about data, connections, inferences,
2 Start with Privacy by Design in All Big Data Applications 37
interpretations, and thresholds for inclusion that advances a specific purpose. The
concern is that Big Data may consolidate existing prejudices and stereotyping, as
well as reinforce social exclusion and stratification (Tene and Polonetsky 2013;
IWGDPT 2014;FTC2016). This is said to have implications for the quality of Big
Data analysis because of “echo chambers”3in the collection phase (Singer 2011;
Quattrociocchi et al. 2016).
2.3.3 Lack of Transparency/Accountability
As an individual’s personal information spreads throughout the Big Data ecosystem
amongst numerous players, it is easy to see that the individual will have less control
over what may be happening to the data. This secondary use of data raises privacy
concerns. A primary purpose is identified at the time of collection of personal
information. Secondary uses are generally permitted with that person’s consent,
unless otherwise permitted by law. Using personal information in Big Data analytics
may not be permitted under the terms of the original consent as it may constitute a
secondary use—unless consent to the secondary use is obtained from the individual.
This characteristic is often linked with a lack of transparency. Whether deliberate or
inadvertent, lack of openness and transparency on how data is compiled and used,
is contrary to a fundamental privacy principle.
It is clear that organizations participating in the Big Data ecosystem need to
have a strong privacy program in place (responsible information management). If
individuals don’t have confidence that their personal information is being managed
properly in Big Data applications, then their trust will be eroded and they may with-
draw or find alternative mechanisms to protect their identity and privacy. The con-
sequences of a privacy breach can include reputational harm, legal action, damage
to a company’s brand or regulatory sanctions and disruption to internal operations.
In more severe cases, it could cause the demise of an organization (Solove 2014).
According to TRUSTe’s Consumer Privacy Confidence Index 2016, 92 per cent of
individuals worry about their privacy online, 44 per cent do not trust companies with
their personal information, and 89 per cent avoid doing business with companies that
they believe do not protect their privacy (TRUSTe/NCSA 2016).
Despite the fact that privacy and security risks may exist, organizations should
not fear pursuing innovation through data analytics. Through the application of
privacy controls and use of appropriate privacy tools privacy risks may be mitigated,
thereby enabling organizations to capitalize on the transformative potential of Big
Data—while adequately safeguarding personal information. This is the central
3In news media an echo chamber is a metaphorical description of a situation in which information,
ideas, or beliefs are amplified or reinforced by transmission and repetition inside an “enclosed” sys-
tem, where different or competing views are censored, disallowed, or otherwise underrepresented.
The term is by analogy with an acoustic echo chamber, where sounds reverberate.
38 A. Cavoukian and M. Chibba
motivation for Privacy by Design, which is aimed at preventing privacy violations
from arising in the first place. Given the necessity of establishing user trust in
order to gain public acceptance of its technologies, any organization seeking to take
advantage of Big Data must apply the Privacy by Design framework as new products
and applications are developed, marketed, and deployed.
2.4 Privacy by Design and the 7 Foundational Principles
The premise of Privacy by Design has at its roots, the Fair Information Practices or
FIPs. Indeed, most privacy laws around the world are based on these practices. By
way of history, the Code of Fair Information Practices (FIPs) was developed in the
1970s and based on essentially five principles (EPIC n.d.):
1. There must be no personal data record-keeping systems whose very existence is
2. There must be a way for a person to find out what information about the person
is in a record and how it is used.
3. There must be a way for a person to prevent information about the person that was
obtained for one purpose from being used or made available for other purposes
without the person’s consent.
4. There must be a way for a person to correct or amend a record of identifiable
information about the person.
5. Any organization creating, maintaining, using, or disseminating records of
identifiable personal data must assure the reliability of the data for their intended
use and must take precautions to prevent misuses of the data.
FIPs represented an important development in the evolution of data privacy since
they provided an essential starting point for responsible information management
practices. However, many organizations began to view enabling privacy via FIPs
and associated laws as regulatory burdens that inhibited innovation. This zero-sum
mindset viewed the task of protecting personal information as a “balancing act”
of competing business and privacy requirements. This balancing approach tended
to overemphasize the significance of notice and choice as the primary method
for addressing personal information data management. As technologies developed,
the possibility for individuals to meaningfully exert control over their personal
information became more and more difficult. It became increasingly clear that FIPs
were a necessary but not a sufficient condition for protecting privacy. Accordingly,
the attention of privacy protection had begun to shift from reactive compliance with
FIPs to proactive system design.
With advances in technologies, it became increasingly apparent that systems
needed to be complemented by a set of norms that reflect broader privacy dimen-
sions (Damiani 2013). The current challenges to privacy related to the dynamic
relationship associated with the forces of innovation, competition and the global
adoption of information communications technologies. These challenges have been
2 Start with Privacy by Design in All Big Data Applications 39
mirrored in security by design. Just as users rely on security engineers to ensure the
adequacy of encryption key lengths, for example, data subjects will rely on privacy
engineers to appropriately embed risk-based controls within systems and processes.
Given the complex and rapid nature of these developments, it becomes apparent that
privacy has to become the default mode of design and operation.
Privacy by Design (PbD), is a globally recognized proactive approach to privacy.
It is a framework developed in the late 1990s by co-author Dr. Ann Cavoukian
(Cavoukian 2011). Privacy by Design is a response to compliance-based approaches
to privacy protection that tend to focus on addressing privacy breaches after-the-fact.
Our view is that this reactive approach does not adequately meet the demands of the
Big Data era. Instead, we recommend that organizations consciously and proactively
incorporate privacy strategies into their operations, by building privacy protections
into their technology, business strategies, and operational processes.
By taking a proactive approach to privacy and making privacy the default setting,
PbD can have a wide-ranging impact across an organization. The approach can
result in changes to governance structures, operational and strategic objectives, roles
and accountabilities, policies, information systems and data flows, decision-making
processes, relationships with stakeholders, and even the organization’s culture.
PbD has been endorsed by many public- and private-sector authorities in the
United States, the European Union, and elsewhere (Harris 2015). In 2010, PbD
was unanimously passed as a framework for privacy protection by the International
Assembly of Privacy Commissioners and Data Protection Authorities (CNW 2010).
This approach transforms consumer privacy issues from a pure policy or compliance
issue into a business imperative. Since getting privacy right has become a critical
success factor to any organization that deals with personal information, taking
an approach that is principled and technology-neutral is now more relevant than
ever. Privacy is best interwoven proactively and to achieve this, privacy principles
should be introduced early on—during architecture planning, system design, and
the development of operational procedures. Privacy by Design, where possible,
should be rooted into actual code, with defaults aligning both privacy and business
The business case for privacy focuses on gaining and maintaining customer trust,
breeding loyalty, and generating repeat business. The value proposition typically
reflects the following:
1. Consumer trust drives successful customer relationship management (CRM) and
lifetime value—in other words, business revenues;
2. Broken trust will result in a loss of market share and revenue, translating into less
return business and lower stock value; and
3. Consumer trust hinges critically on the strength and credibility of an organiza-
tion’s data privacy policies and practices.
In a marketplace where organizations are banding together to offer suites of
goods and services, trust is clearly essential. Of course, trust is not simply an end-
user issue. Companies that have done the work to gain the trust of their customers
cannot risk losing it as a result of another organization’s poor business practices.
40 A. Cavoukian and M. Chibba
2.4.1 The 7 Foundational Principles
Privacy by Design Foundational Principles build upon universal FIPPs in a way
that updates and adapts them to modern information management needs and
requirements. By emphasizing proactive leadership and goal-setting, systematic
and verifiable implementation methods, and demonstrable positive-sum results, the
principles are designed to reconcile the need for robust data protection and an orga-
nization’s desire to unlock the potential of data-driven innovation. Implementing
PbD means focusing on, and living up to, the following 7 Foundational Principles,
which form the essence of PbD (Cavoukian 2011).
Principle 1: Use proactive rather than reactive measures, anticipate and prevent
privacy invasive events before they happen (Proactive not Reactive; Preventative not
Principle 2: Personal data must be automatically protected in any given IT system
or business practice. If an individual does nothing, their privacy still remains intact
(Privacy as the Default). Data minimization is also a default position for privacy, i.e.
the concept of always starting with the minimum personal data possible and then
justifying additional collection, disclosure, retention, and use on an exceptional and
specific data-by-data basis.
Principle 3: Privacy must be embedded into the design and architecture of IT
systems and business practices. It is not bolted on as an add-on, after the fact. Privacy
is integral to the system, without diminishing functionality (Privacy Embedded into
Principle 4: All legitimate interests and objectives are accommodated in a
positive-sum manner (Full Functionality—Positive-Sum [win/win], not Zero-Sum
Principle 5: Security is applied throughout the entire lifecycle of the data
involved—data is securely retained, and then securely destroyed at the end of the
process, in a timely fashion (End-to-End Security—Full Lifecycle Protection).
Principle 6: All stakeholders are assured that whatever the business practice or
technology involved, it is in fact, operating according to the stated promises and
objectives, subject to independent verification; transparency is key (Visibility and
Transparency—Keep it Open).
Principle 7: Architects and operators must keep the interests of the individual
uppermost by offering such measures as strong privacy defaults, appropriate notice,
and empowering user-friendly options (Respect for User Privacy—Keep it User-
2 Start with Privacy by Design in All Big Data Applications 41
2.5 Big Data Applications: Guidance on Applying the PbD
Framework and Principles
While the 7 Foundational Principles of PbD should be applied in a holistic manner
as a broad framework, there are specific principles worthy of pointing out because
they are what defines and distinguishes this approach to privacy. These are principles
1 (Proactive and Preventative), 2 (By Default/Data Minimization), 3 (Embedded
in Design) and 4 (Positive-sum). Although the two examples provided below are
specific to mobile apps, they are illustrative of the Privacy by Design approach to
being proactive, focussing on data minimization and embedding privacy by default.
2.5.1 Being Proactive About Privacy Through Prevention
Privacy by Design aspires to the highest global standards of practical privacy and
data protection possible and to go beyond compliance and achieve visible evidence
and recognition of leadership, regardless of jurisdiction. Good privacy doesn’t just
happen by itself—it requires proactive and continuous goal-setting at the earliest
stages. Global leadership in data protection begins with explicit recognition of the
benefits and value of adopting strong privacy practices, early and consistently (e.g.,
preventing data breaches or harms to individuals from occurring in the first place).
Your app’s main purpose is to display maps. These maps are downloaded by a
mobile device from your central server. They are then later used on the device,
when there may be no network connection available. You realise that analytics
would be useful to see which maps are being downloaded by which users.
This in turn would allow you to make targeted suggestions to individual users
about which other maps they might want to download. You consider using
the following to identify individuals who download the maps: i) the device’s
IMEI number; ii) the MAC address of the device’s wireless network interface;
and iii) the mobile phone number used by the device. You realise that any of
those identifiers may constitute personal data, so for simplicity you decide
not to take on the responsibility of dealing with them yourself. Instead, you
decide to gain users’ consent for the map suggestions feature. When a user
consents, they are assigned a randomly generated unique identifier, solely for
use by your app. (Excerpted from Information Commissioner’s Office ICO
42 A. Cavoukian and M. Chibba
2.5.2 Data Minimization as the Default Through
Personal information that is not collected, retained, or disclosed is data that does
not need to be protected, managed, or accounted for. If the personal information
does not exist, then it cannot be accessed, altered, copied, enriched, shared, lost,
hacked, or otherwise used for secondary and unauthorized purposes. Privacy by
Design is premised on the idea that the starting point for designing information
technologies and systems should always be maximally privacy-enhancing. The
default configuration or settings of technologies, tools, platforms, or services offered
to individuals should be as restrictive as possible regarding use of personally
identifiable data.
When Big Data analytics involves the use of personally identifiable information,
data minimization has the biggest impact on managing data privacy risks, by
effectively eliminating risk at the earliest stage of the information life cycle.
Designing Big Data analytical systems at the front end with no collection of
personally identifiable information—unless and until a specific and compelling
purpose is defined, is the ideal. For example, use(s) of personal information should
be limited to the intended, primary purpose(s) of collection and only extended to
other, non-consistent uses with the explicit consent of the individual (Article 29
Data Protection Working Party 2013). In other cases, organizations may find that
summary or aggregate data may be more than sufficient for their needs.
Your app uses GPS location services to recommend interesting activities near
to where the user is. The database of suggested activities is kept on a central
server under your control. One of your design goals is to keep the amount of
data your app downloads from the central server to a minimum. You therefore
design your app so that each time you use it, it sends location data to the
central server so that only the nearest activities are downloaded. However,
you are also keen to use less privacy-intrusive data where possible. You design
your app so that, by default, the device itself works out where the nearest town
is and uses this location instead, avoiding the need to send exact GPS co-
ordinates of the user”s location back to the central server. Users who want
results based on their accurate location can change the default behaviour.
(Excerpted from Information Commissioner’s Office ICO 2013)
De-identification strategies are considered data minimization. De-identification
provides for a set of tools or techniques to strip a dataset of all information that
could be used to identify an individual, either directly or indirectly, through linkages
to other datasets. The techniques involve deleting or masking “direct identifiers,”
such as names or social insurance numbers, and suppressing or generalizing indirect
identifiers, such as postal codes or birthdates. Indirect identifiers may not be
2 Start with Privacy by Design in All Big Data Applications 43
personally identifying in and of themselves, but when linked to other datasets that
contain direct identifiers, may personally identify individuals. If done properly,
de-identified data can be used for research purposes and data analysis—thus
contributing new insights and achieving innovative goals—while minimizing the
risk of disclosure of the identities of the individuals behind the data (Cavoukian and
El Emam 2014).
This is not to suggest, of course, that data should be collected exclusively in
instances where it may become useful or that data collected for one purpose may be
repurposed at will. Rather, in a big data world, the principle of data minimization
should be interpreted differently, requiring organizations to de-identify data when
possible, implement reasonable security measures, and limit uses of data to those
that are acceptable from not only an individual but also a societal perspective (Tene
and Polonetsky 2013).
2.5.3 Embedding Privacy at the Design Stage
When privacy commitments and data protection controls are embedded into tech-
nologies, operations, and information architectures in a holistic, integrative manner,
innovation and creativity are often by-products (Cavoukian et al. 2014a,b). By
holistic, we mean that broader contexts should always be considered for a proper
assessment of privacy risks and remedies. An integrative approach takes into con-
sideration all stakeholder interests as part of the development dialogue. Sometimes,
having to re-look at alternatives because existing solutions are unacceptable from
a privacy perspective spurs innovative and creative thinking. Embedding privacy
and data protection requires taking a systematic, principled approach—one that
not only relies on accepted standards and process frameworks, but that can stand
up to external reviews and audits. All of the 7 Foundational Principles should be
applied with equal rigour, at every step in design and operation. By doing so, the
privacy impacts of the resulting technology, process, or information architecture,
and their uses, should be demonstrably minimized, and not easily degraded through
use, misconfiguration, or error. To minimize concerns of untoward data usage,
organizations should disclose the logic underlying their decision-making processes
to the extent possible without compromising their trade secrets or intellectual
property rights.
The concept of “user-centricity” may evoke contradictory meanings in networked
or online environments. Through a privacy lens, it contemplates a right of control by
an individual over his or her personal information when online, usually with the help
of technology. For most system designers, it describes a system built with individual
users in mind that may perhaps incorporate users’ privacy interests, risks and needs.
The first may be considered libertarian (informational self-determination), the other,
paternalistic. Privacy by Design embraces both. It acknowledges that technologies,
processes and infrastructures must be designed not just for individual users, but
also structured by them. Users are rarely, if ever, involved in every design decision
44 A. Cavoukian and M. Chibba
or transaction involving their personal information, but they are nonetheless in an
unprecedented position today to exercise a measure of meaningful control over
those designs and transactions, as well as the disposition and use of their personal
information by others.
User interface designers know that human-computer interface can often make
or break an application. Function (substance) is important, but the way in which
that function is delivered is equally as important. This type of design embeds an
effective user privacy experience. As a quid pro quo for looser data collection
and minimization restrictions, organizations should be prepared to share the
wealth created by individuals’ data with those individuals. This means providing
individuals with access to their data in a “usable” format and allowing them to take
advantage of third party applications to analyze their own data and draw useful
conclusions (e.g., consume less protein, go on a skiing vacation, invest in bonds)
(Tene and Polonetsky 2013).
2.5.4 Aspire for Positive-Sum Without Diminishing
In Big Data scenarios, networks are more complex and sophisticated thereby
undermining the dominant “client-server” transaction model because individuals
are often far removed from the client side of the data processing equation. How
could privacy be assured when the collection, disclosure, and use of personal
information might not even involve the individual at all? Inevitably, a zero-sum
paradigm prevails where more of one good (e.g., public security, fraud detection,
operational control) cancels out another good (individual privacy, freedom). The
authors challenge the premise that privacy and data protection necessarily have to
be ceded in order to gain public, personal, or information security benefits from
Big Data. The opposite of zero-sum is positive-sum, where multiple goals may be
achieved concurrently.
Many security technologies and information systems could be designed (or
redesigned) to be effective while minimizing or even eliminating their privacy-
invasive features. This is the positive-sum paradigm. We need only look to the work
of researchers in the area of privacy preserving data mining (Lindell and Pinkas
2002). In some cases, however, this requires broadening the scope of application
from only information communication technologies (ICTs) to include the “soft”
legal, policy, procedural, and other organizational controls and operating contexts
in which privacy might be embedded.
De-identification tools and techniques are gaining popularity and there are
several commercially available products. Nonetheless, furthering research into
de-identification continues (El Emam 2013a, b). Some emerging research-level
technologies hold much promise for enabling privacy and utility of Big Data analy-
sis to co-exist. Two of these technologies are differential privacy and synthetic data.
2 Start with Privacy by Design in All Big Data Applications 45
Differential privacy is an approach that injects random noise into the results
of dataset queries to provide a mathematical guarantee that the presence of any
one individual in the dataset will be masked—thus protecting the privacy of each
individual in the dataset. Typical implementations of differential privacy work by
creating a query interface or “curator” that stands between the dataset’s personal
information and those wanting access to it. An algorithm evaluates the privacy risks
of the queries. The software determines the level of “noise” to introduce into the
analysis results before releasing it. The distortion that is introduced is usually small
enough that it does not affect the quality of the answers in any meaningful way—yet
it is sufficient to protect the identities of the individuals in the dataset (Dwork 2014).
At an administrative level, researchers are not given access to the dataset to
analyze themselves when applying differential privacy. Not surprisingly, this limits
the kinds of questions researchers can ask. Given this limitation, some researchers
are exploring the potential of creating “synthetic” datasets for researchers’ use. As
long as the number of individuals in the dataset is sufficiently large in comparison
to the number of fields or dimensions, it is possible to generate a synthetic dataset
comprised entirely of “fictional” individuals or altered identities that retain the
statistical properties of the original dataset—while delivering differential privacy’s
mathematical “noise” guarantee (Blum et al. 2008). While it is possible to generate
such synthetic datasets, the computational effort required to do so is usually
extremely high. However, there have been important developments into making the
generation of differentially private synthetic datasets more efficient and research
continues to show progress (Thaler et al. 2010).
2.6 Conclusion
There are privacy and security risks and challenges that organizations will face
in the pursuit of Big Data nirvana. While a significant portion of this vast digital
universe is not of a personal nature, there are inherent privacy and security risks
that cannot be overlooked. Make no mistake, organizations must seriously consider
not just the use of Big Data but also the implications of a failure to fully realize
the potential of Big Data. Big data and big data analysis, promise new insights
and benefits such as medical/scientific discoveries, new and innovative economic
drivers, predictive solutions to otherwise unknown, complex societal problems.
Misuses and abuses of personal data diminish informational self-determination,
cause harms, and erode the confidence and trust needed for innovative economic
growth and prosperity. By examining success stories and approaches such as Privacy
by Design, the takeaway should be practical strategies to address the question of
‘How do we achieve the value of Big Data and still respect consumer privacy?’
Above all, Privacy by Design requires architects and operators to keep the interests
of the individual uppermost by offering such measures as strong privacy defaults,
appropriate notice, and empowering user-friendly options. Keep it user-centric!
46 A. Cavoukian and M. Chibba
Article 29 Data protection working party (2013). Opinion 03/2013 on purpose limitation.http://
ec.europa.eu/justice/data-protection/index_en.htm. Accessed 2 August 2016.
Blum, A., Ligett, K., Roth, A. (2008). A learning theory approach to non-interactive database
privacy. In Proceedings of the 40th ACM SIGACT Symposium on Theory of Computing
(pp. 609–618).
Cameron, K. (2013). Afterword. In M. Hildebrandt et al. (Eds.), Digital Enlightenment Yearbook
2013. Amsterdam: IOS Press.
Cavoukian, A. (2009). Privacy and government 2.0: the implications of an open world.http://
www.ontla.on.ca/library/repository/mon/23006/293152.pdf. Accessed 22 November 2016.
Cavoukian, A. (2011). Privacy by Design: The 7 Foundational Principles.Ontario:IPC.
Cavoukian, A. (2013a). A Primer on Metadata: Separating Fact from Fiction.Ontario:IPC.http:/
Cavoukian, A. (2013b). Privacy by design: leadership, methods, and results. In S. Gutwirth, R.
Leenes, P. de Hert, & Y. Poullet (Eds.), Chapter in European Data Protection: Coming of Age
(pp. 175–202). Dordrecht: Springer Science & Business Media Dordrecht.
Cavoukian, A., & Cameron, K. (2011). Wi-Fi Positioning Systems: Beware of Unintended
Cosnequences: Issues Involving Unforeseen Uses of Pre-Existing Architecture.Ontario:IPC.
Cavoukian, A., & El Emam. (2014). De-identification Protocols: Essential for Protecting Privacy,
Ontario: IPC.
Cavoukian, A., & Jonas, J. (2012). Privacy by Design in the Age of Big Data.Ontario:IPC.
Cavoukian, A., Bansal, N., & Koudas, N. (2014a). Building Privacy into Mobile Location Analytics
(MLA) through Privacy by Design.Ontario:IPC.
Cavoukian, A., Dix, A., & El Emam, K. (2014b). The Unintended Consequences of Privacy
Clarke, R. (2000). Beyond OECD guidelines; privacy protection for the 21st century. Xamax
Consultancy Pty Ltd. http://www.rogerclarke.com/DV/PP21C.html. Accessed 22 November
CNW (2010). Landmark resolution passed to preserve the future of privacy. Press Release. Toronto,
ON, Canada. http://www.newswire.ca/news-releases/landmark-resolution-passed-to-preserve-
the-future-of-privacy-546018632.html. Accessed 22 November 2016.
Cukier, K., & Mayer-Schonberger, V. (2013). The dictatorship of data. MIT Technology Review.
https://www.technologyreview.com/s/514591/the-dictatorship-of-data/. Accessed 22 Novem-
ber 2016.
Damiani, M. L. (2013). Privacy enhancing techniques for the protection of mobility patterns in
LBS: research issues and trends. In S. Gutwirth, R. Leenes, P. de Hert, & Y. Poullet (Eds.),
Chapter in european data protection: coming of age (pp. 223–238). Dordrecht: Springer
Science & Business Media Dordrecht.
Department of Commerce (US DOC) (2016). EU-U.S. privacy shield fact sheet. Office of public
affairs, US department of commerce.https://www.commerce.gov/news/fact-sheets/2016/02/eu-
us-privacy-shield. Accessed 22 November 2016.
Dwork, C. (2014). Differential privacy: a cryptographic approach to private data analysis. In J.
Lane, V. Stodden, S. Bender, & H. Nissenbaum (Eds.), Privacy, big data, and the public good:
Frameworks for engagement. New York: Cambridge University Press.
El Emam, K. (2013a). Benefiting from big data while protecting privacy. In K. El Emam
(Ed.), Chapter in risky business: sharing health data while protecting privacy. Bloomington,
IN: Trafford Publishing.
El Emam, K. (2013b). In K. El Emam (Ed.), Who’s afraid of big data? chapter in risky business:
Sharing health data while protecting privacy. Bloomington, IN, USA: Trafford Publishing.
2 Start with Privacy by Design in All Big Data Applications 47
El Emam, K., Buckeridge, D., Tamblyn, R., Neisa, A., Jonker, E., & Verma, A. (2011). The
re-identification risk of Canadians from longitudinal demographics. BMC Medical Informat-
ics and Decision Making,11:46. http://bmcmedinformdecismak.biomedcentral.com/articles/
10.1186/1472-6947-11-46. Accessed 22 November 2016.
EPIC (n.d.). Website: https://epic.org/privacy/consumer/code_fair_info.html. Accessed 22 Novem-
ber 2016.
EU Commission (2012). Fact sheet on the right to be forgotten.http://ec.europa.eu/justice/data-
protection/files/factsheets/factsheet_data_protection_en.pdf. Accessed 22 November 2016.
EU Commission (2015). Fact sheet—questions and answers—data protection reform. Brussels.
http://europa.eu/rapid/press-release_MEMO-15-6385_en.htm. Accessed 4 November 2016.
EU Commission (2016). The EU data protection reform and big data factsheet.http://
Accessed 22 November 2016.
Fogarty, D., & Bell, P. C. (2014). Should you outsource analytics? MIT Sloan Management Review,
55(2), Winter.
FTC (2012). Protecting consumer privacy in an era of rapid change: Recommendations
for businesses and policymakers.https://www.ftc.gov/sites/default/files/documents/
recommendations/120326privacyreport.pdf Accessed August 2016.
FTC (2016). Big data: A tool for inclusion or exclusion? Understanding the Issues.
understanding-issues/160106big-data-rpt.pdf. Accessed 23 November 2016.
Gürses, S.F. Troncoso, C., & Diaz, C. (2011). Engineering privacy by design, Computers, Privacy
& Data Protection.http://www.cosic.esat.kuleuven.be/publications/article-1542.pdf. Accessed
19 November 2016.
Harris, M. (2015). Recap of covington’s privacy by design workshop. inside privacy:
updates on developments in data privacy and cybsersecurity. Covington & Burling-
ton LLP, U.S. https://www.insideprivacy.com/united-states/recap-of-covingtons-privacy-by-
design-workshop/. Accessed 19 November 2016.
HHS (2012). Guidance regarding methods for de-identification of protected health information
in accordance with the health insurance portability and accountability act (HIPPA) pri-
vacy rule.http://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/
index.html. Accessed 2 August 2016.
Information Commissioner’s Office (ICO) (2013). Privacy in Mobile Apps: Guide for app devel-
guidance.pdf Accessed 22 November 2016.
International Working Group on Data Protection in Telecommunications (IWGDPT) (2004)
Common position on privacy and location information in mobile communications services.
adopted-by-the-working-group. Accessed 22 November 2016.
48 A. Cavoukian and M. Chibba
International Working Group on Data Protection in Telecommunications (IWGDPT) (2014).
Working Paper on Big Data and Privacy: Privacy principles under pressure in the age of
Big Data analytics. 55th Meeting.https://datenschutz-berlin.de/content/europa-international/
papers-and-common-positions-adopted-by-the-working-group. Accessed 22 November 2016.
Lane, J., et al. (2014). Privacy, big data and the public good: frameworks for engagement.
Cambridge: Cambridge University Press.
Lindell, Y., & Pinkas, B. (2002). Privacy preserving data mining. Journal of Cryptology, 15,
177–206. International Association for Cryptologic Research.
Lomas, N. (2015). Europe’s top court strikes down safe Harbor data-transfer agreement
with U.S. Techcrunch.https://techcrunch.com/2015/10/06/europes-top-court-strikes-down-
safe-harbor-data-transfer-agreement-with-u-s/. Accessed 22 November 2016.
Mayer, J., Mutchler, P., & Mitchell, J. C. (2016). Evaluating the privacy properties of telephone
metadata. Proceedings of the National Academies of Science, U S A, 113(20), 5536–5541.
NIST. (2010). Guide to protecting the confidentiality of personally identifiable information (PII).
NIST special publication 800–122. Gaithersburg, MD: Computer Science Division.
NIST (2015). De-identification of Personal Information. NISTR 8053. This publication is available
free of charge from: http://dx.doi.org/10.6028/NIST.IR.8053. Accessed 19 November 2016.
Official Journal of the European Union (2016). Regulation (EU) 2016/679 Of The Euro-
pean Parliament and of the Council.http://ec.europa.eu/justice/data-protection/reform/files/
regulation_oj_en.pdf. Accessed 19 November 2016.
Quattrociocchi, W. Scala, A., & Sunstein, C.R. (2016) Echo Chambers on Facebook. Preliminary
draft, not yet published. Available at: http://ssrn.com/abstract=2795110. Accessed 19 Novem-
ber 2016.
Ritter, D. (2014). When to Act on a correlation, and when Not To. Harvard Business Review.https://
hbr.org/2014/03/when-to-act-on-a-correlation-and-when-not-to. Accessed 19 November 2016.
Singer, N. (2011). The trouble with the echo chamber online. New York Times online.http://
www.nytimes.com/2011/05/29/technology/29stream.html?_r=0. Accessed 19 November 2016.
Solove, D. J. (2007). I’ve got nothing to hide’ and other misunderstandings of privacy. San Diego
Law Review, 44, 745.
Solove, D. (2014). Why did in bloom die? A hard lesson about education privacy. Privacy C
Security Blog. TeachPrivacy. Accessed 4 Aug 2016. https://www.teachprivacy.com/inbloom-
Sweeney, L. (2013) Discrimination in online ad delivery.http://dataprivacylab.org/projects/
onlineads/1071-1.pdf. Accessed 22 November 2016.
Tene, O., & Polonetsky, J. (2013). Big data for all: Privacy and user control in the age of analytics.
New Journal of Technology and Intellectual Property, 11(5), 239–272.
Thaler, J., Ullman, J., & Vadhan, S. (2010). PCPs and the hardness of generating synthetic data.
Electronic Colloquium on Computational Complexity, Technical Report, TR10–TR07.
TRUSTe/NCSA (2016). Consumer privacy infographic—US Edition.https://www.truste.com/
resources/privacy-research/ncsa-consumer-privacy-index-us/. Accessed 4 November 2016.
Turow, J., Feldman, L, & Meltzer, K. (2015). Open to exploitation: american shoppers
online and offline. A report from the Annenberg Public Policy Center of the University
of Pennsylvania.http://www.annenbergpublicpolicycenter.org/open-to-exploitation-american-
shoppers-online-and-offline/. Accessed 22 November 2016.
United Nations General Assembly (2016). Resolution adopted by the General Assembly. The right
to privacy in the digital age (68/167).http://www.un.org/ga/search/view_doc.asp?symbol=A/
RES/68/167.Accessed 4 November 2016.
Zhang, Y., Chen, Q., & Zhong, S. (2016). Privacy-preserving data aggregation in mobile phone
sensing. Information Forensics and Security IEEE Transactions on, 11, 980–992.
Chapter 3
Privacy Preserving Federated Big Data Analysis
Wenrui Dai, Shuang Wang, Hongkai Xiong, and Xiaoqian Jiang
3.1 Introduction
With the introduction of electronic health records (EHRs), massive patient data have
been involved in biomedical researches to study the impact of various factors on
disease and mortality. Large clinical data networks have been developed to facilitate
analysis and improve treatment of diseases by collecting healthcare data from
a variety of organizations, including healthcare providers, government agencies,
research institutions and insurance companies. The National Patient-Centered Clin-
ical Research Network, PCORnet (n.d.), facilitates clinical effectiveness research
to provide decision support for prevention, diagnosis and treatment with the data
gathered nationwide. PopMedNet (n.d.) enables the distributed analyses of EHR
held by different organizations without requiring a central repository to collect
data. HMORNnet (Brown et al. 2012) combines PopMedNet platform to provide a
shared infrastructure for distributed querying to allow data sharing between multiple
HMO Research Network projects. Integrating PopMedNet, ESPnet achieves disease
W. Dai ()
Department of Biomedical Informatics, University of California San Diego, La Jolla,
CA 92093, USA
Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
e-mail: wed004@ucsd.edu
S. Wang • X. Jiang
Department of Biomedical Informatics, University of California San Diego, La Jolla,
CA 92093, USA
e-mail: shw070@ucsd.edu;x1jiang@ucsd.edu
H. Xiong
Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
e-mail: xionghongkai@sjtu.edu.cn
© Springer International Publishing AG 2018
S. Srinivasan (ed.), Guide to Big Data Applications, Studies in Big Data 26,
DOI 10.1007/978-3-319-53817-4_3
50 W. Dai et al.
surveillance by collecting and analyzing EHRs owned by different organizations in
a distributed fashion.
Although data sharing can benefit both biomedical discovery and public health,
it would also pose risks for disclosure of sensitive information and consequent
breach of individual privacy. Leakage of demographic, diagnostic, phenotypic and
genotypic information would lead to unexpected implications like discrimination by
employers and health insurance companies. To protect the individually identifiable
health information, Health Insurance Portability and Accountability Act (HIPAA)
(n.d.) was enacted in the United States, in which the security and privacy of pro-
tected health information (PHI) are guaranteed under the standards and regulations
specified by HIPAA Privacy Rule. It defines two methods, Expert Determination and
Safe Harbor (Lafky 2010), to meet with the de-identification standard. In practice,
the Safe Harbor method is widely adopted, where specific information should be
removed and suppressed according to a predefined checklist. However, these de-
identification standards defined by HIPAA Privacy Rule do not provide adequate
privacy protection for healthcare data, as argued by McGraw (2008). Taking
advantage of publicly available background information about an individual, it is
possible to infer sensitive information like predisposition to disease and surnames
from de-identified data. Homer et al. (2008) utilized aggregated allele frequencies
in genome-wide association studies (GWAS) to re-identify individual patients in
a case group based on the reference population from the International HapMap
Project. Wang et al. (2009) extended Homer’s attack with two models to identify
patients from a smaller subset of published statistics or under limited precision
and availability of statistics. Sweeney et al. (2013) showed that most (84–97%)
patients could be exactly identified by linking their profiles in the Personal Genome
Project (PGP) with publicly available records like voter lists. Gymrek et al. (2013)
inferred the surnames from personal genome data sets by profiling Y-chromosome
haplotypes based on recreational genetic genealogy databases, which are public and
online accessible. For Healthcare Cost and Utilization Project (HCUP), Vaidya et al.
(2013) demonstrated the vulnerability of its querying system by making query infer-
ence attacks to infer patient-level information based on multiple correlated queries.
Privacy concerns have presented a challenge to efficient collaborative prediction
and analysis for biomedical research that needs data sharing. Patients would be
unwilling to provide their data to research projects or participate in treatments under
the insufficient privacy protection. For data custodian, data utility might be degraded
to lower the potential privacy risk, as they take responsibility for the security and
confidentiality of their data. Due to institutional policies and legislation, it is not
viable to explicitly transfer patient-level data to a centralized repository or share
them among various institutions in many scenarios. For example, without specific
institutional approval, the U.S. Department of Veterans Affairs requires all patient
data to remain in its server. Naive procedure for exchanging patient-level data
would also be restricted in international cross-institutional collaboration. The Data
Protection Act (1998) in the UK, in line with the EU Data Protection Directive,
prohibits transferring clinical data outside the European Economic Area, unless the
protection for data security is sufficiently guaranteed. Therefore, it is of necessity to
3 Privacy Preserving Federated Big Data Analysis 51
develop models and algorithms for cross-institutional collaboration with sufficient
protection of patient privacy and full compliance with institutional policies.
Federated data analysis has been developed as an alternative for cross-
institutional collaboration, which proposes to exchange aggregated statistics instead
of patient-level data. It enables a variety of privacy-preserving distributed algorithms
that facilitate computation and analysis with a guarantee of prediction accuracy
and protection of privacy and security of patient-level data. These algorithms are
commonly designed to perform regression, classification and evaluation over data
with different types of partition, i.e. horizontally partitioned data (Kantarcioglu
2008) and vertically partitioned data (Vaidya 2008), respectively. Horizontally
partitioned data are composed of data of different patients with the same clinical
variables. Thus, multiple institutions share the sample types of attributes for all
the patients in the federated database. Horizontally partitioned data would be
suitable for collaborative analysis and computation over patients from organizations
in different geographical areas, especially for studies of diseases that require a
large number of examples (patient-level data). On the other hand, for vertically
partitioned data, each institution owns a portion of clinical variables for the same
patients. Attributes from all the institutions are collected and aligned as a federated
database for computation and analysis. In fact, vertically partitioned data would
facilitate collaboration among organizations owning different types of patient-
level data. For example, the PCORnet clinical data research network (CDRN),
pSCANNER (Ohno-Machado et al. 2014), allows distributed analysis over data of
31 million patients that are vertically partitioned across Centers for Medicare and
Medicaid Services, Department of Veteran Affairs, insurance companies, and health
systems. The studies of diseases can be jointly performed based on the diagnostic
information from medical centers, demographic and financial data from insurance
companies and genomic data from laboratories. Figure 3.1 provides an illustrative
example for horizontally and vertically partitioned data, respectively.
In this chapter, we review the privacy-preserving federated data analysis algo-
rithms for large-scale distributed data, especially biomedical data. To collaborate
on distributed data analysis in a privacy-preserving fashion, institutions might not
be able to explicitly share their patient-level data. Privacy-preserving federated
data analysis algorithms aim to establish global models for analysis and prediction
based on non-sensitive local statistics, e.g., intermediary results for Hessian matrix
and kernel matrix, instead of explicitly transferring sensitive patient-level data to
a central repository. Over the past few decades, a series of federated analysis
algorithms have been developed for regression, classification and evaluation of
distributed data with a protection of data privacy and security. Enlightened by the
principle of sharing model without sharing data, federated data modeling techniques
have been proposed to securely derive global model parameters and perform
statistical tests over horizontally and vertically partitioned data. In comparison
to centralized realizations, federated modeling techniques achieved equivalent
accuracy in model parameter estimation and statistical tests with no exchange of
patient-level data. Server/client and decentralized architectures have been developed
to realize federated data analysis in a distributed and privacy-preserving fashion.
52 W. Dai et al.
Fig. 3.1 Illustrative examples for horizontally and vertically partitioned data in federated data
analysis. Nrecords with Mcovariates are distributed across Kinstitutions. (a) Horizontally
partitioned data. The Kinstitutions have different patients sharing the same type of covariates;
(b) Vertically partitioned data. Covariates from the same patients are distributed across all the K
For regression tasks, distributed optimization was efficiently achieved using the
Newton-Ralphson method. To further improve distributed optimization in federated
models, alternating direction method of multipliers (ADMM) was integrated to
formulate decomposable minimization problems with additional auxiliary variables.
Inheriting the convergence properties of methods of Lagrangian multipliers, it
is robust for a variety of distributed analyses under horizontal and vertical data
3 Privacy Preserving Federated Big Data Analysis 53
partitioning. Recognizing that communication between the server and clients was
not protected, secure multiparty computation (SMC) protocols were adopted for
stronger data security and privacy. Secure protocols were widely considered for
distributed analysis like regression, classification and evaluation. The intermediary
results from multiple institutions were aggregated with secure summation, product
and reordering to support the server/client or decentralized architecture. To handle
real-world network conditions, we discussed the asynchronous optimization for
distributed optimization under server/client or decentralized architecture. To support
privacy-preserving data sharing and analysis, this paper summarizes the relevant
literature to present the state-of-the-art algorithms and applications and prospect the
promising improvements and extensions along current trend.
The rest of the paper is organized as follows. Section 3.2 overviews the
architecture and optimization for federated modeling analysis over horizontally and
vertically partitioned data. In Sect. 3.3, we review the applications in regression
and classification based on the Newton-Raphson method and ADMM framework.
Section 3.3 integrates secure multiparty computation protocols with federated
data analysis for protection of intermediary results in distributed analysis and
computation. In Sect. 3.5, we present the asynchronous optimization for general
fixed-point problem and specific coordinate gradient descent and ADMM-based
methods. Finally, Sect. 3.6 makes discussion and conclusion.
3.2 Federated Data Analysis: Architecture and Optimization
In this section, we overview the architectures and optimization methods for feder-
ated data analysis. Server/client and decentralized architectures are well established
in privacy-preserving analysis. Under these architectures, the Newton-Raphson
method and alternating direction method of multipliers (ADMM) framework are
leveraged for distributed computation.
3.2.1 Architecture Server/Client Architecture
In the federated models, the server/client architecture has been established to
estimate global model parameters and perform statistical test over horizontally and
vertically partitioned data, as shown in Fig. 3.2. Under this architecture, federated
data modeling shares models rather than patient-level information. The server
iteratively optimizes the global model parameters based on aggregated intermediary
results that are decomposable over the clients. Each client can utilize its local data
to separately calculate corresponding intermediary results. Subsequently, instead of
sharing the sensitive patient-level data, these intermediary results are exchanged for
secure computation and analysis. Taking maximum likelihood estimation (MLE)
of binary logistic regression for example, each institution calculates and exchanges
54 W. Dai et al.
Fig. 3.2 Server/client architecture for federated data analysis. Each institution only exchanges
intermediary results for the estimation of global model parameters and statistical tests. Each
institution does not communicate with the others to prevent unexpected information leakage
the partial Hessian matrix derived from its local records for horizontally partitioned
data and partial kernel matrix of its local attributes for vertically partitioned data.
Thus, federated models only exchanged aggregated intermediary results rather
than collected raw data in a central repository to make parameter estimation.
Moreover, clients will not collude to infer the raw data, as each client separately
performs the computation. These facts imply that distributed optimization can be
performed in a privacy-preserving fashion, as the raw data tend not to be recovered
from the aggregated intermediary results. It should be noted that the accuracy
of parameter estimation could be guaranteed in federated models, as aggregated
intermediary results do not lose any information in comparison to the centralized
methods. Furthermore, the security of federated models can be further improved
by integrating secure protocols and encryption methods to protect the intermediary
results exchanged between the server and clients.
For logistic regression and multinomial regression models, federated models can
also support distributed statistical tests over horizontally partitioned data, including
goodness-of-fit test and AUC score estimation (Chambless and Diao 2006). Besides
model parameters, variance-covariance matrix could be similarly obtained by aggre-
gating the decomposable intermediary results. Using global model parameters and
variance-covariance matrix, federated models were able to estimate the statistics of
logistic and multinomial regression, including confidence intervals (CIs), standard
error, Z-test statistics and p-values. Furthermore, goodness-of-fit test and AUC score
estimation can be achieved in a distributed manner. The Hosmer and Lemeshow
(H-L) test (Hosmer et al. 2013) is considered to check model fitness, where each
institution only shares its number of records with positive patient outcomes per
3 Privacy Preserving Federated Big Data Analysis 55
Fig. 3.3 Decentralized architecture for federated data analysis. To estimate global model param-
eters and perform statistical tests, each institution exchanges intermediary results with its
neighboring institutions and updates its local model with received aggregated results
decile. Thus, patient-level information and estimate outcomes will not be exchanged
for privacy-preserving consideration. For computation of AUC score, raw data and
estimated outcomes of patients can be protected with peer-to-peer communication. Decentralized Architecture
Figure 3.3 illustrates the decentralized architecture for federated analysis over
horizontally and vertically partitioned data. Contrary to server/client architec-
tures, decentralized architectures do not require a central node (server) to collect
aggregated intermediary results from all the institutions and make global model
parameter estimation and statistical tests. Each institution only communicates with
its neighbors to exchange messages, e.g. institutions with linked health records.
To prevent leakage of patient-level information, institutions would exchange inter-
mediary results rather than raw data. At each iteration, each institution derives its
intermediary results from local data and the aggregated results from its neighbor-
ing institutions. Taking global consensus optimization under ADMM framework
(Boyd et al. 2011) for example, local model parameters are exchanged among
neighboring institutions for global consensus in applications like sparse linear
regression (Mateos et al. 2010), principal component analysis (PCA) (Schizas and
Aduroja 2015) and support vector machine (SVM) (Forero and Giannakis 2010).
Under such architecture, distributed optimization can be performed in a privacy-
preserving manner, as patient-level information would never be exchanged. It is
worth mentioning that communication cost would be reduced in the decentralized
architecture, as messages are only exchanged among neighboring institutions.
56 W. Dai et al.
3.2.2 Distributed Optimization The Newton-Raphson Method
For model parameter estimation, the Newton-Raphson method can be extended
to make distributed optimization for log-likelihood functions over multiple insti-
tutions. It is a powerful technique for finding numerical solutions to nonlinear
algebraic equations with successively linear approximations. Given twice differen-
tiable objective function f(x), the Newton-Raphson method iteratively constructs a
sequence of positions towards the stationary point xwith gradient-like optimiza-
tion. For vector inputs x,the gradient rfand Hessian matrix H(f) are computed for
first and second partial derivatives of f, respectively. At the t-th step, xis updated by
maximizing the log-likelihood function f.
In federated data modeling, to enable distributed optimization, the first and
second partial derivatives are required to be decomposable over multiple institu-
tions. Thus, the gradient and Hessian matrix can be derived from the aggregated
intermediary results separately obtained from all the institutions. The intermediary
results would vary for different tasks and data with different partition. For example,
each institution holds a portion of records Akfor horizontally partitioned data.
Consequently, the intermediary results exchanged in binary logistic regression
are AT
kkAkwith kDdiag((Ak,xk)(1 (Ak,xk))) related with logit function,
while they tend to depend on the set of records at risk for Cox regression model.
For vertically distributed logistic regression, it requires Legendre transform for
distributed dual optimization, where the kernel matrix AkAT
kis separately calculated
based on the portion of attributes Akheld by the k-th institution.
The Newton-Raphson method can achieve faster convergence towards a local
optimum in comparison to gradient descent, when fis a valid objective function.
At the meantime, distributed optimization with the Newton-Raphson method can
achieve equivalent accuracy in model parameter estimation, when compared to its
centralized realization. However, it requires separable first and second partial deriva-
tives over multiple institutions, which make the Newton-Raphson method restrictive
in some distributed optimization scenarios, e.g. distributed Cox regression. In Sect, we introduce alternating direction method of multipliers (ADMM) for a
ubiquitous distributed optimization framework. Alternating Direction Method of Multipliers
Alternating direction method of multipliers (ADMM) (Boyd et al. 2011) is a variant
of the augmented Lagrangian scheme that supports decomposable dual ascent
solution for the method of Lagrangian multipliers. It develops a decomposition-
coordination procedure to decompose the large-scale optimization problem into a
set of small local subproblems. Inheriting the convergence properties of the method
3 Privacy Preserving Federated Big Data Analysis 57
of Lagrangian multipliers, ADMM is able to obtain the globally optimal solution.
In comparison to general equality-constrained minimization, ADMM introduces
auxiliary variables to split the objective function for decomposability. To be
concrete, ADMM solves the minimization problem of the summation of two convex
functions f.x/and g.z/under the constraint of Ax CBz DC:Thus, the augmented
Lagrangian function for optimal solution is formulated with dual variable .
x;z;yL.x;z;y/Df.x/Cg.z/CT.Ax CBz C/C
2jAx CBz Cj2
where is the positive Lagrangian parameter. The augmented Lagrangian function
can be iteratively solved with three steps for partially updating the primal variables
xauxiliary variables zand dual variables .Inthet-th iteration, these three steps are
conducted in a sequential manner.
1. x-minimization: partially update xby minimizing the augmented Lagrangian
function L.x;z;/with fixed zand dual variable ,orx.tC1/Dargmin
2. z-minimization: partially update zby minimizing the augmented Lagrangian
function L.x;z;/with updated xand fixed ,orz.tC1/Dargmin
3. Update dual variables with updated xand z,or.tC1/D.t/C
Here, dual variables are updated with the step size for each iteration.
According to Steps 1–3, xand zare alternately updated in each iteration based on
f.x/and g.z/. This fact implies that the minimization problem over xand zcan be
separately solved for distributed data, when f.x/or g.z/are decomposable. Since
xand zcan be derived from each other based on the dual variables , they can
be exchanged among multiple institutions for distributed optimization. Therefore,
ADMM is desirable for privacy-preserving federated data analysis, where each
institution can solve the decomposable optimization problem based on local data
and submits intermediary results for a global solution.
Under the ADMM framework, the two generic optimization problems, consensus
and sharing, are formulated for distributed analysis over horizontally and vertically
partitioned data, respectively. For horizontally partitioned data, the consensus
problem splits primal variables xand separately optimizes the decomposable
cost function f.x/for all the institutions under the global consensus constraints.
Considering that the submatrix Ak2RNkMof A2RNMcorresponds to the local data
held by the k-th institution, the primal variables xk2RM1for the Kinstitutions are
solved by
fk.Akxk/Cg.z/s:t:xkzD0; kD1; ;K:(3.3)
58 W. Dai et al.
Here, fk.Akxk/is the cost function for the k-th institution and g.z/commonly
represents the regularization term for the optimization problem. According to the
derived augmented Lagrangian function, xkis independently solved based on local
data Akand corresponding dual variables k, while zis updated by averaging xkand
k. Thus, the global model can be established by optimizing over the Kinstitutions
under the global consensus constraints.
The sharing problem is considered for vertically partitioned data, where Aand
xare vertically split into Ak2RNMkand xk2RMk1for the Kinstitutions. Auxiliary
variables zk2RN1are introduced for the k-th institution based on Akand xk. In such
case, the sharing problem is formulated based on the decomposable cost function
kD1zks:t:AkxkzkD0; kD1; ;K(3.4)
Under the ADMM framework, xkand its dual variables kcan be separately
solved, while zkis derived from the aggregated results of Akxkand k. This fact
implies that each institution can locally optimize the decomposable cost function
using its own portion of attributes and adjust the model parameters according to
auxiliary variables derived from global optimization problem. It is worth mentioning
that the global consensus constraints implicitly exist for the dual variables in sharing
3.3 Federated Data Analysis Applications
In this section, we review federated data analysis models for regression and
classification based on the Newton-Raphson method and ADMM framework.
3.3.1 Applications Based on the Newton-Raphson Method
The Newton-Ralphson method is widely adopted in federated data analysis for
generalized linear models, e.g. logistic regression, multinomial regression, and
Cox proportional hazard model, which are widely used in biomedicine. Table 3.1
summarizes the existing federated modeling techniques for distributed data with
their application scenarios, data partitioning, mechanisms for model parameter
estimation, statistical tests and communication protection.
An early federated data analysis paper in biomedical informatics introduced
the Grid binary LOgistic REgression (GLORE) framework (Wu et al. 2012). In
this work, a binary logistic regression model was developed for model parameter
estimation and statistical tests over data horizontally distributed across multiple
institutions in a privacy-preserving manner. For security and confidentiality, the
proposed model shared models rather than patient-level information. Besides model
parameter estimation, distributed algorithms were developed for H-L test and AUC
3 Privacy Preserving Federated Big Data Analysis 59
Table 3.1 Federated modeling techniques for distributed data based on the Newton-Raphson method
Application scenario Data partitioning Parameter estimation Statistical test Communication protection
GLORE (Wu et al.
Logistic regression Horizontal MLE H-L test AUC score
IPDLR (Wu et al. 2012)Secure summation protocol
et al. 2013)
MAP N/A SINE protocol
et al. 2016)
MLE Garbled circuits
VERTIGO (Li et al.
Vertical N/A
Multi-category GLORE
(Wu et al. 2015)
Horizontal MLE H-L test AUC score
HPPCox (Yu et al.
Cox regression Horizontal MLE N/A N/A
WebDISCO (Lu et al.
60 W. Dai et al.
score estimation in GLORE. It was shown to achieve equivalent model parameter
estimation and statistical tests over simulated and clinical datasets in comparison to
centralized methods. WebGLORE (Jiang et al. 2013) provided a free web service
to implement the privacy-preserving architecture for GLORE, where AJAX, JAVA
Applet/Servlet and PHP technologies were seamlessly integrated for secure and
easy-to-use web service. Consequently, it would benefit biomedical researchers
to deploy the practical collaborative software framework in real-world clinical
Inspired by GLORE, a series of extensions and improvements have been made
for various regression tasks with different privacy concerns. Despite shedding light
on federated data analysis, GLORE still suffered from two main limitations: privacy
protection of intermediary results and synchronization for iterative distributed opti-
mization. Wu et al. (2012) considered the institutional privacy for GLORE. During
iterative optimization, sensitive information of an institution would be leaked, as
its contribution to each matrix of coefficients is known to the server. Therefore,
institutional privacy-preserving distributed binary logistic regression (IPDLR) was
developed to enhance the institutional privacy in GLORE by masking the ownership
of the intermediary results exchanged between the server and institutions. To make
all the institutions remain anonymous, a secure summation procedure was developed
to integrate all the intermediary results without identifying their ownership. At each
iteration, client-to-client communication was conducted to merge the intermediary
results on a client basis based on the random matrix assigned by the server. Thus,
the server would obtain the aggregated results without knowing the contribution
of each institution. The secure summation procedure was also employed in ROC
curve plotting to securely integrate local contingency tables derived in the insti-
tutions. Wang et al. (2013) proposed a Bayesian extension for GLORE, namely
EXpectation Propagation LOgistic REgRession (EXPLORER) model, to achieve
distributed privacy-preserving online learning. In comparison to frequentist logistic
regression model, EXPLORER made maximum a posteriori (MAP) estimation
using expectation propagation along the derived factor graph. The model parameters
were iteratively updated based on partial posterior function w.r.t. the records
held by each institution (intra-site update) and the messages passing among the
server and institutions (inter-site update). As a result, EXPLORER improved the
security and flexibility for distributed model learning with similar discrimination
and model fit performance. To reduce the information leakage from unprotected
intermediary results, EXPLORER exchanged the encrypted posterior distribution
of coefficients rather than the intermediary results for model parameter estimation.
The sensitive information about individual patient would not be disclosed, as only
statistics like mean vector and covariance matrix were shared to represent the
aggregated information of the raw data. Moreover, secured intermediate information
exchange (SINE) protocol was adopted to further protect aggregation information.
To guarantee flexibility, EXPLORER leveraged online learning to update the model
based on the newly added records. It also supported asynchronous communication
to avoid coordinating multiple institutions, so that it would be robust under
the emergence of offline institution and interrupted communication. Shi et al.
(2016) developed a grid logistic regression framework based on secure multiparty
3 Privacy Preserving Federated Big Data Analysis 61
computation. In addition to raw data, the proposed SMAC-GLORE protected the
decomposable intermediary results based on garbled circuits during iterative model
learning. Secure matrix multiplication and summation protocols were presented for
maximum likelihood estimation using fixed-Hessian methods. For MLE, Hessian
matrix inversion problem was securely transferred to a recursive procedure of matrix
multiplications and summations using Strassen algorithm, while exp006Fnential
function was approximated with Taylor series expansion.
Wu et al. (2015) extended GLORE to address multi-centric modeling of multi-
category response, where grid multi-category response models were developed
for ordinal and multinomial logistic regression over horizontally partitioned data.
Grid Newton method was proposed to make maximum likelihood estimation of
model parameters in a privacy-preserving fashion. At each iteration, each institution
separately calculated partial gradients and Hessian matrix based on its own data
and the server could integrate these intermediary results to derive the global model
parameters. Thus, the proposed models could reduce disclosure risk, as patient-level
data would not be moved outside the institutions. Furthermore, privacy-preserving
distributed algorithms were presented for grid model fit assessment and AUC score
computation in ordinary and multinomial logistic regression model by extending
the corresponding algorithms for binary response models. The proposed models
were demonstrated to achieve the same accuracy with a guarantee of data privacy in
comparison to the corresponding centralized models.
Recently, Li et al. (2016) proposed a novel method that leveraged dual opti-
mization to solve binary logistic regression over vertically partitioned data. The
proposed vertical grid logistic regression (VERTIGO) derived the global solution
with aggregated intermediate results rather than the sensitive patient-level data.
In the server/client architecture, the server iterative solved the dual problem of
binary logistic regression using the Newton-Raphson method. To compute the
Hessian matrix, each institution transmitted the kernel matrix of its local statistics
to the server for merging. Dot product kernel matrix was adopted to guarantee
that the global gram matrix was decomposable over multiple institutions. For
iterative optimization, it was only required to exchange the dot products of patient
records and dual parameters between the server and institutions. This fact implies
that the patient-level information would not be revealed, when the distribution of
covariates is not highly unbalanced. Employed on both synthetic and real datasets,
VERTIGO was shown to achieve equivalent accuracy for binary logistic regression
in comparison to its centralized counterpart.
Distributed survival analysis is one of the prevailing topics in biomedical
research, which studies the development of a symptom, disease, or mortality with
distributed time-to-event data. Cox proportional hazard model (Cox 1972) is widely
concerned in survival analysis, which evaluates the significance of time-varying
covariates with a hazard function. Yu et al. (2008) proposed a privacy-preserving
Cox model for horizontally partitioned data, where affine projections of patient
data in a lower dimensional space were shared to learn survival model. The
proposed HPPCox model utilized a rank-deficient projection matrix to hide sensitive
information in raw data, as the lower dimensional projections were commonly
irreversible. Project matrix was optimized to minimize loss of information led by
these lower dimensional projections, projection matrix was optimized to simulta-
62 W. Dai et al.
neously maintain the major structures (properties) and reduce the dimensionality
of input data. Thus, feature selection could be enabled to prevent overfitting for
scenarios requiring limited training data. This model was shown to achieve nearly
optimal predictive performance for multi-centric survival analysis. O’Keefe et al.
(2012) presented explicit confidentialisation measures for survival models to avoid
exchanging patient-level data in a remote analysis system, but did not consider
distributed learning model. The work considered and compared confidentialised
outputs for non-parametric survival model with Kaplan-Meier estimates (Kaplan
and Meier 1958), semiparametric Cox proportional hazard model and parametric
survival model with Weibull distribution (Weibull 1951). The confidentialised
outputs would benefit model fit assessment with similar model statistics and
significance in comparison to traditional methods. However, the work was focused
on securely generating survival outputs, did not perform distributed model learning.
Lu et al. (2014) proposed a web service WebDISCO for distributed Cox proportional
hazard model over horizontally partitioned data. The global Cox model was
established under the server/client architecture, where each institution separately
calculated its non-sensitive intermediary statistics for model parameter estimation
in server. WebDISCO investigated the technical feasibility of employing federated
data analysis on survival data. The distributed Cox model was shown to be identical
in mathematical formulation and achieve an equivalent precision for model learning
in comparison to its centralized realization.
3.3.2 Applications Based on ADMM
In this subsection, we review the ADMM-based distributed algorithms for regres-
sion and classification with a brief overview on the convergence analysis for
ADMM-based methods. Table 3.2 summarizes the ADMM-based algorithms for
horizontally and vertically partitioned data and most of them have not been applied
in the context of biomedical informatics.
Table 3.2 Federated modeling techniques for distributed data based on ADMM
Application scenario Data partitioning
Boyd et al. (2011) Logistic regression Horizontal/Vertical
Lasso/Group Lasso Vertical
Support vector machine Vertical
Mateos et al. (2010)Linear regression Horizontal
Mateos and Schizas (2009) Recursive least-squares Horizontal
Mateos and Giannakis (2012) Recursive least-squares Horizontal
Forero and Giannakis (2010)Support vector machine Horizontal
Schizas and Aduroja (2015)Principal component analysis Horizontal
Scardapane et al. (2016)Recurrent neural networks Horizontal
3 Privacy Preserving Federated Big Data Analysis 63 Regression
Boyd et al. (2011) summarized the ADMM-based distributed `1-penalized logistic
regression model for horizontally and vertically partitioned data, respectively. For
horizontally partitioned data, model parameters were separately solved for each
institution by minimizing an `2-regularized log-likelihood function over local data.
Subsequently, auxiliary variable for global consensus was found to minimize the
combination of its `1-norm and the squared difference between the auxiliary
variable and averaged primal and dual variables. It should be noted that the
auxiliary variable could be computed for each attribute in parallel for improved
efficiency. Each institution would not leak its sensitive information, as only its local
model parameters were exchanged for global consensus. When data are vertically
distributed across multiple institutions, a Lasso problem based on the local attributes
was formulated for each institution under the ADMM framework. Aggregating the
intermediary results from all the institutions, the auxiliary variables were derived
from the `2-regularized logistic loss function. The dual variables were updated
based on the aggregated intermediary results and averaged auxiliary variables and
remained same for all the institutions. The distributed logistic regression could be
performed in a privacy-preserving manner, as local data owned by each institution
would not be inferred from the aggregated intermediary results. In both cases, the `2-
regularized minimization based on log-sum-exp. functions can be iteratively solved
using the Newton-Raphson method or L-BFGS algorithm.
Mateos et al. (2010) leveraged alternating direction method of multipliers
(ADMM) to make model parameter estimation in sparse linear regression. The
centralized Lasso model was transformed into a decomposable consensus-based
minimization problem for horizontally partitioned data. The derived minimization
problem can be iteratively solved with ADMM in a decentralized manner, where
each institution only communicates with its neighboring institutions to protect
data privacy. To balance computational complexity and convergence rate, three
iterative algorithms, DQP-Lasso, DCD-Lasso and D-Lasso, were developed based
on ADMM. DQP-Lasso introduced strictly convex quadratic terms to constrain the
model parameters and auxiliary variables for each institution and its neighbors in
augmented Lagrangian function. Thus, quadratic programming (QP) is adopted for
each institution to obtain its model parameters by minimizing the decomposable
Lagrangian function in a cyclic manner. To improve efficiency of iterative opti-
mization, DCD-Lasso introduced coordinate descent to simplify the QP process
for each institution. At each iteration, DCD-Lasso updated the model parameters
for each institution with coordinate descent rather than iteratively obtained the
exact solution to the corresponding QP problem. Furthermore, D-Lasso enabled
parallelized computation of model parameters corresponding to multiple institutions
to enhance convergence speed. D-Lasso can achieve equivalent model estimation in
comparison to DCD-Lasso without additional relaxation terms. It is demonstrated
that the model parameters locally derived by these algorithms are convergent to the
global solution to Lasso.
64 W. Dai et al.
Similar to sparse linear regression, Mateos and Schizas (2009) adopted ADMM
to develop a distributed recursive least-squares (D-RLS) algorithm for time series
data horizontally distributed across multiple institutions. The proposed algorithm
reformulated the exponentially weighted least-squares estimation to a consensus-
based optimization problem by introducing auxiliary variables for corresponding
institutions. The reformulated minimization problem was decomposed into a series
of quadratic optimization problems that were recursively solved for each institution.
Furthermore, Mateos and Giannakis (2012) improved the efficiency of D-RLS by
avoiding explicit matrix inversion in recursively solving quadratic problems for each
institution. The proposed algorithms are demonstrated to be stable for time series
data with sufficient temporal samples under the metric of means and mean squared
errors (MSE). These ADMM-based linear regression and least-squares estimation
were validated over the wireless sensor networks.
The sharing formulations have been also studied for Lasso and group Lasso
problems over vertically partitioned data. Similar to distributed logistic regression
model, sparse least-squares estimation was formulated for each institution to
independently obtain the model parameters corresponding to its own attributes.
In group Lasso, institutions would adopt various regularization parameters for `1-
regularized problem. Since the auxiliary variables were updated analytically with a
linear combination of the aggregated intermediary results and dual variables, the
computational complexity mainly depended on the decomposable `1-regularized
problem for multiple institutions. To improve the efficiency, the x-minimization for
certain institution could be skipped, when its attributes were not considered to be
involved in distributed optimization based on a threshold w.r.t. the regularization
parameters and Lagrangian multiplier. Classification
Forero and Giannakis (2010) leveraged ADMM to develop a distributed support
vector machine (DSVM) classifier for training data horizontally distributed across
multiple institutions. Introducing auxiliary variables for the local model parameters
at each node, the linear SVM classifier was decomposed into a series of convex
sub-problems over these auxiliary variables under the consensus constraints. For
each node, its local model parameters were independently derived from the corre-
sponding sub-problem. The decomposable optimization was iteratively performed
to obtain the unique optimal model parameters in a decentralized manner. Under the
ADMM framework, the global SVM classifier was trained in a privacy-preserving
fashion, as each node exchanges its locally estimated model parameters rather than
the training data it owns. To handle sequential and asynchronous learning tasks, the
linear DSVM classifier support online update for time-varying training data. The
classifier could be partially updated to adapt with the cases that training samples
were added to and removed from the training set. The ADMM-based distributed
optimization was also generalized to nonlinear SVM classifiers, where consensus
constraints on discriminant functions of local model parameters were shrunk to
a rank-deficient subspace of the reproducing kernel Hilbert space. In analogy
3 Privacy Preserving Federated Big Data Analysis 65
to generalized linear regression, ADMM was also adopted for SVM classifier
over vertically partitioned data. The centralized SVM classifier was decomposed
into a series of quadratic programming problems to separately derive the model
parameters corresponding to the attributes owned by each institution. These model
parameters were iteratively adjusted with the auxiliary variables obtained by soft
thresholding over the aggregated intermediary results and averaged dual variables.
Therefore, each institution only needs to share the aggregated intermediary results
of its attributes for each patient’s record.
Schizas and Aduroja (2015) proposed a distributed principal component anal-
ysis (PCA) framework based on ADMM under the metric of mean-square error
(MSE). An equivalent constrained optimization problem was formulated for the
classical PCA framework, where each node used local covariance matrix to sep-
arately estimate its principal eigenspace in a recursive manner. Coordinate descents
were adopted in the ADMM framework to iteratively minimize the augmented
Lagrangian function under the consensus constraints. The proposed framework
enabled distributed dimensionality reduction over horizontally partitioned data in
a privacy-preserving fashion, as each node only exchanged aggregated intermediary
results with its neighboring nodes. Under sufficient iterative optimization, the
estimated principal eigenspace is demonstrated to asymptotically converge to
the subspace spanned by the actual principal eigenvectors. For validation, the
proposed distributed PCA framework was employed in data denoising over wireless
sensor networks, where synthetic and practical evaluations showed it could achieve
enhanced convergence rate and improved performance for noise resilience.
Scardapane et al. (2016) adopted ADMM in the decentralized training of recur-
rent neural networks (RNNs) that optimized the global loss function over the data
horizontally distributed across all the nodes. The proposed distributed algorithm
was designed specifically for Echo State Networks. The decomposable optimization
under consensus constraints was formulated for each node to separately calculate
the model parameter based on the `2-regularized least-squares estimation. For each
node, its auxiliary variable for global consensus was obtained with a weighted
average of model parameters from its neighboring nodes. Thus, it was not required
to share the training data or the hidden matrix representing the current states of
neural networks among multiple nodes. Since the auxiliary variables were not
calculated based on all the nodes in the neural network, it did not necessarily depend
on a server to perform global optimization. Evaluations over large scale synthetic
data showed that the proposed distributed algorithm could achieve comparable
classification performance in comparison to its centralized counterpart. Convergence and Robustness for Decentralized Data Analysis
Ling et al. (2015) proposed a decentralized linearized ADMM (DLM) method
to reduce computational complexity and enhance convergence speed for stan-
dard ADMM. The proposed DLM method simplified the decomposable optimiza-
tion problems with linearized approximation. Thus, the computational cost for
66 W. Dai et al.
implementation based on gradient descent could be significantly reduced for appli-
cations like distributed logistic regression and least-squares estimation. It is demon-
strated that the DLM method can converge to the optimal solution with a linear rate,
when strongly convex and the Lipschitz continuity conditions are satisfied for the
decomposable cost functions and its gradients, respectively. Furthermore, Mokhtari
et al. (2016) considered quadratic approximation of ADMM-based formulation for
logistic regression model to improve the accuracy of model parameter estimation.
The proposed DQM method was shown to obtain more accurate estimates for model
parameters with a linear convergence rate in comparison to DLM.
Investigations have also been made to improve the robustness of ADMM
for various application scenarios. Wang and Banerjee (2012) introduced online
algorithm into ADMM to yield an enhance convergence rate. Goldfarb et al.
(2012) developed a fast first-order linearized augmented Lagrangian minimization
to accelerate the convergence of ADMM algorithm. Considering the insufficient
accessibility of true data values, Ouyang et al. (2013) presented a stochastic ADMM
algorithm aiming to minimize non-smooth composite objective functions.
3.4 Secure Multiparty Computation
The previous sections introduced federated technologies reducing the privacy risks
through model decomposition so that we build global models only on locally
aggregated statistics (without accessing the patient-level data). To further mitigate
the privacy risk, we need to ensure the confidentiality of transmitted summary
statistics in each institution as well. This can be achieved using Secure Multiparty
Computation (SMC), which ensures computation and communication security via
advance cryptographic protocols. Many state-of-the-art SMC schemes are based
upon the idea of translating an algorithm to a binary circuit. Scalable SMC protocols
like Yao’s garbled circuit (Yao 1982) could represent arbitrary functions with a
Boolean circuit to enable masking of inputs and outputs of each gate for sensitive
information. In federated data analysis, two architectures are widely considered to
integrate secure multiparty computation protocols, namely spoke-hub and peer-to-
peer architectures, as shown in Fig. 3.4. The spoke-hub architecture requires one or
more non-colluding institution to perform securely computation and analysis based
on the data collected from the other institutions, while the peer-to-peer architecture
allows secure exchange of encrypted data in a decentralized manner. In this section,
we will introduce some secure multiparty communication protocols for regression,
classification and evaluation tasks.
3.4.1 Regression
Fienberg et al. (2006) proposed privacy-preserving binary logistic regression for
horizontally partitioned data with categorical covariates. Secure summation and data
integration protocols were adopted to securely compute the maximum likelihood
3 Privacy Preserving Federated Big Data Analysis 67
Fig. 3.4 Illustration of architectures for federated data analysis using secure multiparty compu-
tation (SMC) protocols. (a) Spoke-Hub architecture. The non-colluding institution is required
to perform global computation and operation based the encrypted data collected from the other
institutions; (b) Peer-to-peer architecture. All the institutions exchanged and aggregated encrypted
data with each other in a successive manner to achieve federated data analysis
and perform contingency table analysis for log-linear models. Under the categorical
covariates, the logistic regression model could be constructed from the log-linear
models in a distributed and privacy-preserving fashion. Slavkovic and Nardi (2007)
presented a secure logistic regression approach on horizontally and vertically
partitioned data without actually combining them. Secure multiparty computation
protocols were developed for the Newton-Raphson method to solve binary logistic
regression with quantitative covariates. At each iteration, multiparty secure matrix
summation and product protocols were employed to compute the gradients and
Hessian matrix. Here, the inverse of Hessian was derived by recursively performing
secure matrix product over its sub-block matrices. The proposed protocol would
protect the privacy of each institution under the secure multiparty protocols, when
the matrices of raw data are all with a dimensionality greater than one. However,
Fienberg et al. (2009) showed that it would lead to serious privacy breach when
intermediary results are shared among multiple institutions in numerous iterations,
even though they can be protected by mechanism like encryption with random
shares. Nardi et al. (2012) developed a secure protocol to fit logistic regression
over vertically partitioned data based on maximum likelihood estimation (MLE).
In comparison to previous works, it enhanced the security by that only providing
the final results for private computation, as there would be a chance to compromise
the confidentiality of raw data held by each institution from the shared intermediate
values. Two protocols were proposed for approximating logistic function using
existing cryptographic primitives including secret sharing, secure summation and
product with random shares, secure interval membership evaluation and secure
matrix inversion. The first protocol approximated the logistic function with the
summation of step functions. However, this protocol would be computationally
prohibitive for high-dimensional large-scale problems due to the secret sharing and
68 W. Dai et al.
comparison operation over encrypted data. To relieve the computational burden, the
second protocol formulated the approximation by solving an ordinary differential
equation numerically integrated with Euler’s method. Thus, the logistic regression
was iteratively fit with secure summations and products of approximation and
average derivatives of logistic function evaluated on the intermediate values.
The accuracy bound of the two approximation protocols is demonstrated to be
related with minimum eigenvalue of the Fisher information matrix and step size
for approximation. However, these protocols would be prohibitive for real-world
applications due to their high computational complexity and communication rounds.
Sanil et al. (2004) addressed privacy-preserving linear regression analysis over
vertically partitioned data. Quadratic optimization was formulated derive the exact
coefficients of global regression model over multiple institutions. Distributed
computation of regression coefficients was achieved by implementing Powell’s
algorithm under the secure multiparty computation framework. At each iteration,
each institution updated its local regression coefficients and its own subset of
search directions. Subsequently, a common vector was generated by aggregating the
products of local attributes and search directions using secure summation protocol
for computation in next iteration. Finally, the proposed algorithm obtained the global
coefficients and the vector of residuals. Thus, global linear regression could be
made based on the entire dataset collected from all the institutions without actually
exchanging the raw data owned by each institution. Moreover, basic goodness-of-fit
diagnostics could be performed with the coefficient of determination that measured
the strength of the linear relationship. However, this algorithm would be unrealistic
due to the assumption that the institution holding the response attribute should share
it with the other institutions. Karr et al. (2009) proposed a protocol for secure linear
regression over data vertically distributed across multiple institutions. The proposed
protocol was able to estimate the regression coefficients and related statistics like
standard error in a privacy-preserving manner. In distributed optimization, multiple
institutions collaborated to calculate the off-diagonal blocks of the global covariance
matrix using secure matrix products, while the diagonal blocks were obtained based
on the corresponding local attributes. The global covariance matrix was shared
among multiple institutions for secure linear regression and statistical analyses.
Moreover, model diagnostic measures based on residuals can be similarly derived
from the global covariance matrix using secure summation and matrix products
protocols. Remarkably, the proposed protocol could be generalized to a variety
of regression models, e.g. weighted least squares regression, stepwise regression
and ridge regression, under the constraint that sample means and co-variances are
sufficient statistics.
3.4.2 Classification
Naïve Bayes classifier is an effective Bayes learning method with consistent and rea-
sonable performance, which is commonly adopted as benchmark for classification
methods to be evaluated. Vaidya and Clifton (2003b) developed a privacy-preserving
3 Privacy Preserving Federated Big Data Analysis 69
Naive Bayes classifier for vertically partitioned data, where multiple institutions
collaborated to achieve classification with random shares of the global model.
In the proposed Bayes classifier, each institution is only required to share the
class of each instance, rather than the distribution of classes or attribute values.
Secure protocols were developed for training and classification under the secure
multiparty computation framework. In training, random shares of the conditionally
independent probabilities for nominal and numeric attributes were computed for
model parameter estimation. To be concrete, the probability estimates for classes
of instances given the attributes were computed for shares of nominal attributes.
For numeric attributes, mean and variance for the probability density function of
Normal distribution are required. For each class and attribute value, the shares can
be obtained from the institutions owning them. For evaluation, each new instance
was classified by maximizing its posterior probability using Bayes theorem under
the assumption of conditionally independent attribute values. Here, the probabilities
of class conditioned on attributes are derived based on the model parameters.
Furthermore, it is demonstrated that these protocols for training and evaluation are
able to securely compute shares of nominal and numeric attributes and classify the
classes, respectively. Furthermore, Vaidya et al. (2008) introduced secure logarithm
primitives from secure multiparty computation for naïve Bayes classification over
horizontally partitioned data. The model parameters for normal and numeric
attributes were directly computed based on the local counts using secure sum
protocols, as each institution held all the attributes required for classifying an
instance. Therefore, classification could be made locally for each institution.
K-means clustering is popular for cluster analysis, which partitions observations
into the clusters with the nearest means. For data vertically distributed across
multiple institutions, privacy-preserving K-means clustering (Vaidya and Clifton
2003a) was studied to perform clustering without sharing raw data. Using the
proposed K-means clustering algorithm, each institution can obtain its projection
of the cluster means and learn the cluster assignment of each record without
revealing its exact attributes. Since high dimensional problem cannot be simply
decomposed into a combination of lower dimensional problems for each institution,
cooperation between multiple institutions is required to learn the cluster that each
record belongs to. To achieve the privacy-preserving clustering algorithm, secure
multiparty computation framework using homomorphic encryption is introduced for
multiple institutions. For common distance metrics like Euclidean and Manhattan,
the distances between each record and the means of K clusters can be split over
these institutions. For security, these distances were disguised with random values
from a uniform random distribution and non-colluding institutions were adopted to
compare the randomized distances and permute the comparison results. The secure
permutation algorithm is performed in an asymmetric two-party manner according
to the permutation owned by the non-colluding institution. It should be noted that the
initial values of the K means were assigned to their local shares for the institutions to
obtain a feasible solution. As a result, non-colluding institutions would only know
the selected cluster in the permutation, while the exact attributes owned by each
institution would not be disclosed to the others.
70 W. Dai et al.
Liang et al. (2013) developed a distributed PCA algorithm to estimate the global
covariance matrix for principal components. Each institution leveraged standard
PCS algorithm to determine its principal components over the local data. These local
principal components were exchanged and collected to obtain global covariance
matrix. The proposed algorithm integrated the distributed coreset-based clustering
to guarantee that the number of vectors for communication was independent of
size and dimension of the federated data. It is demonstrated that the divergence
between approximations on projected and original data for k-means clustering can
be upper bounded. Guo et al. (2013) developed a covariance-free iterative distributed
principal component analysis (CIDPCA) algorithm for vertically partitioned high-
dimensional data. Instead of approximating global PCA with sampled covariance
matrix, the proposed CIDPCA algorithm is designed to directly determine principal
component by estimating their eigenvalues and eigenvectors. The first principal
component corresponding to the maximum eigenvalues of the covariance matrix
was derived by maximizing the Rayleigh quotient using gradient ascent method. The
iterative method is demonstrated to converge in an exponential rate under arbitrary
initial values of principal components. Subsequently, the remaining principal
components could be iteratively calculated in the orthogonal complement of the
subspace spanned by the previously derived principal components. In comparison
to previous distributed PCA methods, it is shown to achieve higher accuracy
in estimating principal components and better classification performance with a
significant reduction on communication cost. This conclusion is also validated with
a variety of studies over real-world datasets.
Du et al. (2004) studied multivariate statistical analysis for vertically partitioned
data in the secure two-party computation framework, including secure two-party
multivariate linear regression (S2-MLR) and classification (S2-MC). These prob-
lems were addressed in a privacy-preserving manner with secure two-party matrix
computation based on a set of basic protocols. A new security model was proposed
to lower security requirements for higher efficiency, where each institution was
allowed to reveal a part of information about its raw data under the guarantee that
the raw data would not be inferred from the disclosed information. To securely
perform matrix computation, building blocks for matrix product, matrix inverse and
matrix determinant were presented. Thus, S2-MLR and S2-MC could be securely
solved with these building blocks. It is demonstrated that, in the proposed two-party
multivariate statistical analysis, it would be impossible to infer the raw data owned
by each institution, when less than half of its disguised matrix is revealed.
Yu et al. (2006) proposed an efficient privacy-preserving support vector machine
(SVM) classification method, namely PP-SVMV, for vertically partitioned data. In
the proposed method, the global SVM model was constructed from local SVM
models rather than directly exchanging the local data. Thus, both local SVM models
and their corresponding local data for each institution were not disclosed. For
linear kernels, the global kernel matrix is computed by directly merging gram
matrices from multiple institutions to solve the dual problem. This result can
also be extended to ordinary non-linear kernels that can be represented by dot
products of covariates, i.e. polynomial and radial basis function (RBF) kernels.
3 Privacy Preserving Federated Big Data Analysis 71
To guarantee data and model privacy, merge of local models is performed with
secure summation of scalar integers and matrices. Experimental results demon-
strated the accuracy and scalability of PP-SVMV in comparison to the centralized
SVM over the original data. Similarly, Yu et al. (2006) presented a privacy-
preserving solution to support non-linear SVM classification over horizontally
partitioned data. It required that the nonlinear kernel matrices could be directly
calculated based on the gram matrix. Thus, widely-used nonlinear kernel matrices
like polynomial and RBF kernels can be derived from the dot products of all data
pairs using the proposed solution. Secure set intersection cardinality was adopted
as an equivalency to these dot products based on the data horizontally distributed
across the institutions. Thus, commutative one-way hash functions were utilized to
securely obtain the set intersection cardinality. The proposed method was shown to
achieve equivalent classification performance in comparison to the centralized SVM
classifiers. Mangasarian et al. (2008) constructed a reduced kernel matrix with the
original data and a random matrix to perform classification under the protection of
local data. The random kernel based SVM classifier could support both horizontally
and vertically partitioned data. Yunhong et al. (2009) proposed a privacy-preserving
SVM classifier without using secure multiparty computation. The proposed SVM
classifier built its kernel matrix by combining local gram matrices derived from
the original data owned by the corresponding institutions. It shows that local gram
matrices would not reveal the original data using matrix factorization theory, as it
is not unique for each institution to infer its covariates from local gram matrix. The
proposed classification algorithm is developed for SVM classifier with linear and
nonlinear kernels, where the accuracy of distributed classification is comparable
to the ordinary global SVM classifier. Que et al. (2012) presented a distributed
privacy-preserving SVM (DPP-SVM), where server/client collaborative learning
framework is developed to securely estimate parameters of covariates based on
the aggregated local kernel matrices from multiple institutions. For security, all
the model operations are performed on the trusted server, including service layer
for server/client communication, task manager for data validation and computation
engine for parameter estimation.
Vaidya et al. (2008) presented a generalized privacy-preserving algorithm for
building ID3 decision tree over data vertically distributed across multiple insti-
tutions. For efficient and secure classification using the ID3 decision tree, the
proposed algorithm only revealed the basic structure of the tree and the specific
institution responsible for decision making at each node, rather than the exact values
of attributes. For each node in the tree, the basic structure includes its number of
branches and the depths of its subtrees, which represented the number of distinct
values for corresponding attributes. Thus, it is not necessary for each institution
to introduce complex cryptographic protocol at each possible level to securely
classify an instance. It should be noted that the proposed algorithm only needs to
assign class attribute needs to one institution, but each interior node could learn the
count of classes. Consequently, the institution owning class attribute estimated the
distributions throughout the decision tree based on the derived transaction counts,
which would not disclose much new information. The distribution and majority class
72 W. Dai et al.
was determined based on the cardinality of set intersection protocol. Given multiple
institutions, classification of an instance was conducted by exchanging the control
information based on the decision made for each node, but not the attribute values.
To further enhance the efficiency of privacy-preserving ID3 algorithm, Vaidya
et al. (2014) developed random decision tree (RDT) framework to fit parallel and
distributed architecture. Random decision tree is desirable for privacy-preserving
distributed data mining, as it can achieve equivalent effect as perturbation without
diminishing the utility of information from data mining. For horizontally partitioned
data, the structure of RDTs was known to all the institutions that held the same
types of attributes. The RDTs were constructed by considering the accessibility of
the global class distribution vector for leaf nodes. Each institution could derive its
local distribution vector from its own data and submitted the encrypted versions for
aggregation. When the class distribution vectors were known to all the institutions
or the institution owning the RDTs, the aggregation could be directly made based
on homomorphically encrypted data. If the class distribution vectors were forced
to remain unrevealed, a secure electronic voting protocol was presented to make
decision based on the collected encrypted local vectors. For vertically partitioned
data, fully distributed trees with a specified total number were considered, so
that the sensitive attribute information for each institution was not revealed. Each
random tree was split among multiple institutions and constructed recursively using
the BuildTree procedure in a distributed fashion. It is worth mentioning that this
procedure does not require the transaction set. Subsequently, the statistics of each
node was securely updated based on the training set from multiple institutions using
additively homomorphic encryption. Similarly, instance classification was achieved
by averaging the estimated probabilities from multiple RDTs in a distributed
manner. The proposed RDT algorithm is secure, as neither the attribute values nor
the RDT structure is shared during RDT construction and instance classification.
3.4.3 Evaluation
Sorting algorithm is essential for privacy-preserving distributed data analysis,
including ranked elements query, group-level aggregation and statistical tests. In
the secure multiparty computation framework, oblivious sorting algorithm can be
implement by hiding the propagation of values in the sorting network or directly
using sorting algorithm as a basis. Bogdanov et al. (2014) investigated four different
oblivious sorting algorithms for vertically partitioned data, where two algorithms
improved the existing sorting network and quicksort algorithms and the other
two ones were developed to achieve low round count for short vectors and low
communication cost for large inputs, respectively. For short vectors, a naive sorting
protocol NaiveCompSort was presented based on oblivious shuffling of input data
and vectorized comparison of shuffled data. Given large inputs, oblivious radix
sorting protocol was developed as an efficient alternative. It leveraged binary count
sorting algorithm to rearrange the input integer vectors based on the sorted digits
3 Privacy Preserving Federated Big Data Analysis 73
in the same positions. Thus, the oblivious radix sorting protocol is efficient, as it
does not require oblivious comparisons. Furthermore, optimization methods were
proposed to improve the efficiency of the oblivious sorting algorithms. For example,
bitwise shared representation and vectorization would allow data parallelization
to reduce the communication cost and complexity for SMC. For sorting network
structure, shuffling the inputs and re-using the generated network could optimize
its implementation, while uniqueness transformation for comparison-based sorting
protocols could avoid information leakage in the sortable vector. It should be noted
that these sorting algorithms could also be generalized to support matrix sorting. The
complexity and performance analysis for all the four sorting algorithms, including
detailed running-time, network and memory usage, was also presented.
Makrietal.(2014) proposed a privacy-preserving statistical verification for
clinical research based on the aggregated results from statistical computation. It
leveraged secure multiparty computation primitives to perform evaluations for
Student’s and Welch’s t-test, ANOVA (F-test), chi-squared test, Fisher’s exact test
and McNemar’s test over horizontally partitioned data. The proposed statistical
verification could be outsourced to a semi-host third party, namely verifiers, as no
private data were exchanged during the verification process. Secure protocols based
on secret sharing were utilized to compute the means and variance. Consequently,
Student’s t-test, Welch’s test and F-test could be evaluated based on the derived
means and variance. At the meantime, chi-squared test, Fisher’s exact test and
McNemar’s test could be performed based on the frequencies in the contingency
table, which would not reveal the individual records in the group of data held by
certain institutions. The proposed mechanism is proven to protect the data security
and privacy in the semi-honest model using secure multiparty computation protocols
from Shamir’s secret sharing.
3.5 Asynchronous Optimization
In Sect. 3.2, we discussed the Newton-Raphson method and ADMM framework
for distributed optimization in federated data analysis. In these methods, all
institutions are commonly synchronized for computation at each iteration. However,
these synchronous methods would fail due to unexpected communication delay
and interruption under practical network conditions. Asynchronous optimization
algorithms have been developed to make distributed and parallel optimization
based on local updates from institutions with various delays, e.g. institutions with
different computation and processing speeds and access frequencies. Thus, server
and institutions can proceed without waiting for prerequisite information as in
synchronous methods.
74 W. Dai et al.
3.5.1 Asynchronous Optimization Based on Fixed-Point
Chazan and Miranker (1969) firstly proposed asynchronous parallel method to
solve linear systems of equations with chaotic relaxation. The proposed method
developed totally (infinite delay) and partially (bounded delay) asynchronous
parallel algorithms to improve the efficiency of iterative schemes based on limited
number of updates with a guarantee of convergence. Baudet (1978) extended totally
asynchronous chaotic relaxation to solve fixed-point problems with contracting
operators. The proposed methods were guaranteed with the sufficient condition of
contracting operators for convergence to fixed point.
Bertsekas and Tsitsiklis (1989) introduced gradient-like optimization based
on totally and partially asynchronous parallel algorithms in unconstrained and
constrained optimization problems. Totally asynchronous gradient algorithm was
studied for optimization problems with contraction mapping under the metric of
weighted maximum norm. Its convergence is guaranteed when the diagonal dom-
inance condition is satisfied for its Hessian matrix. Provided finite asynchrony for
communication and computation, the partially asynchronous parallel optimization
is shown to converge when the step size for iterative optimization is sufficiently
small in comparison to the asynchrony measure. Tseng (1991) analyzed the rate
of convergence for partially asynchronous gradient projection algorithm. Given an
objective function with Lipschitz continuous gradient, a linear convergence rate can
be achieved, when its isocost surfaces can be discriminated and upper Lipschitz
property holds for at least one of its multivalued functions. Tai and Tseng (2002)
studied the rate of convergence for strongly convex optimization problems based on
asynchronous domain decomposition methods.
Recently, Peng et al. (2016) developed an algorithmic framework AROCK for
asynchronous optimization related to non-expansive operator Twith fixed point.
Under the assumption of atomic update, AROCK randomly selected a component
of primal variables xand updated it without memory locking by using sub-operator
SDITbased on the newly updated variablesbxwithin a bounded delay.
Here itis the random variable indicating the variable for atomic writing at
time t. AROCK is demonstrated to achieve convergence under finite-dimensional
operator. It is also guaranteed to converge to the fixed point with a linear rate,
when the difference between the identity matrix and non-expansive operator is
quasi-strongly monotone. Furthermore, Peng et al. (2016) studied coordinate update
methods for specific optimization problems with coordinate friendly operators.
Remarkably, coordinate update methods could derive and leverage a variety of
coordinate friendly operators to make asynchronous realization for coordinate
descent, proximal gradient, ADMM and primal-dual methods.
3 Privacy Preserving Federated Big Data Analysis 75
3.5.2 Asynchronous Coordinate Gradient Descent
A series of partially asynchronous parallel methods have been developed for
coordinate gradient descent. Niu et al. (2011) presented an asynchronous strategy
HOGWILD! to perform stochastic gradient descent in parallel without memory
locking. The presented strategy enabled multiple processors to make gradient
updates in the shared memory. For sparse learning problem, HOGWILD! achieves
a sublinear convergence rate, as the error led by gradient update could be triv-
ial. Liu et al. (2015) proposed an asynchronous stochastic coordinate descent
algorithm AsySCD to improve parallel minimization of smooth convex functions.
When essential strong convexity condition is satisfied, the minimization can be
solved in a linear convergence rate. Later, Liu and Wright (2015) developed an
asynchronous parallel algorithm AsySPCD to minimize the composite objective
function composed of smooth convex functions using stochastic proximal coor-
dinate descent. AsySPCD considered the scenario that data were simultaneously
accessed and updated by multiple institutions. Under the relaxed optimal strong
convexity condition, the proposed algorithm can achieve a linear convergence rate.
Remarkably, both AsySCD and AsySPCD could be accelerated in a nearly linear
rate under enough number of processors (institutions). Hsieh et al. (2015) developed
a family of parallel algorithms PASSCoDe for stochastic dual coordinate descent
based on the LIBLINEAR software to efficiently solve -regularized empirical risk
minimization problems. At each iteration, PASSCoDe allowed each institution to
randomly select a dual variable to update the primal variables with coordinate
descent. Under asynchronous settings, three parallel algorithms were developed
to handle the possible violation of primal-dual relationship for -regularized min-
imization. PASSCoDe-Lock and PASSCoDe-Atomic were proposed to maintain
the primal-dual relationship with memory locking and atomic writing. PASSCoDe-
Wild achieved nearly optimal performance with no locking and atomic operations
by performing backward error analysis for memory conflicts.
3.5.3 Asynchronous Alternating Direction Method
of Multipliers
Recently, asynchronous ADMM framework with consensus constraints has been
studied for distributed optimization under practical network conditions. Similar to
SMC, two architectures have been developed to realize asynchronous optimization
under the ADMM framework. In the peer-to-peer architecture, institutions are
interconnected with their delay or inconsistency indicated by asynchronous clock,
so that optimization is made in a decentralized manner. Iutzeler et al. (2013) adopted
randomized Gauss-Seidel iterations of Douglas-Rachford operator to develop a
generalized asynchronous ADMM framework over multiple institutions. The pro-
posed asynchronous optimization methods allowed partially activation of isolated
decision variables (without involved in the active constraints) to make distributed
76 W. Dai et al.
minimization without coordinating all the institutions. Based on the monotone
operator theory, a randomized ADMM algorithm was formulated for various
institutions with different frequencies of activation. It is demonstrated to converge to
the optimal minimum under the connectivity assumption for the graph derived from
institutions. Wei and Ozdaglar (2013) proposed an asynchronous algorithm based on
ADMM for separable linearly constrained optimization over multiple institutions.
It established a general ADMM-based formulation under asynchronous settings to
iteratively determine a pair of random variables representing active constraints and
coupled decision variables and utilize them to update primal and dual variables.
Relating the iterative results under asynchronous setting to those based on all the
institutions, the proposed asynchronous algorithm is demonstrated to converge with
a rate inversely proportional to the number of iteration.
Spoke-hub architecture maintains a master node (server) to make consensus over
the primal variables distributed across multiple institutions. Zhang and Kwok (2014)
proposed an asynchronous ADMM algorithm that introduced partial barrier and
bounded delay to control asynchrony and guarantee convergence. The proposed
async-ADMM algorithm updated the local primal and dual variables for each
institution with most recent consensus variable. At the meantime, a master node
required only a partial subset of updated primal and dual variables with bounded
delay to update the consensus variable in an asynchronous manner. Under the
bounded delay, institutions with different computation and processing speed can
be properly counted for global consensus. The convergence property can be
held for a composition of non-smooth convex local objective functions. Hong
(2014) presented a generalized proximal ADMM framework to solve non-convex
optimization problems with nonsmooth objective functions in an asynchronous
and distributed manner. An incremental algorithm was developed for iterative
asynchronous optimization with varying active constraints. At each iteration, the
server updated all the primal and dual variables with the most recent gradients of
local objective functions from institutions. The consensus variables were derived
with proximal operation to compute the gradient in each institution. It should be
noted that the server only required a subset of institutions with newly-computed
gradients to update the primal and dual variables. Thus, the proposed Async-
PADMM algorithm is robust to handle the asynchrony led by communication delay
and interruption.
3.6 Discussion and Conclusion
In this chapter, we review the privacy-preserving federated data analysis algorithms
in the context of biomedical research. Federated data analysis algorithms aim to
facilitate biomedical research and clinical treatments based on large-scale healthcare
data collected from various organizations with full compliance with the institutional
policies and legislation. Federated models are developed to bring computation to
horizontally or vertically partitioned data rather than explicitly transfer them to a
central repository. Server/client and decentralized architectures are established to
3 Privacy Preserving Federated Big Data Analysis 77
estimate global model parameters and perform statistical test based on the aggre-
gated intermediary results from multiple institutions. As a result, the patient-level
data would not be inferred, when they are not distributed in an extremely unbal-
anced way. Distributed optimization for federated data analysis are achieved and
generalized by integrating the Newton-Raphson method and ADMM framework.
For generalized linear regression, the Newton-Raphson method is demonstrated to
achieve equivalent accuracy in model parameter estimation and statistical tests in
comparison to its centralized counterparts. ADMM framework splits the objective
function for distributed optimization with auxiliary variables to formulate decom-
posable cost function for multiple institutions. Since it is not necessarily required
to derive the global consensus from all the local data, ADMM is possible to
support decentralized analysis without data server. The separate estimates from the
institutions are guaranteed to converge under the consensus constraints based on
auxiliary variables. To further protect the communication among servers and clients,
secure multiparty communication protocols were adopted for the applications
like regression, classification and evaluation over distributed data. It protects the
process of computation and communication for improved data security and privacy.
Finally, asynchronous parallel optimization can be incorporated to handle real-world
network conditions and improve the efficiency for parallel computation.
Privacy-preserving federated data analysis can be further studied for improved
security and efficiency and generic formulation for practice. One major challenge
for the federated data analysis is distributed model parameter estimation and
statistical tests for vertically partitioned data. Although federated models are shown
to be effective for logistic regression models and SVM classifiers over vertically
partitioned data, an analogical formulation based on dual optimization would fail for
the analysis tasks with non-separable objective functions like the Cox proportional
hazard model. Since each institution only holds a portion of covariates, the kernel
matrix cannot be explicitly decomposed across the institutions. Under such cases,
ADMM tends to be a promising alternative. However, ADMM commonly suffers
from low convergence rate and consequent high communication cost to solve the
iterative optimization with decomposable subproblems. Two-loop recursion would
be required to solve the optimization problem based on augmented Lagrangian
functions and regularized optimization problem in the server or client side. Thus,
it is imperative to develop privacy-preserving algorithms to consider both accuracy
and efficiency for distributed analysis over vertically partitioned data.
The second challenge is the communication cost between institutions and server.
A server is required to collect the data information or computation results from
each institution and output the final results, while all the institutions collaborate in
a round robin manner without a server in the second architecture. El Emam et al.
(2013) showed that privacy risk might exist even for exchanging intermediary results
that are aggregated from a local patient cohort. Taking horizontally partitioned data
for example, there is a chance to identify sensitive patient-level data from gram
matrix for linear models or indicating variables, covariance matrix, intermediary
results from multiple iterations, or models. On the other hand, encryption methods
can prevent information leakage in exchanging data but will noticeably affect
78 W. Dai et al.
computational and storage efficiency. For example, homomorphic encryption would
significantly increase computation and storage costs, when securely outsourcing
data and computations. To protect all raw data and intermediary results, SMC-based
distributed analysis methods require a high communications cost to deploy oblivious
transfer protocols. Moreover, many existing encryption-based methods cannot be
scaled up to handle large biomedical data. As a result, proper protocol or mechanism
is crucial to balance security and efficiency in institution/server communication.
Acknowledgement This research was supported by the Patient-Centered Outcomes Research
Institute (PCORI) under contract ME-1310-07058, the National Institute of Health (NIH)
under award number R01GM118574, R01GM118609, R00HG008175, R21LM012060, and
Act, D. P. (1998). Data protection act. London: The Stationery Office.
Baudet, G. M. (1978). Asynchronous iterative methods for multiprocessors. Journal of the ACM,
25(2), 226–244.
Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and distributed computation: numerical
methods (Vol. 23). Englewood Cliffs, NJ: Prentice hall.
Bogdanov, D., Laur, S., & Talviste, R. (2014). A practical analysis of oblivious sorting algorithms
for secure multi-party computation. In K. Bernsmed & S. Fischer-Hübner (Eds.), Secure IT
systems (pp. 59–74). Springer International Publishing.
Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and
statistical learning via the alternating direction method of multipliers. Foundations and Trends
in Machine Learning, 3(1), 1–122.
Brown, J., Balaconis, E., Mazza, M., Syat, B., Rosen, R., Kelly, S., et al. (2012). PS1-
46: HMORNnet: Shared infrastructure for distributed querying by HMORN collaboratives.
Clinical Medicine & Research, 10(3), 163.
Chambless, L. E., & Diao, G. (2006). Estimation of time-dependent area under the ROC curve for
long-term risk prediction. Statistics in Medicine, 25(20), 3474–3486.
Chazan, D., & Miranker, W. (1969). Chaotic relaxation. Linear Algebra and Its Applications, 2(2),
Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society,
Series B, Statistical Methodology, 34(2), 187–220.
Du, W., Han, Y. S., & Chen, S. (2004). Privacy-preserving multivariate statistical analysis: Linear
regression and classification. In Proceedings of the 2004 SIAM international conference on
data mining (pp. 222–233).
El Emam, K., Samet, S., Arbuckle, L., Tamblyn, R., Earle, C., & Kantarcioglu, M. (2013). A secure
distributed logistic regression protocol for the detection of rare adverse drug events. Journal of
the American Medical Informatics Association: JAMIA, 20(3), 453–461.
Fienberg, S. E., Fulp, W. J., Slavkovic, A. B., & Wrobel, T. A. (2006). “Secure” log-linear and
logistic regression analysis of distributed databases. In J. Domingo-Ferrer & L. Franconi (Eds.),
Privacy in statistical databases (pp. 277–290). Berlin: Springer.
Fienberg, S. E., Nardi, Y., & Slavkovi´
c, A. B. (2009). Valid statistical analysis for logistic
regression with multiple sources. In C. S. Gal, P. B. Kantor, & M. E. Lesk (Eds.), Protecting
persons while protecting the people (pp. 82–94). Berlin: Springer.
Forero, P. a., & Giannakis, G. B. (2010). Consensus-based distributed support vector machines.
Journal of Machine Learning Research: JMLR, 11, 1663–1707.
3 Privacy Preserving Federated Big Data Analysis 79
Goldfarb, D., Ma, S., & Scheinberg, K. (2012). Fast alternating linearization methods for
minimizing the sum of two convex functions. Mathematical Programming A Publication of
the Mathematical Programming Society, 141(1–2), 349–382.
Guo, Y.-F., Lin, X., Teng, Z., Xue, X., & Fan, J. (2013). A covariance-free iterative algorithm for
distributed principal component analysis on vertically partitioned data. Pattern Recognition,
45(3), 1211–1219.
Gymrek, M., McGuire, A. L., Golan, D., Halperin, E., & Erlich, Y. (2013). Identifying personal
genomes by surname inference. Science, 339(6117), 321–324.
Health Insurance Portability and Accountability Act (HIPAA). (n.d.). Retrieved from http://
Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., et al. (2008).
Resolving individuals contributing trace amounts of DNA to highly complex mixtures using
high-density SNP genotyping microarrays. PLoS Genetics, 4(8), e1000167.
Hong, M. (2014). A distributed, asynchronous and incremental algorithm for nonconvex optimiza-
tion: An ADMM based approach.arXiv [cs.IT]. Retrieved from http://arxiv.org/abs/1412.6058.
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression.NewYork,
NY: Wiley.
Hsieh C-J., Yu H-F., & Dhillon I. S. (2015). Passcode: Parallel asynchronous stochastic dual co-
ordinate descent. In Proceedings of the 32nd international conference on machine learning
(ICML-15) (pp. 2370–2379).
Iutzeler, F., Bianchi, P., Ciblat, P., & Hachem, W. (2013). Asynchronous distributed optimization
using a randomized alternating direction method of multipliers. In 52nd IEEE conference on
decision and control (pp. 3671–3676). ieeexplore.ieee.org.
Jiang, W., Li, P., Wang, S., Wu, Y., Xue, M., Ohno-Machado, L., & Jiang, X. (2013). WebGLORE:
A web service for Grid LOgistic REgression. Bioinformatics, 29(24), 3238–3240.
Kantarcioglu, M. (2008). A survey of privacy-preserving methods across horizontally partitioned
data. In C. C. Aggarwal & P. S. Yu (Eds.), Privacy-preserving data mining (pp. 313–335).
New York, NY: Springer.
Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations.
Journal of the American Statistical Association, 53(282), 457–481.
Karr, A. F., Lin, X., Sanil, A. P., & Reiter, J. P. (2009). Privacy-preserving analysis of vertically
partitioned data using secure matrix products. Journal of Official Statistics, 25(1), 125–138.
Lafky, D. (2010). The Safe Harbor method of de-identification: An empirical test. In Fourth
national HIPAA summit west.
Liang, Y., Balcan, M. F., & Kanchanapally, V. (2013). Distributed PCA and k-means clustering. In
The big learning workshop at NIPS 2013 (pp. 1–8).
Ling, Q., Shi, W., Wu, G., & Ribeiro, A. (2015). DLM: Decentralized linearized alternating
direction method of multipliers. IEEE Transactions on Signal Processing: A Publication of
the IEEE Signal Processing Society, 63(15), 4051–4064.
Liu, J., & Wright, S. J. (2015). Asynchronous stochastic coordinate descent: Parallelism and
convergence properties. SIAM Journal on Optimization, 25(1), 351–376.
Liu, J., Wright, S. J., Ré, C., Bittorf, V., & Sridhar, S. (2015). An asynchronous parallel stochastic
coordinate descent algorithm. Journal of Machine Learning Research JMLR, 16, 285–322.
Retrieved from http://www.jmlr.org/papers/volume16/liu15a/liu15a.pdf.
Li, Y., Jiang, X., Wang, S., Xiong, H., & Ohno-Machado, L. (2016). VERTIcal Grid lOgistic
regression (VERTIGO). Journal of the American Medical Informatics Association, 23(3),
Lu, C.-L., Wang, S., Ji, Z., Wu, Y., Xiong, L., & Jiang, X. (2014). WebDISCO: A web service
for distributed cox model learning without patient-level data sharing. Journal of the American
Medical Informatics Association, 22(6), 1212–1219.
Makri, E., Everts, M. H., de Hoogh, S, Peter, A., H. op den Akker., Hartel, P., & Jonker, W.
(2014). Privacy-preserving verification of clinical research (pp. 481–500). Presented at the
Sicherheit 2014 - Sicherheit, Schutz und Zuverlässigkeit, Beiträge der 7. Jahrestagung des
Fachbereichs Sicherheit der Gesellschaft für Informatik e.V. (GI), Bonn, Germany: Gesellschaft
für Informatik.
80 W. Dai et al.
Mangasarian, O. L., Wild, E. W., & Fung, G. M. (2008). Privacy-preserving classification of
vertically partitioned data via random kernels. ACM Transactions on Knowledge Discovery
from Data, 2(3), 12:1–12:16.
Mateos, G., Bazerque, J. A., & Giannakis, G. B. (2010). Distributed sparse linear regression.
IEEE Transactions on Signal Processing: A Publication of the IEEE Signal Processing Society,
58(10), 5262–5276.
Mateos, G., & Giannakis, G. B. (2012). Distributed recursive least-squares: Stability and perfor-
mance analysis. IEEE Transactions on Signal Processing: A Publication of the IEEE Signal
Processing Society, 60(612), 3740–3754.
Mateos, G., & Schizas, I. D. (2009). Distributed recursive least-squares for consensus-based in-
network adaptive estimation. IEEE Transactions on Signal Processing, 57(11), 4583–4588.
Retrieved from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5061644.
McGraw, D. (2008). Why the HIPAA privacy rules would not adequately protect personal health
records: Center for Democracy and Technology (CDT) brief. http, 1–3.
Mokhtari, A., Shi, W., Ling, Q., & Ribeiro, A. (2016). DQM: Decentralized quadratically approx-
imated alternating direction method of multipliers. IEEE Transactions on Signal Processing: A
Publication of the IEEE Signal Processing Society, 64(19), 5158–5173.
Nardi, Y., Fienberg, S. E., & Hall, R. J. (2012). Achieving both valid and secure logistic
regression analysis on aggregated data from different private sources. Journal of Privacy and
Confidentiality, 4(1), 9.
Niu, F., Recht, B., Re, C., & Wright, S. (2011). Hogwild: A lock-free approach to parallelizing
stochastic gradient descent. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, & K. Q.
Weinberger (Eds.), Advances in neural information processing systems 24 (pp. 693–701). Red
Hook, NY: Curran Associates, Inc.
Ohno-Machado, L., Agha, Z., Bell, D. S., Dahm, L., Day, M. E., & Doctor, J. N. (2014).
pSCANNER: Patient-centered scalable national network for effectiveness research. Journal
of the American Medical Informatics Association: JAMIA, 21(4), 621–626.
O’Keefe, C. M., Sparks, R. S., McAullay, D., & Loong, B. (2012). Confidentialising survival
analysis output in a remote data access system. Journal of Privacy and Confidentiality, 4(1), 6.
Ouyang, H., He, N., Tran, L., & Gray, A. (2013). Stochastic alternating direction method
of multipliers. In Proceedings of the 30th international conference on machine learning
(pp. 80–88).
PCORnet: The National Patient-Centered Clinical Research Network. (n.d.). Retrieved from http:/
Peng, Z., Wu, T., Xu, Y., Yan, M., & Yin, W. (2016a). Coordinate-friendly structures, algorithms
and applications. Annals of Mathematical Sciences and Applications, 1(1), 57–119.
Peng, Z., Xu, Y., Yan, M., & Yin, W. (2016b). ARock: An algorithmic framework for asynchronous
parallel coordinate updates. SIAM Journal of Scientific Computing, 38(5), A2851–A2879.
PopMedNet. (n.d.). Retrieved from http://www.popmednet.org/.
Que, J., Jiang, X., & Ohno-Machado, L. (2012). A collaborative framework for distributed privacy-
preserving support vector machine learning. In AMIA annual symposium (pp. 1350–1359).
Chicago, IL: AMIA.
Sanil, A. P., Karr, A. F., Lin, X., & Reiter, J. P. (2004). Privacy Preserving regression modelling via
distributed computation. In Proceedings of the tenth ACM SIGKDD international conference
on knowledge discovery and data mining (pp. 677–682). New York, NY, USA: ACM.
Scardapane, S., Wang, D., & Panella, M. (2016). A decentralized training algorithm for Echo State
Networks in distributed big data applications. Neural Networks: The Official Journal of the
International Neural Network Society, 78, 65–74.
Schizas, I. D., & Aduroja, A. (2015). A distributed framework for dimensionality reduction
and denoising. IEEE Transactions on Signal Processing: A Publication of the IEEE Signal
Processing Society, 63(23), 6379–6394.
Shi, H., Jiang, C., Dai, W., Jiang, X., Tang, Y., Ohno-Machado, L., & Wang, S. (2016).
Secure multi-pArty computation grid LOgistic REgression (SMAC-GLORE). BMC Medical
Informatics and Decision Making, 16(Suppl 3), 89.
3 Privacy Preserving Federated Big Data Analysis 81
Slavkovic, A. B., & Nardi, Y. (2007). “Secure” logistic regression of horizontally and verti-
cally partitioned distributed databases. Seventh IEEE International. Retrieved from http://
Sweeney, L., Abu, A., & Winn, J. (2013). Identifying participants in the personal genome project
by name. Available at SSRN 2257732. https://doi.org/10.2139/ssrn.2257732.
Tai, X.-C., & Tseng, P. (2002). Convergence rate analysis of an asynchronous space decomposition
method for convex minimization. Mathematics of Computation, 71(239), 1105–1135.
Tseng, P. (1991). On the rate of convergence of a partially asynchronous gradient projection
algorithm. SIAM Journal on Optimization, 1(4), 603–619.
Vaidya, J. (2008). A survey of privacy-preserving methods across vertically partitioned data. In C.
C. Aggarwal & P. S. Yu (Eds.), Privacy-preserving data mining (pp. 337–358). New York, NY:
Vaidya, J., & Clifton, C. (2003a). Privacy-preserving K-means clustering over vertically partitioned
data. In Proceedings of the ninth ACM SIGKDD international conference on knowledge
discovery and data mining (pp. 206–215). New York, NY: ACM.
Vaidya, J., & Clifton, C. (2003b). Privacy preserving naive bayes classifier for vertically partitioned
data. In Proceedings of the ninth acm SIGKDD international conference on knowledge
discovery and data mining.
Vaidya, J., Clifton, C., Kantarcioglu, M., & Patterson, A. S. (2008a). Privacy-preserving decision
trees over vertically partitioned data. ACM Transactions on Knowledge Discovery from Data,
2(3), 14:1–14:27.
Vaidya, J., Kantarcıo˘
glu, M., & Clifton, C. (2008b). Privacy-preserving naive bayes classification.
Journal on Very Large Data Bases, 17(4), 879–898.
Vaidya, J., Shafiq, B., Fan, W., Mehmood, D., & Lorenzi, D. (2014). A random decision tree
framework for privacy-preserving data mining. IEEE Transactions on Dependable and Secure
Computing, 11(5), 399–411.
Vaidya, J., Shafiq, B., Jiang, X., & Ohno-Machado, L. (2013). Identifying inference attacks against
healthcare data repositories. In AMIA joint summits on translational science proceedings. AMIA
summit on translational science,2013, (pp. 262–266).
Wang, H., & Banerjee, A. (2012). Online alternating direction method. In J. Langford & J. Pineau
(Eds.), Proceedings of the 29th international conference on machine learning (ICML-12) (pp.
1119–1126). New York, NY: Omnipress.
Wang, R., Li, Y. F., Wang, X., Tang, H., & Zhou, X. (2009). Learning your identity and disease
from research papers: information leaks in genome wide association study. In Proceedings of
the 16th ACM conference on computer and communications security (pp. 534–544). New York,
Wang, S., Jiang, X., Wu, Y., Cui, L., Cheng, S., & Ohno-Machado, L. (2013). EXpectation
propagation LOgistic REgRession (EXPLORER ): Distributed privacy-preserving online
model learning. Journal of Biomedical Informatics, 46(3), 1–50.
Weibull, W. (1951). A statistical distribution function of wide applicability. Journal of Applied
Mathematics, 18(3), 293–297.
Wei, E., & Ozdaglar, A. (2013). On the O(1 Dk) convergence of asynchronous distributed
alternating direction method of multipliers. In Global conference on signal and information
processing (GlobalSIP), 2013 IEEE (pp. 551–554). ieeexplore.ieee.org.
Wu, Y., Jiang, X., Kim, J., & Ohno-Machado, L. (2012a). Grid binary LOgistic REgression
(GLORE): Building shared models without sharing data. Journal of the American Medical
Informatics Association, 2012(5), 758–764.
Wu, Y., Jiang, X., & Ohno-Machado, L. (2012b). Preserving institutional privacy in distributed
binary logistic regression. In AMIA annual symposium (pp. 1450–1458). Chicago, IL: AMIA.
Wu, Y., Jiang, X., Wang, S., Jiang, W., Li, P., & Ohno-Machado, L. (2015). Grid multi-category
response logistic models. BMC Medical Informatics and Decision Making, 15(1), 1–10.
Yao, A. C. (1982). Protocols for secure computations. In Foundations of computer science, 1982.
SFCS ‘08. 23rd annual symposium on (pp. 160–164). ieeexplore.ieee.org.
82 W. Dai et al.
Yu, H., Jiang, X., & Vaidya, J. (2006a). Privacy-preserving SVM using nonlinear kernels
on horizontally partitioned data. In Proceedings of the 2006 ACM symposium on applied
computing (pp. 603–610). New York, NY: ACM.
Yu, H., Vaidya, J., & Jiang, X. (2006b). Privacy-preserving SVM classification on vertically
partitioned data. Lecture Notes in Computer Science,3918 LNAI, 647–656.
Yunhong, H., Liang, F., & Guoping, H. (2009). Privacy-Preserving svm classification on vertically
partitioned data without secure multi-party computation. In 2009 fifth international conference
on natural computation (Vol. 1, pp. 543–546). ieeexplore.ieee.org.
Yu, S., Fung, G., Rosales, R., Krishnan, S., Rao, R. B., Dehing-Oberije, C., & Lambin, P. (2008).
Privacy-preserving cox regression for survival analysis. In Proceedings of the 14th ACM
SIGKDD international conference on Knowledge discovery and data mining (pp. 1034–1042).
New York, NY: ACM.
Zhang, R., & Kwok, J. (2014). Asynchronous distributed ADMM for consensus optimization. In
Proceedings of the 31st international conference on machine learning (pp. 1701–1709).
Chapter 4
Word Embedding for Understanding Natural
Language: A Survey
Yang Li and Tao Yang
4.1 Introduction
Natural language understanding from text data is an important field in Artificial
Intelligence. As images and acoustic waves can be mathematically modeled by
analog or digital signals, we also need a way to represent text data in order to process
it automatically. For example, the sentence “The cat sat on the mat.” can not be
processed or understood directly by the computer system. The easiest way is to rep-
resent is through a sparse discrete vector f.icat;1/;.imat;1/;.ion;1/;.isit;1/;.ithe;2/g,
where iwdenotes the index of word win the vocabulary. This is called one-hot
embedding. However, there are several disadvantages for this simple model. First,
it generates high dimensional vectors whose length depends on the volume of
the vocabulary, which is usually very large. Meanwhile, the semantic relationship
between words (e.g., “sit on”) cannot be reflected by these separate counts. Due
to the subjectivity of languages, the meaning of word (phrase) varies in different
contexts. This makes it a more challenging task to automatically process text data.
A goal of language modeling is to learn the joint probability that a given word
sequence occurs in texts. A larger probability value indicates that the sequence is
more commonly used. One candidate solution to involve the relationship between
words is called “n-gram” model, where nis the hyperparameter manually chosen.
The “n-gram” model takes into consideration the phrases of nconsecutive words. In
the example above, “the cat”, “cat sat”, “sat on”, “on the”, “the mat” will be counted,
if we set nD2. The disadvantage of “n-gram” model is that it is constrained by
the parameter n. Moreover, the intrinsic difficulty of language modeling is that: a
word sequence that is used to test the model is likely to be different from all the
sequences in the training set (Bengio et al. 2003). To make it more concrete, suppose
Y. Li • T. Ya n g ( )
School of Automation, NorthWestern Polytechnical University, Xi’an, Shanxi 710072, P.R. China
e-mail: liyangnpu@mail.nwpu.edu.cn;yangtao107@nwpu.edu.cn
© Springer International Publishing AG 2018
S. Srinivasan (ed.), Guide to Big Data Applications, Studies in Big Data 26,
DOI 10.1007/978-3-319-53817-4_4
84 Y. Li and T. Yang
we want to model the joint distribution of 5-word sequences in a vocabulary of
100,000 words, the possible number of combinations is 10000051D1025 1,
which is also the number of free parameters to learn. It is prohibitively large for
further processing. This phenomenon is called “the curse of dimensionality”. The
root of this curse is the “generalization” problem. We need an effective mechanism
to extrapolate the knowledge obtained during training to the new cases. For discrete
spaces mentioned above, the structure of generalization is not obvious, i.e. any
change on the discrete variables, as well as their combinations, will have a drastic
influence on the value of joint distribution to be estimated.
To solve the dilemma, we can model the variables in a continuous space, where
we can think of how training points trigger probability mass and distribute it
smoothly to the neighborhood around them. Then it goes back to the problem
of word embedding, whose concept was first introduced in Hinton (1986). Word
embedding, sometimes named as word representation, is a collective name for
a set of language models and feature selection methods. Its main goal is to
map textual words or phrases into a low-dimensional continuous space. Using
the example above, “cat” can be denoted as Œ0:1; 0:3; : : : ; 0:2Mand “mat” can
be expressed as Œ0:1; 0:2; : : : ; 0:4M, where Mis the hyperparameter. After that,
advanced NLP tasks can be processed implemented based on these real-valued
vectors. Word embedding encodes the semantic and syntactic information of words,
where semantic information mainly correlates with the meaning of words, while
syntactic information refers to their structural roles. Is a basic procedure in natural
language processing. From the high level, most of the models try to optimize a
loss function trying to minimize the discrepancy between prediction values and
target values. A basic assumption is that words in similar context should have
similar meaning (Harris 1954). This hypothesis emphasizes the bound of .w;Qw/
for common word-context pairs (word wand its contextual word Qwusually appear
together) and weaken the correlation of rare ones.
The existing word embedding approaches are diverse. There are several ways to
group them into different categories. According to Sun et al. (2015)andLaietal.
(2015), the models can be classified as either paradigmatic models or syntagmatic
models, based on the word distribution information. The text region where words
co-occur is the core of the syntagmatic model, but for the paradigmatic model
it is the similar context that matters. Take “The tiger is a fierce animal.” and
“The wolf is a fierce animal.” as examples, “tiger-fierce” and “wolf-fierce” are the
syntagmatic words, while “wolf-tiger” are the paradigmatic words. Shazeer et al.
(2016) divides the models into two classes, matrix factorization and slide-window
sampling method, according to how word embedding is generated. The former is
based on the word co-occurrence matrix, where word embedding is obtained from
matrix decomposition. For the latter one, data sampled from sliding windows is used
to predict the context word.
In this chapter, word embedding approaches are introduced according to the
method of mapping words to latent spaces. Here we mainly focus on the Neural
Network Language Model (NNLM) (Xu and Rudnicky 2000; Mikolov et al. 2013)
and the Sparse Coding Approach (SPA) (Yogatama et al. 2014a). Vector Space
4 Word Embedding for Understanding Natural Language: A Survey 85
Model aims at feature expression. A word-document matrix is first constructed,
where each entry counts the occurrence frequency of a word in documents. Then
embedding vectors containing semantic information of words are obtained through
probability generation or matrix decomposition. In the Neural Network Language
Model, fed with training data, word embedding is encoded as the weights of a certain
layer in the neural network. There are several types of network architectures, such
as Restricted Boltzmann Machine, Recurrent Neural Network, Recursive Neural
Network, Convolutional Neural Network and Hierarchical Neural Network. They
are able to capture both the semantic and syntactic information. Sparse coding
model is another state-of-the-art method to get word embedding. Its goal is to
discover a set of bases that can represent the words efficiently.
The main structure of this paper is as follows: the models mentioned above
and the evaluation methods are introduced in Sect. 4.2; the applications of word
embedding are included in Sect. 4.3; conclusion and future work are in Sect. 4.4.
4.2 Word Embedding Approaches and Evaluations
Generally speaking, the goal of word embedding is mapping the words in unlabeled
text data to a continuously-valued low dimensional space, in order to capture the
internal semantic and syntactic information. In this section, we first introduce the
background of text representation. Then, according to the specific methods for
generating the mapping, we mainly focus on the Neural Network Language Model
and Sparse Coding Approach. The former is further introduced in three parts,
depending on the network structures applied in the model. In the end, we provide
evaluation approaches for measuring the performance of word embedding models.
4.2.1 Background
One of the widely used approach for expressing the text documents is the Vector
Space Model (VSM), where documents are represented as vectors. VSM was
originally developed for the SMART information retrieval system (Salton et al.
1997). Some classical VSMs can be found in Deerweste et al. (1990) and Hofmann
There are various ways of building VSMs. In scenarios of information retrieval
where people care more about the textual features that facilitate text categorization,
various features selection methods such as document frequency (DF), information
gain (IG), mutual information (MI), 2-test (CHI-test) and term strength (TS)
have different effects on text classification (Yang and Pedersen 1997). These
approaches help reduce the dimension of the text data, which could be helpful
for the subsequent processing. DF refers to the number of documents that a word
appears. IG measures the information loss between presence and absence term in
86 Y. Li and T. Yang
the document. MI is the ratio between term-document joint probability and the
product of their marginal probability. CHI-test applies the sums of squared errors
and tries to find out the significant difference between observed word frequency and
expected word frequency. TS estimates how likely a term will appear in “closely-
related” documents. In information retrieval systems, these methods are useful for
transforming the raw text to the vector space, and tremendous improvement has
been achieved for information classification. However, it does not work well alone
in applications that require both semantic and syntax information of words, since
part of the original text information is already lost in feature selection procedures.
However, various word embedding models, such as sparse coding, are built based
on the VSM.
In Deerweste et al. (1990), Latent Semantic Analysis (LSA) is proposed to index
and retrieve the information from text data automatically. It applies SVD on the
term-document association data to gain the embedding output. Work in Landauer
et al. (1998) and Landauer and Dumais (1997) gives an explanation to the word
similarity that derives from the Test of English as a Foreign Language (TOEFL).
After that, Landauer (2002) tries to detect the synonym through LSA. Furthermore,
Yih et al. (2012) uses LSA to distinguish between synonyms and antonyms in
the documents and its extension Multiview LSA (MVLSA) (Rastogi et al. 2015)
supports the fusion of arbitrary views of data. However, all of them have to confront
the limitation that depends on a single matrix of term-document co-occurrences.
Usually the word embedding from those models is not flexible enough, because of
the strong reliance on the observed matrix.
Latent Dirichlet Allocation (LDA) (Bengio et al. 2003) is another useful
model for feature expression. It assumes that some latent topics are contained in
documents. The latent topic Tis derived from the conditional distribution p.wjT/,
i.e. the probability that word wappears in T. Word embedding from traditional LDA
models can only capture the topic information but not the syntactic one, which does
not fully achieve its goal. Moreover, traditional LDA-based model is only used for
topic discovery, but not word representation. But we can get the k-dimensional word
embedding by training a k-topic model, the word embedding is from the rows of
word-topic matrix and the matrix is filled by p.wjT/values.
Although both of LSA and LDA utilize the statistical information directly for
word embedding generation, there are some differences between them. Specifically,
LSA is based on matrix factorization and subject to the non-negativity constraint.
LDA relies on the word distribution, and it is expressed by the Dirichlet priori
distribution which is the conjugate of multinomial distribution.
4.2.2 Neural Network Language Model
The concept of word embedding was first introduced with the Neural Networks
Language Model (NNLM) in Xu and Rudnicky (2000) and Bengio et al. (2003).
The motivation behind is that words are more likely to share similar meaning if they
4 Word Embedding for Understanding Natural Language: A Survey 87
are in the same context (Harris 1954). The probability that the word sequence W
occurs can be formulated according to the Bayes rule:
where P(W) is the joint distribution of sequence W, and htdenotes the context
words around word wt. The goal is to evaluate the probability that the word wt
appears given its context information. Because there are a huge number of possible
combinations of a word and its context, it is impractical to specify all P.wtjht/.In
this case, we can use a function ˆto map the context into some equivalent classes,
so that we have
P.wtjht/DP.wtjˆ.ht// (4.2)
where all of the words are statistically independent.
If we use the n-gram model to construct a table of conditional probability for each
word, then the last n-1 words are combined together as the context information:
where n is the number of the words considered, while 1 and 2 are the most
commonly used value. Only the successive words occurring before the current word
wtare taken into account.
Most NNLM-based approaches belong to the class of unsupervised models.
Networks with various architectures such as Restrict Boltzmann Machine (RBM),
Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and
Long-Short Term Memory (LSTM) can be used to build word embedding. Log-
Bilinear (LBL) model, SENNA, and Word2Vector are some of the most representa-
tive examples of NNLM.
Usually the goal of the NNLM is to maximize or minimize the function of Log-
Likelihood, sometimes with additional constraints. Suppose we have a sequence
w1;w2;:::;wnin the corpus, and we want to maximize the log-likelihood of
tnC1/in the Feed Forward Neural Network (Bengio et al. 2003). Let xbe
the embedding vector of proceeding words, so that
xDŒe.wtnC1/;:::;e.wt2/; e.wt1/ (4.4)
where e.w/represents the embedding of word w. In Feed Forward Neural Network
(without the direct edge in the structure) with one hidden layer, the layer function
can be expressed as:
yDbCU.tanh.dCWx// (4.5)
88 Y. Li and T. Yang
Fig. 4.1 The basic structure of the Feed Forward Neural Network
where Uis the transformation matrix, Wis the weight matrix of hidden layer, band
dare bias vectors. Finally, yis fed into a softmax layer to obtain the probability
of the target word, and the basic structure of this model is shown in Fig. 4.1.The
parameters in this model are .b;U;d;W/. However, since we need to explicitly
normalize (e.g. use softmax) all of the values in the vocabulary when computing
the probability for the next word, it will be very costly for training and testing
(Morin and Bengio 2005). After estimating the conditional probability in Eq. (4.3),
the total probability P.wtnC1; ::; wt1;wt/can be obtained directly by locating the
mark of the phrase that is constructed by the words that appears in the same window
(Collobert et al. 2011). The resultant probability value quantifies the normality if
such a sentence appears in the natural language (more details can be found in
Instead of using matrix multiplication on the final layer (Bengio et al. 2003;Mnih
and Hinton 2007) applies the hierarchical structure (see Sect. to reduce the
computational complexity. It constructs the Log-BiLinear (LBL) model by adding
bilinear interaction between word vectors and hidden variables on the basis of the
energy function (see Sect.
Recently, Word2Vector proposed by Mikolov et al. (2013) which contains
Skip-Gram and (Continuous Bag-of-Words) CBOW these two frameworks is very
popular in NLP tasks. It can be seen as a two-layer Neural Network Language
Model. Furthermore, applying Noise Contrastive Estimation (NCE) increases its
efficiency. However, we can also add the priori information into this model, Liu
et al. (2015) extends Skip-Gram by treating topical information as important priori
4 Word Embedding for Understanding Natural Language: A Survey 89
knowledge for training word embedding and proposes the topical word embedding
(TWE) model, where text words and their affiliated topic (context) derived from
LDA are combined to obtain the embedding.
In the subsections below, we will introduce some basic architectures of NNLM
for word embedding. Restricted Boltzmann Machine (RBM)
The ability of capturing the latent features among words usually depends on the
model structure. Boltzmann Machine (BM) originates from the log-linear Markov
Random Field (MRF) and its energy function is linearly related to free parameters.
According to the statistical dynamics, energy function is useful in word embedding
generation (Mnih and Hinton 2007). Restricted Boltzmann Machine (RBM), which
prunes the visible-visible and hidden-hidden connections, is a simplified version of
BM. A diagram of RBM is given in Fig. 4.2.
The main components of the energy function in RBM include binary variables
(hidden unites hand visible unites v), weights WD.wi;j/that establish connections
between hjand vi,biasesaifor the visible units and bjfor the hidden ones.
Specifically, the energy function is formulated as follows,
E.v; h/DvTWh aTvbTh(4.6)
During the generation process word embedding, the energy function is used as the
probability distribution as shown in Eq. (4.7).
P.v; h/DeE.v;h/=Z(4.7)
where Zis a partition function defined as the sum of eE.h;v/.
Fig. 4.2 The basic structure
of the RBM
90 Y. Li and T. Yang
Mnih and Hinton (2007) proposes three models on the basis of RBM by adding
connections to the previous status. Starting from the undirected graphical model,
they try to estimate the latent conditional distribution of words. To compute the
probability of the next word in a sequence, the energy function is defined as follows:
where vT
iRdenotes the vector that represents word wifrom the word dictionary R,
Wiin hidden units denotes the similarity between two words, and bh;bv;brare the
biases for the hidden units, words and word features respectively.
The disadvantage of this class of models is that it is costly to estimate massive
parameters during the training process. To reduce the number of parameters,
Mnih and Hinton (2007) extends the factored RBM language model by adding
connections Camong words. It also removes the hidden variables and directly
applies the stochastic binary variables. The modified energy function is as below:
Here Cidenotes correlation between word vectors wiand wt,brand bvdenotes the
word biases. This function quantifies the bilinear interaction between words, which
is also the reason why models of this kind are called Log-BiLinear Language (LBL)
models. In Eq. (4.9), entries of vector hDPt1
iRCicorrespond to the nodes
in the hidden layer, and yiDhRTvnrepresents the set of nodes in the output layer.
We can see that the syntax information is contained in the hidden layer, for Cihere
can be seen as the contribution of word ito current word n, which is like the cosine
similarity between two words. Recurrent and Recursive Neural Network
To reduce the number of parameters, we could unify the layer functions with some
repetitive parameters. By treating the text as sequential data, Mikolov et al. (2010)
proposes a language model based on the Recurrent Neural Network, the structure of
which is shown in Fig. 4.3. The model has a circular architecture. At time t,word
embedding w.t/is generated out of the first layer. Then, it is transferred together
with the output from the context layer at time t1as the new input x.t/to the
context layer, which is formulated as follows:
x.t/Dw.t/Cs.t1/ (4.10)
4 Word Embedding for Understanding Natural Language: A Survey 91
Fig. 4.3 The structure of the
recurrent neural network
Fig. 4.4 The structure of the Recursive Neural Network
In each cycle, the outcomes of context layer and output layer are denoted as s.t/and
y.t/respectively. Inspired by Bengio et al. (2003), the output layer also applies the
softmax function.
Unlike Recurrent Neural Network, Recursive Neural Network (Goller and
Kuchler 1996) is a linear chain in space as shown in Fig. 4.4. To fully analyze
the sentence structure, the process iterates from two word vectors in leaf nodes
b;cto merge into a hidden output h1, which will continue to concatenate another
input word vector aas the next input. The computing order is determined either by
the sentence structure or the graph structure. Not only could hidden layers capture
the context information, they could also involve the syntax information. Therefore,
92 Y. Li and T. Yang
Fig. 4.5 (A) The structure of the Recursive Neural Network model where each node represents
a vector and all the word vectors are in the leaf nodes. (B) The structure of the MV-RNN model
where each node consists of a vector and a matrix
Recursive Neural Network can capture more information than the context-based
model, which is desirable for NLP tasks.
For the sequential text data in Recursive Neural Network, Socher et al. (2011,
2013,2012) parses the sentences into tree structures through the tree bank models
(Klein and Manning 2003; Antony et al. 2010). The model proposed by Socher
et al. (2011) employs the Recursive Neural Network, and its structure is shown
in Fig. 4.5A,
where a;b2Rd1are the word vectors, parent node p2Rd1is the hidden output,
and fDtanh adds element-wise nonlinearity to the model. W2Rd2dis the
parameter to learn.
To incorporate more information in each node, work in (Socher et al. 2012)
proposes the model of Matrix-Vector Recursive Neural Network (MV-RNN), for
which the parsing-tree structure is shown in Fig. 4.5B. Different from that in
Fig. 4.5A, each node in this parsing tree includes a vector and a matrix. The vector
represents the word vector in leaf nodes and phrase in non-leaf ones. The matrix
is applied to neighboring vectors, and it is initialized by the identity matrix Iplus
a Gaussian noise. The vectors a;b;pcapture the inherent meaning transferred in
the parsing tree, and the matrixes A;B;Pare to capture the changes of the meaning
for the neighboring words or phrases. Apparently, the drawback of this model is
the large number of parameters to be estimated. So in the work of Socher et al.
(2013), the model of Recursive Neural Tensor Network (RNTN) uses the tensor-
based composition function to avoid this problem.
In comparison, the main differences between recurrent and recursive models are
the computing order and the network structure. The structure of the former has
a closed loop dealing with the time sequence, while the latter has an open loop
tackling the spatial sequence.
4 Word Embedding for Understanding Natural Language: A Survey 93 Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN) consists of several feature extraction layers,
which is inspired by biological process (Matsugu et al. 2003). It has already been
successfully applied in many image recognition problems. The main advantage of
CNN is that it does not depend on the prior knowledge or human efforts when doing
feature selection.
CNN can also be applied to extract latent features from text data (Collobert
and Weston 2008; Collobert et al. 2011). As shown in Fig. 4.6, three different
convolutional kernels, in three different colors, select different features from word
embedding separately. The feature vector is the combination of the max values from
the rows of the selected features.
In the models above, supervised learning is applied to train the whole neural
network, and the convolutional layer here is to model the latent feature of the
initialized word embedding. Since supervised training method is applied here, it
is able to generate word embedding suitable for the special task. Though the word
embedding is usually a by-product in this case, it still has a sound performance in
capturing semantic and syntactic information.
Fig. 4.6 The structure of CNN for feature extraction by using convolutional kernels
94 Y. Li and T. Yang Hierarchical Neural Language Model (HNLM)
There are many tricks to speed up the process of training NNLM, such as short list,
hash table for the word and stochastic gradient descent (Bengio et al. 2003). But
when facing with datasets of large scale, further improvement in model efficiency is
still needed.
Transferring the knowledge from a small portion of observed data examples to
other cases can save a lot of computational efforts, thus speeding up the learning
process. However, it is a difficult task for neural language models which involve
computing the joint probability of certain word combinations. This is because the
possible combinations of words from a vocabulary is immensely larger than the ones
from all potential texts. So it is necessary to look for some priori information before
searching all possible combinations. Based on human knowledge or statistical
information, we can cluster all the words into several classes where some similar
statistical properties are shared by the words in each class. Some hierarchical models
are constructed based on this idea (Mnih and Hinton 2008; Morin and Bengio 2005).
Hierarchical structures has a big influence on the algorithm complexity. For
example, Mnih and Hinton (2008) and Morin and Bengio (2005) reduces the
complexity of language models by clustering words into a binary tree. Specifically,
given a dictionary Vcontaining jVjwords, the speed up will be O.jVj=logjVj/
(Morin and Bengio 2005) if the tree is binary.
To determine the hierarchical structure, hierarchical word classes (lexical word
categories in a narrow sense) are needed in most cases (Rijkhoff and Jan 2007). The
way of constructing word classes is sensitive to the word distribution. Word classes
can be built based on prior (expert) knowledge (Fellbaum 1998) or through data-
driven methods (Sun et al. 2012), but they could also be automatically generated in
accordance with the usage statistics (Mnih and Hinton 2008; McMahon and Smith
1996). The prior knowledge is accurate but limited by the application range, since
there are lots of problems where expert categories are unavailable. The data-driven
methods can boost a high level of accuracy but it still needs a seed corpus which is
manually tagged. Universality is the major advantage of the automatic generation
method. It could be applied to any natural language scenarios without the expert
knowledge, though it will be more complex to construct.
By utilizing the hierarchical structure, lots of language models have been
proposed (Morin and Bengio 2005; Mnih and Hinton 2008; Yogatama et al.
2014a; Djuric et al. 2015). Morin and Bengio (2005) introduces the hierarchical
decomposition bases on the word classes that extract from the WordNet. Similar
with Bengio et al. (2003), it uses the layer function to extract the latent features, the
process is shown in Eq. (4.12).
Here xis the concatenation of the context word [same as Eq. (4.4)], biis the biases
vector, U;W;bi;xplays the same roles as in Eq. (4.5). The disadvantage is the
procedure of tree construction which has to combine the manual and data-driving
4 Word Embedding for Understanding Natural Language: A Survey 95
processing (Morin and Bengio 2005) together. Hierarchical Log BiLinear (HLBL)
model which is proposed by Mnih and Hinton (2008) overcomes the disadvantage
by using a boosting method to generate the tree automatically. The binary tree with
words as leaves consists of two components: the words in the leaves which can be
represented by a sequential binary code uniquely from top to down, as well as the
probability for decision making at each node. Each non-leaf node in the tree also
has an associated vector qwhich is used for discrimination. The probability of the
predicted word win HLBL is formulated as follows
where diis tth digit in the binary code sequence for word w, and qiis the feature
vector for the ith node in the path to word wfrom the root. In each non-leaf node,
the probability of the decision is given by
where .x/is the logistic function for decision making, Cjis the weight matrix
corresponding to the context word wj, and biis the bias to capture the context-
independent tendency to one of its children when leaving this node.
There are both advantages and disadvantages of the hierarchical structure. Many
models benefit from the reduction of computation complexity, while the consuming
time on the structure building is the main drawback.
4.2.3 Sparse Coding Approach
Sparse Coding is an unsupervised model that learns a set of over-complete bases to
represent data efficiently. It generates base vectors figsuch that the input vector
x2Rncan be represented by a linear combination of them xDPk
iD1˛ii, where k
is the number of base vectors and k>n. Although there are many techniques such as
Principal Component Analysis (PCA) that helps learn a complete set of basis vectors
efficiently, Sparse Coding aims to use the least bases to represent the input x.The
over-complete base vectors are able to capture the structures and patterns inherent
in the data, so the nature characteristic of the input can be seized through Sparse
To be specific, sparse Coding constructs an over-complete dictionary D2RLK
of basis vectors, together with a code matrix A2RKV, to represent Vwords in C
contents X2RLVby minimizing such function as follows:
arg min
96 Y. Li and T. Yang
where is a regularization hyperparameter and is the regularizer. It applies
the squared loss for the reconstructed error, but loss functions like L1-regularized
quadratic function can also be an efficient alternative for this problem (Schökopf
et al. 2007).
Plenty of algorithms have been proposed to extend the sparse coding framework
starting from the general loss function above. If we apply a non-negativity constraint
on A(i.e. Ai;j0), the problem is turned into the Non-Negative Sparse Embedding
(NNSE). Besides adding constraints on A, if we also requires that all entries in Dare
bigger than 0, then the problem can be transformed into the Non-Negative Sparse
Coding (NNSC). NNSC is a matrix factorization technique previously studied in
machine learning communities (Hoyer 2002).
For .A/, Yogatama et al. (2014b) designs a forest-structured regularizer that
enables the mixture use of the dimension. The structure of this model is in Fig. 4.7,
there are seven variables in the tree which describes the order that variables ‘enter
the model’. In this work, it sets the rule that a node may take a nonzero value only
if its ancestors do. For example, nodes 3 and 4 may only be nonzero if nodes 1 and
2 are also nonzero. So the regularizer for Ashows in Eq. (4.16).
where avis the vth column of A.
Faruqui et al. (2015) proposes two methods by adding restriction on the
dictionary D:
arg min
Fig. 4.7 An example of a
regularization forest that
governs the order in which
variables enter the model
4 Word Embedding for Understanding Natural Language: A Survey 97
where is the regularization hyperparameter. The first method applies l1penalty on
A, so that the Function 4.17 can be broken into the loss for each word vector, which
makes it possible for parallel optimization.
arg min
Here xiand aiare the ith column vector of matrix Xand Arespectively. The second
method is to add non-negativity constraints on variables so that:
arg min
Apart from the work based on Sparse Coding, Sun et al. (2016) adds the l1regu-
larizer into Word2Vector. The challenge in this work lies in the optimization method,
for stochastic gradient descent (SGD) cannot produce sparse solution directly with
l1regularizer in online training. The method of Regularized Dual Averaging (RDA)
proposed by Lin (2009) keeps track of online average subgradients at each update,
and optimizes l1regularized loss function based on Continuous Bag-of-Words
(CBOW) Model (Mikolov et al. 2013) in online learning.
4.2.4 Evaluations of Word Embedding
Measurements for word embedding usually depend on the specific applications.
Some representative examples include perplexity, analogy precision and sentiment
classification precision.
Perplexity is an evaluation method which originates from information theory.
It measures how well a distribution or a model can predict the sample. Bengio
et al. (2003) uses the perplexity of the corpus as the target, where the lower of
the perplexity value is, the higher quality the word embedding conveys since the
information is more specific. Word or phrase analogy analysis is another important
assessment way. Work in (Mikolov et al. 2013) designs the precision of semantic
and syntactic prediction as the standard measurement to evaluate the quality of word
embedding. It is applied in phrase and word level independently. All the assessment
methods mentioned above are from the view of linguistic phenomenon that the
distance between two words in vector space reflects the correlation between them.
Besides the measurements of linguistic phenomenon, a lot of work evaluates
word embedding by using the task where they are applied. For example, if the word
embedding is used in sentiment classification, then the precision of the classification
can be used for evaluation. If it is applied in machine translation, then the precision
98 Y. Li and T. Yang
of the translation is the one that matters. This rule also holds for other tasks like
POS tagging, named entity recognize, textual entailment, and so on.
There is no best measurement method for all scenarios. The most useful way is to
combine it with the target of the task to achieve a more reasonable evaluation result.
4.3 Word Embedding Applications
There are many applications for word embedding, especially in NLP tasks where
word embedding is fed as input data or the features of the text data directly. In this
section, we will discuss how word embedding can be applied in the scenarios such as
semantic analysis, syntax analysis, idiomaticity analysis, Part Of the Speech (POS)
tagging, sentiment analysis, named entity recognition, textual entailment as well as
machine translation.
The goal of syntax analysis is to extract the syntactic structure from sentences.
Recent works (Socher et al. 2011; Huang et al. 2012; Collobert and Weston 2008;
Mikolov et al. 2013) have taken the syntax information into consideration to obtain
word embedding, and the result in Andreas and Dan (2014) shows that word
embedding can entail the syntactic information directly. A fundamental problem
to syntax analysis is part of the speech tagging, which is about labeling the words
to different categories (e.g., plural, noun, adverb). It requires the knowledge of
definitions and contextual information. Part of the speech tagging is a word-level
NLP task. Collobert and Weston (2008) and Collobert et al. (2011) apply word
embedding for POS tagging and achieve state-of-the-art results. In some scenarios,
in order to figure out the syntax of text, we need to first recognize the named
entities in it. Named entity recognition aims to find out the names of persons,
organizations, time expressions, monetary values, locations and so on. Works in
Collobert and Weston (2008), Collobert et al. (2011), Zou et al. (2013), Luo et al.
(2014), Pennington et al. (2014) show the expressiveness of word embedding in
these applications.
As a subsequent task, semantic analysis (Goddard 2011) relates the syntactic
structures of words to their language-independent meanings.1In other words, it
reveals words that are correlated with each other. For example, the vector (“Madrid”
“Spain” C“France”) is close to (“Paris”). Previous approaches (Scott et al.
1999; Yih et al. 2012) mainly use the statistical information of the text. The works
in Mikolov et al. (2013), Socher et al. (2011), Huang et al. (2012), Collobert
and Weston (2008) apply the analogy precision to measure the quality of word
embedding (see in Sect. 4.2.4). It is shown that word embedding can manifest the
semantic information of words in the text. The semantics of each word can help us
understand the combinatory meaning of several words. The phenomenon of multi-
words expression idiomaticity is common in nature language, and it is difficult to be