# (Studies In Big Data 26) Srinivasan S. (ed.) Guide To Applications Springer (2018)

### (Studies%20in%20Big%20Data%2026)%20Srinivasan%20S.%20(ed.)-Guide%20to%20Big%20Data%20Applications-Springer%20(2018)

User Manual:

Open the PDF directly: View PDF .

Page Count: 567 [warning: Documents this large are best viewed by clicking the View PDF Link!]

- Foreword
- Preface
- Acknowledgements
- Contents
- List of Reviewers
- Part I General
- 1 Strategic Applications of Big Data
- 1.1 Introduction
- 1.2 From Value Disciplines to Digital Disciplines
- 1.3 Information Excellence
- 1.4 Solution Leadership
- 1.4.1 Digital-Physical Mirroring
- 1.4.2 Real-Time Product/Service Optimization
- 1.4.3 Product/Service Usage Optimization
- 1.4.4 Predictive Analytics and Predictive Maintenance
- 1.4.5 Product-Service System Solutions
- 1.4.6 Long-Term Product Improvement
- 1.4.7 The Experience Economy
- 1.4.8 Experiences
- 1.4.9 Transformations
- 1.4.10 Customer-Centered Product and Service Data Integration
- 1.4.11 Beyond Business

- 1.5 Collective Intimacy
- 1.6 Accelerated Innovation
- 1.7 Integrated Disciplines
- 1.8 Conclusion
- References

- 2 Start with Privacy by Design in All Big Data Applications
- 3 Privacy Preserving Federated Big Data Analysis
- 4 Word Embedding for Understanding Natural Language:A Survey

- 1 Strategic Applications of Big Data
- Part II Applications in Science
- 5 Big Data Solutions to Interpreting Complex Systems in the Environment
- 6 High Performance Computing and Big Data
- 6.1 Introduction
- 6.2 High Performance in Action
- 6.3 High-Performance and Big Data Deployment Types
- 6.4 Software and Hardware Considerations for Building Highly Performant Data Platforms
- 6.5 Designing Data Pipelines for High Performance
- 6.6 Conclusions
- References

- 7 Managing Uncertainty in Large-Scale Inversions for the Oil and Gas Industry with Big Data
- 8 Big Data in Oil & Gas and Petrophysics
- 8.1 Introduction
- 8.2 The Value of Big Data for the Petroleum Industry
- 8.3 General Explanation of Terms
- 8.4 Steps to Big Data in the Oilfield
- 8.5 Future of Tools, Example
- 8.6 Next Step: Big Data Analytics in the Oil Industry
- 8.7 Big Data is the future of O&G
- 8.8 In Conclusion
- References

- 9 Friendship Paradoxes on Quora
- 9.1 Introduction
- 9.2 A Brief Review of the Statistics of Friendship Paradoxes: What are Strong Paradoxes, and Why Should We Measure Them?
- 9.2.1 Feld's Mathematical Argument
- 9.2.2 What Does Feld's Argument Imply?
- 9.2.3 Friendship Paradox Under Random Wiring
- 9.2.4 Beyond Random-Wiring Assumptions: Why Weak and Strong Friendship Paradoxes are Ubiquitous in Undirected Networks
- 9.2.5 Weak Generalized Paradoxes are Ubiquitous Too
- 9.2.6 Strong Degree-Based Paradoxes in Directed Networks and Strong Generalized Paradoxes are Nontrivial

- 9.3 Strong Paradoxes in the Quora Follow Network
- 9.4 A Strong Paradox in Downvoting
- 9.4.1 What are Upvotes and Downvotes?
- 9.4.2 The Downvoting Network and the Core Questions
- 9.4.3 The Downvoting Paradox is Absent in the Full Downvoting Network
- 9.4.4 The Downvoting Paradox Occurs When The Downvotee and Downvoter are Active Contributors
- 9.4.5 Does a ``Content-Contribution Paradox'' Explain the Downvoting Paradox?
- 9.4.6 Summary and Implications

- 9.5 A Strong Paradox in Upvoting
- 9.6 Conclusion
- References

- 10 Deduplication Practices for Multimedia Data in the Cloud
- 10.1 Context and Motivation
- 10.2 Data Deduplication Technical Design Issues
- 10.3 Chapter Highlights
- 10.4 Secure Image Deduplication Through Image Compression
- 10.5 Background
- 10.6 Proposed Image Deduplication Scheme
- 10.7 Secure Video Deduplication Scheme in Cloud Storage Environment Using H.264 Compression
- 10.8 Background
- 10.9 Proposed Video Deduplication Scheme
- 10.10 Chapter Summary
- References

- 11 Privacy-Aware Search and Computation Over Encrypted Data Stores
- 11.1 Introduction
- 11.2 Searchable Encryption Models
- 11.3 Text
- 11.4 Range Queries
- 11.5 Media
- 11.6 Other Applications
- 11.7 Conclusions
- References

- 12 Civil Infrastructure Serviceability Evaluation Based on Big Data
- 12.1 Introduction
- 12.2 Implementation Framework About MS-SHM-Hadoop
- 12.3 Acquisition of Sensory Data and Integration of Structure-Related Data
- 12.4 Nationwide Civil Infrastructure Survey
- 12.5 Global Structural Integrity Analysis
- 12.6 Localized Critical Component Reliability Analysis
- 12.7 Civil Infrastructure's Reliability Analysis Based on Bayesian Network
- 12.8 Conclusion and Future Work
- References

- Part III Applications in Medicine
- 13 Nonlinear Dynamical Systems with Chaos and Big Data:A Case Study of Epileptic Seizure Prediction and Control
- 13.1 Introduction
- 13.2 Background
- 13.3 Nonlinear Dynamical Systems with Chaos
- 13.4 Lyapunov Exponents
- 13.5 Rapid Prototyping HPCmatlab Platform
- 13.6 Case Study: Epileptic Seizure Prediction and Control
- 13.7 Future Research Opportunities and Conclusions
- Appendix 1: Electrical Stimulation and Experiment Setup
- Appendix 2: Preparation of Animals
- References

- 14 Big Data to Big Knowledge for Next Generation Medicine:A Data Science Roadmap
- 15 Time-Based Comorbidity in Patients Diagnosed with Tobacco Use Disorder
- 15.1 Introduction
- 15.2 Method
- 15.3 Results
- 15.3.1 Top 20 Diseases in TUD and Non-TUD Patients
- 15.3.2 Top 20 Diseases in TUD Patients and Corresponding Prevalence in Non-TUD Patients
- 15.3.3 Top 15 Comorbidities in TUD Patients Across Two Hospital Visits (Second Iteration) and Corresponding Prevalence in Non-TUD Patients
- 15.3.4 Comorbidities in TUD Patients Across Three Hospital Visits (Third Iteration) and Comparison with Non-TUD Patients

- 15.4 Discussion and Concluding Remarks
- References

- 16 The Impact of Big Data on the Physician
- 16.1 Part 1: The Patient-Physician Relationship
- 16.1.1 Defining Quality Care
- 16.1.2 Choosing the Best Doctor
- 16.1.3 Choosing the Best Hospital
- 16.1.4 Sharing Information: Using Big Data to Expand the Patient History
- 16.1.5 What is mHealth?
- 16.1.6 mHealth from the Provider Side
- 16.1.7 mHealth from the Patient Side
- 16.1.8 Logistical Concerns
- 16.1.9 Accessibility
- 16.1.10 Privacy and Security
- 16.1.11 Regulation and Liability
- 16.1.12 Patient Education and Partnering with Patients
- 16.1.13 Developing Online Health Communities Through Social Media: Creating Data that Fuels Research
- 16.1.14 Translating Complex Medical Datainto Patient-Friendly Formats
- 16.1.15 Beyond the Package Insert: Iodine.com
- 16.1.16 Data Inspires the Advent of New Models of Medical Care Delivery

- 16.2 Part II: Physician Uses for Big Data in Clinical Care
- References

- 16.1 Part 1: The Patient-Physician Relationship

- 13 Nonlinear Dynamical Systems with Chaos and Big Data:A Case Study of Epileptic Seizure Prediction and Control
- Part IV Applications in Business
- 17 The Potential of Big Data in Banking
- 18 Marketing Applications Using Big Data
- 19 Does Yelp Matter? Analyzing (And Guide to Using) Ratings for a Quick Serve Restaurant Chain

- Author Biographies
- Index

Studies in Big Data 26

S.Srinivasan Editor

Guide to

Big Data

Applications

About this Series

The series “Studies in Big Data” (SBD) publishes new developments and advances

in the various areas of Big Data – quickly and with a high quality. The intent

is to cover the theory, research, development, and applications of Big Data, as

embedded in the ﬁelds of engineering, computer science, physics, economics and

life sciences. The books of the series refer to the analysis and understanding of

large, complex, and/or distributed data sets generated from recent digital sources

coming from sensors or other physical instruments as well as simulations, crowd

sourcing, social networks or other internet transactions, such as emails or video

click streams and other. The series contains monographs, lecture notes and edited

volumes in Big Data spanning the areas of computational intelligence including

neural networks, evolutionary computation, soft computing, fuzzy systems, as well

as artiﬁcial intelligence, data mining, modern statistics and Operations research, as

well as self-organizing systems. Of particular value to both the contributors and

the readership are the short publication timeframe and the world-wide distribution,

which enable both wide and rapid dissemination of research output.

More information about this series at http://www.springer.com/series/11970

S. Srinivasan

Editor

Guide to Big Data

Applications

123

Editor

S. Srinivasan

Jesse H. Jones School of Business

Texas Southern University

Houston, TX, USA

ISSN 2197-6503 ISSN 2197-6511 (electronic)

Studies in Big Data

ISBN 978-3-319-53816-7 ISBN 978-3-319-53817-4 (eBook)

DOI 10.1007/978-3-319-53817-4

Library of Congress Control Number: 2017936371

© Springer International Publishing AG 2018

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of

the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation,

broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information

storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology

now known or hereafter developed.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication

does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant

protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book

are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or

the editors give a warranty, express or implied, with respect to the material contained herein or for any

errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional

claims in published maps and institutional afﬁliations.

Printed on acid-free paper

This Springer imprint is published by Springer Nature

The registered company is Springer International Publishing AG

The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

To my wife Lakshmi and grandson Sahaas

Foreword

It gives me great pleasure to write this Foreword for this timely publication on the

topic of the ever-growing list of Big Data applications. The potential for leveraging

existing data from multiple sources has been articulated over and over, in an

almost inﬁnite landscape, yet it is important to remember that in doing so, domain

knowledge is key to success. Naïve attempts to process data are bound to lead to

errors such as accidentally regressing on noncausal variables. As Michael Jordan

at Berkeley has pointed out, in Big Data applications the number of combinations

of the features grows exponentially with the number of features, and so, for any

particular database, you are likely to ﬁnd some combination of columns that will

predict perfectly any outcome, just by chance alone. It is therefore important that

we do not process data in a hypothesis-free manner and skip sanity checks on our

data.

In this collection titled “Guide to Big Data Applications,” the editor has

assembled a set of applications in science, medicine, and business where the authors

have attempted to do just this—apply Big Data techniques together with a deep

understanding of the source data. The applications covered give a ﬂavor of the

beneﬁts of Big Data in many disciplines. This book has 19 chapters broadly divided

into four parts. In Part I, there are four chapters that cover the basics of Big

Data, aspects of privacy, and how one could use Big Data in natural language

processing (a particular concern for privacy). Part II covers eight chapters that

vii

viii Foreword

look at various applications of Big Data in environmental science, oil and gas, and

civil infrastructure, covering topics such as deduplication, encrypted search, and the

friendship paradox.

Part III covers Big Data applications in medicine, covering topics ranging

from “The Impact of Big Data on the Physician,” written from a purely clinical

perspective, to the often discussed deep dives on electronic medical records.

Perhaps most exciting in terms of future landscaping is the application of Big Data

application in healthcare from a developing country perspective. This is one of the

most promising growth areas in healthcare, due to the current paucity of current

services and the explosion of mobile phone usage. The tabula rasa that exists in

many countries holds the potential to leapfrog many of the mistakes we have made

in the west with stagnant silos of information, arbitrary barriers to entry, and the

lack of any standardized schema or nondegenerate ontologies.

In Part IV, the book covers Big Data applications in business, which is perhaps

the unifying subject here, given that none of the above application areas are likely

to succeed without a good business model. The potential to leverage Big Data

approaches in business is enormous, from banking practices to targeted advertising.

The need for innovation in this space is as important as the underlying technologies

themselves. As Clayton Christensen points out in The Innovator’s Prescription,

three revolutions are needed for a successful disruptive innovation:

1. A technology enabler which “routinizes” previously complicated task

2. A business model innovation which is affordable and convenient

3. A value network whereby companies with disruptive mutually reinforcing

economic models sustain each other in a strong ecosystem

We see this happening with Big Data almost every week, and the future is

exciting.

In this book, the reader will encounter inspiration in each of the above topic

areas and be able to acquire insights into applications that provide the ﬂavor of this

fast-growing and dynamic ﬁeld.

Atlanta, GA, USA Gari Clifford

December 10, 2016

Preface

Big Data applications are growing very rapidly around the globe. This new approach

to decision making takes into account data gathered from multiple sources. Here my

goal is to show how these diverse sources of data are useful in arriving at actionable

information. In this collection of articles the publisher and I are trying to bring

in one place several diverse applications of Big Data. The goal is for users to see

how a Big Data application in another ﬁeld could be replicated in their discipline.

With this in mind I have assembled in the “Guide to Big Data Applications” a

collection of 19 chapters written by academics and industry practitioners globally.

These chapters reﬂect what Big Data is, how privacy can be protected with Big

Data and some of the important applications of Big Data in science, medicine and

business. These applications are intended to be representative and not exhaustive.

For nearly two years I spoke with major researchers around the world and the

publisher. These discussions led to this project. The initial Call for Chapters was

sent to several hundred researchers globally via email. Approximately 40 proposals

were submitted. Out of these came commitments for completion in a timely manner

from 20 people. Most of these chapters are written by researchers while some are

written by industry practitioners. One of the submissions was not included as it

could not provide evidence of use of Big Data. This collection brings together in

one place several important applications of Big Data. All chapters were reviewed

using a double-blind process and comments provided to the authors. The chapters

included reﬂect the ﬁnal versions of these chapters.

I have arranged the chapters in four parts. Part I includes four chapters that deal

with basic aspects of Big Data and how privacy is an integral component. In this

part I include an introductory chapter that lays the foundation for using Big Data

in a variety of applications. This is then followed with a chapter on the importance

of including privacy aspects at the design stage itself. This chapter by two leading

researchers in the ﬁeld shows the importance of Big Data in dealing with privacy

issues and how they could be better addressed by incorporating privacy aspects at

the design stage itself. The team of researchers from a major research university in

the USA addresses the importance of federated Big Data. They are looking at the

use of distributed data in applications. This part is concluded with a chapter that

ix

xPreface

shows the importance of word embedding and natural language processing using

Big Data analysis.

In Part II, there are eight chapters on the applications of Big Data in science.

Science is an important area where decision making could be enhanced on the

way to approach a problem using data analysis. The applications selected here

deal with Environmental Science, High Performance Computing (HPC), friendship

paradox in noting which friend’s inﬂuence will be signiﬁcant, signiﬁcance of

using encrypted search with Big Data, importance of deduplication in Big Data

especially when data is collected from multiple sources, applications in Oil &

Gas and how decision making can be enhanced in identifying bridges that need

to be replaced as part of meeting safety requirements. All these application areas

selected for inclusion in this collection show the diversity of ﬁelds in which Big

Data is used today. The Environmental Science application shows how the data

published by the National Oceanic and Atmospheric Administration (NOAA) is

used to study the environment. Since such datasets are very large, specialized tools

are needed to beneﬁt from them. In this chapter the authors show how Big Data

tools help in this effort. The team of industry practitioners discuss how there is

great similarity in the way HPC deals with low-latency, massively parallel systems

and distributed systems. These are all typical of how Big Data is used using tools

such as MapReduce, Hadoop and Spark. Quora is a leading provider of answers to

user queries and in this context one of their data scientists is addressing how the

Friendship paradox is playing a signiﬁcant part in Quora answers. This is a classic

illustration of a Big Data application using social media.

Big Data applications in science exist in many branches and it is very heavily

used in the Oil and Gas industry. Two chapters that address the Oil and Gas

application are written by two sets of people with extensive industry experience.

Two speciﬁc chapters are devoted to how Big Data is used in deduplication practices

involving multimedia data in the cloud and how privacy-aware searches are done

over encrypted data. Today, people are very concerned about the security of data

stored with an application provider. Encryption is the preferred tool to protect such

data and so having an efﬁcient way to search such encrypted data is important.

This chapter’s contribution in this regard will be of great beneﬁt for many users.

We conclude Part II with a chapter that shows how Big Data is used in noting the

structural safety of nation’s bridges. This practical application shows how Big Data

is used in many different ways.

Part III considers applications in medicine. A group of expert doctors from

leading medical institutions in the Bay Area discuss how Big Data is used in the

practice of medicine. This is one area where many more applications abound and the

interested reader is encouraged to look at such applications. Another chapter looks

at how data scientists are important in analyzing medical data. This chapter reﬂects

a view from Asia and discusses the roadmap for data science use in medicine.

Smoking has been noted as one of the leading causes of human suffering. This part

includes a chapter on comorbidity aspects related to smokers based on a Big Data

analysis. The details presented in this chapter would help the reader to focus on other

possible applications of Big Data in medicine, especially cancer. Finally, a chapter is

Preface xi

included that shows how scientiﬁc analysis of Big Data helps with epileptic seizure

prediction and control.

Part IV of the book deals with applications in Business. This is an area where

Big Data use is expected to provide tangible results quickly to businesses. The

three applications listed under this part include an application in banking, an

application in marketing and an application in Quick Serve Restaurants. The

banking application is written by a group of researchers in Europe. Their analysis

shows that the importance of identifying ﬁnancial fraud early is a global problem

and how Big Data is used in this effort. The marketing application highlights the

various ways in which Big Data could be used in business. Many large business

sectors such as the airlines industry are using Big Data to set prices. The application

with respect to a Quick Serve Restaurant chain deals with the impact of Yelp ratings

and how it inﬂuences people’s use of Quick Serve Restaurants.

As mentioned at the outset, this collection of chapters on Big Data applications

is expected to serve as a sample for other applications in various ﬁelds. The readers

will ﬁnd novel ways in which data from multiple sources is combined to derive

beneﬁt for the general user. Also, in speciﬁc areas such as medicine, the use of Big

Data is having profound impact in opening up new areas for exploration based on

the availability of large volumes of data. These are all having practical applications

that help extend people’s lives. I earnestly hope that this collection of applications

will spur the interest of the reader to look at novel ways of using Big Data.

This book is a collective effort of many people. The contributors to this book

come from North America, Europe and Asia. This diversity shows that Big Data is

a truly global way in which people use the data to enhance their decision-making

capabilities and to derive practical beneﬁts. The book greatly beneﬁted from the

careful review by many reviewers who provided detailed feedback in a timely

manner. I have carefully checked all chapters for consistency of information in

content and appearance. In spite of careful checking and taking advantage of the

tools provided by technology, it is highly likely that some errors might have crept in

to the chapter content. In such cases I take responsibility for such errors and request

your help in bringing them to my attention so that they can be corrected in future

editions.

Houston, TX, USA S. Srinivasan

December 15, 2016

Acknowledgements

A project of this nature would be possible only with the collective efforts of many

people. Initially I proposed the project to Springer, New York, over two years ago.

Springer expressed interest in the proposal and one of their editors, Ms. Mary

James, contacted me to discuss the details. After extensive discussions with major

researchers around the world we ﬁnally settled on this approach. A global Call for

Chapters was made in January 2016 both by me and Springer, New York, through

their channels of communication. Ms. Mary James helped throughout the project

by providing answers to questions that arose. In this context, I want to mention

the support of Ms. Brinda Megasyamalan from the printing house of Springer.

Ms. Megasyamalan has been a constant source of information as the project

progressed. Ms. Subhashree Rajan from the publishing arm of Springer has been

extremely cooperative and patient in getting all the page proofs and incorporating

all the corrections. Ms. Mary James provided all the encouragement and support

throughout the project by responding to inquiries in a timely manner.

The reviewers played a very important role in maintaining the quality of this

publication by their thorough reviews. We followed a double-blind review process

whereby the reviewers were unaware of the identity of the authors and vice versa.

This helped in providing quality feedback. All the authors cooperated very well by

incorporating the reviewers’ suggestions and submitting their ﬁnal chapters within

the time allotted for that purpose. I want to thank individually all the reviewers and

all the authors for their dedication and contribution to this collective effort.

I want to express my sincere thanks to Dr. Gari Clifford of Emory University and

Georgia Institute of Technology in providing the Foreword to this publication. In

spite of his many commitments, Dr. Clifford was able to ﬁnd the time to go over all

the Abstracts and write the Foreword without delaying the project.

Finally, I want to express my sincere appreciation to my wife for accommodating

the many special needs when working on a project of this nature.

xiii

Contents

Part I General

1 Strategic Applications of Big Data ........................................ 3

Joe Weinman

2 Start with Privacy by Design in All Big Data Applications ............ 29

Ann Cavoukian and Michelle Chibba

3 Privacy Preserving Federated Big Data Analysis ....................... 49

Wenrui Dai, Shuang Wang, Hongkai Xiong, and Xiaoqian Jiang

4 Word Embedding for Understanding Natural Language:

A Survey ..................................................................... 83

Yang Li and Tao Yang

Part II Applications in Science

5 Big Data Solutions to Interpreting Complex Systems in the

Environment ................................................................ 107

Hongmei Chi, Sharmini Pitter, Nan Li, and Haiyan Tian

6 High Performance Computing and Big Data ............................ 125

Rishi Divate, Sankalp Sah, and Manish Singh

7 Managing Uncertainty in Large-Scale Inversions for the Oil

and Gas Industry with Big Data .......................................... 149

Jiefu Chen, Yueqin Huang, Tommy L. Binford Jr., and Xuqing Wu

8 Big Data in Oil & Gas and Petrophysics ................................. 175

Mark Kerzner and Pierre Jean Daniel

9 Friendship Paradoxes on Quora .......................................... 205

Shankar Iyer

10 Deduplication Practices for Multimedia Data in the Cloud ........... 245

Fatema Rashid and Ali Miri

xv

xvi Contents

11 Privacy-Aware Search and Computation Over Encrypted Data

Stores......................................................................... 273

Hoi Ting Poon and Ali Miri

12 Civil Infrastructure Serviceability Evaluation Based on Big Data.... 295

Yu Liang, Dalei Wu, Dryver Huston, Guirong Liu, Yaohang Li,

Cuilan Gao, and Zhongguo John Ma

Part III Applications in Medicine

13 Nonlinear Dynamical Systems with Chaos and Big Data:

A Case Study of Epileptic Seizure Prediction and Control ............ 329

Ashfaque Shaﬁque, Mohamed Sayeed, and Konstantinos Tsakalis

14 Big Data to Big Knowledge for Next Generation Medicine:

A Data Science Roadmap .................................................. 371

Tavpritesh Sethi

15 Time-Based Comorbidity in Patients Diagnosed with Tobacco

Use Disorder................................................................. 401

Pankush Kalgotra, Ramesh Sharda, Bhargav Molaka,

and Samsheel Kathuri

16 The Impact of Big Data on the Physician ................................ 415

Elizabeth Le, Sowmya Iyer, Teja Patil, Ron Li, Jonathan H. Chen,

Michael Wang, and Erica Sobel

Part IV Applications in Business

17 The Potential of Big Data in Banking .................................... 451

Rimvydas Skyrius, Gintar˙

eGiri

¯

unien˙

e, Igor Katin,

Michail Kazimianec, and Raimundas Žilinskas

18 Marketing Applications Using Big Data ................................. 487

S. Srinivasan

19 Does Yelp Matter? Analyzing (And Guide to Using) Ratings

for a Quick Serve Restaurant Chain ..................................... 503

Bogdan Gadidov and Jennifer Lewis Priestley

Author Biographies .............................................................. 523

Index............................................................................... 553

List of Reviewers

Maruthi Bhaskar

Jay Brandi

Jorge Brusa

Arnaub Chatterjee

Robert Evans

Aly Farag

Lila Ghemri

Ben Hu

Balaji Janamanchi

Mehmed Kantardzic

Mark Kerzner

Ashok Krishnamurthy

Angabin Matin

Hector Miranda

P. S. Raju

S. Srinivasan

Rakesh Verma

Daniel Vrinceanu

Haibo Wang

Xuqing Wu

Alec Yasinsac

xvii

Part I

General

Chapter 1

Strategic Applications of Big Data

Joe Weinman

1.1 Introduction

For many people, big data is somehow virtually synonymous with one application—

marketing analytics—in one vertical—retail. For example, by collecting purchase

transaction data from shoppers based on loyalty cards or other unique identiﬁers

such as telephone numbers, account numbers, or email addresses, a company can

segment those customers better and identify promotions that will boost proﬁtable

revenues, either through insights derived from the data, A/B testing, bundling, or

the like. Such insights can be extended almost without bound. For example, through

sophisticated analytics, Harrah’s determined that its most proﬁtable customers

weren’t “gold cuff-linked, limousine-riding high rollers,” but rather teachers, doc-

tors, and even machinists (Loveman 2003). Not only did they come to understand

who their best customers were, but how they behaved and responded to promotions.

For example, their target customers were more interested in an offer of $60 worth of

chips than a total bundle worth much more than that, including a room and multiple

steak dinners in addition to chips.

While marketing such as this is a great application of big data and analytics, the

reality is that big data has numerous strategic business applications across every

industry vertical. Moreover, there are many sources of big data available from a

company’s day-to-day business activities as well as through open data initiatives,

such as data.gov in the U.S., a source with almost 200,000 datasets at the time of

this writing.

To apply big data to critical areas of the ﬁrm, there are four major generic

approaches that companies can use to deliver unparalleled customer value and

J. Weinman ()

Independent Consultant, Flanders, NJ 07836, USA

e-mail: joeweinman@gmail.com

© Springer International Publishing AG 2018

S. Srinivasan (ed.), Guide to Big Data Applications, Studies in Big Data 26,

DOI 10.1007/978-3-319-53817-4_1

3

4 J. Weinman

achieve strategic competitive advantage: better processes, better products and

services, better customer relationships, and better innovation.

1.1.1 Better Processes

Big data can be used to optimize processes and asset utilization in real time, to

improve them in the long term, and to generate net new revenues by entering

new businesses or at least monetizing data generated by those processes. UPS

optimizes pickups and deliveries across its 55,000 routes by leveraging data ranging

from geospatial and navigation data to customer pickup constraints (Rosenbush

and Stevens 2015). Or consider 23andMe, which has sold genetic data it collects

from individuals. One such deal with Genentech focused on Parkinson’s disease

gained net new revenues of ﬁfty million dollars, rivaling the revenues from its “core”

business (Lee 2015).

1.1.2 Better Products and Services

Big data can be used to enrich the quality of customer solutions, moving them up

the experience economy curve from mere products or services to experiences or

transformations. For example, Nike used to sell sneakers, a product. However, by

collecting and aggregating activity data from customers, it can help transform them

into better athletes. By linking data from Nike products and apps with data from

ecosystem solution elements, such as weight scales and body-fat analyzers, Nike

can increase customer loyalty and tie activities to outcomes (Withings 2014).

1.1.3 Better Customer Relationships

Rather than merely viewing data as a crowbar with which to open customers’ wallets

a bit wider through targeted promotions, it can be used to develop deeper insights

into each customer, thus providing better service and customer experience in the

short term and products and services better tailored to customers as individuals

in the long term. Netﬂix collects data on customer activities, behaviors, contexts,

demographics, and intents to better tailor movie recommendations (Amatriain

2013). Better recommendations enhance customer satisfaction and value which

in turn makes these customers more likely to stay with Netﬂix in the long term,

reducing churn and customer acquisition costs, as well as enhancing referral (word-

of-mouth) marketing. Harrah’s determined that customers that were “very happy”

with their customer experience increased their spend by 24% annually; those that

were unhappy decreased their spend by 10% annually (Loveman 2003).

1 Strategic Applications of Big Data 5

1.1.4 Better Innovation

Data can be used to accelerate the innovation process, and make it of higher quality,

all while lowering cost. Data sets can be published or otherwise incorporated as

part of an open contest or challenge, enabling ad hoc solvers to identify a best

solution meeting requirements. For example, GE Flight Quest incorporated data

on scheduled and actual ﬂight departure and arrival times, for a contest intended

to devise algorithms to better predict arrival times, and another one intended to

improve them (Kaggle n.d.). As the nexus of innovation moves from man to

machine, data becomes the fuel on which machine innovation engines run.

These four business strategies are what I call digital disciplines (Weinman

2015), and represent an evolution of three customer-focused strategies called value

disciplines, originally devised by Michael Treacy and Fred Wiersema in their inter-

national bestseller The Discipline of Market Leaders (Treacy and Wiersema 1995).

1.2 From Value Disciplines to Digital Disciplines

The value disciplines originally identiﬁed by Treacy and Wiersema are operational

excellence,product leadership, and customer intimacy.

Operational excellence entails processes which generate customer value by being

lower cost or more convenient than those of competitors. For example, Michael

Dell, operating as a college student out of a dorm room, introduced an assemble-

to-order process for PCs by utilizing a direct channel which was originally the

phone or physical mail and then became the Internet and eCommerce. He was

able to drive the price down, make it easier to order, and provide a PC built to

customers’ speciﬁcations by creating a new assemble-to-order process that bypassed

indirect channel middlemen that stocked pre-built machines en masse, who offered

no customization but charged a markup nevertheless.

Product leadership involves creating leading-edge products (or services) that

deliver superior value to customers. We all know the companies that do this: Rolex

in watches, Four Seasons in lodging, Singapore Airlines or Emirates in air travel.

Treacy and Wiersema considered innovation as being virtually synonymous with

product leadership, under the theory that leading products must be differentiated in

some way, typically through some innovation in design, engineering, or technology.

Customer intimacy, according to Treacy and Wiersema, is focused on segmenting

markets, better understanding the unique needs of those niches, and tailoring solu-

tions to meet those needs. This applies to both consumer and business markets. For

example, a company that delivers packages might understand a major customer’s

needs intimately, and then tailor a solution involving stocking critical parts at

their distribution centers, reducing the time needed to get those products to their

customers. In the consumer world, customer intimacy is at work any time a tailor

adjusts a garment for a perfect ﬁt, a bartender customizes a drink, or a doctor

diagnoses and treats a medical issue.

6 J. Weinman

Traditionally, the thinking was that a company would do well to excel in a given

discipline, and that the disciplines were to a large extent mutually exclusive. For

example, a fast food restaurant might serve a limited menu to enhance operational

excellence. A product leadership strategy of having many different menu items, or a

customer intimacy strategy of customizing each and every meal might conﬂict with

the operational excellence strategy. However, now, the economics of information—

storage prices are exponentially decreasing and data, once acquired, can be lever-

aged elsewhere—and the increasing ﬂexibility of automation—such as robotics—

mean that companies can potentially pursue multiple strategies simultaneously.

Digital technologies such as big data enable new ways to think about the insights

originally derived by Treacy and Wiersema. Another way to think about it is that

digital technologies plus value disciplines equal digital disciplines: operational

excellence evolves to information excellence, product leadership of standalone

products and services becomes solution leadership of smart, digital products and

services connected to the cloud and ecosystems, customer intimacy expands to

collective intimacy, and traditional innovation becomes accelerated innovation.In

the digital disciplines framework, innovation becomes a separate discipline, because

innovation applies not only to products, but also processes, customer relationships,

and even the innovation process itself. Each of these new strategies can be enabled

by big data in profound ways.

1.2.1 Information Excellence

Operational excellence can be viewed as evolving to information excellence, where

digital information helps optimize physical operations including their processes and

resource utilization; where the world of digital information can seamlessly fuse

with that of physical operations; and where virtual worlds can replace physical.

Moreover, data can be extracted from processes to enable long term process

improvement, data collected by processes can be monetized, and new forms of

corporate structure based on loosely coupled partners can replace traditional,

monolithic, vertically integrated companies. As one example, location data from

cell phones can be aggregated and analyzed to determine commuter trafﬁc patterns,

thereby helping to plan transportation network improvements.

1.2.2 Solution Leadership

Products and services can become sources of big data, or utilize big data to

function more effectively. Because individual products are typically limited in

storage capacity, and because there are beneﬁts to data aggregation and cloud

processing, normally the data that is collected can be stored and processed in the

cloud. A good example might be the GE GEnx jet engine, which collects 5000

1 Strategic Applications of Big Data 7

data points each second from each of 20 sensors. GE then uses the data to develop

better predictive maintenance algorithms, thus reducing unplanned downtime for

airlines. (GE Aviation n.d.) Mere product leadership becomes solution leadership,

where standalone products become cloud-connected and data-intensive. Services

can also become solutions, because services are almost always delivered through

physical elements: food services through restaurants and ovens; airline services

through planes and baggage conveyors; healthcare services through x-ray machines

and pacemakers. The components of such services connect to each other and exter-

nally. For example, healthcare services can be better delivered through connected

pacemakers, and medical diagnostic data from multiple individual devices can be

aggregated to create a patient-centric view to improve health outcomes.

1.2.3 Collective Intimacy

Customer intimacy is no longer about dividing markets into segments, but rather

dividing markets into individuals, or even further into multiple personas that an

individual might have. Personalization and contextualization offers the ability to

not just deliver products and services tailored to a segment, but to an individual.

To do this effectively requires current, up-to-date information as well as historical

data, collected at the level of the individual and his or her individual activities

and characteristics down to the granularity of DNA sequences and mouse moves.

Collective intimacy is the notion that algorithms running on collective data from

millions of individuals can generate better tailored services for each individual.

This represents the evolution of intimacy from face-to-face, human-mediated

relationships to virtual, human-mediated relationships over social media, and from

there, onward to virtual, algorithmically mediated products and services.

1.2.4 Accelerated Innovation

Finally, innovation is not just associated with product leadership, but can create new

processes, as Walmart did with cross-docking or Uber with transportation, or new

customer relationships and collective intimacy, as Amazon.com uses data to better

upsell/cross-sell, and as Netﬂix innovated its Cinematch recommendation engine.

The latter was famously done through the Netﬂix Prize, a contest with a million

dollar award for whoever could best improve Cinematch by at least 10% (Bennett

and Lanning 2007). Such accelerated innovation can be faster, cheaper, and better

than traditional means of innovation. Often, such approaches exploit technologies

such as the cloud and big data. The cloud is the mechanism for reaching multiple

potential solvers on an ad hoc basis, with published big data being the fuel for

problem solving. For example, Netﬂix published anonymized customer ratings of

movies, and General Electric published planned and actual ﬂight arrival times.

8 J. Weinman

Today, machine learning and deep learning based on big data sets are a means

by which algorithms are innovating themselves. Google DeepMind’s AlphaGo Go-

playing system beat the human world champion at Go, Lee Sedol, partly based on

learning how to play by not only “studying” tens of thousands of human games, but

also by playing an increasingly tougher competitor: itself (Moyer 2016).

1.2.5 Value Disciplines to Digital Disciplines

The three classic value disciplines of operational excellence, product leadership and

customer intimacy become transformed in a world of big data and complementary

digital technologies to become information excellence, solution leadership, collec-

tive intimacy, and accelerated innovation. These represent four generic strategies

that leverage big data in the service of strategic competitive differentiation; four

generic strategies that represent the horizontal applications of big data.

1.3 Information Excellence

Most of human history has been centered on the physical world. Hunting and

gathering, ﬁshing, agriculture, mining, and eventually manufacturing and physical

operations such as shipping, rail, and eventually air transport. It’s not news that the

focus of human affairs is increasingly digital, but the many ways in which digital

information can complement, supplant, enable, optimize, or monetize physical

operations may be surprising. As more of the world becomes digital, the use of

information, which after all comes from data, becomes more important in the

spheres of business, government, and society (Fig. 1.1).

1.3.1 Real-Time Process and Resource Optimization

There are numerous business functions, such as legal, human resources, ﬁnance,

engineering, and sales, and a variety of ways in which different companies in a

variety of verticals such as automotive, healthcare, logistics, or pharmaceuticals

conﬁgure these functions into end-to-end processes. Examples of processes might

be “claims processing” or “order to delivery” or “hire to ﬁre”. These in turn use a

variety of resources such as people, trucks, factories, equipment, and information

technology.

Data can be used to optimize resource use as well as to optimize processes for

goals such as cycle time, cost, or quality.

Some good examples of the use of big data to optimize processes are inventory

management/sales forecasting, port operations, and package delivery logistics.

1 Strategic Applications of Big Data 9

Fig. 1.1 High-level architecture for information excellence

Too much inventory is a bad thing, because there are costs to holding inventory:

the capital invested in the inventory, risk of disaster, such as a warehouse ﬁre,

insurance, ﬂoor space, obsolescence, shrinkage (i.e., theft), and so forth. Too little

inventory is also bad, because not only may a sale be lost, but the prospect may

go elsewhere to acquire the good, realize that the competitor is a ﬁne place to

shop, and never return. Big data can help with sales forecasting and thus setting

correct inventory levels. It can also help to develop insights, which may be subtle

or counterintuitive. For example, when Hurricane Frances was projected to strike

Florida, analytics helped stock stores, not only with “obvious” items such as

bottled water and ﬂashlights, but non-obvious products such as strawberry Pop-Tarts

(Hayes 2004). This insight was based on mining store transaction data from prior

hurricanes.

Consider a modern container port. There are multiple goals, such as minimizing

the time ships are in port to maximize their productivity, minimizing the time ships

or rail cars are idle, ensuring the right containers get to the correct destinations,

maximizing safety, and so on. In addition, there may be many types of structured

and unstructured data, such as shipping manifests, video surveillance feeds of roads

leading to and within the port, data on bridges, loading cranes, weather forecast data,

truck license plates, and so on. All of these data sources can be used to optimize port

operations in line with the multiple goals (Xvela 2016).

Or consider a logistics ﬁrm such as UPS. UPS has invested hundreds of millions

of dollars in ORION (On-Road Integrated Optimization and Navigation). It takes

data such as physical mapping data regarding roads, delivery objectives for each

package, customer data such as when customers are willing to accept deliveries,

and the like. For each of 55,000 routes, ORION determines the optimal sequence of

10 J. Weinman

an average of 120 stops per route. The combinatorics here are staggering, since there

are roughly 10**200 different possible sequences, making it impossible to calculate

a perfectly optimal route, but heuristics can take all this data and try to determine

the best way to sequence stops and route delivery trucks to minimize idling time,

time waiting to make left turns, fuel consumption and thus carbon footprint, and

to maximize driver labor productivity and truck asset utilization, all the while

balancing out customer satisfaction and on-time deliveries. Moreover real-time data

such as geographic location, trafﬁc congestion, weather, and fuel consumption, can

be exploited for further optimization (Rosenbush and Stevens 2015).

Such capabilities could also be used to not just minimize time or maximize

throughput, but also to maximize revenue. For example, a theme park could

determine the optimal location for a mobile ice cream or face painting stand,

based on prior customer purchases and exact location of customers within the

park. Customers’ locations and identities could be identiﬁed through dedicated long

range radios, as Disney does with MagicBands; through smartphones, as Singtel’s

DataSpark unit does (see below); or through their use of related geographically

oriented services or apps, such as Uber or Foursquare.

1.3.2 Long-Term Process Improvement

In addition to such real-time or short-term process optimization, big data can also

be used to optimize processes and resources over the long term.

For example, DataSpark, a unit of Singtel (a Singaporean telephone company)

has been extracting data from cell phone locations to be able to improve the

MTR (Singapore’s subway system) and customer experience (Dataspark 2016).

For example, suppose that GPS data showed that many subway passengers were

traveling between two stops but that they had to travel through a third stop—a hub—

to get there. By building a direct line to bypass the intermediate stop, travelers could

get to their destination sooner, and congestion could be relieved at the intermediate

stop as well as on some of the trains leading to it. Moreover, this data could also be

used for real-time process optimization, by directing customers to avoid a congested

area or line suffering an outage through the use of an alternate route. Obviously a

variety of structured and unstructured data could be used to accomplish both short-

term and long-term improvements, such as GPS data, passenger mobile accounts

and ticket purchases, video feeds of train stations, train location data, and the like.

1.3.3 Digital-Physical Substitution and Fusion

The digital world and the physical world can be brought together in a number of

ways. One way is substitution, as when a virtual audio, video, and/or web conference

substitutes for physical airline travel, or when an online publication substitutes for

1 Strategic Applications of Big Data 11

a physically printed copy. Another way to bring together the digital and physical

worlds is fusion, where both online and ofﬂine experiences become seamlessly

merged. An example is in omni-channel marketing, where a customer might browse

online, order online for pickup in store, and then return an item via the mail. Or, a

customer might browse in the store, only to ﬁnd the correct size out of stock, and

order in store for home delivery. Managing data across the customer journey can

provide a single view of the customer to maximize sales and share of wallet for that

customer. This might include analytics around customer online browsing behavior,

such as what they searched for, which styles and colors caught their eye, or what

they put into their shopping cart. Within the store, patterns of behavior can also be

identiﬁed, such as whether people of a certain demographic or gender tend to turn

left or right upon entering the store.

1.3.4 Exhaust-Data Monetization

Processes which are instrumented and monitored can generate massive amounts

of data. This data can often be monetized or otherwise create beneﬁts in creative

ways. For example, Uber’s main business is often referred to as “ride sharing,”

which is really just offering short term ground transportation to passengers desirous

of rides by matching them up with drivers who can give them rides. However, in

an arrangement with the city of Boston, it will provide ride pickup and drop-off

locations, dates, and times. The city will use the data for trafﬁc engineering, zoning,

and even determining the right number of parking spots needed (O’Brien 2015).

Such inferences can be surprisingly subtle. Consider the case of a revolving door

ﬁrm that could predict retail trends and perhaps even recessions. Fewer shoppers

visiting retail stores means fewer shoppers entering via the revolving door. This

means lower usage of the door, and thus fewer maintenance calls.

Another good example is 23andMe. 23andMe is a ﬁrm that was set up to leverage

new low cost gene sequence technologies. A 23andMe customer would take a saliva

sample and mail it to 23andMe, which would then sequence the DNA and inform

the customer about certain genetically based risks they might face, such as markers

signaling increased likelihood of breast cancer due to a variant in the BRCA1 gene.

They also would provide additional types of information based on this sequence,

such as clarifying genetic relationships among siblings or questions of paternity.

After compiling massive amounts of data, they were able to monetize the

collected data outside of their core business. In one $50 million deal, they sold data

from Parkinson’s patients to Genentech, with the objective of developing a cure

for Parkinson’s through deep analytics (Lee 2015). Note that not only is the deal

lucrative, especially since essentially no additional costs were incurred to sell this

data, but also highly ethical. Parkinson’s patients would like nothing better than for

Genentech—or anybody else, for that matter—to develop a cure.

12 J. Weinman

1.3.5 Dynamic, Networked, Virtual Corporations

Processes don’t need to be restricted to the four walls of the corporation. For

example, supply chain optimization requires data from suppliers, channels, and

logistics companies. Many companies have focused on their core business and

outsourced or partnered with others to create and continuously improve supply

chains. For example, Apple sells products, but focuses on design and marketing,

not manufacturing. As many people know, Apple products are built by a partner,

Foxconn, with expertise in precision manufacturing electronic products.

One step beyond such partnerships or virtual corporations are dynamic, net-

worked virtual corporations. An example is Li & Fung. Apple sells products such

as iPhones and iPads, without owning any manufacturing facilities. Similarly, Li &

Fung sells products, namely clothing, without owning any manufacturing facilities.

However, unlike Apple, who relies largely on one main manufacturing partner; Li &

Fung relies on a network of over 10,000 suppliers. Moreover, the exact conﬁguration

of those suppliers can change week by week or even day by day, even for the same

garment. A shirt, for example, might be sewed in Indonesia with buttons from

Thailand and fabric from S. Korea. That same SKU, a few days later, might be

made in China with buttons from Japan and fabric from Vietnam. The constellation

of suppliers is continuously optimized, by utilizing data on supplier resource

availability and pricing, transportation costs, and so forth (Wind et al. 2009).

1.3.6 Beyond Business

Information excellence also applies to governmental and societal objectives. Earlier

we mentioned using big data to improve the Singapore subway operations and

customer experience; later we’ll mention how it’s being used to improve trafﬁc

congestion in Rio de Janeiro. As an example of societal objectives, consider the

successful delivery of vaccines to remote areas. Vaccines can lose their efﬁcacy

or even become unsafe unless they are refrigerated, but delivery to outlying areas

can mean a variety of transport mechanisms and intermediaries. For this reason,

it is important to ensure that they remain refrigerated across their “cold chain.”

A low-tech method could potentially warn of unsafe vaccines: for example, put a

container of milk in with the vaccines, and if the milk spoils it will smell bad and the

vaccines are probably bad as well. However, by collecting data wirelessly from the

refrigerators throughout the delivery process, not only can it be determined whether

the vaccines are good or bad, but improvements can be made to the delivery process

by identifying the root cause of the loss of refrigeration, for example, loss of power

at a particular port, and thus steps can be taken to mitigate the problem, such as the

deployment of backup power generators (Weinman 2016).

1 Strategic Applications of Big Data 13

1.4 Solution Leadership

Products (and services) were traditionally standalone and manual, but now have

become connected and automated. Products and services now connect to the cloud

and from there on to ecosystems. The ecosystems can help collect data, analyze it,

provide data to the products or services, or all of the above (Fig. 1.2).

1.4.1 Digital-Physical Mirroring

In product engineering, an emerging approach is to build a data-driven engineering

model of a complex product. For example, GE mirrors its jet engines with “digital

twins” or “virtual machines” (unrelated to the computing concept of the same

name). The idea is that features, engineering design changes, and the like can be

made to the model much more easily and cheaply than building an actual working jet

engine. A new turbofan blade material with different weight, brittleness, and cross

section might be simulated to determine impacts on overall engine performance.

To do this requires product and materials data. Moreover, predictive analytics can

be run against massive amounts of data collected from operating engines (Warwick

2015).

Fig. 1.2 High-level architecture for solution leadership

14 J. Weinman

1.4.2 Real-Time Product/Service Optimization

Recall that solutions are smart, digital, connected products that tie over networks to

the cloud and from there onward to unbounded ecosystems. As a result, the actual

tangible, physical product component functionality can potentially evolve over time

as the virtual, digital components adapt. As two examples, consider a browser that

provides “autocomplete” functions in its search bar, i.e., typing shortcuts based

on previous searches, thus saving time and effort. Or, consider a Tesla, whose

performance is improved by evaluating massive quantities of data from all the Teslas

on the road and their performance. As Tesla CEO Elon Musk says, “When one car

learns something, the whole ﬂeet learns” (Coren 2016).

1.4.3 Product/Service Usage Optimization

Products or services can be used better by customers by collecting data and pro-

viding feedback to customers. The Ford Fusion’s EcoGuide SmartGauge provides

feedback to drivers on their fuel efﬁciency. Jackrabbit starts are bad; smooth driving

is good. The EcoGuide SmartGauge grows “green” leaves to provide drivers with

feedback, and is one of the innovations credited with dramatically boosting sales of

the car (Roy 2009).

GE Aviation’s Flight Efﬁciency Services uses data collected from numerous

ﬂights to determine best practices to maximize fuel efﬁciency, ultimately improving

airlines carbon footprint and proﬁtability. This is an enormous opportunity, because

it’s been estimated that one-ﬁfth of fuel is wasted due to factors such as suboptimal

fuel usage and inefﬁcient routing. For example, voluminous data and quantitative

analytics were used to develop a business case to gain approval from the Malaysian

Directorate of Civil Aviation for AirAsia to use single-engine taxiing. This con-

serves fuel because only one engine is used to taxi, rather than all the engines

running while the plane is essentially stuck on the runway.

Perhaps one of the most interesting examples of using big data to optimize

products and services comes from a company called Opower, which was acquired

by Oracle. It acquires data on buildings, such as year built, square footage, and

usage, e.g., residence, hair salon, real estate ofﬁce. It also collects data from smart

meters on actual electricity consumption. By combining all of this together, it

can message customers such as businesses and homeowners with speciﬁc, targeted

insights, such as that a particular hair salon’s electricity consumption is higher than

80% of hair salons of similar size in the area built to the same building code (and

thus equivalently insulated). Such “social proof” gamiﬁcation has been shown to

be extremely effective in changing behavior compared to other techniques such as

rational quantitative ﬁnancial comparisons (Weinman 2015).

1 Strategic Applications of Big Data 15

1.4.4 Predictive Analytics and Predictive Maintenance

Collecting data from things and analyzing it can enable predictive analytics and

predictive maintenance. For example, the GE GEnx jet engine has 20 or so sensors,

each of which collects 5000 data points per second in areas such as oil pressure,

fuel ﬂow and rotation speed. This data can then be used to build models and identify

anomalies and predict when the engine will fail.

This in turn means that airline maintenance crews can “ﬁx” an engine before it

fails. This maximizes what the airlines call “time on wing,” in other words, engine

availability. Moreover, engines can be proactively repaired at optimal times and

optimal locations, where maintenance equipment, crews, and spare parts are kept

(Weinman 2015).

1.4.5 Product-Service System Solutions

When formerly standalone products become connected to back-end services and

solve customer problems they become product-service system solutions. Data can

be the glue that holds the solution together. A good example is Nike and the

NikeCecosystem.

Nike has a number of mechanisms for collecting activity tracking data, such as

the NikeCFuelBand, mobile apps, and partner products, such as NikeCKinect

which is a video “game” that coaches you through various workouts. These can

collect data on activities, such as running or bicycling or doing jumping jacks. Data

can be collected, such as the route taken on a run, and normalized into “NikeFuel”

points (Weinman 2015).

Other elements of the ecosystem can measure outcomes. For example, a variety

of scales can measure weight, but the Withings Smart Body Analyzer can also

measure body fat percentage, and link that data to NikeFuel points (Choquel 2014).

By linking devices measuring outcomes to devices monitoring activities—with

the linkages being data traversing networks—individuals can better achieve their

personal goals to become better athletes, lose a little weight, or get more toned.

1.4.6 Long-Term Product Improvement

Actual data on how products are used can ultimately be used for long-term product

improvement. For example, a cable company can collect data on the pattern of

button presses on its remote controls. A repeated pattern of clicking around the

“Guide” button fruitlessly and then ﬁnally ordering an “On Demand” movie might

lead to a clearer placement of a dedicated “On Demand” button on the control. Car

companies such as Tesla can collect data on actual usage, say, to determine how

16 J. Weinman

many batteries to put in each vehicle based on the statistics of distances driven;

airlines can determine what types of meals to offer; and so on.

1.4.7 The Experience Economy

In the Experience Economy framework, developed by Joe Pine and Jim Gilmore,

there is a ﬁve-level hierarchy of increasing customer value and ﬁrm proﬁtability.

At the lowest level are commodities, which may be farmed, ﬁshed, or mined, e.g.,

coffee beans. At the next level of value are products, e.g., packaged, roasted coffee

beans. Still one level higher are services, such as a corner coffee bar. One level

above this are experiences, such as a ﬁne French restaurant, which offers coffee

on the menu as part of a “total product” that encompasses ambience, romance, and

professional chefs and services. But, while experiences may be ephemeral, at the

ultimate level of the hierarchy lie transformations, which are permanent, such as

a university education, learning a foreign language, or having life-saving surgery

(Pine and Gilmore 1999).

1.4.8 Experiences

Experiences can be had without data or technology. For example, consider a hike

up a mountain to its summit followed by taking in the scenery and the fresh air.

However, data can also contribute to experiences. For example, Disney MagicBands

are long-range radios that tie to the cloud. Data on theme park guests can be used to

create magical, personalized experiences. For example, guests can sit at a restaurant

without expressly checking in, and their custom order will be brought to their

table, based on tracking through the MagicBands and data maintained in the cloud

regarding the individuals and their orders (Kuang 2015).

1.4.9 Transformations

Data can also be used to enable transformations. For example, the NikeCfamily

and ecosystem of solutions mentioned earlier can help individuals lose weight or

become better athletes. This can be done by capturing data from the individual on

steps taken, routes run, and other exercise activities undertaken, as well as results

data through connected scales and body fat monitors. As technology gets more

sophisticated, no doubt such automated solutions will do what any athletic coach

does, e.g., coaching on backswings, grip positions, stride lengths, pronation and the

like. This is how data can help enable transformations (Weinman 2015).

1 Strategic Applications of Big Data 17

1.4.10 Customer-Centered Product and Service Data

Integration

When multiple products and services each collect data, they can provide a 360ıview

of the patient. For example, patients are often scanned by radiological equipment

such as CT (computed tomography) scanners and X-ray machines. While individual

machines should be calibrated to deliver a safe dose, too many scans from too

many devices over too short a period can deliver doses over accepted limits, leading

potentially to dangers such as cancer. GE Dosewatch provides a single view of the

patient, integrating dose information from multiple medical devices from a variety

of manufacturers, not just GE (Combs 2014).

Similarly, ﬁnancial companies are trying to develop a 360ıview of their

customers’ ﬁnancial health. Rather than the brokerage division being run separately

from the mortgage division, which is separate from the retail bank, integrating data

from all these divisions can help ensure that the customer is neither over-leveraged

or underinvested.

1.4.11 Beyond Business

The use of connected refrigerators to help improve the cold chain was described

earlier in the context of information excellence for process improvement. Another

example of connected products and services is cities, such as Singapore, that help

reduce carbon footprint through connected parking garages. The parking lots report

how many spaces they have available, so that a driver looking for parking need not

drive all around the city: clearly visible digital signs and a mobile app describe how

many—if any—spaces are available (Abdullah 2015).

This same general strategy can be used with even greater impact in the devel-

oping world. For example, in some areas, children walk an hour or more to a well

to ﬁll a bucket with water for their families. However, the well may have gone dry.

Connected, “smart” pump handles can report their usage, and inferences can be

made as to the state of the well. For example, a few pumps of the handle and then

no usage, another few pumps and then no usage, etc., is likely to signify someone

visiting the well, attempting to get water, then abandoning the effort due to lack of

success (ITU and Cisco 2016).

1.5 Collective Intimacy

At one extreme, a customer “relationship” is a one-time, anonymous transaction.

Consider a couple celebrating their 30th wedding anniversary with a once-in-a-

lifetime trip to Paris. While exploring the left bank, they buy a baguette and some

Brie from a hole-in-the-wall bistro. They will never see the bistro again, nor vice

versa.

18 J. Weinman

At the other extreme, there are companies and organizations that see customers

repeatedly. Amazon.com sees its customers’ patterns of purchases; Netﬂix sees its

customers’ patterns of viewing; Uber sees its customers’ patterns of pickups and

drop-offs. As other verticals become increasingly digital, they too will gain more

insight into customers as individuals, rather than anonymous masses. For example,

automobile insurers are increasingly pursuing “pay-as-you-drive,” or “usage-based”

insurance. Rather than customers’ premiums being merely based on aggregate,

coarse-grained information such as age, gender, and prior tickets, insurers can

charge premiums based on individual, real-time data such as driving over the speed

limit, weaving in between lanes, how congested the road is, and so forth.

Somewhere in between, there are ﬁrms that may not have any transaction history

with a given customer, but can use predictive analytics based on statistical insights

derived from large numbers of existing customers. Capital One, for example,

famously disrupted the existing market for credit cards by building models to create

“intimate” offers tailored to each prospect rather than a one-size ﬁts all model

(Pham 2015).

Big data can also be used to analyze and model churn. Actions can be taken to

intercede before a customer has defected, thus retaining that customer and his or her

proﬁts.

In short, big data can be used to determine target prospects, determine what to

offer them, maximize revenue and proﬁtability, keep them, decide to let them defect

to a competitor, or win them back (Fig. 1.3).

Fig. 1.3 High-level architecture for collective intimacy

1 Strategic Applications of Big Data 19

1.5.1 Target Segments, Features and Bundles

A traditional arena for big data and analytics has been better marketing to customers

and market basket analysis. For example, one type of analysis entails identifying

prospects, clustering customers, non-buyers, and prospects as three groups: loyal

customers, who will buy your product no matter what; those who won’t buy no

matter what; and those that can be swayed. Marketing funds for advertising and

promotions are best spent with the last category, which will generate sales uplift.

A related type of analysis is market basket analysis, identifying those products

that might be bought together. Offering bundles of such products can increase proﬁts

(Skiera and Olderog 2000). Even without bundles, better merchandising can goose

sales. For example, if new parents who buy diapers also buy beer, it makes sense to

put them in the aisle together. This may be extended to product features, where the

“bundle” isn’t a market basket but a basket of features built in to a product, say a

sport suspension package and a V8 engine in a sporty sedan.

1.5.2 Upsell/Cross-Sell

Amazon.com uses a variety of approaches to maximize revenue per customer. Some,

such as Amazon Prime, which offers free two-day shipping, are low tech and based

on behavioral economics principles such as the “ﬂat-rate bias” and how humans

frame expenses such as sunk costs (Lambrecht and Skiera 2006). But they are

perhaps best known for their sophisticated algorithms, which do everything from

automating pricing decisions, making millions of price changes every day, (Falk

2013) and also in recommending additional products to buy, through a variety of

algorithmically generated capabilities such as “People Also Bought These Items”.

Some are reasonably obvious, such as, say, paper and ink suggestions if a copier

is bought or a mounting plate if a large ﬂat screen TV is purchased. But many are

subtle, and based on deep analytics at scale of the billions of purchases that have

been made.

1.5.3 Recommendations

If Amazon.com is the poster child for upsell/cross-sell, Netﬂix is the one for a

pure recommendation engine. Because Netﬂix charges a ﬂat rate for a household,

there is limited opportunity for upsell without changing the pricing model. Instead,

the primary opportunity is for customer retention, and perhaps secondarily, referral

marketing, i.e., recommendations from existing customers to their friends. The key

to that is maximizing the quality of the total customer experience. This has multiple

dimensions, such as whether DVDs arrive in a reasonable time or a streaming video

20 J. Weinman

plays cleanly at high resolution, as opposed to pausing to rebuffer frequently. But

one very important dimension is the quality of the entertainment recommendations,

because 70% of what Netﬂix viewers watch comes about through recommendations.

If viewers like the recommendations, they will like Netﬂix, and if they don’t, they

will cancel service. So, reduced churn and maximal lifetime customer value are

highly dependent on this (Amatriain 2013).

Netﬂix uses extremely sophisticated algorithms against trillions of data points,

which attempt to solve as best as possible the recommendation problem. For

example, they must balance out popularity with personalization. Most people like

popular movies; this is why they are popular. But every viewer is an individual,

hence will like different things. Netﬂix continuously evolves their recommendation

engine(s), which determine which options are presented when a user searches, what

is recommended based on what’s trending now, what is recommended based on

prior movies the user has watched, and so forth. This evolution spans a broad

set of mathematical and statistical methods and machine learning algorithms, such

as matrix factorization, restricted Boltzmann machines, latent Dirichlet allocation,

gradient boosted decision trees, and afﬁnity propagation (Amatriain 2013). In addi-

tion, a variety of metrics—such as member retention and engagement time—and

experimentation techniques—such as ofﬂine experimentation and A/B testing—are

tuned for statistical validity and used to measure the success of the ensemble of

algorithms (Gomez-Uribe and Hunt 2015).

1.5.4 Sentiment Analysis

A particularly active current area in big data is the use of sophisticated algorithms to

determine an individual’s sentiment (Yegulalp 2015). For example, textual analysis

of tweets or posts can determine how a customer feels about a particular product.

Emerging techniques include emotional analysis of spoken utterances and even

sentiment analysis based on facial imaging.

Some enterprising companies are using sophisticated algorithms to conduct such

sentiment analysis at scale, in near real time, to buy or sell stocks based on how

sentiment is turning as well as additional analytics (Lin 2016).

1.5.5 Beyond Business

Such an approach is relevant beyond the realm of corporate affairs. For example,

a government could utilize a collective intimacy strategy in interacting with its

citizens, in recommending the best combination of public transportation based on

personal destination objectives, or the best combination of social security beneﬁts,

based on a personal ﬁnancial destination. Dubai, for example, has released a mobile

app called Dubai Now that will act as a single portal to thousands of government

services, including, for example, personalized, contextualized GPS-based real-time

trafﬁc routing (Al Serkal 2015).

1 Strategic Applications of Big Data 21

1.6 Accelerated Innovation

Innovation has evolved through multiple stages, from the solitary inventor, such

as the early human who invented the ﬂint hand knife, through shop invention,

a combination of research lab and experimental manufacturing facility, to the

corporate research labs (Weinman 2015). However, even the best research labs can

only hire so many people, but ideas can come from anywhere.

The theory of open innovation proposes loosening the ﬁrm boundaries to partners

who may have ideas or technologies that can be brought into the ﬁrm, and to

distribution partners who may be able to make and sell ideas developed from within

the ﬁrm. Open innovation suggests creating relationships that help both of these

approaches succeed (Chesbrough 2003).

However, even preselecting relationships can be overly constricting. A still more

recent approach to innovation lets these relationships be ad hoc and dynamic. I call

it accelerated innovation, but it can be not only faster, but also better and cheaper.

One way to do this is by holding contests or posting challenges, which theoretically

anyone in the world could solve. Related approaches include innovation networks

and idea markets. Increasingly, machines will be responsible for innovation, and we

are already seeing this in systems such as IBM’s Chef Watson, which ingested a

huge database of recipes and now can create its own innovative dishes, and Google

DeepMind’s AlphaGo, which is innovating game play in one of the oldest games in

the world, Go (Fig. 1.4).

Fig. 1.4 High-level architecture for accelerated innovation

22 J. Weinman

1.6.1 Contests and Challenges

Netﬂix depends heavily on the quality of its recommendations to maximize

customer satisfaction, thus customer retention, and thereby total customer lifetime

value and proﬁtability. The original Netﬂix Cinematch recommendation system

let Netﬂix customers rate movies on a scale of one star to ﬁve stars. If Netﬂix

recommended a movie that the customer then rated a one, there is an enormous

discrepancy between Netﬂix’s recommendations and the user’s delight. With a per-

fect algorithm, customers would always rate a Netﬂix-recommended movie a ﬁve.

Netﬂix launched the Netﬂix Prize in 2006. It was open to anyone who wished

to compete, and multiple teams did so, from around the world. Netﬂix made 100

million anonymized movie ratings available. These came from almost half a million

subscribers, across almost 20,000 movie titles. It also withheld 3 million ratings to

evaluate submitted contestant algorithms (Bennett and Lanning 2007). Eventually,

the prize was awarded to a team that did in fact meet the prize objective: a 10%

improvement in the Cinematch algorithm. Since that time, Netﬂix has continued

to evolve its recommendation algorithms, adapting them to the now prevalent

streaming environment which provides billions of additional data points. While

user-submitted DVD ratings may be reasonably accurate; actual viewing behaviors

and contexts are substantially more accurate and thus better predictors. As one

example, while many viewers say they appreciate foreign documentaries; actual

viewing behavior shows that crude comedies are much more likely to be watched.

GE Flight Quest is another example of a big data challenge. For Flight Quest 1,

GE published data on planned and actual ﬂight departures and arrivals, as well

as weather conditions, with the objective of better predicting ﬂight times. Flight

Quest II then attempted to improve ﬂight arrival times through better scheduling

and routing. The key point of both the Netﬂix Prize and GE’s Quests is that large

data sets were the cornerstone of the innovation process. Methods used by Netﬂix

were highlighted in Sect. 1.5.3. The methods used by the GE Flight Quest winners

span gradient boosting, random forest models, ridge regressions, and dynamic

programming (GE Quest n.d.).

1.6.2 Contest Economics

Running such a contest exhibits what I call “contest economics.” For example,

rather than paying for effort, as a ﬁrm would when paying salaries to its R&D

team, it can now pay only for results. The results may be qualitative, for example,

“best new product idea,” or quantitative, for example, a percentage improvement

in the Cinematch algorithm. Moreover, the “best” idea may be selected, or the best

one surpassing a particular given threshold or judges’ decision. This means that

hundreds or tens of thousands of “solvers” or contestants may be working on your

problem, but you only need to pay out in the event of a sufﬁciently good solution.

1 Strategic Applications of Big Data 23

Moreover, because the world’s best experts in a particular discipline may be

working on your problem, the quality of the solution may be higher than if

conducted internally, and the time to reach a solution may be faster than internal

R&D could do it, especially in the fortuitous situation where just the right expert is

matched with just the right problem.

1.6.3 Machine Innovation

Technology is evolving to be not just an enabler of innovation, but the source of

innovation itself. For example, a program called AlphaGo developed by DeepMind,

which has been acquired by Google, bested the European champion, Fan Hui, and

then the world champion, Lee Sedol. Rather than mere brute force examination of

many moves in the game tree together with a board position evaluation metric, it

used a deep learning approach coupled with some game knowledge encoded by its

developers (Moyer 2016).

Perhaps the most interesting development, however, was in Game 2 of the

tournament between AlphaGo and Sedol. Move 37 was so unusual that the human

commentators thought it was a mistake—a bug in the program. Sedol stood up and

left the game table for 15 min to regain his composure. It was several moves later

that the rationale and impact of Move 37 became clear, and AlphaGo ended up

winning that game, and the tournament. Move 37 was “beautiful,” (Metz 2016)in

retrospect, the way that the heliocentric theory of the solar system or the Theory of

Relativity or the concept of quasicrystals now are. To put it another way, a machine

innovated beyond what thousands of years and millions of players had been unable

to do.

1.6.4 Beyond Business

Of course, such innovation is not restricted to board games. Melvin is a program

that designs experiments in quantum physics, which are notoriously counterintuitive

or non-intuitive to design. It takes standard components such as lasers and beam

splitters, and determines new ways to combine them to test various quantum

mechanics hypotheses. It has already been successful in creating such experiments.

In another example of the use of big data for innovation, automated hypothesis

generation software was used to scan almost two hundred thousand scientiﬁc paper

abstracts in biochemistry to determine the most promising “kinases,”—a type of

protein—that activate another speciﬁc protein, “p53”, which slows cancer growth.

All but two of the top prospects identiﬁed by the software proved to have the desired

effect (The Economist 2014).

24 J. Weinman

1.7 Integrated Disciplines

A traditional precept of business strategy is the idea of focus. As ﬁrms select

a focused product area, market segment, or geography, say, they also make a

conscious decision on what to avoid or say “no” to. A famous story concerns

Southwest, an airline known for its no frills, low-cost service. Its CEO, Herb

Kelleher, in explaining its strategy, explained that every strategic decision could be

viewed in the light of whether it helped achieve that focus. For example, the idea of

serving a tasty chicken Caesar salad on its ﬂights could be instantly nixed, because

it wouldn’t be aligned with low cost (Heath and Heath 2007).

McDonald’s famously ran into trouble by attempting to pursue operational

excellence, product leadership, and customer intimacy at the same time, and these

were in conﬂict. After all, having the tastiest burgers—product leadership—

would mean foregoing mass pre-processing in factories that created frozen

patties—operational excellence. Having numerous products, combinations and

customizations such as double patty, extra mayo, no onions—customer intimacy—

would take extra time and conﬂict with a speedy drive through line—operational

excellence (Weinman 2015).

However, the economics of information and information technology mean that

a company can well orient itself to more than one discipline. The robots that

run Amazon.com’s logistics centers, for example, can use routing and warehouse

optimization programs—operational excellence—that are designed once, and don’t

necessarily conﬂict with the algorithms that make product recommendations based

on prior purchases and big data analytics across millions or billions of transactions.

The efﬁcient delivery of unicast viewing streams to Netﬂix streaming

subscribers—operational excellence—doesn’t conﬂict with the entertainment

suggestions derived by the Netﬂix recommender—collective intimacy—nor does

it conﬂict with the creation of original Netﬂix content—product leadership—nor

does it impact Netﬂix’s ability to run open contests and challenges such as the

Netﬂix Prize or the Netﬂix Cloud OSS (Open Source Software) Prize—accelerated

innovation.

In fact, not only do the disciplines not conﬂict, but, in such cases, data captured

or derived in one discipline can be used to support the needs of another in the

same company. For example, Netﬂix famously used data on customer behaviors,

such as rewind or re-watch, contexts, such as mobile device or family TV, and

demographics, such as age and gender, that were part of its collective intimacy

strategy, to inform decisions made about investing in and producing House of Cards,

a highly popular, Emmy-Award-winning show, that supports product leadership.

The data need not even be restricted to a single company. Uber, the “ride-

sharing” company, entered into an agreement with Starwood, the hotel company

(Hirson 2015). A given Uber customer might be dropped off at a competitor’s hotel,

offering Starwood the tantalizing possibility of emailing that customer a coupon for

20% off their next stay at a Sheraton or Westin, say, possibly converting a lifelong

competitor customer into a lifelong Starwood customer. The promotion could be

extremely targeted, along the lines of, say, “Mr. Smith, you’ve now stayed at our

1 Strategic Applications of Big Data 25

competitor’s hotel at least three times. But did you know that the Westin Times

Square is rated 1 star higher than the competitor? Moreover, it’s only half as far

away from your favorite restaurant as the competitor, and has a health club included

in the nightly fee, which has been rated higher than the health club you go to at the

competitor hotel.”

Such uses are not restricted to analytics for marketing promotions. For example,

Waze and the City of Rio de Janeiro have announced a collaboration. Waze is a

mobile application that provides drivers information such as driving instructions,

based not only on maps but also real-time congestion data derived from all other

Waze users, a great example of crowdsourcing with customer intimacy. In a bidirec-

tional arrangement with Rio de Janeiro, Waze will improve its real time routing by

utilizing not only the data produced by Waze users, but additional data feeds offered

by the City. Data will ﬂow in the other direction, as well, as Rio uses data collected

by Waze to plan new roads or to better time trafﬁc signals (Ungerleider 2015).

1.8 Conclusion

A combination of technologies such as the cloud, big data and analytics, machine

learning, social, mobile, and the Internet of Things is transforming the world

around us. Information technologies, of course, have information at their nexus,

and consequently data, and the capabilities to extract information and insight and

make decisions and take action on that insight, are key to the strategic application of

information technology to increase the competitiveness of our ﬁrms, enhance value

created for our customers, and to excel beyond these domains also into the areas of

government and society.

Four generic strategies—information excellence, solution leadership, collective

intimacy, and accelerated innovation—can be used independently or in combination

to utilize big data and related technologies to differentiate and create customer

value—for better processes and resources, better products and services, better

customer relationships, and better innovation, respectively.

References

Abdullah, Z. (2015). New app to help motorists ﬁnd available parking.http://

www.straitstimes.com/singapore/new-app-to-help-motorists-ﬁnd-available-parking. Accessed

15 September 2016.

Al Serkal, M. (2015). Dubai to launch smart app for 2000 government services.http://

gulfnews.com/news/uae/government/dubai-to-launch-smart-app-for-2-000-government-serv

ices-1.1625556. Accessed 15 September 2016.

Amatriain, X. (2013). Big and personal: data and models behind Netﬂix recommendations. In

Proceedings of the 2nd International Workshop on Big Data, Streams and Heterogeneous

Source Mining: Algorithms, Systems, Programming Models and Applications (pp. 1–6). ACM.

Bennett, J., & Lanning, S. (2007). The Netﬂix prize. In Proceedings of KDD Cup and Workshop

(Vol. 2007, p. 35).

26 J. Weinman

Chesbrough, H. W. (2003). Open innovation: The new imperative for creating and proﬁting from

technology. Boston: Harvard Business School Press.

Choquel, J. (2014). NikeFuel total on your Withings scale.http://blog.withings.com/2014/07/22/

new-way-to-fuel-your-motivation-see-your-nikefuel-total-on-your-withings-scale. Accessed

15 September 2016.

Combs, V. (2014). An infographic that works: I want dose watch from GE healthcare. Med City

News.http://medcitynews.com/2014/05/infographic-works-want-ges-dosewatch. Accessed 15

September 2016.

Coren, M. (2016). Tesla has 780 million miles of driving data, and adds another million

every 10 hours.http://qz.com/694520/tesla-has-780-million-miles-of-driving-data-and-adds-

another-million-every-10-hours/. Accessed 15 September 2016.

Dataspark (2016) Can data science help build better public transport? https://

datasparkanalytics.com/insight/can-data-science-help-build-better-public-transport. Accessed

15 September 2016.

Falk, T. (2013). Amazon changes prices millions of times every day. ZDnet.com.http://

www.zdnet.com/article/amazon-changes-prices-millions-of-times-every-day. Accessed 15

September 2016.

GE Aviation (n.d.). http://www.geaviation.com/commercial/engines/genx/. Accessed 15 Septem-

ber 2016.

GE Quest (n.d.). http://www.gequest.com/c/ﬂight. Accessed 15 November 2016.

Gomez-Uribe, C., & Hunt, N. (2015). The netﬂix recommender system: algorithms, business value,

and innovation. ACM Transactions on Management Information Systems, 6(4), 1–19.

Hayes, C. (2004). What Wal-Mart knows about customers’ habits. The New York

Times.http://www.nytimes.com/2004/11/14/business/yourmoney/what-walmart-knows-about-

customers-habits.html. Accessed 15 September 2016.

Heath, C., & Heath, D. (2007). Made to Stick: Why some ideas survive and others die.NewYork,

USA: Random House.

Hirson, R. (2015). Uber: The big data company.http://www.forbes.com/sites/ronhirson/2015/03/

23/uber-the-big-data-company/#2987ae0225f4. Accessed 15 September 2016.

ITU and Cisco (2016). Harnessing the Internet of Things for Global Development.http:/

/www.itu.int/en/action/broadband/Documents/Harnessing-IoT-Global-Development.pdf.

Accessed 15 September 2016.

Kaggle (n.d.) GE Tackles the industrial internet.https://www.kaggle.com/content/kaggle/img/

casestudies/Kaggle%20Case%20Study-GE.pdf. Accessed 15 September 2016.

Kuang, C. (2015). Disney’s $1 Billion bet on a magical wristband. Wired.http://www.wired.com/

2015/03/disney-magicband. Accessed 15 September 2016.

Lambrecht, A., & Skiera, B. (2006). Paying too much and being happy about it: Existence, causes

and consequences of tariff-choice biases. Journal of Marketing Research, XLIII, 212–223.

Lee, S. (2015). 23 and Me and Genentech in deal to research Parkinson’s treatments.

SFgate, January 6, 2015. http://www.sfgate.com/health/article/23andMe-and-Genentech-in-

deal-to-research-5997703.php. Accessed 15 September 2016.

Lin, D. (2016). Seeking and ﬁnding alpha—Will cloud disrupt the investment man-

agement industry? https://thinkacloud.wordpress.com/2016/03/07/seeking-and-ﬁnding-alpha-

will-cloud-disrupt-the-investment-management-industry/. Accessed 15 September 2016.

Loveman, G. W. (2003). Diamonds in the data mine. Harvard Business Review, 81(5), 109–113.

Metz, C. (2016). In two moves, AlphaGo and lee sedol redeﬁned the future. Wired.

http://www.wired.com/2016/03/two-moves-alphago-lee-sedol-redeﬁned-future/. Accessed 15

September 2016.

Moyer, C. (2016). How Google’s AlphaGo beat a Go world champion.http://www.theatlantic.com/

technology/archive/2016/03/the-invisible-opponent/475611/. Accessed 15 September 2016.

O’Brien, S. A. (2015). Uber partners with Boston on trafﬁc data.http://money.cnn.com/2015/01/

13/technology/uber-boston-trafﬁc-data/. Accessed 15 September 2016.

1 Strategic Applications of Big Data 27

Pham, P. (2015). The Impacts of big data that you may not have heard of. forbes.com.http://

www.forbes.com/sites/peterpham/2015/08/28/the-impacts-of-big-data-that-you-may-not-have

-heard-of/#3b1ccc1c957d. Accessed 15 September 2016.

Pine, J., & Gilmore, J. (1999). The experience economy: Work is theatre and every business a stage.

Boston: Harvard Business School Press.

Rosenbush, S., & Stevens, L. (2015). At UPS, the algorithm is the driver. The Wall Street Jour-

nal.http://www.wsj.com/articles/at-ups-the-algorithm-is-the-driver-1424136536. Accessed 15

September 2016.

Roy, R., (2009). Ford’s green goddess grows leaves.http://www.autoblog.com/2009/10/29/ford-

smart-gauge-engineer/. Accessed 15 September 2016.

Skiera, B., & Olderog, T. (2000). The beneﬁts of bundling strategies. Schmalenbach Business

Review, 52, 137–159.

The Economist (2014). Computer says try this.http://www.economist.com/news/science-and-

technology/21621704-new-type-software-helps-researchers-decide-what-they-should-be-loo

king. Accessed 15 September 2016.

Treacy, M., & Wiersema, F. (1995). The discipline of market leaders. Reading, USA: Addison-

Wesley.

Ungerleider, N. (2015). Waze is driving into city hall.http://www.fastcompany.com/3045080/

waze-is-driving-into-city-hall. Accessed 15 September 2016.

Warwick, G. (2015). GE advances analytical maintenance with digital twins.http://

aviationweek.com/optimizing-engines-through-lifecycle/ge-advances-analytical-maintenance-

digital-twins. Accessed 15 September 2016.

Weinman, J. (2015). Digital disciplines. New Jersey: John Wiley & Sons.

Weinman, J. (2016). The internet of things for developing economies. CIO.http://www.cio.com/

article/3027989/internet-of-things/the-internet-of-things-for-developing-economies.html.

Accessed 15 September 2016.

Wind, J., Fung, V., & Fung, W. (2009). Network orchestration: Creating and managing global

supply chains without owning them. In P. R. Kleindorfer, Y. (Jerry) R. Wind, & R. E.

Gunther (Eds.), The Network Challenge: Strategy, Proﬁt, and Risk in an Interlinked World

(pp. 299–314). Upper Saddle River, USA: Wharton School Publishing.

Withings (2014). NikeFuel total on your Withings scale.http://blog.withings.com/2014/07/22/

new-way-to-fuel-your-motivation-see-your-nikefuel-total-on-your-withings-scale/. Accessed

15 September 2016.

Xvela (2016). https://xvela.com/solutions.html. Accessed 15 September 2016.

Yegulalp, S. 2015. IBM’s Watson mines Twitter for sentiments. 17 Mar 2015. Infoworld.com.

http://www.infoworld.com/article/2897602/big-data/ibms-watson-now-mines-twitter-for-sen

timents-good-bad-and-ugly.html. Accessed 15 September 2016.

Chapter 2

Start with Privacy by Design in All Big Data

Applications

Ann Cavoukian and Michelle Chibba

2.1 Introduction

The evolution of networked information and communication technologies has, in

one generation, radically changed the value of and ways to manage data. These

trends carry profound implications for privacy. The creation and dissemination of

data has accelerated around the world, and is being copied and stored indeﬁnitely,

resulting in the emergence of Big Data. The old information destruction paradigm

created in an era of paper records is no longer relevant, because digital bits

and bytes have now attained near immortality in cyberspace, thwarting efforts

to successfully remove them from “public” domains. The practical obscurity of

personal information—the data protection of yesteryear—is disappearing as data

becomes digitized, connected to the grid, and exploited in countless new ways.

We’ve all but given up trying to inventory and classify information, and now rely

more on advanced search techniques and automated tools to manage and “mine”

data. The combined effect is that while information has become cheap to distribute,

copy, and recombine; personal information has also become far more available

and consequential. The challenges to control and protect personal information are

signiﬁcant. Implementing and following good privacy practices should not be a

hindrance to innovation, to reaping societal beneﬁts or to ﬁnding the means to

reinforce the public good from Big Data analytics—in fact, by doing so, innovation

is fostered with doubly-enabling, win–win outcomes. The privacy solution requires

a combination of data minimization techniques, credible safeguards, meaningful

individual participation in data processing life cycles, and robust accountability

measures in place by organizations informed by an enhanced and enforceable set of

A. Cavoukian () • M. Chibba

Faculty of Science, Privacy and Big Data Institute, Ryerson University, 350 Victoria Street,

Toronto, ON M5B 2K3, Canada

e-mail: ann.cavoukian@ryerson.ca;michelle.chibba@ryerson.ca

© Springer International Publishing AG 2018

S. Srinivasan (ed.), Guide to Big Data Applications, Studies in Big Data 26,

DOI 10.1007/978-3-319-53817-4_2

29

30 A. Cavoukian and M. Chibba

universal privacy principles better suited to modern realities. This is where Privacy

by Design becomes an essential approach for Big Data applications. This chapter

begins by deﬁning information privacy, then it will provide an overview of the

privacy risks associated with Big Data applications. Finally, the authors will discuss

Privacy by Design as an international framework for privacy, then provide guidance

on using the Privacy by Design Framework and the 7 Foundational Principles, to

achieve both innovation and privacy—not one at the expense of the other.

2.2 Information Privacy Deﬁned

Information privacy refers to the right or ability of individuals to exercise control

over the collection, use and disclosure by others of their personal information

(Clarke 2000). The ability to determine the fate of one’s personal information is

so important that the authors wish to bring to the attention of the readers, the

term “informational self-determination” which underpins the approach taken to

privacy in this chapter. This term was established in 1983 in Germany when the

Constitutional Court ruled that individuals, not governments, determine the fate of

their personal information. Since this time, in December 2013, the United Nations

General Assembly adopted resolution 68/167 (UN 2016), which expressed deep

concern at the negative impact that surveillance and interception of communications

may have on human rights. The General Assembly afﬁrmed that the rights held by

people ofﬂine must also be protected online, and it called upon all States to respect

and protect the right to privacy in digital communication.

Information privacy makes each of us ‘masters’ of the data that identiﬁes each

of us – as individual, citizen, worker, consumer, patient, student, tourist, investor,

parent, son, or daughter. For this, the notions of empowerment, control, choice and

self-determination are the very essence of what we refer to as information privacy.

As ‘custodians’ of our information, we expect governments and business can be

trusted with its safekeeping and proper use.

There have also been references to statements such as “If you have nothing to

hide, you have nothing to fear.” (Solove 2007) Privacy is not about secrecy. It is

about the freedom to exercise one’s right to decide who to choose to share the

personal details of one’s life with. Democracy does not begin with intrusions into

one’s personal sphere—it begins with human rights, civil liberties and privacy—all

fundamental to individual freedom.

Sometimes, safekeeping or information security is taken to mean that privacy

has been addressed. To be clear, information security does not equal privacy.

While data security certainly plays a vital role in enhancing privacy, there is an

important distinction to be made—security is about protecting data assets. It is

about achieving the goals of conﬁdentiality, integrity and availability. Privacy related

goals developed in Europe that complement this security triad are: unlinkability,

transparency and intervenability. In other words, information privacy incorporates

a much broader set of protections than security alone. We look to the work on

2 Start with Privacy by Design in All Big Data Applications 31

‘contextual integrity’ (Dwork 2014) that extends the meaning of privacy to a much

broader class of transmission principles that cannot be presumed unless warranted

by other context-speciﬁc parameters inﬂuenced by other actors and information

types. Privacy relates not only to the way that information is protected and accessed,

but also to the way in which it is collected and used. For example, user access

controls protect personal information from internal threats by preventing even the

possibility of accidental or intentional disclosure or misuse. This protection is

especially needed in the world of Big Data.

2.2.1 Is It Personally Identiﬁable Information?

Not all data gives rise to privacy concerns. An important ﬁrst step for any Big

Data application is to determine whether the information involved falls under the

deﬁnition of personally identiﬁable information (PII). Privacy laws around the

world include a deﬁnition of personal information and it is this deﬁnition which

is integral to whether or not the rules apply. Although there are privacy laws

around the world, each with a deﬁnition of personal information, we will use the

NIST deﬁnition, where personal information (also known as personally identiﬁable

information) may be deﬁned as any information, recorded or otherwise, relating

to an identiﬁable individual (NIST 2010). It is important to note that almost any

information (e.g. biographical, biological, genealogical, historical, transactional,

locational, relational, computational, vocational, or reputational), may become

personal in nature. Privacy laws and associated rules will apply to information

if there is a reasonable possibility of identifying a speciﬁc individual—whether

directly, indirectly, or through manipulation or data linkage.

Understanding the different forms of non-personal data helps to better understand

what constitutes personal information. One example is de-identiﬁed or anonymous

information, which will be dealt with in more detail later in this chapter. NIST

deﬁnes de-identiﬁed information as records that have had enough personal infor-

mation removed or obscured in some manner such that the remaining information

does not identify an individual, and there is no reasonable basis to believe that the

information can be used to identify an individual (NIST 2015). As an illustration,

under a U.S. law known as the Health Insurance Portability and Accountability Act

(HIPAA), a set of standards exist to determine when health-care information is

no longer ‘individually identiﬁable’ or de-identiﬁed (HHS 2012). If this standard

is achieved, then the health-care information would not be subject to this law

governing the privacy of health care information. Another example is the EU

General Data Protection Regulation (GDPR) that similarly, excludes anonymous

information (EU Commission 2015). Of interest, however, is that this European

law introduces the concept of “pseudonymization” deﬁned as the processing of

personal data in such a way as to prevent attribution to an identiﬁed or identiﬁable

32 A. Cavoukian and M. Chibba

person without additional information that may be held separately.1For research and

statistical purposes, certain requirements under the GDPR are relaxed if the personal

data is pseudonymized, which is considered an appropriate safeguard alongside

encryption (Ofﬁcial Journal of the European Union 2016).

Another form is when personal information is aggregated. Aggregation refers

to summary data that have been generated by performing a calculation across all

individual units as a whole. For example, medical researchers may use aggregated

patient data to assess new treatment strategies; governments may use aggregated

population data for statistical analysis on certain publicly funded programs for

reporting purposes; companies may use aggregated sales data to assist in deter-

mining future product lines. Work has also been done on privacy-preserving data

aggregation in wireless sensor networks, especially relevant in the context of the

Internet of Things (Zhang et al. 2016). By using aggregated data, there is a reduced

risk of connecting this information to a speciﬁc person or identify an individual.

Lastly, while personal information may be classiﬁed as conﬁdential, not all

conﬁdential information should be governed under privacy rules. Conﬁdential

information includes information that should not be publicly available and often

holds tremendous value and importance for organizations, such as strategic business

plans, interim revenue forecasts, proprietary research, or other intellectual property.

The distinction is that while the theft or loss of such conﬁdential information is of

grave concern for an organization it would not constitute a privacy breach because

it does not involve personal information—rather, it is business information.

The growth in Big Data applications and other information communication

technologies have added to the challenges of deﬁnition of personal information.

There are times when information architectures, developed by engineers to ensure

the smooth functioning of computer networks and connectivity, lead to unforeseen

uses that have an impact on identity and privacy. These changes present challenges

to what constitutes personal information, extending it from obvious tombstone

data (name, address, telephone number, date of birth, gender) to the innocuous

computational or metadata once the purview of engineering requirements for

communicating between devices (Cameron 2013; Mayer et al. 2016).

Metadata, for example, is information generated by our communications devices

and our communications service providers as we use landline or mobile phones,

computers, tablets, or other computing devices. Metadata is essentially information

about other information—in this case, relating to our communications (Mayer

et al. 2016). Using metadata in Big Data analysis requires understanding of context.

1NIST (2015) deﬁnes ‘pseudonymization’ as a speciﬁc kind of transformation in which

the names and other information that directly identiﬁes an individual are replaced with

pseudonyms. Pseudonymization allows linking information belonging to an individual across

multiple data records or information systems, provided that all direct identiﬁers are systematically

pseudonymized. Pseudonymization can be readily reversed if the entity that performed the

pseudonymization retains a table linking the original identities to the pseudonyms, or if the

substitution is performed using an algorithm for which the parameters are known or can be

discovered.

2 Start with Privacy by Design in All Big Data Applications 33

Metadata reveals detailed pattern of associations that can be far more invasive of

privacy than merely accessing the content of one’s communications (Cavoukian

2013a,b). Addresses, such as the Media Access Control (MAC) number that are

designed to be persistent and unique for the purpose of running software applica-

tions and utilizing Wi-Fi positioning systems to communicate to a local area network

can now reveal much more about an individual through advances in geo-location

services and uses of smart mobile devices (Cavoukian and Cameron 2011). Another

good example in the mobile environment would be a unique device identiﬁer such as

an International Mobile Equipment Identity (IMEI) number: even though this does

not name the individual, if it is used to treat individuals differently it will ﬁt the

deﬁnition of personal data (Information Commissioner’s Ofﬁce ICO 2013).

No doubt, the mobile ecosystem is extremely complex and architectures that

were ﬁrst developed to ensure the functioning of wireless network components

now act as geo-location points, thereby transforming the original intent or what

might be an unintended consequence for privacy. As noted by the International

Working Group on Data Protection in Telecommunications (IWGDPT 2004)“The

enhanced precision of location information and its availability to parties other

than the operators of mobile telecommunications networks create unprecedented

threats to the privacy of the users of mobile devices linked to telecommunications

networks.” When a unique identiﬁer may be linked to an individual, it often falls

under the deﬁnition of “personal information” and carries with it a set of regulatory

responsibilities.

2.3 Big Data: Understanding the Challenges to Privacy

Before moving into understanding the challenges and risks to privacy that arise

from Big Data applications and the associated data ecosystem, it is important to

emphasize that these should not be deterrents to extracting value from Big Data.

The authors believe that by understanding these privacy risks early on, Big Data

application developers, researchers, policymakers, and other stakeholders will be

sensitized to the privacy issues and therefore, be able to raise early ﬂags on potential

unintended consequences as part of a privacy/security threat risk analysis.

We know that with advances in Big Data applications, organizations are devel-

oping a more complete understanding of the individuals with whom they interact

because of the growth and development of data analytical tools, and systems avail-

able to them. Public health authorities, for example, have a need for more detailed

information in order to better inform policy decisions related to managing their

increasingly limited resources. Local governments are able to gain insights never

before available into trafﬁc patterns that lead to greater road and pedestrian safety.

These examples and many more demonstrate the ability to extract insights from Big

Data that will, without a doubt, be of enormous socio-economic signiﬁcance. These

challenges and insights are further examined in the narrative on the impact of Big

Data on privacy (Lane et al. 2014).

34 A. Cavoukian and M. Chibba

With this shift to knowledge creation and service delivery, the value of infor-

mation and the need to manage it responsibly have grown dramatically. At the

same time, rapid innovation, global competition and increasing system complexity

present profound challenges for informational privacy. The notion of informational

self-determination seems to be collapsing under the weight, diversity, speed and

volume of Big Data processing in the modern digital era. When a Big Data set is

comprised of identiﬁable information, then a host of customary privacy risks apply.

As technological advances improve our ability to exploit Big Data, potential privacy

concerns could stir a regulatory backlash that would dampen the data economy and

stiﬂe innovation (Tene and Polonetsky 2013). These concerns are reﬂected in, for

example, the debate around the new European legislation that includes a ‘right to

be forgotten’ that is aimed at helping individuals better manage data protection

risks online by requiring organizations to delete their data if there are no legitimate

grounds for retaining it (EU Commission 2012). The genesis of the incorporation

of this right comes from a citizen complaint to a data protection regulator against

a newspaper and a major search engine concerning outdated information about the

citizen that continued to appear in online search results of the citizen’s name. Under

certain conditions now, individuals have the right to ask search engines to remove

links with personal information about them that is “inaccurate, inadequate, irrelevant

or excessive.” (EU Commission 2012)

Big Data challenges the tenets of information security, which may also be

of consequence for the protection of privacy. Security challenges arise because

Big Data involves several infrastructure layers for data processing, new types of

infrastructure to handle the enormous ﬂow of data, as well as requiring non-

scalable encryption of large data sets. Further, a data breach may have more severe

consequences when enormous datasets are stored. Consider, for example, the value

of a large dataset of identiﬁable information or conﬁdential information for that

matter, that could make it a target of theft or for ransom—the larger the dataset, the

more likely it may be targeted for misuse. Once unauthorized disclosure takes place,

the impact on privacy will be far greater, because the information is centralized and

contains more data elements. In extreme cases, unauthorized disclosure of personal

information could put public safety at risk.

Outsourcing Big Data analytics and managing data accountability are other

issues that arise when handling identiﬁable datasets. This is especially true in

a Big Data context, since organizations with large amounts of data may lack

the ability to perform analytics themselves and will outsource this analysis and

reporting (Fogarty and Bell 2014). There is also a growing presence of data

brokers involved in collecting information, including personal information, from

a wide variety of sources other than the individual, for the purpose of reselling

such information to their customers for various purposes, including verifying an

individual’s identity, differentiating records, marketing products, and preventing

ﬁnancial fraud (FTC 2012). Data governance becomes a sine qua non for the

enterprise and the stakeholders within the Big Data ecosystem.

2 Start with Privacy by Design in All Big Data Applications 35

2.3.1 Big Data: The Antithesis of Data Minimization

To begin, the basis of Big Data is the antithesis of a fundamental privacy principle

which is data minimization. The principle of data minimization or the limitation

principle (Gürses et al. 2011) is intended to ensure that no more personal informa-

tion is collected and stored than what is necessary to fulﬁl clearly deﬁned purposes.

This approach follows through the fully data lifecycle where personal data must be

deleted when it is no longer necessary for the original purpose. The challenge to this

is that Big Data entails a new way of looking at data, where data is assigned value in

itself. In other words, the value of the data is linked to its future and potential uses.

In moving from data minimization to what may be termed data maximization

or Big Data, the challenge to privacy is the risk of creating automatic data linkages

between seemingly non-identiﬁable data which, on its own, may not be sensitive, but

when compiled, may generate a sensitive result. These linkages can result in a broad

portrait of an individual including revelations of a sensitive nature—a portrait once

inconceivable since the identiﬁers were separated in various databases. Through the

use of Big Data tools, we also know that it is possible to identify patterns which may

predict people’s dispositions, for example related to health, political viewpoints or

sexual orientation (Cavoukian and Jonas 2012).

By connecting key pieces of data that link people to things, the capability of data

analytics can render ordinary data into information about an identiﬁable individual

and reveal details about a person’s lifestyle and habits. A telephone number or postal

code, for example, can be combined with other data to identify the location of a

person’s home and work; an IP or email address can be used to identify consumer

habits and social networks.

An important trend and contribution to Big Data is the movement by government

institutions to open up their data holdings in an effort to enhance citizen participation

in government and at the same time spark innovation and new insights through

access to invaluable government data (Cavoukian 2009).2

With this potential for Big Data to create data linkages being so powerful, the

term “super” data or “super” content has been introduced (Cameron 2013). “Super”

data is more powerful than other data in a Big Data context, because the use of

one piece of “super” data, which on its own would not normally reveal much, can

spark new data linkages that grow exponentially until the individual is identiﬁed.

Each new transaction in a Big Data system would compound this effect and spread

identiﬁability like a contagion.

Indeed, to illustrate the signiﬁcant implications of data maximization on privacy

we need only look at the shock of the Snowden revelations and the eventual reper-

cussions. A top EU court decision in 2015 declared the longstanding Safe Harbor

2There are many government Open Data initiatives such as U.S. Government’s Open Data at

www.data.gov; Canadian Government’s Open Data at http://open.canada.ca/en/open-data;UN

Data at http://data.un.org/; EU Open Data Portal at https://data.europa.eu/euodp/en/data/.Thisis

just a sample of the many Open Data sources around the world.

36 A. Cavoukian and M. Chibba

data transfer agreement between Europe and the U.S. invalid (Lomas 2015). The

issues had everything to do with concerns about not just government surveillance

but the relationship with U.S. business and their privacy practices. Eventually, a new

agreement was introduced known as the EU-U.S. Privacy Shield (US DOC 2016)

(EU Commission 2016). This new mechanism introduces greater transparency

requirements for the commercial sector on their privacy practices among a number

of other elements including U.S. authorities afﬁrming that collection of information

for intelligence is focussed and targeted.

The authors strongly believe that an important lesson learned for Big Data suc-

cess is that when the individual participant is more directly involved in information

collection, the accuracy of the information’s context grows and invariably increases

the quality of the data under analysis. Another observation, that may seem to be

contradictory, is that even in Big Data scenarios where algorithms are tasked with

ﬁnding connections within vast datasets, data minimization is not only essential

for safeguarding personally identiﬁable information—it could help with ﬁnding the

needle without the haystack by reducing extraneous irrelevant data.

2.3.2 Predictive Analysis: Correlation Versus Causation

Use of correlation analysis may yield completely incorrect results for individuals.

Correlation is often mistaken for causality (Ritter 2014). If the analyses show that

individuals who like X have an eighty per cent probability rating of being exposed

to Y, it is impossible to conclude that this will occur in 100 per cent of the cases.

Thus, discrimination on the basis of statistical analysis may become a privacy issue

(Sweeney 2013). A development where more and more decisions in society are

based on use of algorithms may result in a “Dictatorship of Data”, (Cukier and

Mayer-Schonberger 2013) where we are no longer judged on the basis of our actual

actions, but on the basis of what the data indicate will be our probable actions.

In a survey undertaken by the Annenberg Public Policy Center, the researchers

found that most Americans overwhelmingly consider forms of price discrimination

and behavioral targeting ethically wrong (Turow et al. 2015). Not only are these

approaches based on proﬁling individuals but using personal information about an

individual for purposes the individual is unaware of. The openness of data sources

and the power of not just data mining but now predictive analysis and other complex

algorithms also present a challenge to the process of de-identiﬁcation. The risks of

re-identiﬁcation are more apparent, requiring more sophisticated de-identiﬁcation

techniques (El Emam et al. 2011). In addition, while the concept of “nudging”

is gaining popularity, using identiﬁable data for proﬁling individuals to analyse,

predict, and inﬂuence human behaviour may be perceived as invasive and unjustiﬁed

surveillance.

Data determinism and discrimination are also concerns that arise from a Dic-

tatorship of Data. Extensive use of automated decisions and prediction analyses

may actually result in adverse consequences for individuals. Algorithms are not

neutral, but reﬂect choices, among others, about data, connections, inferences,

2 Start with Privacy by Design in All Big Data Applications 37

interpretations, and thresholds for inclusion that advances a speciﬁc purpose. The

concern is that Big Data may consolidate existing prejudices and stereotyping, as

well as reinforce social exclusion and stratiﬁcation (Tene and Polonetsky 2013;

IWGDPT 2014;FTC2016). This is said to have implications for the quality of Big

Data analysis because of “echo chambers”3in the collection phase (Singer 2011;

Quattrociocchi et al. 2016).

2.3.3 Lack of Transparency/Accountability

As an individual’s personal information spreads throughout the Big Data ecosystem

amongst numerous players, it is easy to see that the individual will have less control

over what may be happening to the data. This secondary use of data raises privacy

concerns. A primary purpose is identiﬁed at the time of collection of personal

information. Secondary uses are generally permitted with that person’s consent,

unless otherwise permitted by law. Using personal information in Big Data analytics

may not be permitted under the terms of the original consent as it may constitute a

secondary use—unless consent to the secondary use is obtained from the individual.

This characteristic is often linked with a lack of transparency. Whether deliberate or

inadvertent, lack of openness and transparency on how data is compiled and used,

is contrary to a fundamental privacy principle.

It is clear that organizations participating in the Big Data ecosystem need to

have a strong privacy program in place (responsible information management). If

individuals don’t have conﬁdence that their personal information is being managed

properly in Big Data applications, then their trust will be eroded and they may with-

draw or ﬁnd alternative mechanisms to protect their identity and privacy. The con-

sequences of a privacy breach can include reputational harm, legal action, damage

to a company’s brand or regulatory sanctions and disruption to internal operations.

In more severe cases, it could cause the demise of an organization (Solove 2014).

According to TRUSTe’s Consumer Privacy Conﬁdence Index 2016, 92 per cent of

individuals worry about their privacy online, 44 per cent do not trust companies with

their personal information, and 89 per cent avoid doing business with companies that

they believe do not protect their privacy (TRUSTe/NCSA 2016).

Despite the fact that privacy and security risks may exist, organizations should

not fear pursuing innovation through data analytics. Through the application of

privacy controls and use of appropriate privacy tools privacy risks may be mitigated,

thereby enabling organizations to capitalize on the transformative potential of Big

Data—while adequately safeguarding personal information. This is the central

3In news media an echo chamber is a metaphorical description of a situation in which information,

ideas, or beliefs are ampliﬁed or reinforced by transmission and repetition inside an “enclosed” sys-

tem, where different or competing views are censored, disallowed, or otherwise underrepresented.

The term is by analogy with an acoustic echo chamber, where sounds reverberate.

38 A. Cavoukian and M. Chibba

motivation for Privacy by Design, which is aimed at preventing privacy violations

from arising in the ﬁrst place. Given the necessity of establishing user trust in

order to gain public acceptance of its technologies, any organization seeking to take

advantage of Big Data must apply the Privacy by Design framework as new products

and applications are developed, marketed, and deployed.

2.4 Privacy by Design and the 7 Foundational Principles

The premise of Privacy by Design has at its roots, the Fair Information Practices or

FIPs. Indeed, most privacy laws around the world are based on these practices. By

way of history, the Code of Fair Information Practices (FIPs) was developed in the

1970s and based on essentially ﬁve principles (EPIC n.d.):

1. There must be no personal data record-keeping systems whose very existence is

secret.

2. There must be a way for a person to ﬁnd out what information about the person

is in a record and how it is used.

3. There must be a way for a person to prevent information about the person that was

obtained for one purpose from being used or made available for other purposes

without the person’s consent.

4. There must be a way for a person to correct or amend a record of identiﬁable

information about the person.

5. Any organization creating, maintaining, using, or disseminating records of

identiﬁable personal data must assure the reliability of the data for their intended

use and must take precautions to prevent misuses of the data.

FIPs represented an important development in the evolution of data privacy since

they provided an essential starting point for responsible information management

practices. However, many organizations began to view enabling privacy via FIPs

and associated laws as regulatory burdens that inhibited innovation. This zero-sum

mindset viewed the task of protecting personal information as a “balancing act”

of competing business and privacy requirements. This balancing approach tended

to overemphasize the signiﬁcance of notice and choice as the primary method

for addressing personal information data management. As technologies developed,

the possibility for individuals to meaningfully exert control over their personal

information became more and more difﬁcult. It became increasingly clear that FIPs

were a necessary but not a sufﬁcient condition for protecting privacy. Accordingly,

the attention of privacy protection had begun to shift from reactive compliance with

FIPs to proactive system design.

With advances in technologies, it became increasingly apparent that systems

needed to be complemented by a set of norms that reﬂect broader privacy dimen-

sions (Damiani 2013). The current challenges to privacy related to the dynamic

relationship associated with the forces of innovation, competition and the global

adoption of information communications technologies. These challenges have been

2 Start with Privacy by Design in All Big Data Applications 39

mirrored in security by design. Just as users rely on security engineers to ensure the

adequacy of encryption key lengths, for example, data subjects will rely on privacy

engineers to appropriately embed risk-based controls within systems and processes.

Given the complex and rapid nature of these developments, it becomes apparent that

privacy has to become the default mode of design and operation.

Privacy by Design (PbD), is a globally recognized proactive approach to privacy.

It is a framework developed in the late 1990s by co-author Dr. Ann Cavoukian

(Cavoukian 2011). Privacy by Design is a response to compliance-based approaches

to privacy protection that tend to focus on addressing privacy breaches after-the-fact.

Our view is that this reactive approach does not adequately meet the demands of the

Big Data era. Instead, we recommend that organizations consciously and proactively

incorporate privacy strategies into their operations, by building privacy protections

into their technology, business strategies, and operational processes.

By taking a proactive approach to privacy and making privacy the default setting,

PbD can have a wide-ranging impact across an organization. The approach can

result in changes to governance structures, operational and strategic objectives, roles

and accountabilities, policies, information systems and data ﬂows, decision-making

processes, relationships with stakeholders, and even the organization’s culture.

PbD has been endorsed by many public- and private-sector authorities in the

United States, the European Union, and elsewhere (Harris 2015). In 2010, PbD

was unanimously passed as a framework for privacy protection by the International

Assembly of Privacy Commissioners and Data Protection Authorities (CNW 2010).

This approach transforms consumer privacy issues from a pure policy or compliance

issue into a business imperative. Since getting privacy right has become a critical

success factor to any organization that deals with personal information, taking

an approach that is principled and technology-neutral is now more relevant than

ever. Privacy is best interwoven proactively and to achieve this, privacy principles

should be introduced early on—during architecture planning, system design, and

the development of operational procedures. Privacy by Design, where possible,

should be rooted into actual code, with defaults aligning both privacy and business

imperatives.

The business case for privacy focuses on gaining and maintaining customer trust,

breeding loyalty, and generating repeat business. The value proposition typically

reﬂects the following:

1. Consumer trust drives successful customer relationship management (CRM) and

lifetime value—in other words, business revenues;

2. Broken trust will result in a loss of market share and revenue, translating into less

return business and lower stock value; and

3. Consumer trust hinges critically on the strength and credibility of an organiza-

tion’s data privacy policies and practices.

In a marketplace where organizations are banding together to offer suites of

goods and services, trust is clearly essential. Of course, trust is not simply an end-

user issue. Companies that have done the work to gain the trust of their customers

cannot risk losing it as a result of another organization’s poor business practices.

40 A. Cavoukian and M. Chibba

2.4.1 The 7 Foundational Principles

Privacy by Design Foundational Principles build upon universal FIPPs in a way

that updates and adapts them to modern information management needs and

requirements. By emphasizing proactive leadership and goal-setting, systematic

and veriﬁable implementation methods, and demonstrable positive-sum results, the

principles are designed to reconcile the need for robust data protection and an orga-

nization’s desire to unlock the potential of data-driven innovation. Implementing

PbD means focusing on, and living up to, the following 7 Foundational Principles,

which form the essence of PbD (Cavoukian 2011).

Principle 1: Use proactive rather than reactive measures, anticipate and prevent

privacy invasive events before they happen (Proactive not Reactive; Preventative not

Remedial).

Principle 2: Personal data must be automatically protected in any given IT system

or business practice. If an individual does nothing, their privacy still remains intact

(Privacy as the Default). Data minimization is also a default position for privacy, i.e.

the concept of always starting with the minimum personal data possible and then

justifying additional collection, disclosure, retention, and use on an exceptional and

speciﬁc data-by-data basis.

Principle 3: Privacy must be embedded into the design and architecture of IT

systems and business practices. It is not bolted on as an add-on, after the fact. Privacy

is integral to the system, without diminishing functionality (Privacy Embedded into

Design).

Principle 4: All legitimate interests and objectives are accommodated in a

positive-sum manner (Full Functionality—Positive-Sum [win/win], not Zero-Sum

[win/lose]).

Principle 5: Security is applied throughout the entire lifecycle of the data

involved—data is securely retained, and then securely destroyed at the end of the

process, in a timely fashion (End-to-End Security—Full Lifecycle Protection).

Principle 6: All stakeholders are assured that whatever the business practice or

technology involved, it is in fact, operating according to the stated promises and

objectives, subject to independent veriﬁcation; transparency is key (Visibility and

Transparency—Keep it Open).

Principle 7: Architects and operators must keep the interests of the individual

uppermost by offering such measures as strong privacy defaults, appropriate notice,

and empowering user-friendly options (Respect for User Privacy—Keep it User-

Centric).

2 Start with Privacy by Design in All Big Data Applications 41

2.5 Big Data Applications: Guidance on Applying the PbD

Framework and Principles

While the 7 Foundational Principles of PbD should be applied in a holistic manner

as a broad framework, there are speciﬁc principles worthy of pointing out because

they are what deﬁnes and distinguishes this approach to privacy. These are principles

1 (Proactive and Preventative), 2 (By Default/Data Minimization), 3 (Embedded

in Design) and 4 (Positive-sum). Although the two examples provided below are

speciﬁc to mobile apps, they are illustrative of the Privacy by Design approach to

being proactive, focussing on data minimization and embedding privacy by default.

2.5.1 Being Proactive About Privacy Through Prevention

Privacy by Design aspires to the highest global standards of practical privacy and

data protection possible and to go beyond compliance and achieve visible evidence

and recognition of leadership, regardless of jurisdiction. Good privacy doesn’t just

happen by itself—it requires proactive and continuous goal-setting at the earliest

stages. Global leadership in data protection begins with explicit recognition of the

beneﬁts and value of adopting strong privacy practices, early and consistently (e.g.,

preventing data breaches or harms to individuals from occurring in the ﬁrst place).

Your app’s main purpose is to display maps. These maps are downloaded by a

mobile device from your central server. They are then later used on the device,

when there may be no network connection available. You realise that analytics

would be useful to see which maps are being downloaded by which users.

This in turn would allow you to make targeted suggestions to individual users

about which other maps they might want to download. You consider using

the following to identify individuals who download the maps: i) the device’s

IMEI number; ii) the MAC address of the device’s wireless network interface;

and iii) the mobile phone number used by the device. You realise that any of

those identiﬁers may constitute personal data, so for simplicity you decide

not to take on the responsibility of dealing with them yourself. Instead, you

decide to gain users’ consent for the map suggestions feature. When a user

consents, they are assigned a randomly generated unique identiﬁer, solely for

use by your app. (Excerpted from Information Commissioner’s Ofﬁce ICO

2013)

42 A. Cavoukian and M. Chibba

2.5.2 Data Minimization as the Default Through

De-identiﬁcation

Personal information that is not collected, retained, or disclosed is data that does

not need to be protected, managed, or accounted for. If the personal information

does not exist, then it cannot be accessed, altered, copied, enriched, shared, lost,

hacked, or otherwise used for secondary and unauthorized purposes. Privacy by

Design is premised on the idea that the starting point for designing information

technologies and systems should always be maximally privacy-enhancing. The

default conﬁguration or settings of technologies, tools, platforms, or services offered

to individuals should be as restrictive as possible regarding use of personally

identiﬁable data.

When Big Data analytics involves the use of personally identiﬁable information,

data minimization has the biggest impact on managing data privacy risks, by

effectively eliminating risk at the earliest stage of the information life cycle.

Designing Big Data analytical systems at the front end with no collection of

personally identiﬁable information—unless and until a speciﬁc and compelling

purpose is deﬁned, is the ideal. For example, use(s) of personal information should

be limited to the intended, primary purpose(s) of collection and only extended to

other, non-consistent uses with the explicit consent of the individual (Article 29

Data Protection Working Party 2013). In other cases, organizations may ﬁnd that

summary or aggregate data may be more than sufﬁcient for their needs.

Your app uses GPS location services to recommend interesting activities near

to where the user is. The database of suggested activities is kept on a central

server under your control. One of your design goals is to keep the amount of

data your app downloads from the central server to a minimum. You therefore

design your app so that each time you use it, it sends location data to the

central server so that only the nearest activities are downloaded. However,

you are also keen to use less privacy-intrusive data where possible. You design

your app so that, by default, the device itself works out where the nearest town

is and uses this location instead, avoiding the need to send exact GPS co-

ordinates of the user”s location back to the central server. Users who want

results based on their accurate location can change the default behaviour.

(Excerpted from Information Commissioner’s Ofﬁce ICO 2013)

De-identiﬁcation strategies are considered data minimization. De-identiﬁcation

provides for a set of tools or techniques to strip a dataset of all information that

could be used to identify an individual, either directly or indirectly, through linkages

to other datasets. The techniques involve deleting or masking “direct identiﬁers,”

such as names or social insurance numbers, and suppressing or generalizing indirect

identiﬁers, such as postal codes or birthdates. Indirect identiﬁers may not be

2 Start with Privacy by Design in All Big Data Applications 43

personally identifying in and of themselves, but when linked to other datasets that

contain direct identiﬁers, may personally identify individuals. If done properly,

de-identiﬁed data can be used for research purposes and data analysis—thus

contributing new insights and achieving innovative goals—while minimizing the

risk of disclosure of the identities of the individuals behind the data (Cavoukian and

El Emam 2014).

This is not to suggest, of course, that data should be collected exclusively in

instances where it may become useful or that data collected for one purpose may be

repurposed at will. Rather, in a big data world, the principle of data minimization

should be interpreted differently, requiring organizations to de-identify data when

possible, implement reasonable security measures, and limit uses of data to those

that are acceptable from not only an individual but also a societal perspective (Tene

and Polonetsky 2013).

2.5.3 Embedding Privacy at the Design Stage

When privacy commitments and data protection controls are embedded into tech-

nologies, operations, and information architectures in a holistic, integrative manner,

innovation and creativity are often by-products (Cavoukian et al. 2014a,b). By

holistic, we mean that broader contexts should always be considered for a proper

assessment of privacy risks and remedies. An integrative approach takes into con-

sideration all stakeholder interests as part of the development dialogue. Sometimes,

having to re-look at alternatives because existing solutions are unacceptable from

a privacy perspective spurs innovative and creative thinking. Embedding privacy

and data protection requires taking a systematic, principled approach—one that

not only relies on accepted standards and process frameworks, but that can stand

up to external reviews and audits. All of the 7 Foundational Principles should be

applied with equal rigour, at every step in design and operation. By doing so, the

privacy impacts of the resulting technology, process, or information architecture,

and their uses, should be demonstrably minimized, and not easily degraded through

use, misconﬁguration, or error. To minimize concerns of untoward data usage,

organizations should disclose the logic underlying their decision-making processes

to the extent possible without compromising their trade secrets or intellectual

property rights.

The concept of “user-centricity” may evoke contradictory meanings in networked

or online environments. Through a privacy lens, it contemplates a right of control by

an individual over his or her personal information when online, usually with the help

of technology. For most system designers, it describes a system built with individual

users in mind that may perhaps incorporate users’ privacy interests, risks and needs.

The ﬁrst may be considered libertarian (informational self-determination), the other,

paternalistic. Privacy by Design embraces both. It acknowledges that technologies,

processes and infrastructures must be designed not just for individual users, but

also structured by them. Users are rarely, if ever, involved in every design decision

44 A. Cavoukian and M. Chibba

or transaction involving their personal information, but they are nonetheless in an

unprecedented position today to exercise a measure of meaningful control over

those designs and transactions, as well as the disposition and use of their personal

information by others.

User interface designers know that human-computer interface can often make

or break an application. Function (substance) is important, but the way in which

that function is delivered is equally as important. This type of design embeds an

effective user privacy experience. As a quid pro quo for looser data collection

and minimization restrictions, organizations should be prepared to share the

wealth created by individuals’ data with those individuals. This means providing

individuals with access to their data in a “usable” format and allowing them to take

advantage of third party applications to analyze their own data and draw useful

conclusions (e.g., consume less protein, go on a skiing vacation, invest in bonds)

(Tene and Polonetsky 2013).

2.5.4 Aspire for Positive-Sum Without Diminishing

Functionality

In Big Data scenarios, networks are more complex and sophisticated thereby

undermining the dominant “client-server” transaction model because individuals

are often far removed from the client side of the data processing equation. How

could privacy be assured when the collection, disclosure, and use of personal

information might not even involve the individual at all? Inevitably, a zero-sum

paradigm prevails where more of one good (e.g., public security, fraud detection,

operational control) cancels out another good (individual privacy, freedom). The

authors challenge the premise that privacy and data protection necessarily have to

be ceded in order to gain public, personal, or information security beneﬁts from

Big Data. The opposite of zero-sum is positive-sum, where multiple goals may be

achieved concurrently.

Many security technologies and information systems could be designed (or

redesigned) to be effective while minimizing or even eliminating their privacy-

invasive features. This is the positive-sum paradigm. We need only look to the work

of researchers in the area of privacy preserving data mining (Lindell and Pinkas

2002). In some cases, however, this requires broadening the scope of application

from only information communication technologies (ICTs) to include the “soft”

legal, policy, procedural, and other organizational controls and operating contexts

in which privacy might be embedded.

De-identiﬁcation tools and techniques are gaining popularity and there are

several commercially available products. Nonetheless, furthering research into

de-identiﬁcation continues (El Emam 2013a, b). Some emerging research-level

technologies hold much promise for enabling privacy and utility of Big Data analy-

sis to co-exist. Two of these technologies are differential privacy and synthetic data.

2 Start with Privacy by Design in All Big Data Applications 45

Differential privacy is an approach that injects random noise into the results

of dataset queries to provide a mathematical guarantee that the presence of any

one individual in the dataset will be masked—thus protecting the privacy of each

individual in the dataset. Typical implementations of differential privacy work by

creating a query interface or “curator” that stands between the dataset’s personal

information and those wanting access to it. An algorithm evaluates the privacy risks

of the queries. The software determines the level of “noise” to introduce into the

analysis results before releasing it. The distortion that is introduced is usually small

enough that it does not affect the quality of the answers in any meaningful way—yet

it is sufﬁcient to protect the identities of the individuals in the dataset (Dwork 2014).

At an administrative level, researchers are not given access to the dataset to

analyze themselves when applying differential privacy. Not surprisingly, this limits

the kinds of questions researchers can ask. Given this limitation, some researchers

are exploring the potential of creating “synthetic” datasets for researchers’ use. As

long as the number of individuals in the dataset is sufﬁciently large in comparison

to the number of ﬁelds or dimensions, it is possible to generate a synthetic dataset

comprised entirely of “ﬁctional” individuals or altered identities that retain the

statistical properties of the original dataset—while delivering differential privacy’s

mathematical “noise” guarantee (Blum et al. 2008). While it is possible to generate

such synthetic datasets, the computational effort required to do so is usually

extremely high. However, there have been important developments into making the

generation of differentially private synthetic datasets more efﬁcient and research

continues to show progress (Thaler et al. 2010).

2.6 Conclusion

There are privacy and security risks and challenges that organizations will face

in the pursuit of Big Data nirvana. While a signiﬁcant portion of this vast digital

universe is not of a personal nature, there are inherent privacy and security risks

that cannot be overlooked. Make no mistake, organizations must seriously consider

not just the use of Big Data but also the implications of a failure to fully realize

the potential of Big Data. Big data and big data analysis, promise new insights

and beneﬁts such as medical/scientiﬁc discoveries, new and innovative economic

drivers, predictive solutions to otherwise unknown, complex societal problems.

Misuses and abuses of personal data diminish informational self-determination,

cause harms, and erode the conﬁdence and trust needed for innovative economic

growth and prosperity. By examining success stories and approaches such as Privacy

by Design, the takeaway should be practical strategies to address the question of

‘How do we achieve the value of Big Data and still respect consumer privacy?’

Above all, Privacy by Design requires architects and operators to keep the interests

of the individual uppermost by offering such measures as strong privacy defaults,

appropriate notice, and empowering user-friendly options. Keep it user-centric!

46 A. Cavoukian and M. Chibba

References

Article 29 Data protection working party (2013). Opinion 03/2013 on purpose limitation.http://

ec.europa.eu/justice/data-protection/index_en.htm. Accessed 2 August 2016.

Blum, A., Ligett, K., Roth, A. (2008). A learning theory approach to non-interactive database

privacy. In Proceedings of the 40th ACM SIGACT Symposium on Theory of Computing

(pp. 609–618).

Cameron, K. (2013). Afterword. In M. Hildebrandt et al. (Eds.), Digital Enlightenment Yearbook

2013. Amsterdam: IOS Press.

Cavoukian, A. (2009). Privacy and government 2.0: the implications of an open world.http://

www.ontla.on.ca/library/repository/mon/23006/293152.pdf. Accessed 22 November 2016.

Cavoukian, A. (2011). Privacy by Design: The 7 Foundational Principles.Ontario:IPC.

Cavoukian, A. (2013a). A Primer on Metadata: Separating Fact from Fiction.Ontario:IPC.http:/

/www.ipc.on.ca/images/Resources/metadata.pdf.

Cavoukian, A. (2013b). Privacy by design: leadership, methods, and results. In S. Gutwirth, R.

Leenes, P. de Hert, & Y. Poullet (Eds.), Chapter in European Data Protection: Coming of Age

(pp. 175–202). Dordrecht: Springer Science & Business Media Dordrecht.

Cavoukian, A., & Cameron, K. (2011). Wi-Fi Positioning Systems: Beware of Unintended

Cosnequences: Issues Involving Unforeseen Uses of Pre-Existing Architecture.Ontario:IPC.

Cavoukian, A., & El Emam. (2014). De-identiﬁcation Protocols: Essential for Protecting Privacy,

Ontario: IPC.

Cavoukian, A., & Jonas, J. (2012). Privacy by Design in the Age of Big Data.Ontario:IPC.

Cavoukian, A., Bansal, N., & Koudas, N. (2014a). Building Privacy into Mobile Location Analytics

(MLA) through Privacy by Design.Ontario:IPC.

Cavoukian, A., Dix, A., & El Emam, K. (2014b). The Unintended Consequences of Privacy

Paternalism.Ontario:IPC.

Clarke, R. (2000). Beyond OECD guidelines; privacy protection for the 21st century. Xamax

Consultancy Pty Ltd. http://www.rogerclarke.com/DV/PP21C.html. Accessed 22 November

2016.

CNW (2010). Landmark resolution passed to preserve the future of privacy. Press Release. Toronto,

ON, Canada. http://www.newswire.ca/news-releases/landmark-resolution-passed-to-preserve-

the-future-of-privacy-546018632.html. Accessed 22 November 2016.

Cukier, K., & Mayer-Schonberger, V. (2013). The dictatorship of data. MIT Technology Review.

https://www.technologyreview.com/s/514591/the-dictatorship-of-data/. Accessed 22 Novem-

ber 2016.

Damiani, M. L. (2013). Privacy enhancing techniques for the protection of mobility patterns in

LBS: research issues and trends. In S. Gutwirth, R. Leenes, P. de Hert, & Y. Poullet (Eds.),

Chapter in european data protection: coming of age (pp. 223–238). Dordrecht: Springer

Science & Business Media Dordrecht.

Department of Commerce (US DOC) (2016). EU-U.S. privacy shield fact sheet. Ofﬁce of public

affairs, US department of commerce.https://www.commerce.gov/news/fact-sheets/2016/02/eu-

us-privacy-shield. Accessed 22 November 2016.

Dwork, C. (2014). Differential privacy: a cryptographic approach to private data analysis. In J.

Lane, V. Stodden, S. Bender, & H. Nissenbaum (Eds.), Privacy, big data, and the public good:

Frameworks for engagement. New York: Cambridge University Press.

El Emam, K. (2013a). Beneﬁting from big data while protecting privacy. In K. El Emam

(Ed.), Chapter in risky business: sharing health data while protecting privacy. Bloomington,

IN: Trafford Publishing.

El Emam, K. (2013b). In K. El Emam (Ed.), Who’s afraid of big data? chapter in risky business:

Sharing health data while protecting privacy. Bloomington, IN, USA: Trafford Publishing.

2 Start with Privacy by Design in All Big Data Applications 47

El Emam, K., Buckeridge, D., Tamblyn, R., Neisa, A., Jonker, E., & Verma, A. (2011). The

re-identiﬁcation risk of Canadians from longitudinal demographics. BMC Medical Informat-

ics and Decision Making,11:46. http://bmcmedinformdecismak.biomedcentral.com/articles/

10.1186/1472-6947-11-46. Accessed 22 November 2016.

EPIC (n.d.). Website: https://epic.org/privacy/consumer/code_fair_info.html. Accessed 22 Novem-

ber 2016.

EU Commission (2012). Fact sheet on the right to be forgotten.http://ec.europa.eu/justice/data-

protection/ﬁles/factsheets/factsheet_data_protection_en.pdf. Accessed 22 November 2016.

EU Commission (2015). Fact sheet—questions and answers—data protection reform. Brussels.

http://europa.eu/rapid/press-release_MEMO-15-6385_en.htm. Accessed 4 November 2016.

EU Commission (2016). The EU data protection reform and big data factsheet.http://

ec.europa.eu/justice/data-protection/ﬁles/data-protection-big-data_factsheet_web_en.pdf.

Accessed 22 November 2016.

Fogarty, D., & Bell, P. C. (2014). Should you outsource analytics? MIT Sloan Management Review,

55(2), Winter.

FTC (2012). Protecting consumer privacy in an era of rapid change: Recommendations

for businesses and policymakers.https://www.ftc.gov/sites/default/ﬁles/documents/

reports/federal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-

recommendations/120326privacyreport.pdf Accessed August 2016.

FTC (2016). Big data: A tool for inclusion or exclusion? Understanding the Issues.

https://www.ftc.gov/system/ﬁles/documents/reports/big-data-tool-inclusion-or-exclusion-

understanding-issues/160106big-data-rpt.pdf. Accessed 23 November 2016.

Gürses, S.F. Troncoso, C., & Diaz, C. (2011). Engineering privacy by design, Computers, Privacy

& Data Protection.http://www.cosic.esat.kuleuven.be/publications/article-1542.pdf. Accessed

19 November 2016.

Harris, M. (2015). Recap of covington’s privacy by design workshop. inside privacy:

updates on developments in data privacy and cybsersecurity. Covington & Burling-

ton LLP, U.S. https://www.insideprivacy.com/united-states/recap-of-covingtons-privacy-by-

design-workshop/. Accessed 19 November 2016.

HHS (2012). Guidance regarding methods for de-identiﬁcation of protected health information

in accordance with the health insurance portability and accountability act (HIPPA) pri-

vacy rule.http://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identiﬁcation/

index.html. Accessed 2 August 2016.

Information Commissioner’s Ofﬁce (ICO) (2013). Privacy in Mobile Apps: Guide for app devel-

opers.https://ico.org.uk/media/for-organisations/documents/1596/privacy-in-mobile-apps-dp-

guidance.pdf Accessed 22 November 2016.

International Working Group on Data Protection in Telecommunications (IWGDPT) (2004)

Common position on privacy and location information in mobile communications services.

https://datenschutz-berlin.de/content/europa-international/international-working-group-on-

data-protection-in-telecommunications-iwgdpt/working-papers-and-common-positions-

adopted-by-the-working-group. Accessed 22 November 2016.

48 A. Cavoukian and M. Chibba

International Working Group on Data Protection in Telecommunications (IWGDPT) (2014).

Working Paper on Big Data and Privacy: Privacy principles under pressure in the age of

Big Data analytics. 55th Meeting.https://datenschutz-berlin.de/content/europa-international/

international-working-group-on-data-protection-in-telecommunications-iwgdpt/working-

papers-and-common-positions-adopted-by-the-working-group. Accessed 22 November 2016.

Lane, J., et al. (2014). Privacy, big data and the public good: frameworks for engagement.

Cambridge: Cambridge University Press.

Lindell, Y., & Pinkas, B. (2002). Privacy preserving data mining. Journal of Cryptology, 15,

177–206. International Association for Cryptologic Research.

Lomas, N. (2015). Europe’s top court strikes down safe Harbor data-transfer agreement

with U.S. Techcrunch.https://techcrunch.com/2015/10/06/europes-top-court-strikes-down-

safe-harbor-data-transfer-agreement-with-u-s/. Accessed 22 November 2016.

Mayer, J., Mutchler, P., & Mitchell, J. C. (2016). Evaluating the privacy properties of telephone

metadata. Proceedings of the National Academies of Science, U S A, 113(20), 5536–5541.

NIST. (2010). Guide to protecting the conﬁdentiality of personally identiﬁable information (PII).

NIST special publication 800–122. Gaithersburg, MD: Computer Science Division.

NIST (2015). De-identiﬁcation of Personal Information. NISTR 8053. This publication is available

free of charge from: http://dx.doi.org/10.6028/NIST.IR.8053. Accessed 19 November 2016.

Ofﬁcial Journal of the European Union (2016). Regulation (EU) 2016/679 Of The Euro-

pean Parliament and of the Council.http://ec.europa.eu/justice/data-protection/reform/ﬁles/

regulation_oj_en.pdf. Accessed 19 November 2016.

Quattrociocchi, W. Scala, A., & Sunstein, C.R. (2016) Echo Chambers on Facebook. Preliminary

draft, not yet published. Available at: http://ssrn.com/abstract=2795110. Accessed 19 Novem-

ber 2016.

Ritter, D. (2014). When to Act on a correlation, and when Not To. Harvard Business Review.https://

hbr.org/2014/03/when-to-act-on-a-correlation-and-when-not-to. Accessed 19 November 2016.

Singer, N. (2011). The trouble with the echo chamber online. New York Times online.http://

www.nytimes.com/2011/05/29/technology/29stream.html?_r=0. Accessed 19 November 2016.

Solove, D. J. (2007). I’ve got nothing to hide’ and other misunderstandings of privacy. San Diego

Law Review, 44, 745.

Solove, D. (2014). Why did in bloom die? A hard lesson about education privacy. Privacy C

Security Blog. TeachPrivacy. Accessed 4 Aug 2016. https://www.teachprivacy.com/inbloom-

die-hard-lesson-education-privacy/

Sweeney, L. (2013) Discrimination in online ad delivery.http://dataprivacylab.org/projects/

onlineads/1071-1.pdf. Accessed 22 November 2016.

Tene, O., & Polonetsky, J. (2013). Big data for all: Privacy and user control in the age of analytics.

New Journal of Technology and Intellectual Property, 11(5), 239–272.

Thaler, J., Ullman, J., & Vadhan, S. (2010). PCPs and the hardness of generating synthetic data.

Electronic Colloquium on Computational Complexity, Technical Report, TR10–TR07.

TRUSTe/NCSA (2016). Consumer privacy infographic—US Edition.https://www.truste.com/

resources/privacy-research/ncsa-consumer-privacy-index-us/. Accessed 4 November 2016.

Turow, J., Feldman, L, & Meltzer, K. (2015). Open to exploitation: american shoppers

online and ofﬂine. A report from the Annenberg Public Policy Center of the University

of Pennsylvania.http://www.annenbergpublicpolicycenter.org/open-to-exploitation-american-

shoppers-online-and-ofﬂine/. Accessed 22 November 2016.

United Nations General Assembly (2016). Resolution adopted by the General Assembly. The right

to privacy in the digital age (68/167).http://www.un.org/ga/search/view_doc.asp?symbol=A/

RES/68/167.Accessed 4 November 2016.

Zhang, Y., Chen, Q., & Zhong, S. (2016). Privacy-preserving data aggregation in mobile phone

sensing. Information Forensics and Security IEEE Transactions on, 11, 980–992.

Chapter 3

Privacy Preserving Federated Big Data Analysis

Wenrui Dai, Shuang Wang, Hongkai Xiong, and Xiaoqian Jiang

3.1 Introduction

With the introduction of electronic health records (EHRs), massive patient data have

been involved in biomedical researches to study the impact of various factors on

disease and mortality. Large clinical data networks have been developed to facilitate

analysis and improve treatment of diseases by collecting healthcare data from

a variety of organizations, including healthcare providers, government agencies,

research institutions and insurance companies. The National Patient-Centered Clin-

ical Research Network, PCORnet (n.d.), facilitates clinical effectiveness research

to provide decision support for prevention, diagnosis and treatment with the data

gathered nationwide. PopMedNet (n.d.) enables the distributed analyses of EHR

held by different organizations without requiring a central repository to collect

data. HMORNnet (Brown et al. 2012) combines PopMedNet platform to provide a

shared infrastructure for distributed querying to allow data sharing between multiple

HMO Research Network projects. Integrating PopMedNet, ESPnet achieves disease

W. Dai ()

Department of Biomedical Informatics, University of California San Diego, La Jolla,

CA 92093, USA

Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

e-mail: wed004@ucsd.edu

S. Wang • X. Jiang

Department of Biomedical Informatics, University of California San Diego, La Jolla,

CA 92093, USA

e-mail: shw070@ucsd.edu;x1jiang@ucsd.edu

H. Xiong

Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

e-mail: xionghongkai@sjtu.edu.cn

© Springer International Publishing AG 2018

S. Srinivasan (ed.), Guide to Big Data Applications, Studies in Big Data 26,

DOI 10.1007/978-3-319-53817-4_3

49

50 W. Dai et al.

surveillance by collecting and analyzing EHRs owned by different organizations in

a distributed fashion.

Although data sharing can beneﬁt both biomedical discovery and public health,

it would also pose risks for disclosure of sensitive information and consequent

breach of individual privacy. Leakage of demographic, diagnostic, phenotypic and

genotypic information would lead to unexpected implications like discrimination by

employers and health insurance companies. To protect the individually identiﬁable

health information, Health Insurance Portability and Accountability Act (HIPAA)

(n.d.) was enacted in the United States, in which the security and privacy of pro-

tected health information (PHI) are guaranteed under the standards and regulations

speciﬁed by HIPAA Privacy Rule. It deﬁnes two methods, Expert Determination and

Safe Harbor (Lafky 2010), to meet with the de-identiﬁcation standard. In practice,

the Safe Harbor method is widely adopted, where speciﬁc information should be

removed and suppressed according to a predeﬁned checklist. However, these de-

identiﬁcation standards deﬁned by HIPAA Privacy Rule do not provide adequate

privacy protection for healthcare data, as argued by McGraw (2008). Taking

advantage of publicly available background information about an individual, it is

possible to infer sensitive information like predisposition to disease and surnames

from de-identiﬁed data. Homer et al. (2008) utilized aggregated allele frequencies

in genome-wide association studies (GWAS) to re-identify individual patients in

a case group based on the reference population from the International HapMap

Project. Wang et al. (2009) extended Homer’s attack with two models to identify

patients from a smaller subset of published statistics or under limited precision

and availability of statistics. Sweeney et al. (2013) showed that most (84–97%)

patients could be exactly identiﬁed by linking their proﬁles in the Personal Genome

Project (PGP) with publicly available records like voter lists. Gymrek et al. (2013)

inferred the surnames from personal genome data sets by proﬁling Y-chromosome

haplotypes based on recreational genetic genealogy databases, which are public and

online accessible. For Healthcare Cost and Utilization Project (HCUP), Vaidya et al.

(2013) demonstrated the vulnerability of its querying system by making query infer-

ence attacks to infer patient-level information based on multiple correlated queries.

Privacy concerns have presented a challenge to efﬁcient collaborative prediction

and analysis for biomedical research that needs data sharing. Patients would be

unwilling to provide their data to research projects or participate in treatments under

the insufﬁcient privacy protection. For data custodian, data utility might be degraded

to lower the potential privacy risk, as they take responsibility for the security and

conﬁdentiality of their data. Due to institutional policies and legislation, it is not

viable to explicitly transfer patient-level data to a centralized repository or share

them among various institutions in many scenarios. For example, without speciﬁc

institutional approval, the U.S. Department of Veterans Affairs requires all patient

data to remain in its server. Naive procedure for exchanging patient-level data

would also be restricted in international cross-institutional collaboration. The Data

Protection Act (1998) in the UK, in line with the EU Data Protection Directive,

prohibits transferring clinical data outside the European Economic Area, unless the

protection for data security is sufﬁciently guaranteed. Therefore, it is of necessity to

3 Privacy Preserving Federated Big Data Analysis 51

develop models and algorithms for cross-institutional collaboration with sufﬁcient

protection of patient privacy and full compliance with institutional policies.

Federated data analysis has been developed as an alternative for cross-

institutional collaboration, which proposes to exchange aggregated statistics instead

of patient-level data. It enables a variety of privacy-preserving distributed algorithms

that facilitate computation and analysis with a guarantee of prediction accuracy

and protection of privacy and security of patient-level data. These algorithms are

commonly designed to perform regression, classiﬁcation and evaluation over data

with different types of partition, i.e. horizontally partitioned data (Kantarcioglu

2008) and vertically partitioned data (Vaidya 2008), respectively. Horizontally

partitioned data are composed of data of different patients with the same clinical

variables. Thus, multiple institutions share the sample types of attributes for all

the patients in the federated database. Horizontally partitioned data would be

suitable for collaborative analysis and computation over patients from organizations

in different geographical areas, especially for studies of diseases that require a

large number of examples (patient-level data). On the other hand, for vertically

partitioned data, each institution owns a portion of clinical variables for the same

patients. Attributes from all the institutions are collected and aligned as a federated

database for computation and analysis. In fact, vertically partitioned data would

facilitate collaboration among organizations owning different types of patient-

level data. For example, the PCORnet clinical data research network (CDRN),

pSCANNER (Ohno-Machado et al. 2014), allows distributed analysis over data of

31 million patients that are vertically partitioned across Centers for Medicare and

Medicaid Services, Department of Veteran Affairs, insurance companies, and health

systems. The studies of diseases can be jointly performed based on the diagnostic

information from medical centers, demographic and ﬁnancial data from insurance

companies and genomic data from laboratories. Figure 3.1 provides an illustrative

example for horizontally and vertically partitioned data, respectively.

In this chapter, we review the privacy-preserving federated data analysis algo-

rithms for large-scale distributed data, especially biomedical data. To collaborate

on distributed data analysis in a privacy-preserving fashion, institutions might not

be able to explicitly share their patient-level data. Privacy-preserving federated

data analysis algorithms aim to establish global models for analysis and prediction

based on non-sensitive local statistics, e.g., intermediary results for Hessian matrix

and kernel matrix, instead of explicitly transferring sensitive patient-level data to

a central repository. Over the past few decades, a series of federated analysis

algorithms have been developed for regression, classiﬁcation and evaluation of

distributed data with a protection of data privacy and security. Enlightened by the

principle of sharing model without sharing data, federated data modeling techniques

have been proposed to securely derive global model parameters and perform

statistical tests over horizontally and vertically partitioned data. In comparison

to centralized realizations, federated modeling techniques achieved equivalent

accuracy in model parameter estimation and statistical tests with no exchange of

patient-level data. Server/client and decentralized architectures have been developed

to realize federated data analysis in a distributed and privacy-preserving fashion.

52 W. Dai et al.

Fig. 3.1 Illustrative examples for horizontally and vertically partitioned data in federated data

analysis. Nrecords with Mcovariates are distributed across Kinstitutions. (a) Horizontally

partitioned data. The Kinstitutions have different patients sharing the same type of covariates;

(b) Vertically partitioned data. Covariates from the same patients are distributed across all the K

institutions

For regression tasks, distributed optimization was efﬁciently achieved using the

Newton-Ralphson method. To further improve distributed optimization in federated

models, alternating direction method of multipliers (ADMM) was integrated to

formulate decomposable minimization problems with additional auxiliary variables.

Inheriting the convergence properties of methods of Lagrangian multipliers, it

is robust for a variety of distributed analyses under horizontal and vertical data

3 Privacy Preserving Federated Big Data Analysis 53

partitioning. Recognizing that communication between the server and clients was

not protected, secure multiparty computation (SMC) protocols were adopted for

stronger data security and privacy. Secure protocols were widely considered for

distributed analysis like regression, classiﬁcation and evaluation. The intermediary

results from multiple institutions were aggregated with secure summation, product

and reordering to support the server/client or decentralized architecture. To handle

real-world network conditions, we discussed the asynchronous optimization for

distributed optimization under server/client or decentralized architecture. To support

privacy-preserving data sharing and analysis, this paper summarizes the relevant

literature to present the state-of-the-art algorithms and applications and prospect the

promising improvements and extensions along current trend.

The rest of the paper is organized as follows. Section 3.2 overviews the

architecture and optimization for federated modeling analysis over horizontally and

vertically partitioned data. In Sect. 3.3, we review the applications in regression

and classiﬁcation based on the Newton-Raphson method and ADMM framework.

Section 3.3 integrates secure multiparty computation protocols with federated

data analysis for protection of intermediary results in distributed analysis and

computation. In Sect. 3.5, we present the asynchronous optimization for general

ﬁxed-point problem and speciﬁc coordinate gradient descent and ADMM-based

methods. Finally, Sect. 3.6 makes discussion and conclusion.

3.2 Federated Data Analysis: Architecture and Optimization

In this section, we overview the architectures and optimization methods for feder-

ated data analysis. Server/client and decentralized architectures are well established

in privacy-preserving analysis. Under these architectures, the Newton-Raphson

method and alternating direction method of multipliers (ADMM) framework are

leveraged for distributed computation.

3.2.1 Architecture

3.2.1.1 Server/Client Architecture

In the federated models, the server/client architecture has been established to

estimate global model parameters and perform statistical test over horizontally and

vertically partitioned data, as shown in Fig. 3.2. Under this architecture, federated

data modeling shares models rather than patient-level information. The server

iteratively optimizes the global model parameters based on aggregated intermediary

results that are decomposable over the clients. Each client can utilize its local data

to separately calculate corresponding intermediary results. Subsequently, instead of

sharing the sensitive patient-level data, these intermediary results are exchanged for

secure computation and analysis. Taking maximum likelihood estimation (MLE)

of binary logistic regression for example, each institution calculates and exchanges

54 W. Dai et al.

Fig. 3.2 Server/client architecture for federated data analysis. Each institution only exchanges

intermediary results for the estimation of global model parameters and statistical tests. Each

institution does not communicate with the others to prevent unexpected information leakage

the partial Hessian matrix derived from its local records for horizontally partitioned

data and partial kernel matrix of its local attributes for vertically partitioned data.

Thus, federated models only exchanged aggregated intermediary results rather

than collected raw data in a central repository to make parameter estimation.

Moreover, clients will not collude to infer the raw data, as each client separately

performs the computation. These facts imply that distributed optimization can be

performed in a privacy-preserving fashion, as the raw data tend not to be recovered

from the aggregated intermediary results. It should be noted that the accuracy

of parameter estimation could be guaranteed in federated models, as aggregated

intermediary results do not lose any information in comparison to the centralized

methods. Furthermore, the security of federated models can be further improved

by integrating secure protocols and encryption methods to protect the intermediary

results exchanged between the server and clients.

For logistic regression and multinomial regression models, federated models can

also support distributed statistical tests over horizontally partitioned data, including

goodness-of-ﬁt test and AUC score estimation (Chambless and Diao 2006). Besides

model parameters, variance-covariance matrix could be similarly obtained by aggre-

gating the decomposable intermediary results. Using global model parameters and

variance-covariance matrix, federated models were able to estimate the statistics of

logistic and multinomial regression, including conﬁdence intervals (CIs), standard

error, Z-test statistics and p-values. Furthermore, goodness-of-ﬁt test and AUC score

estimation can be achieved in a distributed manner. The Hosmer and Lemeshow

(H-L) test (Hosmer et al. 2013) is considered to check model ﬁtness, where each

institution only shares its number of records with positive patient outcomes per

3 Privacy Preserving Federated Big Data Analysis 55

Fig. 3.3 Decentralized architecture for federated data analysis. To estimate global model param-

eters and perform statistical tests, each institution exchanges intermediary results with its

neighboring institutions and updates its local model with received aggregated results

decile. Thus, patient-level information and estimate outcomes will not be exchanged

for privacy-preserving consideration. For computation of AUC score, raw data and

estimated outcomes of patients can be protected with peer-to-peer communication.

3.2.1.2 Decentralized Architecture

Figure 3.3 illustrates the decentralized architecture for federated analysis over

horizontally and vertically partitioned data. Contrary to server/client architec-

tures, decentralized architectures do not require a central node (server) to collect

aggregated intermediary results from all the institutions and make global model

parameter estimation and statistical tests. Each institution only communicates with

its neighbors to exchange messages, e.g. institutions with linked health records.

To prevent leakage of patient-level information, institutions would exchange inter-

mediary results rather than raw data. At each iteration, each institution derives its

intermediary results from local data and the aggregated results from its neighbor-

ing institutions. Taking global consensus optimization under ADMM framework

(Boyd et al. 2011) for example, local model parameters are exchanged among

neighboring institutions for global consensus in applications like sparse linear

regression (Mateos et al. 2010), principal component analysis (PCA) (Schizas and

Aduroja 2015) and support vector machine (SVM) (Forero and Giannakis 2010).

Under such architecture, distributed optimization can be performed in a privacy-

preserving manner, as patient-level information would never be exchanged. It is

worth mentioning that communication cost would be reduced in the decentralized

architecture, as messages are only exchanged among neighboring institutions.

56 W. Dai et al.

3.2.2 Distributed Optimization

3.2.2.1 The Newton-Raphson Method

For model parameter estimation, the Newton-Raphson method can be extended

to make distributed optimization for log-likelihood functions over multiple insti-

tutions. It is a powerful technique for ﬁnding numerical solutions to nonlinear

algebraic equations with successively linear approximations. Given twice differen-

tiable objective function f(x), the Newton-Raphson method iteratively constructs a

sequence of positions towards the stationary point xwith gradient-like optimiza-

tion. For vector inputs x,the gradient rfand Hessian matrix H(f) are computed for

ﬁrst and second partial derivatives of f, respectively. At the t-th step, xis updated by

maximizing the log-likelihood function f.

x.tC1/Dx.t/hHfx.t/i1rfx.t/(3.1)

In federated data modeling, to enable distributed optimization, the ﬁrst and

second partial derivatives are required to be decomposable over multiple institu-

tions. Thus, the gradient and Hessian matrix can be derived from the aggregated

intermediary results separately obtained from all the institutions. The intermediary

results would vary for different tasks and data with different partition. For example,

each institution holds a portion of records Akfor horizontally partitioned data.

Consequently, the intermediary results exchanged in binary logistic regression

are AT

kkAkwith kDdiag((Ak,xk)(1 (Ak,xk))) related with logit function,

while they tend to depend on the set of records at risk for Cox regression model.

For vertically distributed logistic regression, it requires Legendre transform for

distributed dual optimization, where the kernel matrix AkAT

kis separately calculated

based on the portion of attributes Akheld by the k-th institution.

The Newton-Raphson method can achieve faster convergence towards a local

optimum in comparison to gradient descent, when fis a valid objective function.

At the meantime, distributed optimization with the Newton-Raphson method can

achieve equivalent accuracy in model parameter estimation, when compared to its

centralized realization. However, it requires separable ﬁrst and second partial deriva-

tives over multiple institutions, which make the Newton-Raphson method restrictive

in some distributed optimization scenarios, e.g. distributed Cox regression. In Sect

3.2.1.2, we introduce alternating direction method of multipliers (ADMM) for a

ubiquitous distributed optimization framework.

3.2.2.2 Alternating Direction Method of Multipliers

Alternating direction method of multipliers (ADMM) (Boyd et al. 2011) is a variant

of the augmented Lagrangian scheme that supports decomposable dual ascent

solution for the method of Lagrangian multipliers. It develops a decomposition-

coordination procedure to decompose the large-scale optimization problem into a

set of small local subproblems. Inheriting the convergence properties of the method

3 Privacy Preserving Federated Big Data Analysis 57

of Lagrangian multipliers, ADMM is able to obtain the globally optimal solution.

In comparison to general equality-constrained minimization, ADMM introduces

auxiliary variables to split the objective function for decomposability. To be

concrete, ADMM solves the minimization problem of the summation of two convex

functions f.x/and g.z/under the constraint of Ax CBz DC:Thus, the augmented

Lagrangian function for optimal solution is formulated with dual variable .

min

x;z;yL.x;z;y/Df.x/Cg.z/CT.Ax CBz C/C

2jAx CBz Cj2

2(3.2)

where is the positive Lagrangian parameter. The augmented Lagrangian function

can be iteratively solved with three steps for partially updating the primal variables

xauxiliary variables zand dual variables .Inthet-th iteration, these three steps are

conducted in a sequential manner.

1. x-minimization: partially update xby minimizing the augmented Lagrangian

function L.x;z;/with ﬁxed zand dual variable ,orx.tC1/Dargmin

x

Lx;z.t/;.t/,

2. z-minimization: partially update zby minimizing the augmented Lagrangian

function L.x;z;/with updated xand ﬁxed ,orz.tC1/Dargmin

x

Lx;z.tC1/;.t/,

and

3. Update dual variables with updated xand z,or.tC1/D.t/C

Ax.tC1/CBz.tC1/C

Here, dual variables are updated with the step size for each iteration.

According to Steps 1–3, xand zare alternately updated in each iteration based on

f.x/and g.z/. This fact implies that the minimization problem over xand zcan be

separately solved for distributed data, when f.x/or g.z/are decomposable. Since

xand zcan be derived from each other based on the dual variables , they can

be exchanged among multiple institutions for distributed optimization. Therefore,

ADMM is desirable for privacy-preserving federated data analysis, where each

institution can solve the decomposable optimization problem based on local data

and submits intermediary results for a global solution.

Under the ADMM framework, the two generic optimization problems, consensus

and sharing, are formulated for distributed analysis over horizontally and vertically

partitioned data, respectively. For horizontally partitioned data, the consensus

problem splits primal variables xand separately optimizes the decomposable

cost function f.x/for all the institutions under the global consensus constraints.

Considering that the submatrix Ak2RNkMof A2RNMcorresponds to the local data

held by the k-th institution, the primal variables xk2RM1for the Kinstitutions are

solved by

min

x;z

K

X

kD1

fk.Akxk/Cg.z/s:t:xkzD0; kD1; ;K:(3.3)

58 W. Dai et al.

Here, fk.Akxk/is the cost function for the k-th institution and g.z/commonly

represents the regularization term for the optimization problem. According to the

derived augmented Lagrangian function, xkis independently solved based on local

data Akand corresponding dual variables k, while zis updated by averaging xkand

k. Thus, the global model can be established by optimizing over the Kinstitutions

under the global consensus constraints.

The sharing problem is considered for vertically partitioned data, where Aand

xare vertically split into Ak2RNMkand xk2RMk1for the Kinstitutions. Auxiliary

variables zk2RN1are introduced for the k-th institution based on Akand xk. In such

case, the sharing problem is formulated based on the decomposable cost function

fk.xk/.

min

x;zXK

kD1fk.xk/CgXK

kD1zks:t:AkxkzkD0; kD1; ;K(3.4)

Under the ADMM framework, xkand its dual variables kcan be separately

solved, while zkis derived from the aggregated results of Akxkand k. This fact

implies that each institution can locally optimize the decomposable cost function

using its own portion of attributes and adjust the model parameters according to

auxiliary variables derived from global optimization problem. It is worth mentioning

that the global consensus constraints implicitly exist for the dual variables in sharing

problem.

3.3 Federated Data Analysis Applications

In this section, we review federated data analysis models for regression and

classiﬁcation based on the Newton-Raphson method and ADMM framework.

3.3.1 Applications Based on the Newton-Raphson Method

The Newton-Ralphson method is widely adopted in federated data analysis for

generalized linear models, e.g. logistic regression, multinomial regression, and

Cox proportional hazard model, which are widely used in biomedicine. Table 3.1

summarizes the existing federated modeling techniques for distributed data with

their application scenarios, data partitioning, mechanisms for model parameter

estimation, statistical tests and communication protection.

An early federated data analysis paper in biomedical informatics introduced

the Grid binary LOgistic REgression (GLORE) framework (Wu et al. 2012). In

this work, a binary logistic regression model was developed for model parameter

estimation and statistical tests over data horizontally distributed across multiple

institutions in a privacy-preserving manner. For security and conﬁdentiality, the

proposed model shared models rather than patient-level information. Besides model

parameter estimation, distributed algorithms were developed for H-L test and AUC

3 Privacy Preserving Federated Big Data Analysis 59

Table 3.1 Federated modeling techniques for distributed data based on the Newton-Raphson method

Application scenario Data partitioning Parameter estimation Statistical test Communication protection

GLORE (Wu et al.

2012)

Logistic regression Horizontal MLE H-L test AUC score

estimation

N/A

IPDLR (Wu et al. 2012)Secure summation protocol

EXPLORER (Wang

et al. 2013)

MAP N/A SINE protocol

SMAC-GLORE (Shi

et al. 2016)

MLE Garbled circuits

VERTIGO (Li et al.

2016)

Vertical N/A

Multi-category GLORE

(Wu et al. 2015)

Multinomial

regression

Horizontal MLE H-L test AUC score

estimation

N/A

HPPCox (Yu et al.

2008)

Cox regression Horizontal MLE N/A N/A

WebDISCO (Lu et al.

2014)

60 W. Dai et al.

score estimation in GLORE. It was shown to achieve equivalent model parameter

estimation and statistical tests over simulated and clinical datasets in comparison to

centralized methods. WebGLORE (Jiang et al. 2013) provided a free web service

to implement the privacy-preserving architecture for GLORE, where AJAX, JAVA

Applet/Servlet and PHP technologies were seamlessly integrated for secure and

easy-to-use web service. Consequently, it would beneﬁt biomedical researchers

to deploy the practical collaborative software framework in real-world clinical

applications.

Inspired by GLORE, a series of extensions and improvements have been made

for various regression tasks with different privacy concerns. Despite shedding light

on federated data analysis, GLORE still suffered from two main limitations: privacy

protection of intermediary results and synchronization for iterative distributed opti-

mization. Wu et al. (2012) considered the institutional privacy for GLORE. During

iterative optimization, sensitive information of an institution would be leaked, as

its contribution to each matrix of coefﬁcients is known to the server. Therefore,

institutional privacy-preserving distributed binary logistic regression (IPDLR) was

developed to enhance the institutional privacy in GLORE by masking the ownership

of the intermediary results exchanged between the server and institutions. To make

all the institutions remain anonymous, a secure summation procedure was developed

to integrate all the intermediary results without identifying their ownership. At each

iteration, client-to-client communication was conducted to merge the intermediary

results on a client basis based on the random matrix assigned by the server. Thus,

the server would obtain the aggregated results without knowing the contribution

of each institution. The secure summation procedure was also employed in ROC

curve plotting to securely integrate local contingency tables derived in the insti-

tutions. Wang et al. (2013) proposed a Bayesian extension for GLORE, namely

EXpectation Propagation LOgistic REgRession (EXPLORER) model, to achieve

distributed privacy-preserving online learning. In comparison to frequentist logistic

regression model, EXPLORER made maximum a posteriori (MAP) estimation

using expectation propagation along the derived factor graph. The model parameters

were iteratively updated based on partial posterior function w.r.t. the records

held by each institution (intra-site update) and the messages passing among the

server and institutions (inter-site update). As a result, EXPLORER improved the

security and ﬂexibility for distributed model learning with similar discrimination

and model ﬁt performance. To reduce the information leakage from unprotected

intermediary results, EXPLORER exchanged the encrypted posterior distribution

of coefﬁcients rather than the intermediary results for model parameter estimation.

The sensitive information about individual patient would not be disclosed, as only

statistics like mean vector and covariance matrix were shared to represent the

aggregated information of the raw data. Moreover, secured intermediate information

exchange (SINE) protocol was adopted to further protect aggregation information.

To guarantee ﬂexibility, EXPLORER leveraged online learning to update the model

based on the newly added records. It also supported asynchronous communication

to avoid coordinating multiple institutions, so that it would be robust under

the emergence of ofﬂine institution and interrupted communication. Shi et al.

(2016) developed a grid logistic regression framework based on secure multiparty

3 Privacy Preserving Federated Big Data Analysis 61

computation. In addition to raw data, the proposed SMAC-GLORE protected the

decomposable intermediary results based on garbled circuits during iterative model

learning. Secure matrix multiplication and summation protocols were presented for

maximum likelihood estimation using ﬁxed-Hessian methods. For MLE, Hessian

matrix inversion problem was securely transferred to a recursive procedure of matrix

multiplications and summations using Strassen algorithm, while exp006Fnential

function was approximated with Taylor series expansion.

Wu et al. (2015) extended GLORE to address multi-centric modeling of multi-

category response, where grid multi-category response models were developed

for ordinal and multinomial logistic regression over horizontally partitioned data.

Grid Newton method was proposed to make maximum likelihood estimation of

model parameters in a privacy-preserving fashion. At each iteration, each institution

separately calculated partial gradients and Hessian matrix based on its own data

and the server could integrate these intermediary results to derive the global model

parameters. Thus, the proposed models could reduce disclosure risk, as patient-level

data would not be moved outside the institutions. Furthermore, privacy-preserving

distributed algorithms were presented for grid model ﬁt assessment and AUC score

computation in ordinary and multinomial logistic regression model by extending

the corresponding algorithms for binary response models. The proposed models

were demonstrated to achieve the same accuracy with a guarantee of data privacy in

comparison to the corresponding centralized models.

Recently, Li et al. (2016) proposed a novel method that leveraged dual opti-

mization to solve binary logistic regression over vertically partitioned data. The

proposed vertical grid logistic regression (VERTIGO) derived the global solution

with aggregated intermediate results rather than the sensitive patient-level data.

In the server/client architecture, the server iterative solved the dual problem of

binary logistic regression using the Newton-Raphson method. To compute the

Hessian matrix, each institution transmitted the kernel matrix of its local statistics

to the server for merging. Dot product kernel matrix was adopted to guarantee

that the global gram matrix was decomposable over multiple institutions. For

iterative optimization, it was only required to exchange the dot products of patient

records and dual parameters between the server and institutions. This fact implies

that the patient-level information would not be revealed, when the distribution of

covariates is not highly unbalanced. Employed on both synthetic and real datasets,

VERTIGO was shown to achieve equivalent accuracy for binary logistic regression

in comparison to its centralized counterpart.

Distributed survival analysis is one of the prevailing topics in biomedical

research, which studies the development of a symptom, disease, or mortality with

distributed time-to-event data. Cox proportional hazard model (Cox 1972) is widely

concerned in survival analysis, which evaluates the signiﬁcance of time-varying

covariates with a hazard function. Yu et al. (2008) proposed a privacy-preserving

Cox model for horizontally partitioned data, where afﬁne projections of patient

data in a lower dimensional space were shared to learn survival model. The

proposed HPPCox model utilized a rank-deﬁcient projection matrix to hide sensitive

information in raw data, as the lower dimensional projections were commonly

irreversible. Project matrix was optimized to minimize loss of information led by

these lower dimensional projections, projection matrix was optimized to simulta-

62 W. Dai et al.

neously maintain the major structures (properties) and reduce the dimensionality

of input data. Thus, feature selection could be enabled to prevent overﬁtting for

scenarios requiring limited training data. This model was shown to achieve nearly

optimal predictive performance for multi-centric survival analysis. O’Keefe et al.

(2012) presented explicit conﬁdentialisation measures for survival models to avoid

exchanging patient-level data in a remote analysis system, but did not consider

distributed learning model. The work considered and compared conﬁdentialised

outputs for non-parametric survival model with Kaplan-Meier estimates (Kaplan

and Meier 1958), semiparametric Cox proportional hazard model and parametric

survival model with Weibull distribution (Weibull 1951). The conﬁdentialised

outputs would beneﬁt model ﬁt assessment with similar model statistics and

signiﬁcance in comparison to traditional methods. However, the work was focused

on securely generating survival outputs, did not perform distributed model learning.

Lu et al. (2014) proposed a web service WebDISCO for distributed Cox proportional

hazard model over horizontally partitioned data. The global Cox model was

established under the server/client architecture, where each institution separately

calculated its non-sensitive intermediary statistics for model parameter estimation

in server. WebDISCO investigated the technical feasibility of employing federated

data analysis on survival data. The distributed Cox model was shown to be identical

in mathematical formulation and achieve an equivalent precision for model learning

in comparison to its centralized realization.

3.3.2 Applications Based on ADMM

In this subsection, we review the ADMM-based distributed algorithms for regres-

sion and classiﬁcation with a brief overview on the convergence analysis for

ADMM-based methods. Table 3.2 summarizes the ADMM-based algorithms for

horizontally and vertically partitioned data and most of them have not been applied

in the context of biomedical informatics.

Table 3.2 Federated modeling techniques for distributed data based on ADMM

Application scenario Data partitioning

Boyd et al. (2011) Logistic regression Horizontal/Vertical

Lasso/Group Lasso Vertical

Support vector machine Vertical

Mateos et al. (2010)Linear regression Horizontal

Mateos and Schizas (2009) Recursive least-squares Horizontal

Mateos and Giannakis (2012) Recursive least-squares Horizontal

Forero and Giannakis (2010)Support vector machine Horizontal

Schizas and Aduroja (2015)Principal component analysis Horizontal

Scardapane et al. (2016)Recurrent neural networks Horizontal

3 Privacy Preserving Federated Big Data Analysis 63

3.3.2.1 Regression

Boyd et al. (2011) summarized the ADMM-based distributed `1-penalized logistic

regression model for horizontally and vertically partitioned data, respectively. For

horizontally partitioned data, model parameters were separately solved for each

institution by minimizing an `2-regularized log-likelihood function over local data.

Subsequently, auxiliary variable for global consensus was found to minimize the

combination of its `1-norm and the squared difference between the auxiliary

variable and averaged primal and dual variables. It should be noted that the

auxiliary variable could be computed for each attribute in parallel for improved

efﬁciency. Each institution would not leak its sensitive information, as only its local

model parameters were exchanged for global consensus. When data are vertically

distributed across multiple institutions, a Lasso problem based on the local attributes

was formulated for each institution under the ADMM framework. Aggregating the

intermediary results from all the institutions, the auxiliary variables were derived

from the `2-regularized logistic loss function. The dual variables were updated

based on the aggregated intermediary results and averaged auxiliary variables and

remained same for all the institutions. The distributed logistic regression could be

performed in a privacy-preserving manner, as local data owned by each institution

would not be inferred from the aggregated intermediary results. In both cases, the `2-

regularized minimization based on log-sum-exp. functions can be iteratively solved

using the Newton-Raphson method or L-BFGS algorithm.

Mateos et al. (2010) leveraged alternating direction method of multipliers

(ADMM) to make model parameter estimation in sparse linear regression. The

centralized Lasso model was transformed into a decomposable consensus-based

minimization problem for horizontally partitioned data. The derived minimization

problem can be iteratively solved with ADMM in a decentralized manner, where

each institution only communicates with its neighboring institutions to protect

data privacy. To balance computational complexity and convergence rate, three

iterative algorithms, DQP-Lasso, DCD-Lasso and D-Lasso, were developed based

on ADMM. DQP-Lasso introduced strictly convex quadratic terms to constrain the

model parameters and auxiliary variables for each institution and its neighbors in

augmented Lagrangian function. Thus, quadratic programming (QP) is adopted for

each institution to obtain its model parameters by minimizing the decomposable

Lagrangian function in a cyclic manner. To improve efﬁciency of iterative opti-

mization, DCD-Lasso introduced coordinate descent to simplify the QP process

for each institution. At each iteration, DCD-Lasso updated the model parameters

for each institution with coordinate descent rather than iteratively obtained the

exact solution to the corresponding QP problem. Furthermore, D-Lasso enabled

parallelized computation of model parameters corresponding to multiple institutions

to enhance convergence speed. D-Lasso can achieve equivalent model estimation in

comparison to DCD-Lasso without additional relaxation terms. It is demonstrated

that the model parameters locally derived by these algorithms are convergent to the

global solution to Lasso.

64 W. Dai et al.

Similar to sparse linear regression, Mateos and Schizas (2009) adopted ADMM

to develop a distributed recursive least-squares (D-RLS) algorithm for time series

data horizontally distributed across multiple institutions. The proposed algorithm

reformulated the exponentially weighted least-squares estimation to a consensus-

based optimization problem by introducing auxiliary variables for corresponding

institutions. The reformulated minimization problem was decomposed into a series

of quadratic optimization problems that were recursively solved for each institution.

Furthermore, Mateos and Giannakis (2012) improved the efﬁciency of D-RLS by

avoiding explicit matrix inversion in recursively solving quadratic problems for each

institution. The proposed algorithms are demonstrated to be stable for time series

data with sufﬁcient temporal samples under the metric of means and mean squared

errors (MSE). These ADMM-based linear regression and least-squares estimation

were validated over the wireless sensor networks.

The sharing formulations have been also studied for Lasso and group Lasso

problems over vertically partitioned data. Similar to distributed logistic regression

model, sparse least-squares estimation was formulated for each institution to

independently obtain the model parameters corresponding to its own attributes.

In group Lasso, institutions would adopt various regularization parameters for `1-

regularized problem. Since the auxiliary variables were updated analytically with a

linear combination of the aggregated intermediary results and dual variables, the

computational complexity mainly depended on the decomposable `1-regularized

problem for multiple institutions. To improve the efﬁciency, the x-minimization for

certain institution could be skipped, when its attributes were not considered to be

involved in distributed optimization based on a threshold w.r.t. the regularization

parameters and Lagrangian multiplier.

3.3.2.2 Classiﬁcation

Forero and Giannakis (2010) leveraged ADMM to develop a distributed support

vector machine (DSVM) classiﬁer for training data horizontally distributed across

multiple institutions. Introducing auxiliary variables for the local model parameters

at each node, the linear SVM classiﬁer was decomposed into a series of convex

sub-problems over these auxiliary variables under the consensus constraints. For

each node, its local model parameters were independently derived from the corre-

sponding sub-problem. The decomposable optimization was iteratively performed

to obtain the unique optimal model parameters in a decentralized manner. Under the

ADMM framework, the global SVM classiﬁer was trained in a privacy-preserving

fashion, as each node exchanges its locally estimated model parameters rather than

the training data it owns. To handle sequential and asynchronous learning tasks, the

linear DSVM classiﬁer support online update for time-varying training data. The

classiﬁer could be partially updated to adapt with the cases that training samples

were added to and removed from the training set. The ADMM-based distributed

optimization was also generalized to nonlinear SVM classiﬁers, where consensus

constraints on discriminant functions of local model parameters were shrunk to

a rank-deﬁcient subspace of the reproducing kernel Hilbert space. In analogy

3 Privacy Preserving Federated Big Data Analysis 65

to generalized linear regression, ADMM was also adopted for SVM classiﬁer

over vertically partitioned data. The centralized SVM classiﬁer was decomposed

into a series of quadratic programming problems to separately derive the model

parameters corresponding to the attributes owned by each institution. These model

parameters were iteratively adjusted with the auxiliary variables obtained by soft

thresholding over the aggregated intermediary results and averaged dual variables.

Therefore, each institution only needs to share the aggregated intermediary results

of its attributes for each patient’s record.

Schizas and Aduroja (2015) proposed a distributed principal component anal-

ysis (PCA) framework based on ADMM under the metric of mean-square error

(MSE). An equivalent constrained optimization problem was formulated for the

classical PCA framework, where each node used local covariance matrix to sep-

arately estimate its principal eigenspace in a recursive manner. Coordinate descents

were adopted in the ADMM framework to iteratively minimize the augmented

Lagrangian function under the consensus constraints. The proposed framework

enabled distributed dimensionality reduction over horizontally partitioned data in

a privacy-preserving fashion, as each node only exchanged aggregated intermediary

results with its neighboring nodes. Under sufﬁcient iterative optimization, the

estimated principal eigenspace is demonstrated to asymptotically converge to

the subspace spanned by the actual principal eigenvectors. For validation, the

proposed distributed PCA framework was employed in data denoising over wireless

sensor networks, where synthetic and practical evaluations showed it could achieve

enhanced convergence rate and improved performance for noise resilience.

Scardapane et al. (2016) adopted ADMM in the decentralized training of recur-

rent neural networks (RNNs) that optimized the global loss function over the data

horizontally distributed across all the nodes. The proposed distributed algorithm

was designed speciﬁcally for Echo State Networks. The decomposable optimization

under consensus constraints was formulated for each node to separately calculate

the model parameter based on the `2-regularized least-squares estimation. For each

node, its auxiliary variable for global consensus was obtained with a weighted

average of model parameters from its neighboring nodes. Thus, it was not required

to share the training data or the hidden matrix representing the current states of

neural networks among multiple nodes. Since the auxiliary variables were not

calculated based on all the nodes in the neural network, it did not necessarily depend

on a server to perform global optimization. Evaluations over large scale synthetic

data showed that the proposed distributed algorithm could achieve comparable

classiﬁcation performance in comparison to its centralized counterpart.

3.3.2.3 Convergence and Robustness for Decentralized Data Analysis

Ling et al. (2015) proposed a decentralized linearized ADMM (DLM) method

to reduce computational complexity and enhance convergence speed for stan-

dard ADMM. The proposed DLM method simpliﬁed the decomposable optimiza-

tion problems with linearized approximation. Thus, the computational cost for

66 W. Dai et al.

implementation based on gradient descent could be signiﬁcantly reduced for appli-

cations like distributed logistic regression and least-squares estimation. It is demon-

strated that the DLM method can converge to the optimal solution with a linear rate,

when strongly convex and the Lipschitz continuity conditions are satisﬁed for the

decomposable cost functions and its gradients, respectively. Furthermore, Mokhtari

et al. (2016) considered quadratic approximation of ADMM-based formulation for

logistic regression model to improve the accuracy of model parameter estimation.

The proposed DQM method was shown to obtain more accurate estimates for model

parameters with a linear convergence rate in comparison to DLM.

Investigations have also been made to improve the robustness of ADMM

for various application scenarios. Wang and Banerjee (2012) introduced online

algorithm into ADMM to yield an enhance convergence rate. Goldfarb et al.

(2012) developed a fast ﬁrst-order linearized augmented Lagrangian minimization

to accelerate the convergence of ADMM algorithm. Considering the insufﬁcient

accessibility of true data values, Ouyang et al. (2013) presented a stochastic ADMM

algorithm aiming to minimize non-smooth composite objective functions.

3.4 Secure Multiparty Computation

The previous sections introduced federated technologies reducing the privacy risks

through model decomposition so that we build global models only on locally

aggregated statistics (without accessing the patient-level data). To further mitigate

the privacy risk, we need to ensure the conﬁdentiality of transmitted summary

statistics in each institution as well. This can be achieved using Secure Multiparty

Computation (SMC), which ensures computation and communication security via

advance cryptographic protocols. Many state-of-the-art SMC schemes are based

upon the idea of translating an algorithm to a binary circuit. Scalable SMC protocols

like Yao’s garbled circuit (Yao 1982) could represent arbitrary functions with a

Boolean circuit to enable masking of inputs and outputs of each gate for sensitive

information. In federated data analysis, two architectures are widely considered to

integrate secure multiparty computation protocols, namely spoke-hub and peer-to-

peer architectures, as shown in Fig. 3.4. The spoke-hub architecture requires one or

more non-colluding institution to perform securely computation and analysis based

on the data collected from the other institutions, while the peer-to-peer architecture

allows secure exchange of encrypted data in a decentralized manner. In this section,

we will introduce some secure multiparty communication protocols for regression,

classiﬁcation and evaluation tasks.

3.4.1 Regression

Fienberg et al. (2006) proposed privacy-preserving binary logistic regression for

horizontally partitioned data with categorical covariates. Secure summation and data

integration protocols were adopted to securely compute the maximum likelihood

3 Privacy Preserving Federated Big Data Analysis 67

Fig. 3.4 Illustration of architectures for federated data analysis using secure multiparty compu-

tation (SMC) protocols. (a) Spoke-Hub architecture. The non-colluding institution is required

to perform global computation and operation based the encrypted data collected from the other

institutions; (b) Peer-to-peer architecture. All the institutions exchanged and aggregated encrypted

data with each other in a successive manner to achieve federated data analysis

and perform contingency table analysis for log-linear models. Under the categorical

covariates, the logistic regression model could be constructed from the log-linear

models in a distributed and privacy-preserving fashion. Slavkovic and Nardi (2007)

presented a secure logistic regression approach on horizontally and vertically

partitioned data without actually combining them. Secure multiparty computation

protocols were developed for the Newton-Raphson method to solve binary logistic

regression with quantitative covariates. At each iteration, multiparty secure matrix

summation and product protocols were employed to compute the gradients and

Hessian matrix. Here, the inverse of Hessian was derived by recursively performing

secure matrix product over its sub-block matrices. The proposed protocol would

protect the privacy of each institution under the secure multiparty protocols, when

the matrices of raw data are all with a dimensionality greater than one. However,

Fienberg et al. (2009) showed that it would lead to serious privacy breach when

intermediary results are shared among multiple institutions in numerous iterations,

even though they can be protected by mechanism like encryption with random

shares. Nardi et al. (2012) developed a secure protocol to ﬁt logistic regression

over vertically partitioned data based on maximum likelihood estimation (MLE).

In comparison to previous works, it enhanced the security by that only providing

the ﬁnal results for private computation, as there would be a chance to compromise

the conﬁdentiality of raw data held by each institution from the shared intermediate

values. Two protocols were proposed for approximating logistic function using

existing cryptographic primitives including secret sharing, secure summation and

product with random shares, secure interval membership evaluation and secure

matrix inversion. The ﬁrst protocol approximated the logistic function with the

summation of step functions. However, this protocol would be computationally

prohibitive for high-dimensional large-scale problems due to the secret sharing and

68 W. Dai et al.

comparison operation over encrypted data. To relieve the computational burden, the

second protocol formulated the approximation by solving an ordinary differential

equation numerically integrated with Euler’s method. Thus, the logistic regression

was iteratively ﬁt with secure summations and products of approximation and

average derivatives of logistic function evaluated on the intermediate values.

The accuracy bound of the two approximation protocols is demonstrated to be

related with minimum eigenvalue of the Fisher information matrix and step size

for approximation. However, these protocols would be prohibitive for real-world

applications due to their high computational complexity and communication rounds.

Sanil et al. (2004) addressed privacy-preserving linear regression analysis over

vertically partitioned data. Quadratic optimization was formulated derive the exact

coefﬁcients of global regression model over multiple institutions. Distributed

computation of regression coefﬁcients was achieved by implementing Powell’s

algorithm under the secure multiparty computation framework. At each iteration,

each institution updated its local regression coefﬁcients and its own subset of

search directions. Subsequently, a common vector was generated by aggregating the

products of local attributes and search directions using secure summation protocol

for computation in next iteration. Finally, the proposed algorithm obtained the global

coefﬁcients and the vector of residuals. Thus, global linear regression could be

made based on the entire dataset collected from all the institutions without actually

exchanging the raw data owned by each institution. Moreover, basic goodness-of-ﬁt

diagnostics could be performed with the coefﬁcient of determination that measured

the strength of the linear relationship. However, this algorithm would be unrealistic

due to the assumption that the institution holding the response attribute should share

it with the other institutions. Karr et al. (2009) proposed a protocol for secure linear

regression over data vertically distributed across multiple institutions. The proposed

protocol was able to estimate the regression coefﬁcients and related statistics like

standard error in a privacy-preserving manner. In distributed optimization, multiple

institutions collaborated to calculate the off-diagonal blocks of the global covariance

matrix using secure matrix products, while the diagonal blocks were obtained based

on the corresponding local attributes. The global covariance matrix was shared

among multiple institutions for secure linear regression and statistical analyses.

Moreover, model diagnostic measures based on residuals can be similarly derived

from the global covariance matrix using secure summation and matrix products

protocols. Remarkably, the proposed protocol could be generalized to a variety

of regression models, e.g. weighted least squares regression, stepwise regression

and ridge regression, under the constraint that sample means and co-variances are

sufﬁcient statistics.

3.4.2 Classiﬁcation

Naïve Bayes classiﬁer is an effective Bayes learning method with consistent and rea-

sonable performance, which is commonly adopted as benchmark for classiﬁcation

methods to be evaluated. Vaidya and Clifton (2003b) developed a privacy-preserving

3 Privacy Preserving Federated Big Data Analysis 69

Naive Bayes classiﬁer for vertically partitioned data, where multiple institutions

collaborated to achieve classiﬁcation with random shares of the global model.

In the proposed Bayes classiﬁer, each institution is only required to share the

class of each instance, rather than the distribution of classes or attribute values.

Secure protocols were developed for training and classiﬁcation under the secure

multiparty computation framework. In training, random shares of the conditionally

independent probabilities for nominal and numeric attributes were computed for

model parameter estimation. To be concrete, the probability estimates for classes

of instances given the attributes were computed for shares of nominal attributes.

For numeric attributes, mean and variance for the probability density function of

Normal distribution are required. For each class and attribute value, the shares can

be obtained from the institutions owning them. For evaluation, each new instance

was classiﬁed by maximizing its posterior probability using Bayes theorem under

the assumption of conditionally independent attribute values. Here, the probabilities

of class conditioned on attributes are derived based on the model parameters.

Furthermore, it is demonstrated that these protocols for training and evaluation are

able to securely compute shares of nominal and numeric attributes and classify the

classes, respectively. Furthermore, Vaidya et al. (2008) introduced secure logarithm

primitives from secure multiparty computation for naïve Bayes classiﬁcation over

horizontally partitioned data. The model parameters for normal and numeric

attributes were directly computed based on the local counts using secure sum

protocols, as each institution held all the attributes required for classifying an

instance. Therefore, classiﬁcation could be made locally for each institution.

K-means clustering is popular for cluster analysis, which partitions observations

into the clusters with the nearest means. For data vertically distributed across

multiple institutions, privacy-preserving K-means clustering (Vaidya and Clifton

2003a) was studied to perform clustering without sharing raw data. Using the

proposed K-means clustering algorithm, each institution can obtain its projection

of the cluster means and learn the cluster assignment of each record without

revealing its exact attributes. Since high dimensional problem cannot be simply

decomposed into a combination of lower dimensional problems for each institution,

cooperation between multiple institutions is required to learn the cluster that each

record belongs to. To achieve the privacy-preserving clustering algorithm, secure

multiparty computation framework using homomorphic encryption is introduced for

multiple institutions. For common distance metrics like Euclidean and Manhattan,

the distances between each record and the means of K clusters can be split over

these institutions. For security, these distances were disguised with random values

from a uniform random distribution and non-colluding institutions were adopted to

compare the randomized distances and permute the comparison results. The secure

permutation algorithm is performed in an asymmetric two-party manner according

to the permutation owned by the non-colluding institution. It should be noted that the

initial values of the K means were assigned to their local shares for the institutions to

obtain a feasible solution. As a result, non-colluding institutions would only know

the selected cluster in the permutation, while the exact attributes owned by each

institution would not be disclosed to the others.

70 W. Dai et al.

Liang et al. (2013) developed a distributed PCA algorithm to estimate the global

covariance matrix for principal components. Each institution leveraged standard

PCS algorithm to determine its principal components over the local data. These local

principal components were exchanged and collected to obtain global covariance

matrix. The proposed algorithm integrated the distributed coreset-based clustering

to guarantee that the number of vectors for communication was independent of

size and dimension of the federated data. It is demonstrated that the divergence

between approximations on projected and original data for k-means clustering can

be upper bounded. Guo et al. (2013) developed a covariance-free iterative distributed

principal component analysis (CIDPCA) algorithm for vertically partitioned high-

dimensional data. Instead of approximating global PCA with sampled covariance

matrix, the proposed CIDPCA algorithm is designed to directly determine principal

component by estimating their eigenvalues and eigenvectors. The ﬁrst principal

component corresponding to the maximum eigenvalues of the covariance matrix

was derived by maximizing the Rayleigh quotient using gradient ascent method. The

iterative method is demonstrated to converge in an exponential rate under arbitrary

initial values of principal components. Subsequently, the remaining principal

components could be iteratively calculated in the orthogonal complement of the

subspace spanned by the previously derived principal components. In comparison

to previous distributed PCA methods, it is shown to achieve higher accuracy

in estimating principal components and better classiﬁcation performance with a

signiﬁcant reduction on communication cost. This conclusion is also validated with

a variety of studies over real-world datasets.

Du et al. (2004) studied multivariate statistical analysis for vertically partitioned

data in the secure two-party computation framework, including secure two-party

multivariate linear regression (S2-MLR) and classiﬁcation (S2-MC). These prob-

lems were addressed in a privacy-preserving manner with secure two-party matrix

computation based on a set of basic protocols. A new security model was proposed

to lower security requirements for higher efﬁciency, where each institution was

allowed to reveal a part of information about its raw data under the guarantee that

the raw data would not be inferred from the disclosed information. To securely

perform matrix computation, building blocks for matrix product, matrix inverse and

matrix determinant were presented. Thus, S2-MLR and S2-MC could be securely

solved with these building blocks. It is demonstrated that, in the proposed two-party

multivariate statistical analysis, it would be impossible to infer the raw data owned

by each institution, when less than half of its disguised matrix is revealed.

Yu et al. (2006) proposed an efﬁcient privacy-preserving support vector machine

(SVM) classiﬁcation method, namely PP-SVMV, for vertically partitioned data. In

the proposed method, the global SVM model was constructed from local SVM

models rather than directly exchanging the local data. Thus, both local SVM models

and their corresponding local data for each institution were not disclosed. For

linear kernels, the global kernel matrix is computed by directly merging gram

matrices from multiple institutions to solve the dual problem. This result can

also be extended to ordinary non-linear kernels that can be represented by dot

products of covariates, i.e. polynomial and radial basis function (RBF) kernels.

3 Privacy Preserving Federated Big Data Analysis 71

To guarantee data and model privacy, merge of local models is performed with

secure summation of scalar integers and matrices. Experimental results demon-

strated the accuracy and scalability of PP-SVMV in comparison to the centralized

SVM over the original data. Similarly, Yu et al. (2006) presented a privacy-

preserving solution to support non-linear SVM classiﬁcation over horizontally

partitioned data. It required that the nonlinear kernel matrices could be directly

calculated based on the gram matrix. Thus, widely-used nonlinear kernel matrices

like polynomial and RBF kernels can be derived from the dot products of all data

pairs using the proposed solution. Secure set intersection cardinality was adopted

as an equivalency to these dot products based on the data horizontally distributed

across the institutions. Thus, commutative one-way hash functions were utilized to

securely obtain the set intersection cardinality. The proposed method was shown to

achieve equivalent classiﬁcation performance in comparison to the centralized SVM

classiﬁers. Mangasarian et al. (2008) constructed a reduced kernel matrix with the

original data and a random matrix to perform classiﬁcation under the protection of

local data. The random kernel based SVM classiﬁer could support both horizontally

and vertically partitioned data. Yunhong et al. (2009) proposed a privacy-preserving

SVM classiﬁer without using secure multiparty computation. The proposed SVM

classiﬁer built its kernel matrix by combining local gram matrices derived from

the original data owned by the corresponding institutions. It shows that local gram

matrices would not reveal the original data using matrix factorization theory, as it

is not unique for each institution to infer its covariates from local gram matrix. The

proposed classiﬁcation algorithm is developed for SVM classiﬁer with linear and

nonlinear kernels, where the accuracy of distributed classiﬁcation is comparable

to the ordinary global SVM classiﬁer. Que et al. (2012) presented a distributed

privacy-preserving SVM (DPP-SVM), where server/client collaborative learning

framework is developed to securely estimate parameters of covariates based on

the aggregated local kernel matrices from multiple institutions. For security, all

the model operations are performed on the trusted server, including service layer

for server/client communication, task manager for data validation and computation

engine for parameter estimation.

Vaidya et al. (2008) presented a generalized privacy-preserving algorithm for

building ID3 decision tree over data vertically distributed across multiple insti-

tutions. For efﬁcient and secure classiﬁcation using the ID3 decision tree, the

proposed algorithm only revealed the basic structure of the tree and the speciﬁc

institution responsible for decision making at each node, rather than the exact values

of attributes. For each node in the tree, the basic structure includes its number of

branches and the depths of its subtrees, which represented the number of distinct

values for corresponding attributes. Thus, it is not necessary for each institution

to introduce complex cryptographic protocol at each possible level to securely

classify an instance. It should be noted that the proposed algorithm only needs to

assign class attribute needs to one institution, but each interior node could learn the

count of classes. Consequently, the institution owning class attribute estimated the

distributions throughout the decision tree based on the derived transaction counts,

which would not disclose much new information. The distribution and majority class

72 W. Dai et al.

was determined based on the cardinality of set intersection protocol. Given multiple

institutions, classiﬁcation of an instance was conducted by exchanging the control

information based on the decision made for each node, but not the attribute values.

To further enhance the efﬁciency of privacy-preserving ID3 algorithm, Vaidya

et al. (2014) developed random decision tree (RDT) framework to ﬁt parallel and

distributed architecture. Random decision tree is desirable for privacy-preserving

distributed data mining, as it can achieve equivalent effect as perturbation without

diminishing the utility of information from data mining. For horizontally partitioned

data, the structure of RDTs was known to all the institutions that held the same

types of attributes. The RDTs were constructed by considering the accessibility of

the global class distribution vector for leaf nodes. Each institution could derive its

local distribution vector from its own data and submitted the encrypted versions for

aggregation. When the class distribution vectors were known to all the institutions

or the institution owning the RDTs, the aggregation could be directly made based

on homomorphically encrypted data. If the class distribution vectors were forced

to remain unrevealed, a secure electronic voting protocol was presented to make

decision based on the collected encrypted local vectors. For vertically partitioned

data, fully distributed trees with a speciﬁed total number were considered, so

that the sensitive attribute information for each institution was not revealed. Each

random tree was split among multiple institutions and constructed recursively using

the BuildTree procedure in a distributed fashion. It is worth mentioning that this

procedure does not require the transaction set. Subsequently, the statistics of each

node was securely updated based on the training set from multiple institutions using

additively homomorphic encryption. Similarly, instance classiﬁcation was achieved

by averaging the estimated probabilities from multiple RDTs in a distributed

manner. The proposed RDT algorithm is secure, as neither the attribute values nor

the RDT structure is shared during RDT construction and instance classiﬁcation.

3.4.3 Evaluation

Sorting algorithm is essential for privacy-preserving distributed data analysis,

including ranked elements query, group-level aggregation and statistical tests. In

the secure multiparty computation framework, oblivious sorting algorithm can be

implement by hiding the propagation of values in the sorting network or directly

using sorting algorithm as a basis. Bogdanov et al. (2014) investigated four different

oblivious sorting algorithms for vertically partitioned data, where two algorithms

improved the existing sorting network and quicksort algorithms and the other

two ones were developed to achieve low round count for short vectors and low

communication cost for large inputs, respectively. For short vectors, a naive sorting

protocol NaiveCompSort was presented based on oblivious shufﬂing of input data

and vectorized comparison of shufﬂed data. Given large inputs, oblivious radix

sorting protocol was developed as an efﬁcient alternative. It leveraged binary count

sorting algorithm to rearrange the input integer vectors based on the sorted digits

3 Privacy Preserving Federated Big Data Analysis 73

in the same positions. Thus, the oblivious radix sorting protocol is efﬁcient, as it

does not require oblivious comparisons. Furthermore, optimization methods were

proposed to improve the efﬁciency of the oblivious sorting algorithms. For example,

bitwise shared representation and vectorization would allow data parallelization

to reduce the communication cost and complexity for SMC. For sorting network

structure, shufﬂing the inputs and re-using the generated network could optimize

its implementation, while uniqueness transformation for comparison-based sorting

protocols could avoid information leakage in the sortable vector. It should be noted

that these sorting algorithms could also be generalized to support matrix sorting. The

complexity and performance analysis for all the four sorting algorithms, including

detailed running-time, network and memory usage, was also presented.

Makrietal.(2014) proposed a privacy-preserving statistical veriﬁcation for

clinical research based on the aggregated results from statistical computation. It

leveraged secure multiparty computation primitives to perform evaluations for

Student’s and Welch’s t-test, ANOVA (F-test), chi-squared test, Fisher’s exact test

and McNemar’s test over horizontally partitioned data. The proposed statistical

veriﬁcation could be outsourced to a semi-host third party, namely veriﬁers, as no

private data were exchanged during the veriﬁcation process. Secure protocols based

on secret sharing were utilized to compute the means and variance. Consequently,

Student’s t-test, Welch’s test and F-test could be evaluated based on the derived

means and variance. At the meantime, chi-squared test, Fisher’s exact test and

McNemar’s test could be performed based on the frequencies in the contingency

table, which would not reveal the individual records in the group of data held by

certain institutions. The proposed mechanism is proven to protect the data security

and privacy in the semi-honest model using secure multiparty computation protocols

from Shamir’s secret sharing.

3.5 Asynchronous Optimization

In Sect. 3.2, we discussed the Newton-Raphson method and ADMM framework

for distributed optimization in federated data analysis. In these methods, all

institutions are commonly synchronized for computation at each iteration. However,

these synchronous methods would fail due to unexpected communication delay

and interruption under practical network conditions. Asynchronous optimization

algorithms have been developed to make distributed and parallel optimization

based on local updates from institutions with various delays, e.g. institutions with

different computation and processing speeds and access frequencies. Thus, server

and institutions can proceed without waiting for prerequisite information as in

synchronous methods.

74 W. Dai et al.

3.5.1 Asynchronous Optimization Based on Fixed-Point

Algorithms

Chazan and Miranker (1969) ﬁrstly proposed asynchronous parallel method to

solve linear systems of equations with chaotic relaxation. The proposed method

developed totally (inﬁnite delay) and partially (bounded delay) asynchronous

parallel algorithms to improve the efﬁciency of iterative schemes based on limited

number of updates with a guarantee of convergence. Baudet (1978) extended totally

asynchronous chaotic relaxation to solve ﬁxed-point problems with contracting

operators. The proposed methods were guaranteed with the sufﬁcient condition of

contracting operators for convergence to ﬁxed point.

Bertsekas and Tsitsiklis (1989) introduced gradient-like optimization based

on totally and partially asynchronous parallel algorithms in unconstrained and

constrained optimization problems. Totally asynchronous gradient algorithm was

studied for optimization problems with contraction mapping under the metric of

weighted maximum norm. Its convergence is guaranteed when the diagonal dom-

inance condition is satisﬁed for its Hessian matrix. Provided ﬁnite asynchrony for

communication and computation, the partially asynchronous parallel optimization

is shown to converge when the step size for iterative optimization is sufﬁciently

small in comparison to the asynchrony measure. Tseng (1991) analyzed the rate

of convergence for partially asynchronous gradient projection algorithm. Given an

objective function with Lipschitz continuous gradient, a linear convergence rate can

be achieved, when its isocost surfaces can be discriminated and upper Lipschitz

property holds for at least one of its multivalued functions. Tai and Tseng (2002)

studied the rate of convergence for strongly convex optimization problems based on

asynchronous domain decomposition methods.

Recently, Peng et al. (2016) developed an algorithmic framework AROCK for

asynchronous optimization related to non-expansive operator Twith ﬁxed point.

Under the assumption of atomic update, AROCK randomly selected a component

of primal variables xand updated it without memory locking by using sub-operator

SDITbased on the newly updated variablesbxwithin a bounded delay.

x.tC1/Dx.t/Sitbx.t/(3.5)

Here itis the random variable indicating the variable for atomic writing at

time t. AROCK is demonstrated to achieve convergence under ﬁnite-dimensional

operator. It is also guaranteed to converge to the ﬁxed point with a linear rate,

when the difference between the identity matrix and non-expansive operator is

quasi-strongly monotone. Furthermore, Peng et al. (2016) studied coordinate update

methods for speciﬁc optimization problems with coordinate friendly operators.

Remarkably, coordinate update methods could derive and leverage a variety of

coordinate friendly operators to make asynchronous realization for coordinate

descent, proximal gradient, ADMM and primal-dual methods.

3 Privacy Preserving Federated Big Data Analysis 75

3.5.2 Asynchronous Coordinate Gradient Descent

A series of partially asynchronous parallel methods have been developed for

coordinate gradient descent. Niu et al. (2011) presented an asynchronous strategy

HOGWILD! to perform stochastic gradient descent in parallel without memory

locking. The presented strategy enabled multiple processors to make gradient

updates in the shared memory. For sparse learning problem, HOGWILD! achieves

a sublinear convergence rate, as the error led by gradient update could be triv-

ial. Liu et al. (2015) proposed an asynchronous stochastic coordinate descent

algorithm AsySCD to improve parallel minimization of smooth convex functions.

When essential strong convexity condition is satisﬁed, the minimization can be

solved in a linear convergence rate. Later, Liu and Wright (2015) developed an

asynchronous parallel algorithm AsySPCD to minimize the composite objective

function composed of smooth convex functions using stochastic proximal coor-

dinate descent. AsySPCD considered the scenario that data were simultaneously

accessed and updated by multiple institutions. Under the relaxed optimal strong

convexity condition, the proposed algorithm can achieve a linear convergence rate.

Remarkably, both AsySCD and AsySPCD could be accelerated in a nearly linear

rate under enough number of processors (institutions). Hsieh et al. (2015) developed

a family of parallel algorithms PASSCoDe for stochastic dual coordinate descent

based on the LIBLINEAR software to efﬁciently solve -regularized empirical risk

minimization problems. At each iteration, PASSCoDe allowed each institution to

randomly select a dual variable to update the primal variables with coordinate

descent. Under asynchronous settings, three parallel algorithms were developed

to handle the possible violation of primal-dual relationship for -regularized min-

imization. PASSCoDe-Lock and PASSCoDe-Atomic were proposed to maintain

the primal-dual relationship with memory locking and atomic writing. PASSCoDe-

Wild achieved nearly optimal performance with no locking and atomic operations

by performing backward error analysis for memory conﬂicts.

3.5.3 Asynchronous Alternating Direction Method

of Multipliers

Recently, asynchronous ADMM framework with consensus constraints has been

studied for distributed optimization under practical network conditions. Similar to

SMC, two architectures have been developed to realize asynchronous optimization

under the ADMM framework. In the peer-to-peer architecture, institutions are

interconnected with their delay or inconsistency indicated by asynchronous clock,

so that optimization is made in a decentralized manner. Iutzeler et al. (2013) adopted

randomized Gauss-Seidel iterations of Douglas-Rachford operator to develop a

generalized asynchronous ADMM framework over multiple institutions. The pro-

posed asynchronous optimization methods allowed partially activation of isolated

decision variables (without involved in the active constraints) to make distributed

76 W. Dai et al.

minimization without coordinating all the institutions. Based on the monotone

operator theory, a randomized ADMM algorithm was formulated for various

institutions with different frequencies of activation. It is demonstrated to converge to

the optimal minimum under the connectivity assumption for the graph derived from

institutions. Wei and Ozdaglar (2013) proposed an asynchronous algorithm based on

ADMM for separable linearly constrained optimization over multiple institutions.

It established a general ADMM-based formulation under asynchronous settings to

iteratively determine a pair of random variables representing active constraints and

coupled decision variables and utilize them to update primal and dual variables.

Relating the iterative results under asynchronous setting to those based on all the

institutions, the proposed asynchronous algorithm is demonstrated to converge with

a rate inversely proportional to the number of iteration.

Spoke-hub architecture maintains a master node (server) to make consensus over

the primal variables distributed across multiple institutions. Zhang and Kwok (2014)

proposed an asynchronous ADMM algorithm that introduced partial barrier and

bounded delay to control asynchrony and guarantee convergence. The proposed

async-ADMM algorithm updated the local primal and dual variables for each

institution with most recent consensus variable. At the meantime, a master node

required only a partial subset of updated primal and dual variables with bounded

delay to update the consensus variable in an asynchronous manner. Under the

bounded delay, institutions with different computation and processing speed can

be properly counted for global consensus. The convergence property can be

held for a composition of non-smooth convex local objective functions. Hong

(2014) presented a generalized proximal ADMM framework to solve non-convex

optimization problems with nonsmooth objective functions in an asynchronous

and distributed manner. An incremental algorithm was developed for iterative

asynchronous optimization with varying active constraints. At each iteration, the

server updated all the primal and dual variables with the most recent gradients of

local objective functions from institutions. The consensus variables were derived

with proximal operation to compute the gradient in each institution. It should be

noted that the server only required a subset of institutions with newly-computed

gradients to update the primal and dual variables. Thus, the proposed Async-

PADMM algorithm is robust to handle the asynchrony led by communication delay

and interruption.

3.6 Discussion and Conclusion

In this chapter, we review the privacy-preserving federated data analysis algorithms

in the context of biomedical research. Federated data analysis algorithms aim to

facilitate biomedical research and clinical treatments based on large-scale healthcare

data collected from various organizations with full compliance with the institutional

policies and legislation. Federated models are developed to bring computation to

horizontally or vertically partitioned data rather than explicitly transfer them to a

central repository. Server/client and decentralized architectures are established to

3 Privacy Preserving Federated Big Data Analysis 77

estimate global model parameters and perform statistical test based on the aggre-

gated intermediary results from multiple institutions. As a result, the patient-level

data would not be inferred, when they are not distributed in an extremely unbal-

anced way. Distributed optimization for federated data analysis are achieved and

generalized by integrating the Newton-Raphson method and ADMM framework.

For generalized linear regression, the Newton-Raphson method is demonstrated to

achieve equivalent accuracy in model parameter estimation and statistical tests in

comparison to its centralized counterparts. ADMM framework splits the objective

function for distributed optimization with auxiliary variables to formulate decom-

posable cost function for multiple institutions. Since it is not necessarily required

to derive the global consensus from all the local data, ADMM is possible to

support decentralized analysis without data server. The separate estimates from the

institutions are guaranteed to converge under the consensus constraints based on

auxiliary variables. To further protect the communication among servers and clients,

secure multiparty communication protocols were adopted for the applications

like regression, classiﬁcation and evaluation over distributed data. It protects the

process of computation and communication for improved data security and privacy.

Finally, asynchronous parallel optimization can be incorporated to handle real-world

network conditions and improve the efﬁciency for parallel computation.

Privacy-preserving federated data analysis can be further studied for improved

security and efﬁciency and generic formulation for practice. One major challenge

for the federated data analysis is distributed model parameter estimation and

statistical tests for vertically partitioned data. Although federated models are shown

to be effective for logistic regression models and SVM classiﬁers over vertically

partitioned data, an analogical formulation based on dual optimization would fail for

the analysis tasks with non-separable objective functions like the Cox proportional

hazard model. Since each institution only holds a portion of covariates, the kernel

matrix cannot be explicitly decomposed across the institutions. Under such cases,

ADMM tends to be a promising alternative. However, ADMM commonly suffers

from low convergence rate and consequent high communication cost to solve the

iterative optimization with decomposable subproblems. Two-loop recursion would

be required to solve the optimization problem based on augmented Lagrangian

functions and regularized optimization problem in the server or client side. Thus,

it is imperative to develop privacy-preserving algorithms to consider both accuracy

and efﬁciency for distributed analysis over vertically partitioned data.

The second challenge is the communication cost between institutions and server.

A server is required to collect the data information or computation results from

each institution and output the ﬁnal results, while all the institutions collaborate in

a round robin manner without a server in the second architecture. El Emam et al.

(2013) showed that privacy risk might exist even for exchanging intermediary results

that are aggregated from a local patient cohort. Taking horizontally partitioned data

for example, there is a chance to identify sensitive patient-level data from gram

matrix for linear models or indicating variables, covariance matrix, intermediary

results from multiple iterations, or models. On the other hand, encryption methods

can prevent information leakage in exchanging data but will noticeably affect

78 W. Dai et al.

computational and storage efﬁciency. For example, homomorphic encryption would

signiﬁcantly increase computation and storage costs, when securely outsourcing

data and computations. To protect all raw data and intermediary results, SMC-based

distributed analysis methods require a high communications cost to deploy oblivious

transfer protocols. Moreover, many existing encryption-based methods cannot be

scaled up to handle large biomedical data. As a result, proper protocol or mechanism

is crucial to balance security and efﬁciency in institution/server communication.

Acknowledgement This research was supported by the Patient-Centered Outcomes Research

Institute (PCORI) under contract ME-1310-07058, the National Institute of Health (NIH)

under award number R01GM118574, R01GM118609, R00HG008175, R21LM012060, and

U01EB023685.

References

Act, D. P. (1998). Data protection act. London: The Stationery Ofﬁce.

Baudet, G. M. (1978). Asynchronous iterative methods for multiprocessors. Journal of the ACM,

25(2), 226–244.

Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and distributed computation: numerical

methods (Vol. 23). Englewood Cliffs, NJ: Prentice hall.

Bogdanov, D., Laur, S., & Talviste, R. (2014). A practical analysis of oblivious sorting algorithms

for secure multi-party computation. In K. Bernsmed & S. Fischer-Hübner (Eds.), Secure IT

systems (pp. 59–74). Springer International Publishing.

Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and

statistical learning via the alternating direction method of multipliers. Foundations and Trends

in Machine Learning, 3(1), 1–122.

Brown, J., Balaconis, E., Mazza, M., Syat, B., Rosen, R., Kelly, S., et al. (2012). PS1-

46: HMORNnet: Shared infrastructure for distributed querying by HMORN collaboratives.

Clinical Medicine & Research, 10(3), 163.

Chambless, L. E., & Diao, G. (2006). Estimation of time-dependent area under the ROC curve for

long-term risk prediction. Statistics in Medicine, 25(20), 3474–3486.

Chazan, D., & Miranker, W. (1969). Chaotic relaxation. Linear Algebra and Its Applications, 2(2),

199–222.

Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society,

Series B, Statistical Methodology, 34(2), 187–220.

Du, W., Han, Y. S., & Chen, S. (2004). Privacy-preserving multivariate statistical analysis: Linear

regression and classiﬁcation. In Proceedings of the 2004 SIAM international conference on

data mining (pp. 222–233).

El Emam, K., Samet, S., Arbuckle, L., Tamblyn, R., Earle, C., & Kantarcioglu, M. (2013). A secure

distributed logistic regression protocol for the detection of rare adverse drug events. Journal of

the American Medical Informatics Association: JAMIA, 20(3), 453–461.

Fienberg, S. E., Fulp, W. J., Slavkovic, A. B., & Wrobel, T. A. (2006). “Secure” log-linear and

logistic regression analysis of distributed databases. In J. Domingo-Ferrer & L. Franconi (Eds.),

Privacy in statistical databases (pp. 277–290). Berlin: Springer.

Fienberg, S. E., Nardi, Y., & Slavkovi´

c, A. B. (2009). Valid statistical analysis for logistic

regression with multiple sources. In C. S. Gal, P. B. Kantor, & M. E. Lesk (Eds.), Protecting

persons while protecting the people (pp. 82–94). Berlin: Springer.

Forero, P. a., & Giannakis, G. B. (2010). Consensus-based distributed support vector machines.

Journal of Machine Learning Research: JMLR, 11, 1663–1707.

3 Privacy Preserving Federated Big Data Analysis 79

Goldfarb, D., Ma, S., & Scheinberg, K. (2012). Fast alternating linearization methods for

minimizing the sum of two convex functions. Mathematical Programming A Publication of

the Mathematical Programming Society, 141(1–2), 349–382.

Guo, Y.-F., Lin, X., Teng, Z., Xue, X., & Fan, J. (2013). A covariance-free iterative algorithm for

distributed principal component analysis on vertically partitioned data. Pattern Recognition,

45(3), 1211–1219.

Gymrek, M., McGuire, A. L., Golan, D., Halperin, E., & Erlich, Y. (2013). Identifying personal

genomes by surname inference. Science, 339(6117), 321–324.

Health Insurance Portability and Accountability Act (HIPAA). (n.d.). Retrieved from http://

www.hhs.gov/ocr/hipaa.

Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., et al. (2008).

Resolving individuals contributing trace amounts of DNA to highly complex mixtures using

high-density SNP genotyping microarrays. PLoS Genetics, 4(8), e1000167.

Hong, M. (2014). A distributed, asynchronous and incremental algorithm for nonconvex optimiza-

tion: An ADMM based approach.arXiv [cs.IT]. Retrieved from http://arxiv.org/abs/1412.6058.

Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression.NewYork,

NY: Wiley.

Hsieh C-J., Yu H-F., & Dhillon I. S. (2015). Passcode: Parallel asynchronous stochastic dual co-

ordinate descent. In Proceedings of the 32nd international conference on machine learning

(ICML-15) (pp. 2370–2379).

Iutzeler, F., Bianchi, P., Ciblat, P., & Hachem, W. (2013). Asynchronous distributed optimization

using a randomized alternating direction method of multipliers. In 52nd IEEE conference on

decision and control (pp. 3671–3676). ieeexplore.ieee.org.

Jiang, W., Li, P., Wang, S., Wu, Y., Xue, M., Ohno-Machado, L., & Jiang, X. (2013). WebGLORE:

A web service for Grid LOgistic REgression. Bioinformatics, 29(24), 3238–3240.

Kantarcioglu, M. (2008). A survey of privacy-preserving methods across horizontally partitioned

data. In C. C. Aggarwal & P. S. Yu (Eds.), Privacy-preserving data mining (pp. 313–335).

New York, NY: Springer.

Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations.

Journal of the American Statistical Association, 53(282), 457–481.

Karr, A. F., Lin, X., Sanil, A. P., & Reiter, J. P. (2009). Privacy-preserving analysis of vertically

partitioned data using secure matrix products. Journal of Ofﬁcial Statistics, 25(1), 125–138.

Lafky, D. (2010). The Safe Harbor method of de-identiﬁcation: An empirical test. In Fourth

national HIPAA summit west.

Liang, Y., Balcan, M. F., & Kanchanapally, V. (2013). Distributed PCA and k-means clustering. In

The big learning workshop at NIPS 2013 (pp. 1–8).

Ling, Q., Shi, W., Wu, G., & Ribeiro, A. (2015). DLM: Decentralized linearized alternating

direction method of multipliers. IEEE Transactions on Signal Processing: A Publication of

the IEEE Signal Processing Society, 63(15), 4051–4064.

Liu, J., & Wright, S. J. (2015). Asynchronous stochastic coordinate descent: Parallelism and

convergence properties. SIAM Journal on Optimization, 25(1), 351–376.

Liu, J., Wright, S. J., Ré, C., Bittorf, V., & Sridhar, S. (2015). An asynchronous parallel stochastic

coordinate descent algorithm. Journal of Machine Learning Research JMLR, 16, 285–322.

Retrieved from http://www.jmlr.org/papers/volume16/liu15a/liu15a.pdf.

Li, Y., Jiang, X., Wang, S., Xiong, H., & Ohno-Machado, L. (2016). VERTIcal Grid lOgistic

regression (VERTIGO). Journal of the American Medical Informatics Association, 23(3),

570–579.

Lu, C.-L., Wang, S., Ji, Z., Wu, Y., Xiong, L., & Jiang, X. (2014). WebDISCO: A web service

for distributed cox model learning without patient-level data sharing. Journal of the American

Medical Informatics Association, 22(6), 1212–1219.

Makri, E., Everts, M. H., de Hoogh, S, Peter, A., H. op den Akker., Hartel, P., & Jonker, W.

(2014). Privacy-preserving veriﬁcation of clinical research (pp. 481–500). Presented at the

Sicherheit 2014 - Sicherheit, Schutz und Zuverlässigkeit, Beiträge der 7. Jahrestagung des

Fachbereichs Sicherheit der Gesellschaft für Informatik e.V. (GI), Bonn, Germany: Gesellschaft

für Informatik.

80 W. Dai et al.

Mangasarian, O. L., Wild, E. W., & Fung, G. M. (2008). Privacy-preserving classiﬁcation of

vertically partitioned data via random kernels. ACM Transactions on Knowledge Discovery

from Data, 2(3), 12:1–12:16.

Mateos, G., Bazerque, J. A., & Giannakis, G. B. (2010). Distributed sparse linear regression.

IEEE Transactions on Signal Processing: A Publication of the IEEE Signal Processing Society,

58(10), 5262–5276.

Mateos, G., & Giannakis, G. B. (2012). Distributed recursive least-squares: Stability and perfor-

mance analysis. IEEE Transactions on Signal Processing: A Publication of the IEEE Signal

Processing Society, 60(612), 3740–3754.

Mateos, G., & Schizas, I. D. (2009). Distributed recursive least-squares for consensus-based in-

network adaptive estimation. IEEE Transactions on Signal Processing, 57(11), 4583–4588.

Retrieved from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5061644.

McGraw, D. (2008). Why the HIPAA privacy rules would not adequately protect personal health

records: Center for Democracy and Technology (CDT) brief. http, 1–3.

Mokhtari, A., Shi, W., Ling, Q., & Ribeiro, A. (2016). DQM: Decentralized quadratically approx-

imated alternating direction method of multipliers. IEEE Transactions on Signal Processing: A

Publication of the IEEE Signal Processing Society, 64(19), 5158–5173.

Nardi, Y., Fienberg, S. E., & Hall, R. J. (2012). Achieving both valid and secure logistic

regression analysis on aggregated data from different private sources. Journal of Privacy and

Conﬁdentiality, 4(1), 9.

Niu, F., Recht, B., Re, C., & Wright, S. (2011). Hogwild: A lock-free approach to parallelizing

stochastic gradient descent. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, & K. Q.

Weinberger (Eds.), Advances in neural information processing systems 24 (pp. 693–701). Red

Hook, NY: Curran Associates, Inc.

Ohno-Machado, L., Agha, Z., Bell, D. S., Dahm, L., Day, M. E., & Doctor, J. N. (2014).

pSCANNER: Patient-centered scalable national network for effectiveness research. Journal

of the American Medical Informatics Association: JAMIA, 21(4), 621–626.

O’Keefe, C. M., Sparks, R. S., McAullay, D., & Loong, B. (2012). Conﬁdentialising survival

analysis output in a remote data access system. Journal of Privacy and Conﬁdentiality, 4(1), 6.

Ouyang, H., He, N., Tran, L., & Gray, A. (2013). Stochastic alternating direction method

of multipliers. In Proceedings of the 30th international conference on machine learning

(pp. 80–88).

PCORnet: The National Patient-Centered Clinical Research Network. (n.d.). Retrieved from http:/

/www.pcornet.org/.

Peng, Z., Wu, T., Xu, Y., Yan, M., & Yin, W. (2016a). Coordinate-friendly structures, algorithms

and applications. Annals of Mathematical Sciences and Applications, 1(1), 57–119.

Peng, Z., Xu, Y., Yan, M., & Yin, W. (2016b). ARock: An algorithmic framework for asynchronous

parallel coordinate updates. SIAM Journal of Scientiﬁc Computing, 38(5), A2851–A2879.

PopMedNet. (n.d.). Retrieved from http://www.popmednet.org/.

Que, J., Jiang, X., & Ohno-Machado, L. (2012). A collaborative framework for distributed privacy-

preserving support vector machine learning. In AMIA annual symposium (pp. 1350–1359).

Chicago, IL: AMIA.

Sanil, A. P., Karr, A. F., Lin, X., & Reiter, J. P. (2004). Privacy Preserving regression modelling via

distributed computation. In Proceedings of the tenth ACM SIGKDD international conference

on knowledge discovery and data mining (pp. 677–682). New York, NY, USA: ACM.

Scardapane, S., Wang, D., & Panella, M. (2016). A decentralized training algorithm for Echo State

Networks in distributed big data applications. Neural Networks: The Ofﬁcial Journal of the

International Neural Network Society, 78, 65–74.

Schizas, I. D., & Aduroja, A. (2015). A distributed framework for dimensionality reduction

and denoising. IEEE Transactions on Signal Processing: A Publication of the IEEE Signal

Processing Society, 63(23), 6379–6394.

Shi, H., Jiang, C., Dai, W., Jiang, X., Tang, Y., Ohno-Machado, L., & Wang, S. (2016).

Secure multi-pArty computation grid LOgistic REgression (SMAC-GLORE). BMC Medical

Informatics and Decision Making, 16(Suppl 3), 89.

3 Privacy Preserving Federated Big Data Analysis 81

Slavkovic, A. B., & Nardi, Y. (2007). “Secure” logistic regression of horizontally and verti-

cally partitioned distributed databases. Seventh IEEE International. Retrieved from http://

ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4476748.

Sweeney, L., Abu, A., & Winn, J. (2013). Identifying participants in the personal genome project

by name. Available at SSRN 2257732. https://doi.org/10.2139/ssrn.2257732.

Tai, X.-C., & Tseng, P. (2002). Convergence rate analysis of an asynchronous space decomposition

method for convex minimization. Mathematics of Computation, 71(239), 1105–1135.

Tseng, P. (1991). On the rate of convergence of a partially asynchronous gradient projection

algorithm. SIAM Journal on Optimization, 1(4), 603–619.

Vaidya, J. (2008). A survey of privacy-preserving methods across vertically partitioned data. In C.

C. Aggarwal & P. S. Yu (Eds.), Privacy-preserving data mining (pp. 337–358). New York, NY:

Springer.

Vaidya, J., & Clifton, C. (2003a). Privacy-preserving K-means clustering over vertically partitioned

data. In Proceedings of the ninth ACM SIGKDD international conference on knowledge

discovery and data mining (pp. 206–215). New York, NY: ACM.

Vaidya, J., & Clifton, C. (2003b). Privacy preserving naive bayes classiﬁer for vertically partitioned

data. In Proceedings of the ninth acm SIGKDD international conference on knowledge

discovery and data mining.

Vaidya, J., Clifton, C., Kantarcioglu, M., & Patterson, A. S. (2008a). Privacy-preserving decision

trees over vertically partitioned data. ACM Transactions on Knowledge Discovery from Data,

2(3), 14:1–14:27.

Vaidya, J., Kantarcıo˘

glu, M., & Clifton, C. (2008b). Privacy-preserving naive bayes classiﬁcation.

Journal on Very Large Data Bases, 17(4), 879–898.

Vaidya, J., Shaﬁq, B., Fan, W., Mehmood, D., & Lorenzi, D. (2014). A random decision tree

framework for privacy-preserving data mining. IEEE Transactions on Dependable and Secure

Computing, 11(5), 399–411.

Vaidya, J., Shaﬁq, B., Jiang, X., & Ohno-Machado, L. (2013). Identifying inference attacks against

healthcare data repositories. In AMIA joint summits on translational science proceedings. AMIA

summit on translational science,2013, (pp. 262–266).

Wang, H., & Banerjee, A. (2012). Online alternating direction method. In J. Langford & J. Pineau

(Eds.), Proceedings of the 29th international conference on machine learning (ICML-12) (pp.

1119–1126). New York, NY: Omnipress.

Wang, R., Li, Y. F., Wang, X., Tang, H., & Zhou, X. (2009). Learning your identity and disease

from research papers: information leaks in genome wide association study. In Proceedings of

the 16th ACM conference on computer and communications security (pp. 534–544). New York,

NY, USA: ACM.

Wang, S., Jiang, X., Wu, Y., Cui, L., Cheng, S., & Ohno-Machado, L. (2013). EXpectation

propagation LOgistic REgRession (EXPLORER ): Distributed privacy-preserving online

model learning. Journal of Biomedical Informatics, 46(3), 1–50.

Weibull, W. (1951). A statistical distribution function of wide applicability. Journal of Applied

Mathematics, 18(3), 293–297.

Wei, E., & Ozdaglar, A. (2013). On the O(1 Dk) convergence of asynchronous distributed

alternating direction method of multipliers. In Global conference on signal and information

processing (GlobalSIP), 2013 IEEE (pp. 551–554). ieeexplore.ieee.org.

Wu, Y., Jiang, X., Kim, J., & Ohno-Machado, L. (2012a). Grid binary LOgistic REgression

(GLORE): Building shared models without sharing data. Journal of the American Medical

Informatics Association, 2012(5), 758–764.

Wu, Y., Jiang, X., & Ohno-Machado, L. (2012b). Preserving institutional privacy in distributed

binary logistic regression. In AMIA annual symposium (pp. 1450–1458). Chicago, IL: AMIA.

Wu, Y., Jiang, X., Wang, S., Jiang, W., Li, P., & Ohno-Machado, L. (2015). Grid multi-category

response logistic models. BMC Medical Informatics and Decision Making, 15(1), 1–10.

Yao, A. C. (1982). Protocols for secure computations. In Foundations of computer science, 1982.

SFCS ‘08. 23rd annual symposium on (pp. 160–164). ieeexplore.ieee.org.

82 W. Dai et al.

Yu, H., Jiang, X., & Vaidya, J. (2006a). Privacy-preserving SVM using nonlinear kernels

on horizontally partitioned data. In Proceedings of the 2006 ACM symposium on applied

computing (pp. 603–610). New York, NY: ACM.

Yu, H., Vaidya, J., & Jiang, X. (2006b). Privacy-preserving SVM classiﬁcation on vertically

partitioned data. Lecture Notes in Computer Science,3918 LNAI, 647–656.

Yunhong, H., Liang, F., & Guoping, H. (2009). Privacy-Preserving svm classiﬁcation on vertically

partitioned data without secure multi-party computation. In 2009 ﬁfth international conference

on natural computation (Vol. 1, pp. 543–546). ieeexplore.ieee.org.

Yu, S., Fung, G., Rosales, R., Krishnan, S., Rao, R. B., Dehing-Oberije, C., & Lambin, P. (2008).

Privacy-preserving cox regression for survival analysis. In Proceedings of the 14th ACM

SIGKDD international conference on Knowledge discovery and data mining (pp. 1034–1042).

New York, NY: ACM.

Zhang, R., & Kwok, J. (2014). Asynchronous distributed ADMM for consensus optimization. In

Proceedings of the 31st international conference on machine learning (pp. 1701–1709).

Chapter 4

Word Embedding for Understanding Natural

Language: A Survey

Yang Li and Tao Yang

4.1 Introduction

Natural language understanding from text data is an important ﬁeld in Artiﬁcial

Intelligence. As images and acoustic waves can be mathematically modeled by

analog or digital signals, we also need a way to represent text data in order to process

it automatically. For example, the sentence “The cat sat on the mat.” can not be

processed or understood directly by the computer system. The easiest way is to rep-

resent is through a sparse discrete vector f.icat;1/;.imat;1/;.ion;1/;.isit;1/;.ithe;2/g,

where iwdenotes the index of word win the vocabulary. This is called one-hot

embedding. However, there are several disadvantages for this simple model. First,

it generates high dimensional vectors whose length depends on the volume of

the vocabulary, which is usually very large. Meanwhile, the semantic relationship

between words (e.g., “sit on”) cannot be reﬂected by these separate counts. Due

to the subjectivity of languages, the meaning of word (phrase) varies in different

contexts. This makes it a more challenging task to automatically process text data.

A goal of language modeling is to learn the joint probability that a given word

sequence occurs in texts. A larger probability value indicates that the sequence is

more commonly used. One candidate solution to involve the relationship between

words is called “n-gram” model, where nis the hyperparameter manually chosen.

The “n-gram” model takes into consideration the phrases of nconsecutive words. In

the example above, “the cat”, “cat sat”, “sat on”, “on the”, “the mat” will be counted,

if we set nD2. The disadvantage of “n-gram” model is that it is constrained by

the parameter n. Moreover, the intrinsic difﬁculty of language modeling is that: a

word sequence that is used to test the model is likely to be different from all the

sequences in the training set (Bengio et al. 2003). To make it more concrete, suppose

Y. Li • T. Ya n g ( )

School of Automation, NorthWestern Polytechnical University, Xi’an, Shanxi 710072, P.R. China

e-mail: liyangnpu@mail.nwpu.edu.cn;yangtao107@nwpu.edu.cn

© Springer International Publishing AG 2018

S. Srinivasan (ed.), Guide to Big Data Applications, Studies in Big Data 26,

DOI 10.1007/978-3-319-53817-4_4

83

84 Y. Li and T. Yang

we want to model the joint distribution of 5-word sequences in a vocabulary of

100,000 words, the possible number of combinations is 10000051D1025 1,

which is also the number of free parameters to learn. It is prohibitively large for

further processing. This phenomenon is called “the curse of dimensionality”. The

root of this curse is the “generalization” problem. We need an effective mechanism

to extrapolate the knowledge obtained during training to the new cases. For discrete

spaces mentioned above, the structure of generalization is not obvious, i.e. any

change on the discrete variables, as well as their combinations, will have a drastic

inﬂuence on the value of joint distribution to be estimated.

To solve the dilemma, we can model the variables in a continuous space, where

we can think of how training points trigger probability mass and distribute it

smoothly to the neighborhood around them. Then it goes back to the problem

of word embedding, whose concept was ﬁrst introduced in Hinton (1986). Word

embedding, sometimes named as word representation, is a collective name for

a set of language models and feature selection methods. Its main goal is to

map textual words or phrases into a low-dimensional continuous space. Using

the example above, “cat” can be denoted as Œ0:1; 0:3; : : : ; 0:2Mand “mat” can

be expressed as Œ0:1; 0:2; : : : ; 0:4M, where Mis the hyperparameter. After that,

advanced NLP tasks can be processed implemented based on these real-valued

vectors. Word embedding encodes the semantic and syntactic information of words,

where semantic information mainly correlates with the meaning of words, while

syntactic information refers to their structural roles. Is a basic procedure in natural

language processing. From the high level, most of the models try to optimize a

loss function trying to minimize the discrepancy between prediction values and

target values. A basic assumption is that words in similar context should have

similar meaning (Harris 1954). This hypothesis emphasizes the bound of .w;Qw/

for common word-context pairs (word wand its contextual word Qwusually appear

together) and weaken the correlation of rare ones.

The existing word embedding approaches are diverse. There are several ways to

group them into different categories. According to Sun et al. (2015)andLaietal.

(2015), the models can be classiﬁed as either paradigmatic models or syntagmatic

models, based on the word distribution information. The text region where words

co-occur is the core of the syntagmatic model, but for the paradigmatic model

it is the similar context that matters. Take “The tiger is a ﬁerce animal.” and

“The wolf is a ﬁerce animal.” as examples, “tiger-ﬁerce” and “wolf-ﬁerce” are the

syntagmatic words, while “wolf-tiger” are the paradigmatic words. Shazeer et al.

(2016) divides the models into two classes, matrix factorization and slide-window

sampling method, according to how word embedding is generated. The former is

based on the word co-occurrence matrix, where word embedding is obtained from

matrix decomposition. For the latter one, data sampled from sliding windows is used

to predict the context word.

In this chapter, word embedding approaches are introduced according to the

method of mapping words to latent spaces. Here we mainly focus on the Neural

Network Language Model (NNLM) (Xu and Rudnicky 2000; Mikolov et al. 2013)

and the Sparse Coding Approach (SPA) (Yogatama et al. 2014a). Vector Space

4 Word Embedding for Understanding Natural Language: A Survey 85

Model aims at feature expression. A word-document matrix is ﬁrst constructed,

where each entry counts the occurrence frequency of a word in documents. Then

embedding vectors containing semantic information of words are obtained through

probability generation or matrix decomposition. In the Neural Network Language

Model, fed with training data, word embedding is encoded as the weights of a certain

layer in the neural network. There are several types of network architectures, such

as Restricted Boltzmann Machine, Recurrent Neural Network, Recursive Neural

Network, Convolutional Neural Network and Hierarchical Neural Network. They

are able to capture both the semantic and syntactic information. Sparse coding

model is another state-of-the-art method to get word embedding. Its goal is to

discover a set of bases that can represent the words efﬁciently.

The main structure of this paper is as follows: the models mentioned above

and the evaluation methods are introduced in Sect. 4.2; the applications of word

embedding are included in Sect. 4.3; conclusion and future work are in Sect. 4.4.

4.2 Word Embedding Approaches and Evaluations

Generally speaking, the goal of word embedding is mapping the words in unlabeled

text data to a continuously-valued low dimensional space, in order to capture the

internal semantic and syntactic information. In this section, we ﬁrst introduce the

background of text representation. Then, according to the speciﬁc methods for

generating the mapping, we mainly focus on the Neural Network Language Model

and Sparse Coding Approach. The former is further introduced in three parts,

depending on the network structures applied in the model. In the end, we provide

evaluation approaches for measuring the performance of word embedding models.

4.2.1 Background

One of the widely used approach for expressing the text documents is the Vector

Space Model (VSM), where documents are represented as vectors. VSM was

originally developed for the SMART information retrieval system (Salton et al.

1997). Some classical VSMs can be found in Deerweste et al. (1990) and Hofmann

(2001).

There are various ways of building VSMs. In scenarios of information retrieval

where people care more about the textual features that facilitate text categorization,

various features selection methods such as document frequency (DF), information

gain (IG), mutual information (MI), 2-test (CHI-test) and term strength (TS)

have different effects on text classiﬁcation (Yang and Pedersen 1997). These

approaches help reduce the dimension of the text data, which could be helpful

for the subsequent processing. DF refers to the number of documents that a word

appears. IG measures the information loss between presence and absence term in

86 Y. Li and T. Yang

the document. MI is the ratio between term-document joint probability and the

product of their marginal probability. CHI-test applies the sums of squared errors

and tries to ﬁnd out the signiﬁcant difference between observed word frequency and

expected word frequency. TS estimates how likely a term will appear in “closely-

related” documents. In information retrieval systems, these methods are useful for

transforming the raw text to the vector space, and tremendous improvement has

been achieved for information classiﬁcation. However, it does not work well alone

in applications that require both semantic and syntax information of words, since

part of the original text information is already lost in feature selection procedures.

However, various word embedding models, such as sparse coding, are built based

on the VSM.

In Deerweste et al. (1990), Latent Semantic Analysis (LSA) is proposed to index

and retrieve the information from text data automatically. It applies SVD on the

term-document association data to gain the embedding output. Work in Landauer

et al. (1998) and Landauer and Dumais (1997) gives an explanation to the word

similarity that derives from the Test of English as a Foreign Language (TOEFL).

After that, Landauer (2002) tries to detect the synonym through LSA. Furthermore,

Yih et al. (2012) uses LSA to distinguish between synonyms and antonyms in

the documents and its extension Multiview LSA (MVLSA) (Rastogi et al. 2015)

supports the fusion of arbitrary views of data. However, all of them have to confront

the limitation that depends on a single matrix of term-document co-occurrences.

Usually the word embedding from those models is not ﬂexible enough, because of

the strong reliance on the observed matrix.

Latent Dirichlet Allocation (LDA) (Bengio et al. 2003) is another useful

model for feature expression. It assumes that some latent topics are contained in

documents. The latent topic Tis derived from the conditional distribution p.wjT/,

i.e. the probability that word wappears in T. Word embedding from traditional LDA

models can only capture the topic information but not the syntactic one, which does

not fully achieve its goal. Moreover, traditional LDA-based model is only used for

topic discovery, but not word representation. But we can get the k-dimensional word

embedding by training a k-topic model, the word embedding is from the rows of

word-topic matrix and the matrix is ﬁlled by p.wjT/values.

Although both of LSA and LDA utilize the statistical information directly for

word embedding generation, there are some differences between them. Speciﬁcally,

LSA is based on matrix factorization and subject to the non-negativity constraint.

LDA relies on the word distribution, and it is expressed by the Dirichlet priori

distribution which is the conjugate of multinomial distribution.

4.2.2 Neural Network Language Model

The concept of word embedding was ﬁrst introduced with the Neural Networks

Language Model (NNLM) in Xu and Rudnicky (2000) and Bengio et al. (2003).

The motivation behind is that words are more likely to share similar meaning if they

4 Word Embedding for Understanding Natural Language: A Survey 87

are in the same context (Harris 1954). The probability that the word sequence W

occurs can be formulated according to the Bayes rule:

P.W/D

N

Y

tD1

P.wtjw1;:::;wt1/D

N

Y

tD1

P.wtjht/(4.1)

where P(W) is the joint distribution of sequence W, and htdenotes the context

words around word wt. The goal is to evaluate the probability that the word wt

appears given its context information. Because there are a huge number of possible

combinations of a word and its context, it is impractical to specify all P.wtjht/.In

this case, we can use a function ˆto map the context into some equivalent classes,

so that we have

P.wtjht/DP.wtjˆ.ht// (4.2)

where all of the words are statistically independent.

If we use the n-gram model to construct a table of conditional probability for each

word, then the last n-1 words are combined together as the context information:

P.wtjht/P.wtjQwt1

tnC1/(4.3)

where n is the number of the words considered, while 1 and 2 are the most

commonly used value. Only the successive words occurring before the current word

wtare taken into account.

Most NNLM-based approaches belong to the class of unsupervised models.

Networks with various architectures such as Restrict Boltzmann Machine (RBM),

Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and

Long-Short Term Memory (LSTM) can be used to build word embedding. Log-

Bilinear (LBL) model, SENNA, and Word2Vector are some of the most representa-

tive examples of NNLM.

Usually the goal of the NNLM is to maximize or minimize the function of Log-

Likelihood, sometimes with additional constraints. Suppose we have a sequence

w1;w2;:::;wnin the corpus, and we want to maximize the log-likelihood of

P.wtjQwt1

tnC1/in the Feed Forward Neural Network (Bengio et al. 2003). Let xbe

the embedding vector of proceeding words, so that

xDŒe.wtnC1/;:::;e.wt2/; e.wt1/ (4.4)

where e.w/represents the embedding of word w. In Feed Forward Neural Network

(without the direct edge in the structure) with one hidden layer, the layer function

can be expressed as:

yDbCU.tanh.dCWx// (4.5)

88 Y. Li and T. Yang

Fig. 4.1 The basic structure of the Feed Forward Neural Network

where Uis the transformation matrix, Wis the weight matrix of hidden layer, band

dare bias vectors. Finally, yis fed into a softmax layer to obtain the probability

of the target word, and the basic structure of this model is shown in Fig. 4.1.The

parameters in this model are .b;U;d;W/. However, since we need to explicitly

normalize (e.g. use softmax) all of the values in the vocabulary when computing

the probability for the next word, it will be very costly for training and testing

(Morin and Bengio 2005). After estimating the conditional probability in Eq. (4.3),

the total probability P.wtnC1; ::; wt1;wt/can be obtained directly by locating the

mark of the phrase that is constructed by the words that appears in the same window

(Collobert et al. 2011). The resultant probability value quantiﬁes the normality if

such a sentence appears in the natural language (more details can be found in

Sect. 4.2.2.3).

Instead of using matrix multiplication on the ﬁnal layer (Bengio et al. 2003;Mnih

and Hinton 2007) applies the hierarchical structure (see Sect. 4.2.2.4) to reduce the

computational complexity. It constructs the Log-BiLinear (LBL) model by adding

bilinear interaction between word vectors and hidden variables on the basis of the

energy function (see Sect. 4.2.2.1).

Recently, Word2Vector proposed by Mikolov et al. (2013) which contains

Skip-Gram and (Continuous Bag-of-Words) CBOW these two frameworks is very

popular in NLP tasks. It can be seen as a two-layer Neural Network Language

Model. Furthermore, applying Noise Contrastive Estimation (NCE) increases its

efﬁciency. However, we can also add the priori information into this model, Liu

et al. (2015) extends Skip-Gram by treating topical information as important priori

4 Word Embedding for Understanding Natural Language: A Survey 89

knowledge for training word embedding and proposes the topical word embedding

(TWE) model, where text words and their afﬁliated topic (context) derived from

LDA are combined to obtain the embedding.

In the subsections below, we will introduce some basic architectures of NNLM

for word embedding.

4.2.2.1 Restricted Boltzmann Machine (RBM)

The ability of capturing the latent features among words usually depends on the

model structure. Boltzmann Machine (BM) originates from the log-linear Markov

Random Field (MRF) and its energy function is linearly related to free parameters.

According to the statistical dynamics, energy function is useful in word embedding

generation (Mnih and Hinton 2007). Restricted Boltzmann Machine (RBM), which

prunes the visible-visible and hidden-hidden connections, is a simpliﬁed version of

BM. A diagram of RBM is given in Fig. 4.2.

The main components of the energy function in RBM include binary variables

(hidden unites hand visible unites v), weights WD.wi;j/that establish connections

between hjand vi,biasesaifor the visible units and bjfor the hidden ones.

Speciﬁcally, the energy function is formulated as follows,

E.v; h/DvTWh aTvbTh(4.6)

During the generation process word embedding, the energy function is used as the

probability distribution as shown in Eq. (4.7).

P.v; h/DeE.v;h/=Z(4.7)

where Zis a partition function deﬁned as the sum of eE.h;v/.

Fig. 4.2 The basic structure

of the RBM

90 Y. Li and T. Yang

Mnih and Hinton (2007) proposes three models on the basis of RBM by adding

connections to the previous status. Starting from the undirected graphical model,

they try to estimate the latent conditional distribution of words. To compute the

probability of the next word in a sequence, the energy function is deﬁned as follows:

E.wt;hIwtnC1Wt1/D.

t1

X

iDtnC1

vT

iRWi/hbT

hhbT

tRTvnbT

vvn(4.8)

where vT

iRdenotes the vector that represents word wifrom the word dictionary R,

Wiin hidden units denotes the similarity between two words, and bh;bv;brare the

biases for the hidden units, words and word features respectively.

The disadvantage of this class of models is that it is costly to estimate massive

parameters during the training process. To reduce the number of parameters,

Mnih and Hinton (2007) extends the factored RBM language model by adding

connections Camong words. It also removes the hidden variables and directly

applies the stochastic binary variables. The modiﬁed energy function is as below:

E.wtIwtnC1Wt1/D.

t1

X

iDtnC1

vT

iRCi/RTvnbT

rRTvnbT

vvn(4.9)

Here Cidenotes correlation between word vectors wiand wt,brand bvdenotes the

word biases. This function quantiﬁes the bilinear interaction between words, which

is also the reason why models of this kind are called Log-BiLinear Language (LBL)

models. In Eq. (4.9), entries of vector hDPt1

iDtnC1vT

iRCicorrespond to the nodes

in the hidden layer, and yiDhRTvnrepresents the set of nodes in the output layer.

We can see that the syntax information is contained in the hidden layer, for Cihere

can be seen as the contribution of word ito current word n, which is like the cosine

similarity between two words.

4.2.2.2 Recurrent and Recursive Neural Network

To reduce the number of parameters, we could unify the layer functions with some

repetitive parameters. By treating the text as sequential data, Mikolov et al. (2010)

proposes a language model based on the Recurrent Neural Network, the structure of

which is shown in Fig. 4.3. The model has a circular architecture. At time t,word

embedding w.t/is generated out of the ﬁrst layer. Then, it is transferred together

with the output from the context layer at time t1as the new input x.t/to the

context layer, which is formulated as follows:

x.t/Dw.t/Cs.t1/ (4.10)

4 Word Embedding for Understanding Natural Language: A Survey 91

Fig. 4.3 The structure of the

recurrent neural network

Fig. 4.4 The structure of the Recursive Neural Network

In each cycle, the outcomes of context layer and output layer are denoted as s.t/and

y.t/respectively. Inspired by Bengio et al. (2003), the output layer also applies the

softmax function.

Unlike Recurrent Neural Network, Recursive Neural Network (Goller and

Kuchler 1996) is a linear chain in space as shown in Fig. 4.4. To fully analyze

the sentence structure, the process iterates from two word vectors in leaf nodes

b;cto merge into a hidden output h1, which will continue to concatenate another

input word vector aas the next input. The computing order is determined either by

the sentence structure or the graph structure. Not only could hidden layers capture

the context information, they could also involve the syntax information. Therefore,

92 Y. Li and T. Yang

Fig. 4.5 (A) The structure of the Recursive Neural Network model where each node represents

a vector and all the word vectors are in the leaf nodes. (B) The structure of the MV-RNN model

where each node consists of a vector and a matrix

Recursive Neural Network can capture more information than the context-based

model, which is desirable for NLP tasks.

For the sequential text data in Recursive Neural Network, Socher et al. (2011,

2013,2012) parses the sentences into tree structures through the tree bank models

(Klein and Manning 2003; Antony et al. 2010). The model proposed by Socher

et al. (2011) employs the Recursive Neural Network, and its structure is shown

in Fig. 4.5A,

pDf.Wa

b/(4.11)

where a;b2Rd1are the word vectors, parent node p2Rd1is the hidden output,

and fDtanh adds element-wise nonlinearity to the model. W2Rd2dis the

parameter to learn.

To incorporate more information in each node, work in (Socher et al. 2012)

proposes the model of Matrix-Vector Recursive Neural Network (MV-RNN), for

which the parsing-tree structure is shown in Fig. 4.5B. Different from that in

Fig. 4.5A, each node in this parsing tree includes a vector and a matrix. The vector

represents the word vector in leaf nodes and phrase in non-leaf ones. The matrix

is applied to neighboring vectors, and it is initialized by the identity matrix Iplus

a Gaussian noise. The vectors a;b;pcapture the inherent meaning transferred in

the parsing tree, and the matrixes A;B;Pare to capture the changes of the meaning

for the neighboring words or phrases. Apparently, the drawback of this model is

the large number of parameters to be estimated. So in the work of Socher et al.

(2013), the model of Recursive Neural Tensor Network (RNTN) uses the tensor-

based composition function to avoid this problem.

In comparison, the main differences between recurrent and recursive models are

the computing order and the network structure. The structure of the former has

a closed loop dealing with the time sequence, while the latter has an open loop

tackling the spatial sequence.

4 Word Embedding for Understanding Natural Language: A Survey 93

4.2.2.3 Convolutional Neural Network (CNN)

Convolutional Neural Network (CNN) consists of several feature extraction layers,

which is inspired by biological process (Matsugu et al. 2003). It has already been

successfully applied in many image recognition problems. The main advantage of

CNN is that it does not depend on the prior knowledge or human efforts when doing

feature selection.

CNN can also be applied to extract latent features from text data (Collobert

and Weston 2008; Collobert et al. 2011). As shown in Fig. 4.6, three different

convolutional kernels, in three different colors, select different features from word

embedding separately. The feature vector is the combination of the max values from

the rows of the selected features.

In the models above, supervised learning is applied to train the whole neural

network, and the convolutional layer here is to model the latent feature of the

initialized word embedding. Since supervised training method is applied here, it

is able to generate word embedding suitable for the special task. Though the word

embedding is usually a by-product in this case, it still has a sound performance in

capturing semantic and syntactic information.

Fig. 4.6 The structure of CNN for feature extraction by using convolutional kernels

94 Y. Li and T. Yang

4.2.2.4 Hierarchical Neural Language Model (HNLM)

There are many tricks to speed up the process of training NNLM, such as short list,

hash table for the word and stochastic gradient descent (Bengio et al. 2003). But

when facing with datasets of large scale, further improvement in model efﬁciency is

still needed.

Transferring the knowledge from a small portion of observed data examples to

other cases can save a lot of computational efforts, thus speeding up the learning

process. However, it is a difﬁcult task for neural language models which involve

computing the joint probability of certain word combinations. This is because the

possible combinations of words from a vocabulary is immensely larger than the ones

from all potential texts. So it is necessary to look for some priori information before

searching all possible combinations. Based on human knowledge or statistical

information, we can cluster all the words into several classes where some similar

statistical properties are shared by the words in each class. Some hierarchical models

are constructed based on this idea (Mnih and Hinton 2008; Morin and Bengio 2005).

Hierarchical structures has a big inﬂuence on the algorithm complexity. For

example, Mnih and Hinton (2008) and Morin and Bengio (2005) reduces the

complexity of language models by clustering words into a binary tree. Speciﬁcally,

given a dictionary Vcontaining jVjwords, the speed up will be O.jVj=logjVj/

(Morin and Bengio 2005) if the tree is binary.

To determine the hierarchical structure, hierarchical word classes (lexical word

categories in a narrow sense) are needed in most cases (Rijkhoff and Jan 2007). The

way of constructing word classes is sensitive to the word distribution. Word classes

can be built based on prior (expert) knowledge (Fellbaum 1998) or through data-

driven methods (Sun et al. 2012), but they could also be automatically generated in

accordance with the usage statistics (Mnih and Hinton 2008; McMahon and Smith

1996). The prior knowledge is accurate but limited by the application range, since

there are lots of problems where expert categories are unavailable. The data-driven

methods can boost a high level of accuracy but it still needs a seed corpus which is

manually tagged. Universality is the major advantage of the automatic generation

method. It could be applied to any natural language scenarios without the expert

knowledge, though it will be more complex to construct.

By utilizing the hierarchical structure, lots of language models have been

proposed (Morin and Bengio 2005; Mnih and Hinton 2008; Yogatama et al.

2014a; Djuric et al. 2015). Morin and Bengio (2005) introduces the hierarchical

decomposition bases on the word classes that extract from the WordNet. Similar

with Bengio et al. (2003), it uses the layer function to extract the latent features, the

process is shown in Eq. (4.12).

P.diD1jqi;wtnC1Wt1/Dsigmoid.U.tanh.dCWx/qi/Cbi/(4.12)

Here xis the concatenation of the context word [same as Eq. (4.4)], biis the biases

vector, U;W;bi;xplays the same roles as in Eq. (4.5). The disadvantage is the

procedure of tree construction which has to combine the manual and data-driving

4 Word Embedding for Understanding Natural Language: A Survey 95

processing (Morin and Bengio 2005) together. Hierarchical Log BiLinear (HLBL)

model which is proposed by Mnih and Hinton (2008) overcomes the disadvantage

by using a boosting method to generate the tree automatically. The binary tree with

words as leaves consists of two components: the words in the leaves which can be

represented by a sequential binary code uniquely from top to down, as well as the

probability for decision making at each node. Each non-leaf node in the tree also

has an associated vector qwhich is used for discrimination. The probability of the

predicted word win HLBL is formulated as follows

P.wtDwjwtnC1Wt1/DY

i

P.dijqi;wtnC1Wt1/(4.13)

where diis tth digit in the binary code sequence for word w, and qiis the feature

vector for the ith node in the path to word wfrom the root. In each non-leaf node,

the probability of the decision is given by

P.diD1jqi;wtnC1Wt1/D..

t1

X

jDtnC1

Cjwj/PqiCbi/(4.14)

where .x/is the logistic function for decision making, Cjis the weight matrix

corresponding to the context word wj, and biis the bias to capture the context-

independent tendency to one of its children when leaving this node.

There are both advantages and disadvantages of the hierarchical structure. Many

models beneﬁt from the reduction of computation complexity, while the consuming

time on the structure building is the main drawback.

4.2.3 Sparse Coding Approach

Sparse Coding is an unsupervised model that learns a set of over-complete bases to

represent data efﬁciently. It generates base vectors figsuch that the input vector

x2Rncan be represented by a linear combination of them xDPk

iD1˛ii, where k

is the number of base vectors and k>n. Although there are many techniques such as

Principal Component Analysis (PCA) that helps learn a complete set of basis vectors

efﬁciently, Sparse Coding aims to use the least bases to represent the input x.The

over-complete base vectors are able to capture the structures and patterns inherent

in the data, so the nature characteristic of the input can be seized through Sparse

Coding.

To be speciﬁc, sparse Coding constructs an over-complete dictionary D2RLK

of basis vectors, together with a code matrix A2RKV, to represent Vwords in C

contents X2RLVby minimizing such function as follows:

arg min

D;AkXDAk2

2C.A/(4.15)

96 Y. Li and T. Yang

where is a regularization hyperparameter and is the regularizer. It applies

the squared loss for the reconstructed error, but loss functions like L1-regularized

quadratic function can also be an efﬁcient alternative for this problem (Schökopf

et al. 2007).

Plenty of algorithms have been proposed to extend the sparse coding framework

starting from the general loss function above. If we apply a non-negativity constraint

on A(i.e. Ai;j0), the problem is turned into the Non-Negative Sparse Embedding

(NNSE). Besides adding constraints on A, if we also requires that all entries in Dare

bigger than 0, then the problem can be transformed into the Non-Negative Sparse

Coding (NNSC). NNSC is a matrix factorization technique previously studied in

machine learning communities (Hoyer 2002).

For .A/, Yogatama et al. (2014b) designs a forest-structured regularizer that

enables the mixture use of the dimension. The structure of this model is in Fig. 4.7,

there are seven variables in the tree which describes the order that variables ‘enter

the model’. In this work, it sets the rule that a node may take a nonzero value only

if its ancestors do. For example, nodes 3 and 4 may only be nonzero if nodes 1 and

2 are also nonzero. So the regularizer for Ashows in Eq. (4.16).

.av/D

V

X

iD1

hav;i;av;Descendants.i/i

2(4.16)

where avis the vth column of A.

Faruqui et al. (2015) proposes two methods by adding restriction on the

dictionary D:

arg min

D;AkXDAk2

2C.A/CkDk2

2(4.17)

Fig. 4.7 An example of a

regularization forest that

governs the order in which

variables enter the model

4 Word Embedding for Understanding Natural Language: A Survey 97

where is the regularization hyperparameter. The ﬁrst method applies l1penalty on

A, so that the Function 4.17 can be broken into the loss for each word vector, which

makes it possible for parallel optimization.

arg min

D;A

V

X

iD1kxiDaik2

2Ckaik1CkDk2

2(4.18)

Here xiand aiare the ith column vector of matrix Xand Arespectively. The second

method is to add non-negativity constraints on variables so that:

arg min

D2RLK

0;A2RKV

0

V

X

iD1kxiDaik2

2Ckaik1CkDk2

2(4.19)

Apart from the work based on Sparse Coding, Sun et al. (2016) adds the l1regu-

larizer into Word2Vector. The challenge in this work lies in the optimization method,

for stochastic gradient descent (SGD) cannot produce sparse solution directly with

l1regularizer in online training. The method of Regularized Dual Averaging (RDA)

proposed by Lin (2009) keeps track of online average subgradients at each update,

and optimizes l1regularized loss function based on Continuous Bag-of-Words

(CBOW) Model (Mikolov et al. 2013) in online learning.

4.2.4 Evaluations of Word Embedding

Measurements for word embedding usually depend on the speciﬁc applications.

Some representative examples include perplexity, analogy precision and sentiment

classiﬁcation precision.

Perplexity is an evaluation method which originates from information theory.

It measures how well a distribution or a model can predict the sample. Bengio

et al. (2003) uses the perplexity of the corpus as the target, where the lower of

the perplexity value is, the higher quality the word embedding conveys since the

information is more speciﬁc. Word or phrase analogy analysis is another important

assessment way. Work in (Mikolov et al. 2013) designs the precision of semantic

and syntactic prediction as the standard measurement to evaluate the quality of word

embedding. It is applied in phrase and word level independently. All the assessment

methods mentioned above are from the view of linguistic phenomenon that the

distance between two words in vector space reﬂects the correlation between them.

Besides the measurements of linguistic phenomenon, a lot of work evaluates

word embedding by using the task where they are applied. For example, if the word

embedding is used in sentiment classiﬁcation, then the precision of the classiﬁcation

can be used for evaluation. If it is applied in machine translation, then the precision

98 Y. Li and T. Yang

of the translation is the one that matters. This rule also holds for other tasks like

POS tagging, named entity recognize, textual entailment, and so on.

There is no best measurement method for all scenarios. The most useful way is to

combine it with the target of the task to achieve a more reasonable evaluation result.

4.3 Word Embedding Applications

There are many applications for word embedding, especially in NLP tasks where

word embedding is fed as input data or the features of the text data directly. In this

section, we will discuss how word embedding can be applied in the scenarios such as

semantic analysis, syntax analysis, idiomaticity analysis, Part Of the Speech (POS)

tagging, sentiment analysis, named entity recognition, textual entailment as well as

machine translation.

The goal of syntax analysis is to extract the syntactic structure from sentences.

Recent works (Socher et al. 2011; Huang et al. 2012; Collobert and Weston 2008;

Mikolov et al. 2013) have taken the syntax information into consideration to obtain

word embedding, and the result in Andreas and Dan (2014) shows that word

embedding can entail the syntactic information directly. A fundamental problem

to syntax analysis is part of the speech tagging, which is about labeling the words

to different categories (e.g., plural, noun, adverb). It requires the knowledge of

deﬁnitions and contextual information. Part of the speech tagging is a word-level

NLP task. Collobert and Weston (2008) and Collobert et al. (2011) apply word

embedding for POS tagging and achieve state-of-the-art results. In some scenarios,

in order to ﬁgure out the syntax of text, we need to ﬁrst recognize the named

entities in it. Named entity recognition aims to ﬁnd out the names of persons,

organizations, time expressions, monetary values, locations and so on. Works in

Collobert and Weston (2008), Collobert et al. (2011), Zou et al. (2013), Luo et al.

(2014), Pennington et al. (2014) show the expressiveness of word embedding in

these applications.

As a subsequent task, semantic analysis (Goddard 2011) relates the syntactic

structures of words to their language-independent meanings.1In other words, it

reveals words that are correlated with each other. For example, the vector (“Madrid”

“Spain” C“France”) is close to (“Paris”). Previous approaches (Scott et al.

1999; Yih et al. 2012) mainly use the statistical information of the text. The works

in Mikolov et al. (2013), Socher et al. (2011), Huang et al. (2012), Collobert

and Weston (2008) apply the analogy precision to measure the quality of word

embedding (see in Sect. 4.2.4). It is shown that word embedding can manifest the

semantic information of words in the text. The semantics of each word can help us

understand the combinatory meaning of several words. The phenomenon of multi-

words expression idiomaticity is common in nature language, and it is difﬁcult to be

1https://en.wikipedia.org/wiki/Semantic_analysis_(linguistics).