Wiley And SAS Business Series : Fraud Analytics Using Descriptive, Predictive, Social Network Techniques A Guide To Data S (Wiley Series) Bart Baesens, Veronique Van Vlasselaer, Wouter Verbeke Analyti

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 402

DownloadWiley And SAS Business Series : Fraud Analytics Using Descriptive, Predictive, Social Network Techniques A Guide To Data S (Wiley Series) Bart Baesens, Veronique Van Vlasselaer, Wouter Verbeke-Fraud Analyti
Open PDF In BrowserView PDF
Fraud Analytics Using
Descriptive, Predictive,
and Social Network
Techniques

Wiley & SAS Business
Series
The Wiley & SAS Business Series presents books that help senior-level
managers with their critical management decisions.
Titles in the Wiley & SAS Business Series include:
Analytics in a Big Data World: The Essential Guide to Data Science and Its
Applications by Bart Baesens
Bank Fraud: Using Technology to Combat Losses by Revathi Subramanian
Big Data Analytics: Turning Big Data into Big Money by Frank Ohlhorst
Big Data, Big Innovation: Enabling Competitive Differentiation through
Business Analytics by Evan Stubbs
Business Analytics for Customer Intelligence by Gert Laursen
Business Intelligence Applied: Implementing an Effective Information and
Communications Technology Infrastructure by Michael Gendron
Business Intelligence and the Cloud: Strategic Implementation Guide by
Michael S. Gendron
Business Transformation: A Roadmap for Maximizing Organizational
Insights by Aiman Zeid
Connecting Organizational Silos: Taking Knowledge Flow Management to
the Next Level with Social Media by Frank Leistner
Data-Driven Healthcare: How Analytics and BI Are Transforming the
Industry by Laura Madsen
Delivering Business Analytics: Practical Guidelines for Best Practice by
Evan Stubbs
Demand-Driven Forecasting: A Structured Approach to Forecasting,
second edition by Charles Chase
Demand-Driven Inventory Optimization and Replenishment: Creating a
More Efficient Supply Chain by Robert A. Davis

Developing Human Capital: Using Analytics to Plan and Optimize Your
Learning and Development Investments by Gene Pease, Barbara Beresford, and Lew Walker
The Executive’s Guide to Enterprise Social Media Strategy: How Social Networks Are Radically Transforming Your Business by David Thomas and
Mike Barlow
Economic and Business Forecasting: Analyzing and Interpreting Econometric Results by John Silvia, Azhar Iqbal, Kaylyn Swankoski, Sarah
Watt, and Sam Bullard
Financial Institution Advantage and The Optimization of Information
Processing by Sean C. Keenan
Foreign Currency Financial Reporting from Euros to Yen to Yuan: A Guide
to Fundamental Concepts and Practical Applications by Robert Rowan
Harness Oil and Gas Big Data with Analytics: Optimize Exploration and
Production with Data Driven Models by Keith Holdaway
Health Analytics: Gaining the Insights to Transform Health Care by Jason
Burke
Heuristics in Analytics: A Practical Perspective of What Influences Our Analytical World by Carlos Andre Reis Pinheiro and Fiona McNeill
Human Capital Analytics: How to Harness the Potential of Your Organization’s Greatest Asset by Gene Pease, Boyce Byerly, and Jac Fitz-enz
Implement, Improve and Expand Your Statewide Longitudinal Data System: Creating a Culture of Data in Education by Jamie McQuiggan and
Armistead Sapp
Killer Analytics: Top 20 Metrics Missing from Your Balance Sheet by Mark
Brown
Predictive Analytics for Human Resources by Jac Fitz-enz and John
Mattox II
Predictive Business Analytics: Forward-Looking Capabilities to Improve
Business Performance by Lawrence Maisel and Gary Cokins
Retail Analytics: The Secret Weapon by Emmett Cox
Social Network Analysis in Telecommunications by Carlos Andre Reis
Pinheiro

Statistical Thinking: Improving Business Performance, second edition by
Roger W. Hoerl and Ronald D. Snee
Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data
Streams with Advanced Analytics by Bill Franks
Too Big to Ignore: The Business Case for Big Data by Phil Simon
The Value of Business Analytics: Identifying the Path to Profitability by
Evan Stubbs
The Visual Organization: Data Visualization, Big Data, and the Quest for
Better Decisions by Phil Simon
Understanding the Predictive Analytics Lifecycle by Al Cordoba
Unleashing Your Inner Leader: An Executive Coach Tells All by Vickie
Bevenour
Using Big Data Analytics: Turning Big Data into Big Money by Jared
Dean
Win with Advanced Business Analytics: Creating Business Value from Your
Data by Jean Paul Isson and Jesse Harriott
For more information on these and other titles in the series, please
visit www.wiley.com.

Fraud Analytics
Using Descriptive,
Predictive, and
Social Network
Techniques
A Guide to Data Science
for Fraud Detection

Bart Baesens
Véronique Van Vlasselaer
Wouter Verbeke

Copyright © 2015 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, electronic, mechanical, photocopying,
recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the
1976 United States Copyright Act, without either the prior written permission of the
Publisher, or authorization through payment of the appropriate per-copy fee to the
Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978)
750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to the
Publisher for permission should be addressed to the Permissions Department, John
Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201)
748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used
their best efforts in preparing this book, they make no representations or warranties
with respect to the accuracy or completeness of the contents of this book and
specifically disclaim any implied warranties of merchantability or fitness for a particular
purpose. No warranty may be created or extended by sales representatives or written
sales materials. The advice and strategies contained herein may not be suitable for your
situation. You should consult with a professional where appropriate. Neither the
publisher nor author shall be liable for any loss of profit or any other commercial
damages, including but not limited to special, incidental, consequential, or other
damages.
For general information on our other products and services or for technical support,
please contact our Customer Care Department within the United States at (800)
762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley publishes in a variety of print and electronic formats and by print-on-demand.
Some material included with standard print versions of this book may not be included
in e-books or in print-on-demand. If this book refers to media such as a CD or DVD
that is not included in the version you purchased, you may download this material at
http://booksupport.wiley.com. For more information about Wiley products, visit
www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Baesens, Bart.
Fraud analytics using descriptive, predictive, and social network techniques : a guide
to data science for fraud detection / Bart Baesens, Veronique Van Vlasselaer, Wouter
Verbeke.
pages cm. — (Wiley & SAS business series)
Includes bibliographical references and index.
ISBN 978-1-119-13312-4 (cloth) — ISBN 978-1-119-14682-7 (epdf) —
ISBN 978-1-119-14683-4 (epub)
1. Fraud—Statistical methods. 2. Fraud—Prevention. 3. Commercial
crimes—Prevention. I. Title.
HV6691.B34 2015
364.16′ 3015195—dc23
2015017861
Cover Design: Wiley
Cover Image: ©iStock.com/aleksandarvelasevic
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1

To my wonderful wife, Katrien, and kids, Ann-Sophie, Victor,
and Hannelore.
To my parents and parents-in-law.

To my husband and soul mate, Niels, for his never-ending
support.
To my parents, parents-in-law, and siblings-in-law.

To Luit and Titus.

Contents
List of Figures xv
Foreword xxiii
Preface xxv
Acknowledgments xxix
Chapter 1 Fraud: Detection, Prevention, and Analytics! 1
Introduction 2
Fraud! 2
Fraud Detection and Prevention 10
Big Data for Fraud Detection 15
Data-Driven Fraud Detection 17
Fraud-Detection Techniques 19
Fraud Cycle 22
The Fraud Analytics Process Model 26
Fraud Data Scientists 30
A Fraud Data Scientist Should Have Solid Quantitative
Skills 30
A Fraud Data Scientist Should Be a Good Programmer 31
A Fraud Data Scientist Should Excel in
Communication and Visualization Skills 31
A Fraud Data Scientist Should Have a Solid Business
Understanding 32
A Fraud Data Scientist Should Be Creative 32
A Scientific Perspective on Fraud 33
References 35
Chapter 2 Data Collection, Sampling, and Preprocessing 37
Introduction 38
Types of Data Sources 38
Merging Data Sources 43
Sampling 45
Types of Data Elements 46

ix

x

CONTENTS

Visual Data Exploration and Exploratory Statistical
Analysis 47
Benford’s Law 48
Descriptive Statistics 51
Missing Values 52
Outlier Detection and Treatment 53
Red Flags 57
Standardizing Data 59
Categorization 60
Weights of Evidence Coding 63
Variable Selection 65
Principal Components Analysis 68
RIDITs 72
PRIDIT Analysis 73
Segmentation 74
References 75
Chapter 3 Descriptive Analytics for Fraud Detection 77
Introduction 78
Graphical Outlier Detection Procedures 79
Statistical Outlier Detection Procedures 83
Break-Point Analysis 84
Peer-Group Analysis 85
Association Rule Analysis 87
Clustering 89
Introduction 89
Distance Metrics 90
Hierarchical Clustering 94
Example of Hierarchical Clustering Procedures 97
k-Means Clustering 104
Self-Organizing Maps 109
Clustering with Constraints 111
Evaluating and Interpreting Clustering Solutions 114
One-Class SVMs 117
References 118
Chapter 4 Predictive Analytics for Fraud Detection 121
Introduction 122
Target Definition 123
Linear Regression 125
Logistic Regression 127
Basic Concepts 127
Logistic Regression Properties 129
Building a Logistic Regression Scorecard 131

CONTENTS

xi

Variable Selection for Linear and Logistic Regression 133
Decision Trees 136
Basic Concepts 136
Splitting Decision 137
Stopping Decision 140
Decision Tree Properties 141
Regression Trees 142
Using Decision Trees in Fraud Analytics 143
Neural Networks 144
Basic Concepts 144
Weight Learning 147
Opening the Neural Network Black Box 150
Support Vector Machines 155
Linear Programming 155
The Linear Separable Case 156
The Linear Nonseparable Case 159
The Nonlinear SVM Classifier 160
SVMs for Regression 161
Opening the SVM Black Box 163
Ensemble Methods 164
Bagging 164
Boosting 165
Random Forests 166
Evaluating Ensemble Methods 167
Multiclass Classification Techniques 168
Multiclass Logistic Regression 168
Multiclass Decision Trees 170
Multiclass Neural Networks 170
Multiclass Support Vector Machines 171
Evaluating Predictive Models 172
Splitting Up the Data Set 172
Performance Measures for Classification Models 176
Performance Measures for Regression Models 185
Other Performance Measures for Predictive Analytical
Models 188
Developing Predictive Models for Skewed Data Sets 189
Varying the Sample Window 190
Undersampling and Oversampling 190
Synthetic Minority Oversampling Technique (SMOTE) 192
Likelihood Approach 194
Adjusting Posterior Probabilities 197
Cost-sensitive Learning 198
Fraud Performance Benchmarks 200
References 201

xii

CONTENTS

Chapter 5 Social Network Analysis for Fraud Detection 207
Networks: Form, Components, Characteristics, and Their
Applications 209
Social Networks 211
Network Components 214
Network Representation 219
Is Fraud a Social Phenomenon? An Introduction to
Homophily 222
Impact of the Neighborhood: Metrics 227
Neighborhood Metrics 228
Centrality Metrics 238
Collective Inference Algorithms 246
Featurization: Summary Overview 254
Community Mining: Finding Groups of Fraudsters 254
Extending the Graph: Toward a Bipartite Representation 266
Multipartite Graphs 269
Case Study: Gotcha! 270
References 277
Chapter 6 Fraud Analytics: Post-Processing 279
Introduction 280
The Analytical Fraud Model Life Cycle 280
Model Representation 281
Traffic Light Indicator Approach 282
Decision Tables 283
Selecting the Sample to Investigate 286
Fraud Alert and Case Management 290
Visual Analytics 296
Backtesting Analytical Fraud Models 302
Introduction 302
Backtesting Data Stability 302
Backtesting Model Stability 305
Backtesting Model Calibration 308
Model Design and Documentation 311
References 312
Chapter 7 Fraud Analytics: A Broader Perspective 313
Introduction 314
Data Quality 314
Data-Quality Issues 314
Data-Quality Programs and Management 315
Privacy 317
The RACI Matrix 318
Accessing Internal Data 319

CONTENTS

Label-Based Access Control (LBAC) 324
Accessing External Data 325
Capital Calculation for Fraud Loss 326
Expected and Unexpected Losses 327
Aggregate Loss Distribution 329
Capital Calculation for Fraud Loss Using Monte Carlo
Simulation 331
An Economic Perspective on Fraud Analytics 334
Total Cost of Ownership 334
Return on Investment 335
In Versus Outsourcing 337
Modeling Extensions 338
Forecasting 338
Text Analytics 340
The Internet of Things 342
Corporate Fraud Governance 344
References 346
About the Authors 347
Index 349

xiii

List of Figures
Figure 1.1

Fraud Triangle

Figure 1.2

Fire Incident Claim-Handling Process

13

Figure 1.3

The Fraud Cycle

23

Figure 1.4

Outlier Detection at the Data Item Level

25

Figure 1.5

Outlier Detection at the Data Set Level

25

Figure 1.6

The Fraud Analytics Process Model

26

Figure 1.7

Profile of a Fraud Data Scientist

33

Figure 1.8

Screenshot of Web of Science Statistics for
Scientific Publications on Fraud between 1996
and 2014

34

Aggregating Normalized Data Tables into a
Non-Normalized Data Table

44

Figure 2.2

Pie Charts for Exploratory Data Analysis

49

Figure 2.3

Benford’s Law Describing the Frequency
Distribution of the First Digit

50

Figure 2.4

Multivariate Outliers

54

Figure 2.5

Histogram for Outlier Detection

54

Figure 2.6

Box Plots for Outlier Detection

55

Figure 2.7

Using the z-Scores for Truncation

57

Figure 2.8

Default Risk Versus Age

60

Figure 2.9

Illustration of Principal Component Analysis in a
Two-Dimensional Data Set

68

Figure 3.1

3D Scatter Plot for Detecting Outliers

80

Figure 3.2

OLAP Cube for Fraud Detection

80

Figure 3.3

Example Pivot Table for Credit Card Fraud
Detection

82

Figure 2.1

7

xv

xvi

LIST OF FIGURES

Figure 3.4

Break-Point Analysis

84

Figure 3.5

Peer-Group Analysis

86

Figure 3.6

Cluster Analysis for Fraud Detection

91

Figure 3.7

Hierarchical Versus Nonhierarchical Clustering
Techniques

91

Figure 3.8

Euclidean Versus Manhattan Distance

92

Figure 3.9

Divisive Versus Agglomerative Hierarchical
Clustering

94

Figure 3.10

Calculating Distances between Clusters

95

Figure 3.11

Example for Clustering Birds. The Numbers
Indicate the Clustering Steps

96

Dendrogram for Birds Example. The Thick Black
Line Indicates the Optimal Clustering

96

Figure 3.13

Screen Plot for Clustering

97

Figure 3.14

Scatter Plot of Hierarchical Clustering Data

98

Figure 3.15

Output of Hierarchical Clustering Procedures

98

Figure 3.16

k-Means Clustering: Start from Original Data

105

Figure 3.17

k-Means Clustering Iteration 1: Randomly
Select Initial Cluster Centroids

105

Figure 3.12

Figure 3.18

k-Means Clustering Iteration 1: Assign Remaining
Observations
106

Figure 3.19

k-Means Iteration Step 2: Recalculate Cluster
Centroids

107

k-Means Clustering Iteration 2: Reassign
Observations

107

k-Means Clustering Iteration 3: Recalculate
Cluster Centroids

108

k-Means Clustering Iteration 3: Reassign
Observations

108

Figure 3.23

Rectangular Versus Hexagonal SOM Grid

109

Figure 3.24

Clustering Countries Using SOMs

111

Figure 3.25

Component Plane for Literacy

112

Figure 3.20
Figure 3.21
Figure 3.22

LIST OF FIGURES

xvii

Figure 3.26

Component Plane for Political Rights

113

Figure 3.27

Must-Link and Cannot-Link Constraints in
Semi-Supervised Clustering

113

Figure 3.28

𝛿-Constraints in Semi-Supervised Clustering

114

Figure 3.29

𝜀-Constraints in Semi-Supervised Clustering

114

Figure 3.30

Cluster Profiling Using Histograms

115

Figure 3.31

Using Decision Trees for Clustering
Interpretation

116

Figure 3.32

One-Class Support Vector Machines

117

Figure 4.1

A Spider Construction in Tax Evasion Fraud

124

Figure 4.2

Regular Versus Fraudulent Bankruptcy

124

Figure 4.3

OLS Regression

126

Figure 4.4

Bounding Function for Logistic Regression

128

Figure 4.5

Linear Decision Boundary of Logistic Regression

130

Figure 4.6

Other Transformations

131

Figure 4.7

Fraud Detection Scorecard

133

Figure 4.8

Calculating the p-Value with a Student’s
t-Distribution

135

Variable Subsets for Four Variables V1 , V2 , V3 ,
and V4

135

Figure 4.10

Example Decision Tree

137

Figure 4.11

Example Data Sets for Calculating Impurity

138

Figure 4.12

Entropy Versus Gini

139

Figure 4.13

Calculating the Entropy for Age Split

139

Figure 4.14

Using a Validation Set to Stop Growing a
Decision Tree

140

Figure 4.15

Decision Boundary of a Decision Tree

142

Figure 4.16

Example Regression Tree for Predicting the
Fraud Percentage

142

Neural Network Representation of Logistic
Regression

145

Figure 4.9

Figure 4.17

xviii

LIST OF FIGURES

Figure 4.18

A Multilayer Perceptron (MLP) Neural Network

145

Figure 4.19

Local Versus Global Minima

148

Figure 4.20

Using a Validation Set for Stopping Neural
Network Training

149

Figure 4.21

Example Hinton Diagram

151

Figure 4.22

Backward Variable Selection

152

Figure 4.23

Decompositional Approach for Neural Network
Rule Extraction

153

Figure 4.24

Pedagogical Approach for Rule Extraction

154

Figure 4.25

Two-Stage Models

155

Figure 4.26

Multiple Separating Hyperplanes

157

Figure 4.27

SVM Classifier for the Perfectly Linearly
Separable Case

157

SVM Classifier in Case of Overlapping
Distributions

159

Figure 4.29

The Feature Space Mapping

160

Figure 4.30

SVMs for Regression

162

Figure 4.31

Representing an SVM Classifier as a Neural
Network

163

Figure 4.32

One-Versus-One Coding for Multiclass Problems

171

Figure 4.33

One-Versus-All Coding for Multiclass Problems

172

Figure 4.34

Training Versus Test Sample Set Up for
Performance Estimation

173

Figure 4.35

Cross-Validation for Performance Measurement

174

Figure 4.36

Bootstrapping

175

Figure 4.37

Calculating Predictions Using a Cut-Off

176

Figure 4.38

The Receiver Operating Characteristic Curve

178

Figure 4.39

Lift Curve

179

Figure 4.40

Cumulative Accuracy Profile

180

Figure 4.41

Calculating the Accuracy Ratio

181

Figure 4.42

The Kolmogorov-Smirnov Statistic

181

Figure 4.28

LIST OF FIGURES

xix

Figure 4.43

A Cumulative Notch Difference Graph

184

Figure 4.44

Scatter Plot: Predicted Fraud Versus Actual
Fraud

185

Figure 4.45

CAP Curve for Continuous Targets

187

Figure 4.46

Regression Error Characteristic (REC) Curve

188

Figure 4.47

Varying the Time Window to Deal with Skewed
Data Sets

190

Figure 4.48

Oversampling the Fraudsters

191

Figure 4.49

Undersampling the Nonfraudsters

191

Figure 4.50

Synthetic Minority Oversampling Technique
(SMOTE)

193

Figure 5.1a

Köningsberg Bridges

210

Figure 5.1b

Schematic Representation of the Köningsberg
Bridges

211

Figure 5.2

Identity Theft. The Frequent Contact List of a
Person is Suddenly Extended with Other Contacts
(Light Gray Nodes). This Might Indicate that a
Fraudster (Dark Gray Node) Took Over that
Customer’s Account and “shares” his/her
Contacts
213

Figure 5.3

Network Representation

214

Figure 5.4

Example of a (Un)Directed Graph

215

Figure 5.5

Follower–Followee Relationships in a Twitter
Network

215

Figure 5.6

Edge Representation

216

Figure 5.7

Example of a Fraudulent Network

218

Figure 5.8

An Egonet. The Ego is Surrounded by Six Alters,
of Whom Two are Legitimate (White Nodes) and
Four are Fraudulent (Gray Nodes)

218

Toy Example of Credit Card Fraud

220

Figure 5.9

xx

LIST OF FIGURES

Figure 5.10

Mathematical Representation of (a) a Sample
Network: (b) the Adjacency or Connectivity Matrix;
(c) the Weight Matrix; (d) the Adjacency List; and
(e) the Weight List
221

Figure 5.11

A Real-Life Example of a Homophilic Network

224

Figure 5.12

A Homophilic Network

225

Figure 5.13

Sample Network

229

Figure 5.14a Degree Distribution

230

Figure 5.14b Illustration of the Degree Distribution for a
Real-Life Network of Social Security Fraud.
The Degree Distribution Follows a Power Law
(log-log axes)

230

Figure 5.15

A 4-regular Graph

231

Figure 5.16

Example Social Network for a Relational Neighbor
Classifier
233

Figure 5.17

Example Social Network for a Probabilistic
Relational Neighbor Classifier

235

Example of Social Network Features for a
Relational Logistic Regression Classifier

236

Example of Featurization with Features
Describing Intrinsic Behavior and Behavior of
the Neighborhood

237

Figure 5.20

Illustration of Dijkstra’s Algorithm

241

Figure 5.21

Illustration of the Number of Connecting Paths
Between Two Nodes

242

Figure 5.18
Figure 5.19

Figure 5.22

Illustration of Betweenness Between Communities
of Nodes
245

Figure 5.23

Pagerank Algorithm

247

Figure 5.24

Illustration of Iterative Process of the PageRank
Algorithm

249

Figure 5.25

Sample Network

254

Figure 5.26

Community Detection for Credit Card Fraud

259

Figure 5.27

Iterative Bisection

261

LIST OF FIGURES

Figure 5.28

xxi

Dendrogram of the Clustering of Figure 5.27 by
the Girvan-Newman Algorithm. The Modularity
Q is Maximized When Splitting the Network into
Two Communities ABC – DEFG

262

Figure 5.29

Complete (a) and Partial (b) Communities

264

Figure 5.30

Overlapping Communities

265

Figure 5.31

Unipartite Graph

266

Figure 5.32

Bipartite Graph

267

Figure 5.33

Connectivity Matrix of a Bipartite Graph

268

Figure 5.34

A Multipartite Graph

269

Figure 5.35

Sample Network of Gotcha!

270

Figure 5.36

Exposure Score of the Resources Derived by a
Propagation Algorithm. The Results are Based
on a Real-life Data Set in Social Security Fraud

273

Egonet in Social Security Fraud. A Company Is
Associated with its Resources

274

ROC Curve of the Gotcha! Model, which
Combines both Intrinsic and Relational Features

275

Figure 6.1

The Analytical Model Life Cycle

280

Figure 6.2

Traffic Light Indicator Approach

282

Figure 6.3

SAS Social Network Analysis Dashboard

293

Figure 6.4

SAS Social Network Analysis Claim Detail
Investigation

294

Figure 6.5

SAS Social Network Analysis Link Detection

295

Figure 6.6

Distribution of Claim Amounts and Average
Claim Value

297

Figure 6.7

Geographical Distribution of Claims

298

Figure 6.8

Zooming into the Geographical Distribution
of Claims

299

Measuring the Efficiency of the Fraud-Detection
Process

300

Evaluating the Efficiency of Fraud Investigators

301

Figure 5.37
Figure 5.38

Figure 6.9
Figure 6.10

xxii

LIST OF FIGURES

Figure 7.1

RACI Matrix

318

Figure 7.2

Anonymizing a Database

321

Figure 7.3

Different SQL Views Defined for a Database

323

Figure 7.4

Aggregate Loss Distribution with Indication
of Expected Loss, Value at Risk (VaR) at
99.9 Percent Confidence Level and Unexpected
Loss

331

Snapshot of a Credit Card Fraud Time Series
Data Set and Associated Histogram of the Fraud
Amounts

332

Aggregate Loss Distribution Resulting from a
Monte Carlo Simulation with Poisson
Distributed Monthly Fraud Frequency and
Associated Pareto Distributed Fraud Loss

334

Figure 7.5

Figure 7.6

Foreword
Fraud will always be with us. It is linked both to organized crime and to
terrorism, and it inflicts substantial economic damage. The perpetrators
of fraud play a dynamic cat and mouse game with those trying to stop
them. Preventing a particular kind of fraud does not mean the fraudsters give up, but merely that they change their tactics: they are constantly on the lookout for new avenues for fraud, for new weaknesses
in the system. And given that our social and financial systems are
forever developing, there are always new opportunities to be exploited.
This book is a clear and comprehensive outline of the current stateof-the-art in fraud-detection and prevention methodology. It describes
the data necessary to detect fraud, and then takes the reader from
the basics of fraud-detection data analytics, through advanced pattern
recognition methodology, to cutting-edge social network analysis and
fraud ring detection.
If we cannot stop fraud altogether, an awareness of the contents of
this book will at least enable readers to reduce the extent of fraud, and
make it harder for criminals to take advantage of the honest. The readers’ organizations, be they public or private, will be better protected if
they implement the strategies described in this book. In short, this book
is a valuable contribution to the well-being of society and of the people
within it.
Professor David J. Hand
Imperial College, London

xxiii

Preface
It is estimated that a typical organization loses about 5 percent of its
revenues due to fraud each year. In this book, we will discuss how
state-of-the-art descriptive, predictive and social network analytics can
be used to fight fraud by learning fraud patterns from historical data.
The focus of this book is not on the mathematics or theory, but
on the practical applications. Formulas and equations will only be
included when absolutely needed from a practitioner’s perspective.
It is also not our aim to provide exhaustive coverage of all analytical
techniques previously developed but, rather, give coverage of the ones
that really provide added value in a practical fraud detection setting.
Being targeted at the business professional in the first place, the
book is written in a condensed, focused way. Prerequisite knowledge
consists of some basic exposure to descriptive statistics (e.g., mean,
standard deviation, correlation, confidence intervals, hypothesis
testing), data handling (using for example, Microsoft Excel, SQL,
etc.), and data visualization (e.g., bar plots, pie charts, histograms,
scatter plots, etc.). Throughout the discussion, many examples of
real-life fraud applications will be included in, for example, insurance
fraud, tax evasion fraud, and credit card fraud. The authors will also
integrate both their research and consulting experience throughout
the various chapters. The book is aimed at (senior) data analysts,
(aspiring) data scientists, consultants, analytics practitioners, and
researchers (e.g., PhD candidates) starting to explore the field.
Chapter 1 sets the stage on fraud detection, prevention, and analytics. It starts by defining fraud and then zooms into fraud detection and
prevention. The impact of big data for fraud detection and the fraud
analytics process model are reviewed next. The chapter concludes by
summarizing the key skills of a fraud data scientist.
Chapter 2 provides extensive discussion on the basic ingredient
of any fraud analytical model: data! It introduces various types of
xxv

xxvi

PREFACE

data sources and discusses how to merge and sample them. The next
sections discuss the different types of data elements, visual exploration,
Benford’s law, and descriptive statistics. These are all essential tools
to start understanding the characteristics and limitations of the data
available. Data preprocessing activities are also extensively covered:
handling missing values, detecting and treating outliers, defining red
flags, standardizing data, categorizing variables, weights of evidence
coding, and variable selection. Principal component analysis is outlined as a technique to reduce the dimensionality of the input data.
This is then further illustrated with RIDIT and PRIDIT analysis. The
chapter ends by reviewing segmentation and the risks thereof.
Chapter 3 continues by exploring the use of descriptive analytics
for fraud detection. The idea here is to look for unusual patterns or
outliers in a fraud data set. Both graphical and statistical outlier detection procedures are reviewed first. This is followed by an overview of
break-point analysis, peer group analysis, association rules, clustering,
and one-class SVMs.
Chapter 4 zooms into predictive analytics for fraud detection. We
start from a labeled data set of transactions whereby each transaction
has a target of interest that can either be binary (e.g., fraudulent or
not) or continuous (e.g., amount of fraud). We then discuss various
analytical techniques to build predictive models: linear regression,
logistic regression, decision trees, neural networks, support vector
machines, ensemble methods, and multiclass classification techniques.
A next section reviews how to measure the performance of a predictive analytical model by first deciding on the data set split-up and
then on the performance metric. The class imbalance problem is
also extensively elaborated. The chapter concludes by giving some
performance benchmarks.
Chapter 5 introduces the reader to social network analysis and
its use for fraud detection. Stating that the propensity to fraud is
often influenced by the social neighborhood, we describe the main
components of a network and illustrate how transactional data sources
can be transformed in networks. In the next section, we elaborate
on featurization, the process on how to extract a set of meaningful
features from the network. We distinguish between three main types
of features: neighborhood metrics, centrality metrics, and collective

PREFACE

xxvii

inference algorithms. We then zoom into community mining, where
we aim at finding groups of fraudsters closely connected in the
network. By introducing multipartite graphs, we address the fact that
fraud often depends on a multitude of different factors and that the
inclusion of all these factors in a network representation contribute to
a better understanding and analysis of the detection problem at hand.
The chapter is concluded with a real-life example of social security
fraud.
Chapter 6 deals with the postprocessing of fraud analytical models.
It starts by giving an overview of the analytical fraud model lifecycle. It
then discusses the traffic light indicator approach and decision tables as
two popular model representations. This is followed by a set of guidelines to appropriately select the fraud sample to investigate. Fraud alert
and case management are covered next. We also illustrate how visual
analytics can contribute to the postprocessing activities. We describe
how to backtest analytical fraud models by considering data stability,
model stability, and model calibration. The chapter concludes by giving
some guidelines about model design and documentation.
Chapter 7 provides a broader perspective on fraud analytics. We
provide some guidelines for setting up and managing data quality
programs. We zoom into privacy and discuss various ways to ensure
appropriate access to both internal and external data. We discuss how
analytical fraud estimates can be used to calculate both expected and
unexpected losses, which can then help to determine provisioning
and capital buffers. A discussion of total cost of ownership and return
on investment provides an economic perspective on fraud analytics.
This is followed by a discussion of in- versus outsourcing of analytical
model development. We briefly zoom into some interesting modeling
extensions, such as forecasting and text analytics. The potential and
danger of the Internet of Things for fraud analytics is also covered.
The chapter concludes by giving some recommendations for corporate
fraud governance.

Acknowledgments
It is a great pleasure to acknowledge the contributions and assistance of
various colleagues, friends, and fellow analytics lovers to the writing of
this book. This book is the result of many years of research and teaching
in analytics, risk management, and fraud. We first would like to thank
our publisher, John Wiley & Sons, for accepting our book proposal less
than one year ago.
We are grateful to the active and lively analytics and fraud detection community for providing various user fora, blogs, online lectures,
and tutorials, which proved very helpful.
We would also like to acknowledge the direct and indirect contributions of the many colleagues, fellow professors, students, researchers,
and friends with whom we collaborated during the past years.
Last but not least, we are grateful to our partners, parents, and
families for their love, support, and encouragement.
We have tried to make this book as complete, accurate, and
enjoyable as possible. Of course, what really matters is what you, the
reader, think of it. Please let us know your views by getting in touch.
The authors welcome all feedback and comments—so do not hesitate
to let us know your thoughts!
Bart Baesens
Véronique Van Vlasselaer
Wouter Verbeke
August 2015

xxix

Fraud Analytics Using
Descriptive, Predictive,
and Social Network
Techniques

C H A P T E R

1

Fraud: Detection,
Prevention, and
Analytics!

1

INTRODUCTION
In this first chapter, we set the scene for what’s ahead by introducing
fraud analytics using descriptive, predictive, and social network techniques. We start off by defining and characterizing fraud and discuss
different types of fraud. Next, fraud detection and prevention is discussed as a means to address and limit the amount and overall impact of
fraud. Big data and analytics provide powerful tools that may improve
an organization’s fraud detection system. We discuss in detail how and
why these tools complement traditional expert-based fraud-detection
approaches. Subsequently, the fraud analytics process model is introduced, providing a high-level overview of the steps that are followed
in developing and implementing a data-driven fraud-detection system. The chapter concludes by discussing the characteristics and skills
of a good fraud data scientist, followed by a scientific perspective on
the topic.

FRAUD!
Since a thorough discussion or investigation requires clear and precise
definitions of the subject of interest, this first section starts by defining
fraud and by highlighting a number of essential characteristics. Subsequently, an explanatory conceptual model will be introduced that
provides deeper insight in the underlying drivers of fraudsters, the
individuals committing fraud. Insight in the field of application—or
in other words, expert knowledge—is crucial for analytics to be successfully applied in any setting, and matters eventually as much as
technical skill. Expert knowledge or insight in the problem at hand
helps an analyst in gathering and processing the right information in
the right manner, and to customize data allowing analytical techniques
to perform as well as possible in detecting fraud.
The Oxford Dictionary defines fraud as follows:
Wrongful or criminal deception intended to result in
financial or personal gain.
On the one hand, this definition captures the essence of fraud
and covers the many different forms and types of fraud that will be
2

FRAUD: DETECTION, PREVENTION, AND ANALYTICS!

3

discussed in this book. On the other hand, it does not very precisely
describe the nature and characteristics of fraud, and as such, does not
provide much direction for discussing the requirements of a fraud
detection system. A more useful definition will be provided below.
Fraud is definitely not a recent phenomenon unique to modern
society, nor is it even unique to mankind. Animal species also engage
in what could be called fraudulent activities, although maybe we should
classify the behavior as displayed by, for instance, chameleons, stick
insects, apes, and others rather as manipulative behavior instead of fraudulent activities, since wrongful or criminal are human categories or concepts that do not straightforwardly apply to animals. Indeed, whether
activities are wrongful or criminal depends on the applicable rules or
legislation, which defines explicitly and formally these categories that
are required in order to be able to classify behavior as being fraudulent.
A more thorough and detailed characterization of the multifaceted
phenomenon of fraud is provided by Van Vlasselaer et al. (2015):
Fraud is an uncommon, well-considered,
imperceptibly concealed, time-evolving and often
carefully organized crime which appears in many
types of forms.
This definition highlights five characteristics that are associated
with particular challenges related to developing a fraud-detection
system, which is the main topic of this book. The first emphasized
characteristic and associated challenge concerns the fact that fraud
is uncommon. Independent of the exact setting or application, only
a minority of the involved population of cases typically concerns
fraud, of which furthermore only a limited number will be known to
concern fraud. This makes it difficult to both detect fraud, since the
fraudulent cases are covered by the nonfraudulent ones, as well as to
learn from historical cases to build a powerful fraud-detection system
since only few examples are available.
In fact, fraudsters exactly try to blend in and not to behave different
from others in order not to get noticed and to remain covered by nonfraudsters. This effectively makes fraud imperceptibly concealed, since
fraudsters do succeed in hiding by well considering and planning how

4

FRAUD ANALYTICS

to precisely commit fraud. Their behavior is definitely not impulsive
and unplanned, since if it were, detection would be far easier.
They also adapt and refine their methods, which they need to do
in order to remain undetected. Fraud-detection systems improve and
learn by example. Therefore, the techniques and tricks fraudsters adopt
evolve in time along with, or better ahead of fraud-detection mechanisms. This cat-and-mouse play between fraudsters and fraud fighters
may seem to be an endless game, yet there is no alternative solution
so far. By adopting and developing advanced fraud-detection and prevention mechanisms, organizations do manage to reduce losses due to
fraud because fraudsters, like other criminals, tend to look for the easy
way and will look for other, easier opportunities. Therefore, fighting
fraud by building advanced and powerful detection systems is definitely not a pointless effort, but admittedly, it is very likely an effort
without end.
Fraud is often as well a carefully organized crime, meaning that
fraudsters often do not operate independently, have allies, and may
induce copycats. Moreover, several fraud types such as money laundering and carousel fraud involve complex structures that are set up in
order to commit fraud in an organized manner. This makes fraud not
to be an isolated event, and as such in order to detect fraud the context
(e.g., the social network of fraudsters) should be taken into account.
Research shows that fraudulent companies indeed are more connected
to other fraudulent companies than to nonfraudulent companies, as
shown in a company tax-evasion case study by Van Vlasselaer et al.
(2015). Social network analytics for fraud detection, as discussed
in Chapter 5, appears to be a powerful tool for unmasking fraud by
making clever use of contextual information describing the network or
environment of an entity.
A final element in the description of fraud provided by Van
Vlasselaer et al. indicates the many different types of forms in which
fraud occurs. This both refers to the wide set of techniques and
approaches used by fraudsters as well as to the many different settings
in which fraud occurs or economic activities that are susceptible to
fraud. Table 1.1 provides a nonexhaustive overview and description of
a number of important fraud types—important being defined in terms of
frequency of occurrence as well as the total monetary value involved.

FRAUD: DETECTION, PREVENTION, AND ANALYTICS!

Table 1.1

5

Nonexhaustive List of Fraud Categories and Types

Credit card fraud

In credit card fraud there is an unauthorized taking of another’s credit.
Some common credit card fraud subtypes are counterfeiting credit cards
(for the definition of counterfeit, see below), using lost or stolen cards, or
fraudulently acquiring credit through mail (definition adopted from
definitions.uslegal.com). Two subtypes can been identified, as described
by Bolton and Hand (2002): (1) Application fraud, involving individuals
obtaining new credit cards from issuing companies by using false
personal information, and then spending as much as possible in a short
space of time; (2) Behavioral fraud, where details of legitimate cards are
obtained fraudulently and sales are made on a “Cardholder Not Present”
basis. This does not necessarily require stealing the physical card, only
stealing the card credentials. Behavioral fraud concerns most of the credit
card fraud. Also, debit card fraud occurs, although less frequent. Credit
card fraud is a form of identity theft, as will be defined below.

Insurance fraud

Broad category-spanning fraud related to any type of insurance, both from
the side of the buyer or seller of an insurance contract. Insurance fraud
from the issuer (seller) includes selling policies from nonexistent
companies, failing to submit premiums and churning policies to create
more commissions. Buyer fraud includes exaggerated claims (property
insurance: obtaining payment that is worth more than the value of the
property destroyed), falsified medical history (healthcare insurance: fake
injuries), postdated policies, faked death, kidnapping or murder (life
insurance fraud), and faked damage (automobile insurance: staged
collision) (definition adopted from www.investopedia.com).

Corruption

Corruption is the misuse of entrusted power (by heritage, education,
marriage, election, appointment, or whatever else) for private gain. This
definition is similar to the definition of fraud provided by the Oxford
Dictionary discussed before, in that the objective is personal gain. It is
different in that it focuses on misuse of entrusted power. The definition
covers as such a broad range of different subtypes of corruption, so does
not only cover corruption by a politician or a public servant, but also, for
example, by the CEO or CFO of a company, the notary public, the team
leader at a workplace, the administrator or admissions-officer to a private
school or hospital, the coach of a soccer team, and so on (definition
adopted from www.corruptie.org).

Counterfeit

An imitation intended to be passed off fraudulently or deceptively as
genuine. Counterfeit typically concerns valuable objects, credit cards,
identity cards, popular products, money, etc. (definition adopted from
www.dictionary.com).
(continued)

6

FRAUD ANALYTICS

Table 1.1

(Continued)

Product warranty
fraud

A product warranty is a type of guarantee that a manufacturer or similar
party makes regarding the condition of its product, and also refers to the
terms and situations in which repairs or exchanges will be made in the
event that the product does not function as originally described or
intended (definition adopted from www.investopedia.com). When a
product fails to offer the described functionalities or displays deviating
characteristics or behavior that are a consequence of the production
process and not a consequence of misuse by the customer, compensation
or remuneration by the manufacturer or provider can be claimed. When
the conditions of the product have been altered due to the customer’s use
of the product, then the warranty does not apply. Intentionally wrongly
claiming compensation or remuneration based on a product warranty is
called product warranty fraud.

Healthcare fraud

Healthcare fraud involves the filing of dishonest healthcare claims in
order to make profit. Practitioner schemes include: individuals obtaining
subsidized or fully covered prescription pills that are actually unneeded
and then selling them on the black market for a profit; billing by
practitioners for care that they never rendered; filing duplicate claims for
the same service rendered; billing for a noncovered service as a covered
service; modifying medical records, and so on. Members can commit
healthcare fraud by providing false information when applying for
programs or services, forging or selling prescription drugs, loaning or
using another’s insurance card, and so on (definition adopted from
www.law.cornell.edu).

Telecommunications fraud

Telecommunication fraud is the theft of telecommunication services
(telephones, cell phones, computers, etc.) or the use of
telecommunication services to commit other forms of fraud (definition
adopted from itlaw.wikia.com). An important example concerns cloning
fraud (i.e. the cloning of a phone number and the related call credit by a
fraudster), which is an instance of superimposition fraud in which
fraudulent usage is superimposed on (added to) the legitimate usage of
an account (Fawcett and Provost 1997).

Money
laundering

The process of taking the proceeds of criminal activity and making them
appear legal. Laundering allows criminals to transform illegally obtained
gain into seemingly legitimate funds. It is a worldwide problem, with an
estimated $300 billion going through the process annually in the United
States (definition adopted from legal-dictionary.thefreedictionary.com).

Click fraud

Click fraud is an illegal practice that occurs when individuals click on a
website’s click-through advertisements (either banner ads or paid text
links) to increase the payable number of clicks to the advertiser. The
illegal clicks could either be performed by having a person manually click
the advertising hyperlinks or by using automated software or online bots
that are programmed to click these banner ads and pay-per-click text ad
links (definition adopted from www.webopedia.com).

FRAUD: DETECTION, PREVENTION, AND ANALYTICS!

Table 1.1

7

(Continued)

Identity theft

The crime of obtaining the personal or financial information of another
person for the purpose of assuming that person’s name or identity in order
to make transactions or purchases. Some identity thieves sift through
trash bins looking for bank account and credit card statements; other more
high-tech methods involve accessing corporate databases to steal lists of
customer information (definition adopted from www.investopedia.com).

Tax evasion

Tax evasion is the illegal act or practice of failing to pay taxes that are
owed. In businesses, tax evasion can occur in connection with income
taxes, employment taxes, sales and excise taxes, and other federal, state,
and local taxes. Examples of practices that are considered tax evasion
include knowingly not reporting income or underreporting income (i.e.,
claiming less income than you actually received from a specific source)
(definition adopted from biztaxlaw.about.com).

Plagiarism

Plagiarizing is defined by Merriam Webster’s online dictionary as to steal
and pass off (the ideas or words of another) as one’s own, to use
(another’s production) without crediting the source, to commit literary
theft, to present as new and original an idea or product derived from an
existing source. It involves both stealing someone else’s work and lying
about it afterward (definition adopted from www.plagiarism.org).

In the end, fraudulent activities are intended to result in gains or
benefits for the fraudster, as emphasized by the definition of fraud provided by the Oxford Dictionary. The potential, usually monetary, gain or
benefit forms in the large majority of cases the basic driver for committing fraud.
The so-called fraud triangle as depicted in Figure 1.1 provides a
more elaborate explanation for the underlying motives or drivers for

Pressure

Fraud Triangle

Opportunity

Figure 1.1 Fraud Triangle

Rationalization

8

FRAUD ANALYTICS

committing occupational fraud. The fraud triangle originates from a
hypothesis formulated by Donald R. Cressey in his 1953 book Other
People’s Money: A Study of the Social Psychology of Embezzlement:

Trusted persons become trust violators when they conceive
of themselves as having a financial problem which is
non-shareable, are aware this problem can be secretly
resolved by violation of the position of financial trust, and
are able to apply to their own conduct in that situation
verbalizations which enable them to adjust their
conceptions of themselves as trusted persons with their
conceptions of themselves as users of the entrusted funds
or property.
This basic conceptual model explains the factors that together cause
or explain the drivers for an individual to commit occupational fraud,
yet provides a useful insight in the fraud phenomenon from a broader
point of view as well. The model has three legs that together institute
fraudulent behavior:
1. Pressure is the first leg and concerns the main motivation for
committing fraud. An individual will commit fraud because a
pressure or a problem is experienced of financial, social, or any
other nature, and it cannot be resolved or relieved in an authorized manner.
2. Opportunity is the second leg of the model, and concerns
the precondition for an individual to be able to commit
fraud. Fraudulent activities can only be committed when
the opportunity exists for the individual to resolve or relieve
the experienced pressure or problem in an unauthorized but
concealed or hidden manner.
3. Rationalization is the psychological mechanism that explains
why fraudsters do not refrain from committing fraud and think
of their conduct as acceptable.
An essay by Duffield and Grabosky (2001) further explores
the motivational basis of fraud from a psychological perspective.

FRAUD: DETECTION, PREVENTION, AND ANALYTICS!

9

It concludes that a number of psychological factors may be present
in those persons who commit fraud, but that these factors are also
associated with entirely legitimate forms of human endeavor. And so
fraudsters cannot be distinguished from nonfraudsters purely based
on psychological characteristics or patterns.
Fraud is a social phenomenon in the sense that the potential benefits for the fraudsters come at the expense of the victims. These victims
are individuals, enterprises, or the government, and as such society as
a whole. Some recent numbers give an indication of the estimated size
and the financial impact of fraud:
◾

A typical organization loses 5 percent of its revenues to fraud
each year (www.acfe.com).

◾

The total cost of insurance fraud (non–health insurance) in the
United States is estimated to be more than $40 billion per year
(www.fbi.gov).

◾

Fraud is costing the United Kingdom £73 billion a year (National
Fraud Authority).

◾

Credit card companies “lose approximately seven cents per
every hundred dollars of transactions due to fraud” (Andrew
Schrage, Money Crashers Personal Finance, 2012).

◾

The average size of the informal economy, as a percent of official GNI in the year 2000, in developing countries is 41 percent,
in transition countries 38 percent, and in OECD countries 18
percent (Schneider 2002).

Even though these numbers are rough estimates rather than exact
measurements, they are based on evidence and do indicate the importance and impact of the phenomenon, and therefore as well the need
for organizations and governments to actively fight and prevent fraud
with all means they have at their disposal. As will be further elaborated
in the final chapter, these numbers also indicate that it is likely worthwhile to invest in fraud-detection and fraud-prevention systems, since
a significant financial return on investment can be made.
The importance and need for effective fraud-detection and fraudprevention systems is furthermore highlighted by the many different

10

FRAUD ANALYTICS

forms or types of fraud of which a number have been summarized in
Table 1.1, which is not exhaustive but, rather, indicative, and which
illustrates the widespread occurrence across different industries and
product and service segments. The broad fraud categories enlisted and
briefly defined in Table 1.1 can be further subdivided into more specific
subtypes, which, although interesting, would lead us too far into the
particularities of each of these forms of fraud. One may refer to the further reading sections at the end of each chapter of this book, providing
selected references to specialized literature on different forms of fraud.
A number of particular fraud types will also be further elaborated in
real-life case studies throughout the book.

FRAUD DETECTION AND PREVENTION
Two components that are essential parts of almost any effective strategy to fight fraud concern fraud detection and fraud prevention. Fraud
detection refers to the ability to recognize or discover fraudulent activities, whereas fraud prevention refers to measures that can be taken
to avoid or reduce fraud. The difference between both is clear-cut; the
former is an ex post approach whereas the latter an ex ante approach.
Both tools may and likely should be used in a complementary manner
to pursue the shared objective, fraud reduction.
However, as will be discussed in more detail further on, preventive
actions will change fraud strategies and consequently impact detection
power. Installing a detection system will cause fraudsters to adapt and
change their behavior, and so the detection system itself will impair
eventually its own detection power. So although complementary, fraud
detection and prevention are not independent and therefore should be
aligned and considered a whole.
The classic approach to fraud detection is an expert-based approach,
meaning that it builds on the experience, intuition, and business
or domain knowledge of the fraud analyst. Such an expert-based
approach typically involves a manual investigation of a suspicious
case, which may have been signaled, for instance, by a customer
complaining of being charged for transactions he did not do. Such
a disputed transaction may indicate a new fraud mechanism to have
been discovered or developed by fraudsters, and therefore requires

FRAUD: DETECTION, PREVENTION, AND ANALYTICS!

11

a detailed investigation for the organization to understand and subsequently address the new mechanism.
Comprehension of the fraud mechanism or pattern allows extending the fraud detection and prevention mechanism that is often implemented as a rule base or engine, meaning in the form of a set of If-Then
rules, by adding rules that describe the newly detected fraud mechanism. These rules, together with rules describing previously detected
fraud patterns, are applied to future cases or transactions and trigger
an alert or signal when fraud is or may be committed by use of this
mechanism. A simple, yet possibly very effective, example of a fraud
detection rule in an insurance claim fraud setting goes as follows:
IF:
◾

Amount of claim is above threshold OR

◾

Severe accident, but no police report OR

◾

Severe injury, but no doctor report OR

◾

Claimant has multiple versions of the accident OR

◾

Multiple receipts submitted

THEN:
◾

Flag claim as suspicious AND

◾

Alert fraud investigation officer.

Such an expert approach suffers from a number of disadvantages.
Rule bases or engines are typically expensive to build, since they
require advanced manual input by the fraud experts, and often turn
out to be difficult to maintain and manage. Rules have to be kept
up to date and only or mostly trigger real fraudulent cases, since
every signaled case requires human follow-up and investigation.
Therefore, the main challenge concerns keeping the rule base lean
and effective—in other words, deciding when and which rules to add,
remove, update, or merge.
It is important to realize that fraudsters can, for instance by trial
and error, learn the business rules that block or expose them and will
devise inventive workarounds. Since the rules in the rule-based detection system are based on past experience, new emerging fraud patterns are not automatically flagged or signaled. Fraud is a dynamic

12

FRAUD ANALYTICS

EXAMPLE CASE

phenomenon, as will be discussed below in more detail, and therefore
needs to be traced continuously. Consequently, a fraud detection and
prevention system also needs to be continuously monitored, improved,
and updated to remain effective.
An expert-based fraud-detection system relies on human expert
input, evaluation, and monitoring, and as such involves a great deal
of labor intense human interventions. An automated approach to
build and maintain a fraud-detection system, requiring less human
involvement, could lead to a more efficient and effective system
for detecting fraud. The next section in this chapter will introduce
several alternative approaches to expert systems that leverage the
massive amounts of data that nowadays can be gathered and processed at very low cost, in order to develop, monitor, and update a
high-performing fraud-detection system in a more automated and
efficient manner. These alternative approaches still require and build
on expert knowledge and input, which remains crucial in order to
build an effective system.

EXAMPLE CASE: EXPERT-BASED APPROACH TO
INTERNAL FRAUD DETECTION IN AN INSURANCE
CLAIM-HANDLING PROCESS
An example expert-based detection and prevention system to signal potential
fraud committed by claim handling officers concerns the business process
depicted in Figure 1.2, illustrating the handling of fire incident claims without
any form of bodily injury (including death) (Caron et al. 2013). The process
involves the following types of activities:
◾

Administrative activities

◾

Evaluation-related activities

◾

In-depth assessment by internal and external experts

◾

Approval activities

◾

Leniency-related activities

◾

Fraud investigation activities

FRAUD: DETECTION, PREVENTION, AND ANALYTICS!

Consult Policy complete
0.518

Cluster 33
5 elements
–0.033

Send claim acceptance
letter complete
0.369
Additional evaluation
activities
Cluster 38
7 elements
–0.033

Leniency-related
activities
Register compensation
agreement complete
0.413

Update provision
complete
0.380

Cluster 37
2 elements
–0.092

Approval cycle

Compensate complete
0.415

Discard provisions
complete
0.660

Figure 1.2 Fire Incident Claim-Handling Process

A number of harmful process deviations and related risks can be
identified regarding these activities:
◾

Forgetting to discard provisions

◾

Multiple partial compensations (exceeding limit)

13

14

FRAUD ANALYTICS

◾

Collusion between administrator and experts

◾

Lack of approval cycle

◾

Suboptimal task allocation

◾

Fraud investigation activities

◾

Processing rejected claim

◾

Forced claim acceptance, absence of a timely primary evaluation

Deviations marked in bold may relate to and therefore indicate fraud. By
adopting business policies as a governance instrument and prescribing
procedures and guidelines, the insurer may reduce the risks involved in
processing the insurance claims. For instance:
Business policy excerpt 1 (customer relationship
management related): If the insured requires immediate assistance (e.g.,
to prevent the development of additional damage), arrangements will be made
for a single partial advanced compensation (maximum x% of expected
covered loss).
◾

Potential risk: The expected (covered) loss could be exceeded
through partial advanced compensations.

Business policy excerpt 2 (avoid financial loss): Settlements
need to be approved.
◾

Potential risk: Collusion between the drafter of the settlement and
the insured

Business policy excerpt 3 (avoid financial loss): The proposal of
a settlement and its approval must be performed by different actors.
◾

Potential risk: A person might hold both the team-leader and the
expert role in the information system.

Business policy excerpt 4 (avoid financial loss): After approval of
the decision (settlement or claim rejection) no changes may occur.
◾

Potential risk: The modifier and the insured might collude.

◾

Potential risk: A rejected claim could undergo further processing.

FRAUD: DETECTION, PREVENTION, AND ANALYTICS!

15

Since detecting fraud based on specified business rules requires prior
knowledge of the fraud scheme, the existence of fraud issues will be the direct
result of either:
◾

An inadequate internal control system (controls fail to prevent
fraud); or

◾

Risks accepted by the management (no preventive or corrective
controls are in place).

Some examples of fraud-detection rules that can be derived from these
business policy excerpts and process deviations, and that may be added to the
fraud-detection rule engine are as follows:
Business policy excerpt 1:
◾

IF multiple advanced payments for one claim, THEN suspicious
case.

Business policy excerpt 2:
◾

IF settlement was not approved before it was paid, THEN suspicious case.

Business policy excerpt 3:
◾

IF settlement is proposed AND approved by the same person, THEN
suspicious case.

Business policy excerpt 4:
◾

IF settlement is approved AND changed afterward, THEN suspicious case.

◾

IF claim is rejected AND processed afterward (e.g., look for a settlement proposal, payment,… activity), THEN suspicious case.

BIG DATA FOR FRAUD DETECTION
When fraudulent activities have been detected and confirmed to effectively concern fraud, two types of measures are typically taken:
1. Corrective measures, that aim to resolve the fraud and correct the
wrongful consequences—for instance by means of pursuing

16

FRAUD ANALYTICS

restitution or compensation for the incurred losses. These corrective measures might also include actions to retrospectively
detect and subsequently address similar fraud cases that made
use of the same mechanism or loopholes in the fraud detection
and prevention system the organization has in place.
2. Preventive measures, which may both include actions that aim at
preventing future fraud by the caught fraudster (e.g., by terminating a contractual agreement with a customer, as well as
actions that aim at preventing fraud of the same type by other
individuals). When an expert-based approach is adopted, an
example preventive measure is to extend the rule engine by
incorporating additional rules that allow detecting and preventing the uncovered fraud mechanism to be applied in the future.
A fraud case must be investigated thoroughly so the underlying
mechanism can be unraveled, extending the available expert
knowledge and allowing it to prevent the fraud mechanism to
be used again in the future by making the organization more
robust and less vulnerable to fraud by adjusting the detection
and prevention system.
Typically, the sooner corrective measures are taken and therefore
the sooner fraud is detected, the more effective such measures may
be and the more losses can be avoided or recompensed. On the other
hand, fraud becomes easier to detect the more time has passed, for a
number of particular reasons.
When a fraud mechanism or path exists—meaning a loophole in the
detection and prevention system of an organization—the number of
times this path will be followed (i.e., the fraud mechanism used) grows
in time and therefore as well the number of occurrences of this particular type of fraud. The more a fraud path is taken the more apparent it
becomes and typically, in fact statistically, the easier to detect. The number of occurrences of a particular type of fraud can be expected to grow
since many fraudsters appear to be repeat offenders. As the expression
goes, “Once a thief, always a thief.” Moreover, a fraud mechanism may
well be discovered by several individuals or the knowledge shared
between fraudsters. As will be shown in Chapter 5 on social network
analytics for fraud detection, certainly some types of fraud tend to

FRAUD: DETECTION, PREVENTION, AND ANALYTICS!

17

spread virally and display what are called social network effects,
indicating that fraudsters share their knowledge on how to commit
fraud. This effect, too, leads to a growing number of occurrences and,
therefore, a higher risk or chance, depending on one’s perspective, of
detection.
Once a case of a particular type of fraud has been revealed, this will
lead to the exposition of similar fraud cases that were committed in the
past and made use of the same mechanism. Typically, a retrospective
screening is performed to assess the size or impact of the newly detected
type of fraud, as well as to resolve (by means of corrective measures,
cf. supra) as much as possible fraud cases. As such, fraud becomes
easier to detect the more time has passed, since more similar fraud
cases will occur in time, increasing the probability that the particular
fraud type will be uncovered, as well as because fraudsters committing
repeated fraud will increase their individual risk of being exposed. The
individual risk will increase the more fraud a fraudster commits for the
same basic reason: The chances of getting noticed get larger.
A final reason why fraud becomes easier to detect the more time
has passed is because better detection techniques are being developed,
are getting readily available, and are being implemented and applied by
a growing amount of organizations. An important driver for improvements with respect to detection techniques is growing data availability.
The informatization and digitalization of almost every aspect of society
and daily life leads to an abundance of available data. This so-called big
data can be explored and exploited for a range of purposes including
fraud detection (Baesens 2014), at a very low cost.

DATA-DRIVEN FRAUD DETECTION
Although classic, expert-based fraud-detection approaches as discussed
before are still in widespread use and definitely represent a good starting point and complementary tool for an organization to develop an
effective fraud-detection and prevention system, a shift is taking place
toward data-driven or statistically based fraud-detection methodologies for three apparent reasons:
1. Precision. Statistically based fraud-detection methodologies offer
an increased detection power compared to classic approaches.

18

FRAUD ANALYTICS

By processing massive volumes of information, fraud patterns
may be uncovered that are not sufficiently apparent to the
human eye. It is important to notice that the improved power
of data-driven approaches over human processing can be
observed in similar applications such as credit scoring or customer churn prediction. Most organizations only have a limited
capacity to have cases checked by an inspector to confirm
whether or not the case effectively concerns fraud. The goal
of a fraud-detection system may be to make the most optimal
use of the limited available inspection capacity, or in other
words to maximize the fraction of fraudulent cases among
the inspected cases (and possibly in addition, the detected
amount of fraud). A system with higher precision, as delivered
by data-based methodologies, directly translates in a higher
fraction of fraudulent inspected cases.
2. Operational efficiency. In certain settings, there is an increasing
amount of cases to be analyzed, requiring an automated process as offered by data-driven fraud-detection methodologies.
Moreover, in several applications, operational requirements
exist, imposing time constraints on the processing of a case.
For instance, when evaluating a transaction with a credit
card, an almost immediate decision is required with respect to
approve or block the transaction because of suspicion of fraud.
Another example concerns fraud detection for customs in a
harbor, where a decision has to be made within a confined
time window whether to let a container pass and be shipped
inland, or whether to further inspect it, possibly causing delays.
Automated data-driven approaches offer such functionality and
are able to comply with stringent operational requirements.
3. Cost efficiency. As already mentioned in the previous section,
developing and maintaining an effective and lean expert-based
fraud-detection system is both challenging and labor intensive.
A more automated and, as such, more efficient approach to
develop and maintain a fraud-detection system, as offered
by data-driven methodologies, is preferred. Chapters 6 and
7 discuss the cost efficiency and return on investment of
data-driven fraud-detection models.

FRAUD: DETECTION, PREVENTION, AND ANALYTICS!

19

An additional driver for the development of improved frauddetection technologies concerns the growing amount of interest that
fraud detection is attracting from the general public, the media, governments, and enterprises. This increasing awareness and attention
for fraud is likely due to its large negative social as well as financial
impact, and leads to growing investments and research into the
matter, both from academia, industry, and government.
Although fraud-detection approaches have gained significant
power over the past years by adopting potent statistically based
methodologies and by analyzing massive amounts of data in order
to discover fraud patterns and mechanisms, still fraud remains hard
to detect. It appears the Pareto principle holds with respect to the
required effort and difficulty of detecting fraud: It appears the principle of decreasing returns holds with respect to the required effort
and so forth. In order to explain the hardness and complexity of
the problem, it is important to acknowledge the fact that fraud is a
dynamic phenomenon, meaning that its nature changes in time. Not
only fraud-detection mechanisms evolve, but also fraudsters adapt
their approaches and are inventive in finding more refined and less
apparent ways to commit fraud without being exposed. Fraudsters
probe fraud-detection and prevention systems to understand their
functioning and to discover their weaknesses, allowing them to adapt
their methods and strategies.

FRAUD-DETECTION TECHNIQUES
Indeed, fraudsters develop advanced strategies to cleverly cover their
tracks in order to avoid being uncovered. Fraudsters tend to try and
blend in as much as possible into the surroundings. Such an approach
reminds of camouflage techniques as used by the military or by animals such as chameleons and stick insects. This is clearly no fraud by
opportunity, but rather, is carefully planned, leading to a need for new
techniques that are able to detect and address patterns that initially
seem to comply with normal behavior, but in reality instigate fraudulent activities.
Detection mechanisms based on unsupervised learning techniques
or descriptive analytics, as discussed in Chapter 3, typically aim at

20

FRAUD ANALYTICS

finding behavior that deviates from normal behavior, or in other words
at detecting anomalies. These techniques learn from historical observations, and are called unsupervised since they do not require these
observations to be labeled as either a fraudulent or a nonfraudulent
example case. An example of behavior that does not comply with
normal behavior in a telecommunications subscription fraud setting
is provided by the transaction data set with call detail records of a
particular subscriber shown in Table 1.2 (Fawcett and Provost 1997).
Remark that the calls found to be fraudulent (last column in the table
indicating bandit) are not suspicious by themselves; however, they are
deviating from normal behavior for this particular subscriber.
Outlier-detection techniques have great value and allow detecting
a significant fraction of fraudulent cases. In particular, they might
allow detecting fraud that is different in nature from historical fraud,
or in other words fraud that makes use of new, unknown mechanisms
resulting in a novel fraud pattern. These new patterns are not discovered
by expert systems, and as such descriptive analytics may be a first
Table 1.2 Call Detail Records of a Customer with Outliers Indicating Suspicious Activity
(deviating behavior starting at a certain moment in time) at the Customer Subscription
(Fawcett and Provost 1997)

Date (m/d)

Time

Day

Duration

Origin

Destination

1/01

10:05:01

Mon

1/05

14:53:27

1/08

09:42:01

1/08

Fraud

13 mins

Brooklyn, NY

Stamford, CT

Fri

5 mins

Brooklyn, NY

Greenwich, CT

Mon

3 mins

Bronx, NY

White Plains, NY

15:01:24

Mon

9 mins

Brooklyn, NY

Brooklyn, NY

1/09

15:06:09

Tue

5 mins

Manhattan, NY

Stamford, CT

1/09

16:28:50

Tue

53 sec

Brooklyn, NY

Brooklyn, NY

1/10

01:45:36

Wed

35 sec

Boston, MA

Chelsea, MA

Bandit

1/10

01:46:29

Wed

34 sec

Boston, MA

Yonkers, MA

Bandit

1/10

01:50:54

Wed

39 sec

Boston, MA

Chelsea, MA

Bandit

1/10

11:23:28

Wed

24 sec

White Plains, NY

Congers, NY

1/11

22:00:28

Thu

37 sec

Boston, MA

East Boston, MA

Bandit

1/11

22:04:01

Thu

37 sec

Boston, MA

East Boston, MA

Bandit

FRAUD: DETECTION, PREVENTION, AND ANALYTICS!

21

complementary tool to be adopted by an organization in order to
improve its expert rule–based fraud-detection system.
Descriptive techniques however show to be prone to deception,
exactly by the camouflage-like fraud strategies already discussed.
Therefore, the detection system can be further improved by complementing it by a tool that is able to unmask fraudsters adopting a
camouflage-like technique.
Therefore, in Chapter 4, a second type of techniques is introduced.
Supervised learning techniques or predictive analytics aim to learn
from historical information or observations in order to retrieve
patterns that allow differentiating between normal and fraudulent
behavior. These techniques exactly aim at finding silent alarms, the
parts of their tracks that fraudsters cannot cover up. Supervised
learners can be applied to predict or detect fraud as well as to estimate
the amount of fraud.
Predictive analytics has limitations as well, probably the most
important one being that they need historical examples to learn from
(i.e., a labeled data set of historically observed fraud behavior). This
reduces their detection power with respect to drastically different
fraud types making use of new mechanisms or methods, and which
have not been detected thus far and are therefore not included in the
historical database of fraud cases from which the predictive model
was learned. As already discussed, descriptive analytics may perform
better with respect to detecting such new fraud mechanisms, at
least if a new fraud mechanism leads to detectable deviations from
normality. This illustrates the complementarity of supervised and
unsupervised methods and motivates the use of both types of methods
as complementary tools in developing a powerful fraud-detection and
prevention system.
A third type of complementary tool concerns social network
analysis, which further extends the abilities of the fraud-detection
system by learning and detecting characteristics of fraudulent behavior
in a network of linked entities. Social network analytics is the newest
tool in our toolbox to fight fraud, and proofs to be a very powerful
means as will appear from the discussion and presented case study

22

FRAUD ANALYTICS

in Chapter 5. Social network analytics allows including an extra
source of information in the analysis, being the relationships between
entities, and as such may contribute in uncovering particular patterns
indicating fraud.
It is important to stress that these three different types of techniques may complement each other since they focus on different
aspects of fraud and are not to be considered as exclusive alternatives.
An effective fraud-detection and prevention system will make use of
and combine these different tools, which have different possibilities
and limitations and therefore reinforce each other when applied in
a combined setup. When developing a fraud-detection system, an
organization will likely follow the order in which the different tools
have been introduced; as a first step an expert-based rule engine
may be developed, which in a second step may be complemented
by descriptive analytics, and subsequently by predictive and social
network analytics. Developing a fraud-detection system in this order
allows the organization to gain expertise and insight in a stepwise
manner, hereby facilitating each next step. However, the exact order
of adopting the different techniques may depend on the characteristics
of the type of fraud an organization is faced with.

FRAUD CYCLE
Figure 1.3 introduces the fraud cycle, and depicts four essential activities:
◾

Fraud detection: Applying detection models on new, unseen
observations and assigning a fraud risk to every observation.

◾

Fraud investigation: A human expert is often required to
investigate suspicious, flagged cases given the involved subtlety
and complexity.

◾

Fraud confirmation: Determining true fraud label, possibly
involving field research.

◾

Fraud prevention: Preventing fraud to be committed in the
future. This might even result in detecting fraud even before
the fraudster knows s/he will commit fraud, which is exactly the

FRAUD: DETECTION, PREVENTION, AND ANALYTICS!

23

Automated detection algorithm

Fraud
detection

Fraud
investigation

Fraud
confirmation

Fraud
prevention

Current process

Figure 1.3 The Fraud Cycle

premise of the 1956 science fiction short story Minority Report by
Philip K. Dick.
Remark the feedback loop in Figure 1.3 from the fraud confirmation activity toward the fraud-detection activity. Newly detected cases
should be added (as soon as possible!) to the database of historical
fraud cases, which is used to learn or induce the detection model.
The fraud-detection model may not be retrained every time a new
case is confirmed; however, a regular update of the model is recommendable given the dynamic nature of fraud and the importance of
detecting fraud as soon as possible. The required frequency of retraining or updating the detection model depends on several factors:
◾

The volatility of the fraud behavior

◾

The detection power of the current model, which is related to
the volatility of the fraud behavior

◾

The amount of (similar) confirmed cases already available in the
database

◾

The rate at which new cases are being confirmed

◾

The required effort to retrain the model

Depending on the emerging need for retraining as determined
by these factors, as well as possible additional factors, an automated
approach such as reinforcement learning may be considered which
continuously updates the detection model by learning from the
newest observations.

EXAMPLE CASE

24

FRAUD ANALYTICS

EXAMPLE CASE: SUPERVISED AND UNSUPERVISED
LEARNING FOR DETECTING CREDIT CARD FRAUD
In order to fight fraud and given the abundant data availability, credit card
companies have been among the early adopters of big data approaches to
develop effective fraud-detection and prevention systems. A typical credit card
transaction is registered in the systems of the credit card company by logging
up to a hundred or more characteristics describing the details of a transaction.
Table 1.3 provides for illustrative purposes a number of such characteristics
or variables that are being captured (Hand 2007).
Table 1.3

Example Credit Card Transaction Data Fields

Transaction ID

Transaction type

Date of transaction

Time of transaction

Amount

Currency

Local currency amount

Merchant ID

Merchant category

Card issuer ID

ATM ID

Cheque account prefix

By logging this information over a period of time, a dataset is being
created that allows applying descriptive analytics. This includes outlier
detection techniques, which allow detecting abnormal or anomalous behavior
and/or characteristics in a data set. So-called outliers may indicate suspicious
activities, and may occur at the data item level or the data set level.
Figure 1.4 provides an illustration of outliers at the data item level, in this
example transactions that deviate from the normal behavior by a customer.
The scatter plot clearly shows three clusters of regular, frequently occurring
types as characterized by the time and place dimension of transactions for one
particular customer, as well as two deviating transactions marked in black.
These outliers are suspicious and possibly concern fraudulent transactions,
and therefore may be flagged for further human investigation.
An outlier at the data set level means that the behavior of a person or
instance does not comply with the overall behavior. Figure 1.5 plots the age
and income characteristics of customers as provided when applying for a
credit card. The two outliers marked in black in the plot may indicate so-called
subscription fraud (cf. Table 1.1, definition of credit card fraud), since these
combinations of age and income strongly deviate from the normal behavior.

25

FRAUD: DETECTION, PREVENTION, AND ANALYTICS!

Time versus Place
70
60

Time

50
40
30
20
10
0
0

20

40

60

80

100

120

140

160

Place
Figure 1.4 Outlier Detection at the Data Item Level

Income

Income versus Age
4500
4000
3500
3000
2500
2000
1500
1000
500
0
0

10

20

30

40

50

60

70

Age

Figure 1.5 Outlier Detection at the Data Set Level

The addition of a field in the available transaction data set that indicates
whether a transaction was fraudulent allows predictive analytics to be applied
to yield a model predicting or classifying an instance as being fraudulent or
not. As will be discussed in the next section as well as in Chapter 4, such
models may be interpreted to understand the underlying credit card fraud
behavior patterns that lead the model to predict whether a transaction might be
fraudulent. Such patterns may be:
◾

Small purchase followed by a big one

◾

Large number of online purchases in a short period

26

FRAUD ANALYTICS

◾

Spending as much as possible as quickly as possible

◾

Spending smaller amounts, spread across time

Such pattern may be harder to detect and concern advanced methods
adopted by fraudsters and developed exactly to avoid detection.

THE FRAUD ANALYTICS PROCESS MODEL
Figure 1.6 provides a high-level overview of the analytics process
model (Han and Kamber 2011; Hand, Mannila, and Smyth 2001; Tan,
Steinbach, and Kumar 2005). As a first step, a thorough definition of
the business problem is needed to be solved with analytics. Next, all
source data must be identified that could be of potential interest. This
is a very important step, as data are the key ingredient to any analytical
exercise and the selection of data will have a deterministic impact on
the analytical models that will be built in a subsequent step. All data
will then be gathered in a staging area that could be a data mart or data
warehouse. Some basic exploratory analysis can be considered here
using for instance OLAP (online analytical processing, see Chapter 3)
facilities for multidimensional data analysis (e.g., roll-up, drill down,
slicing and dicing). This will be followed by a data-cleaning step to

Identify
Business
Problem

Identify
Data
Sources

Select
the
Data

Clean
the
Data

Preprocessing

Figure 1.6 The Fraud Analytics Process Model

Transform
the
Data

Analyze
the
Data

Intepret,
Evaluate,
and Deploy
the Model

Analytics

Postprocessing

FRAUD: DETECTION, PREVENTION, AND ANALYTICS!

27

get rid of all inconsistencies, such as missing values and duplicate
data. Additional transformations may also be considered, such as
binning, alphanumeric to numeric coding, geographical aggregation,
and so on. In the analytics step, an analytical model will be estimated
on the preprocessed and transformed data. In this stage, the actual
fraud-detection model is built. Finally, once the model has been built,
it will be interpreted and evaluated by the fraud experts.
Trivial patterns that may be detected by the model, for instance
similar to expert rules, are interesting as they provide some validation of the model. But of course, the key issue is to find the unknown
yet interesting and actionable patterns (sometimes also referred to as
knowledge diamonds) that can provide added insight and detection
power. Once the analytical model has been appropriately validated and
approved, it can be put into production as an analytics application (e.g.,
decision support system, scoring engine). Important to consider here
is how to represent the model output in a user-friendly way, how to
integrate it with other applications (e.g., detection and prevention system, risk engines), and how to make sure the analytical model can
be appropriately monitored and backtested on an ongoing basis. These
post-processing steps will be discussed in detail in Chapter 6.
It is important to note that the process model outlined in Figure 1.6
is iterative in nature in the sense that one may have to go back to
previous steps during the exercise. For example, during the analytics step, the need for additional data may be identified, which may
necessitate additional cleaning, transformation, and so on. The most
time-consuming step typically is the data selection and preprocessing
step, which usually takes around 80 percent of the total efforts needed
to build an analytical model.
A fraud-detection model must be thoroughly evaluated before
being adopted. Depending on the exact setting and usage of the model,
different aspects may need to be assessed during evaluation in order
to ensure the model to be acceptable for implementation. Table 1.4
reviews several key characteristics of successful fraud analytics models
that may or may not apply, depending on the exact application.
A number of particular challenges may present themselves when
developing and implementing a fraud-detection model, possibly
leading to difficulties in meeting the objectives as expressed by the

28

FRAUD ANALYTICS

Table 1.4

Key Characteristics of Successful Fraud Analytics Models

Statistical
accuracy

Refers to the detection power and the correctness of the statistical model in
flagging cases as being suspicious. Several statistical evaluation criteria
exist and may be applied to evaluate this aspect, such as the hit rate, lift
curves, AUC, etc. A number of suitable measures will be discussed in detail
in Chapter 4. Statistical accuracy may also refer to statistical significance,
meaning that the patterns that have been found in the data have to be real and
not the consequence of coincidence. In other words, we need to make sure
that the model generalizes well and is not overfitted to the historical data set.

Interpretability

When a deeper understanding of the detected fraud patterns is required, for
instance to validate the model before it is adopted for use, a fraud-detection
model may have to be interpretable. This aspect involves a certain degree of
subjectivism, since interpretability may depend on the user’s knowledge. The
interpretability of a model depends on its format, which, in turn, is
determined by the adopted analytical technique. Models that allow the user
to understand the underlying reasons why the model signals a case to be
suspicious are called white-box models, whereas complex incomprehensible
mathematical models are often referred to as black-box models. It may well
be in a fraud-detection setting that black-box models are acceptable,
although in most settings some level of understanding and in fact validation
which is facilitated by interpretability is required for the management to have
confidence and allow the effective operationalization of the model.

Operational
efficiency

Operational efficiency refers to the time that is required to evaluate the
model, or in other words, the time required to evaluate whether a case is
suspicious or not. When cases need to be evaluated in real time, for instance
to signal possible credit card fraud, operational efficiency is crucial and is a
main concern during model performance assessment. Operational efficiency
also entails the efforts needed to collect and preprocess the data, evaluate the
model, monitor and backtest the model, and reestimate it when necessary.

Economical
cost

Developing and implementing a fraud-detection model involves a significant
cost to an organization. The total cost includes the costs to gather,
preprocess, and analyze the data, and the costs to put the resulting analytical
models into production. In addition, the software costs, as well as human
and computing resources, should be taken into account. Possibly also
external data has to be bought to enrich the available in-house data. Clearly,
it is important to perform a thorough cost-benefit analysis at the start of the
project, and to gain insight in the constituent factors of the returns on
investment of building an advanced fraud-detection system.

Regulatory
compliance

Depending on the context there may be internal or organization-specific and
external regulation and legislation that applies to the development and
application of a model. Clearly, a fraud-detection model should be in line
and comply with all applicable regulation and legislation, for instance with
respect to privacy, the use of cookies in a web-browser, etc.

FRAUD: DETECTION, PREVENTION, AND ANALYTICS!

29

characteristics discussed in Table 1.4. A first key challenge concerns
the dynamic nature of fraud. Fraudsters constantly try to beat
detection and prevention systems by developing new strategies and
methods. Therefore, adaptive analytical models and detection and
prevention systems are required, in order to detect and resolve fraud
as soon as possible. Detecting fraud as early as possible is crucial, as
discussed before.
Clearly, it is also crucial to detect fraud as accurately as possible,
and not to miss out on too many fraud cases, especially on fraud cases
involving a large amount or financial impact. The cost of missing a
fraudulent case or a fraud mechanism may be significant. Related to
having good detection power is the requirement of having at the same
time a low false alarm rate, since we also want to avoid harassing
good customers and prevent accounts or transactions to be blocked
unnecessarily.
In developing analytical models with good detection power and
low false alarm rate, an additional difficulty concerns the skewedness
of the data, meaning that we typically have plenty of historical
examples of nonfraudulent cases, but only a limited number of
fraudulent cases. For instance, in a credit card fraud setting, typically
less than 0.5 percent of transactions are fraudulent. Such a problem is
commonly referred to as a needle-in-a-haystack problem and might
cause an analytical technique to experience difficulties in learning an
accurate model. A number of approaches to address the skewedness
of the data will be discussed in Chapter 4.
Depending on the exact application, also operational efficiency
may be a key requirement, meaning that the fraud-detection system
might only have a limited amount of time available to reach a decision
and let a transaction pass or not. As an example, in a credit card
fraud-detection setting the decision time has to be typically less than
eight seconds. Such a requirement clearly impacts the design of the
operational IT systems, but also the design of the analytical model.
The analytical model should not take too long to be evaluated, and
the information or the variables that are used by the model should not
take too long to be gathered or calculated. Calculating trend variables
in real time, for instance, may not be feasible from an operational
perspective, since this is taking too much valuable time. This also

30

FRAUD ANALYTICS

relates to the final challenge of dealing with the massive volumes of
data that are available and need to be processed.

FRAUD DATA SCIENTISTS
Whereas in the previous section we discussed the characteristics of a
good fraud-detection model, in this paragraph we will elaborate on the
key characteristics of a good fraud data scientist from the perspective
of the hiring manager. It is based on our consulting and research experience, having collaborated with many companies worldwide on the
topic of big data, analytics, and fraud detection.

A Fraud Data Scientist Should Have Solid
Quantitative Skills
Obviously, a fraud data scientist should have a thorough background
in statistics, machine learning and/or data mining. The distinction
between these various disciplines is getting more and more blurred
and is actually not that relevant. They all provide a set of quantitative techniques to analyze data and find business relevant patterns
within a particular context such as fraud detection. A data scientist
should be aware of which technique can be applied when and how.
He/she should not focus too much on the underlying mathematical
(e.g., optimization) details but, rather, have a good understanding
of what analytical problem a technique solves, and how its results
should be interpreted. In this context, the education of engineers
in computer science and/or business/industrial engineering should
aim at an integrated, multidisciplinary view, with graduates formed
in both the use of the techniques, and with the business acumen
necessary to bring new endeavors to fruition. Also important is to
spend enough time validating the analytical results obtained so as
to avoid situations often referred to as data massage and/or data
torture whereby data is (intentionally) misrepresented and/or too
much focus is spent discussing spurious correlations. When selecting
the optimal quantitative technique, the fraud data scientist should

FRAUD: DETECTION, PREVENTION, AND ANALYTICS!

31

take into account the specificities of the context and the problem
or fraud-detection application at hand. Typical requirements for
fraud-detection models have been discussed in the previous section
and the fraud data scientist should have a basic understanding and
feeling for all of those. Based on a combination of these requirements,
the data scientist should be capable of selecting the best analytical
technique to solve the particular business problem.

A Fraud Data Scientist Should Be a Good Programmer
As per definition, data scientists work with data. This involves plenty
of activities such as sampling and preprocessing of data, model estimation and post-processing (e.g., sensitivity analysis, model deployment,
backtesting, model validation). Although many user-friendly software
tools are on the market nowadays to automate and support these tasks,
every analytical exercise requires tailored steps to tackle the specificities of a particular business problem and setting. In order to successfully
perform these steps, programming needs to be done. Hence, a good
data scientist should possess sound programming skills (e.g., SAS, R,
Python, etc.). The programming language itself is not that important
as such, as long as he/she is familiar with the basic concepts of programming and knows how to use these to automate repetitive tasks or
perform specific routines.

A Fraud Data Scientist Should Excel in Communication
and Visualization Skills
Like it or not, analytics is a technical exercise. At this moment, there
is a huge gap between the analytical models and the business users.
To bridge this gap, communication and visualization facilities are
key. Hence, data scientists should know how to represent analytical
models and their accompanying statistics and reports in user-friendly
ways using traffic-light approaches, OLAP (online analytical processing) facilities, If-then business rules, and so on. They should be
capable of communicating the right amount of information without

32

FRAUD ANALYTICS

getting lost into complex (e.g., statistical) details, which will inhibit a
model’s successful deployment. By doing so, business users will better
understand the characteristics and behavior in their (big) data which
will improve their attitude toward and acceptance of the resulting
analytical models. Educational institutions must learn to balance,
since many academic degrees form students who are skewed to either
too much analytical or too much practical knowledge.

A Fraud Data Scientist Should Have a Solid Business
Understanding
While this might be obvious, we have witnessed (too) many data science projects that failed because the respective analyst did not understand the business problem at hand. By “business” we refer to the
respective application area. Several examples of such application areas
of fraud-detection techniques were summarized in Table 1.1. Each of
those fields has its own particularities that are important for a fraud
data scientist to know and understand in order to be able to design and
implement a customized fraud-detection system. The more aligned the
detection system with the environment, the better its performance will
be, as evaluated on each of the dimensions already discussed.

A Fraud Data Scientist Should Be Creative
A data scientist needs creativity on at least two levels. First, on a technical level, it is important to be creative with regard to feature selection,
data transformation, and cleaning. These steps of the standard analytics process have to be adapted to each particular application, and
often the “right guess” could make a big difference. Second, big data
and analytics is a fast-evolving field. New problems, technologies, and
corresponding challenges pop up on an ongoing basis. Moreover, also
fraudsters are very creative and adapt their tactics and methods on an
ongoing basis. Therefore, it is crucial that a fraud data scientist keeps up
with these new evolutions and technologies and has enough creativity
to see how they can create new opportunities.
Figure 1.7 summarizes the key characteristics and strengths constituting the ideal fraud data scientist profile.

FRAUD: DETECTION, PREVENTION, AND ANALYTICS!

Creative

Programmer
35
30
25
20
15
10
5
0

Business

33

Modeling

Communication
"Close to ideal"
"Too specialized"

Figure 1.7 Profile of a Fraud Data Scientist

A SCIENTIFIC PERSPECTIVE ON FRAUD
To conclude this chapter, let’s provide a scientific perspective about
the research on fraud. Figure 1.8 shows a screenshot of the Web of
Science statistics when querying all scientific publications between
1996 and 2014 for the key term fraud. It shows the total number of
papers published each year, the number of citations and the top five
most-cited papers.
A couple of conclusions can be drawn as follows:
◾

6,174 scientific papers have been published on the topic of fraud
during the period reported.

◾

The h-index is 44, implying that there are at least 44 papers with
44 citations on the topic of fraud.

◾

The number of publications is steadily increasing, which shows
a growing interest from the academic community and research
on the topic.

◾

The citations are exponentially growing, which is associated
with the increasing number of publications.

◾

Two of the five papers mentioned study the use of analytics for
fraud detection, clearly illustrating the growing attention in the
field for data-driven approaches.

Figure 1.8 Screenshot of Web of Science Statistics for Scientific Publications on Fraud between 1996 and 2014

34

FRAUD: DETECTION, PREVENTION, AND ANALYTICS!

35

REFERENCES
Armstrong, J. S. (2001). Selecting Forecasting Methods. In J.S. Armstrong, ed.
Principles of Forecasting: A Handbook for Researchers and Practitioners. New
York: Springer Science + Business Media, pp. 365–386.
Baesens, B. (2014). Analytics in a Big Data World: The Essential Guide to Data Science
and Its Applications. Hoboken, NJ: John Wiley & Sons.
Bolton, R. J., & Hand, D. J. (2002). Statistical Fraud Detection: A Review.
Statistical Science, 17 (3): 235–249.
Caron, F., Vanden Broucke, S., Vanthienen, J., & Baesens, B. (2013). Advanced
Rule-Based Process Analytics: Applications for Risk Response Decisions
and Management Control Activities. Expert Systems with Applications,
Submitted.
Chakraborty, G., Murali, P., & Satish, G. (2013). Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS. Cary, NC: SAS Institute.
Cressey, D. R. (1953). Other People’s Money; A Study of the Social Psychology of
Embezzlement. New York: Free Press.
Duffield, G., & Grabosky, P. (2001). The Psychology of Fraud. In Trends and
Issues in Crime and Criminal Justice, Australian Institute of Criminology (199).
Elder IV,, J., & Thomas, H. (2012). Practical Text Mining and Statistical Analysis for
Non-Structured Text Data Applications. New York: Academic Press.
Fawcett, T., & Provost, F. (1997). Adaptive Fraud Detection. Data Mining and
Knowledge Discovery 1–3 (3): 291–316.
Grabosky, P., & Duffield, G. (2001). Red Flags of Fraud. Trends and Issues in Crime
and Criminal Justice, Australian Institute of Criminology (200).
Han, J., & Kamber, M. (2011). Data Mining: Concepts and Techniques, Third
Edition: Morgan Kaufmann.
Hand, D. (2007, September). Statistical Techniques for Fraud Detection, Prevention,
and Evaluation. Paper presented at the NATO ASI: Mining Massive Data
sets for Security, London, England.
Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of Data Mining.
Cambridge, MA: Bradford.
Jamain, A. (2001). Benford’s Law. London: Imperial College.
Junqué de Fortuny, E., Martens, D., & Provost, F. (2013). Predictive Modeling
with Big Data: Is Bigger Really Better? Big Data 1 (4): 215–226.
Little, R. J. A., & Rubin, D. B. (2002). Statistical Analysis with Missing Data
(2nd ed.). Hoboken, NJ: Wiley.
Maydanchik, A. (2007). Data Quality Assessment. Bradley Beach, NC: Technics
Publications.

36

FRAUD ANALYTICS

Navarette, E. (2006). Practical Calculation of Expected and Unexpected Losses
in Operational Risk by Simulation Methods (Banca & Finanzas: Documentos de Trabajo, 1 (1): pp. 1–12).
Petropoulos, F., Makridakis, S., Assimakopoulos, V., & Nikolopoulos, K. (2014).
“Horses for courses” in demand forecasting. European Journal of Operational
Research, 237 (1): 152–163.
Schneider, F. (2002). Size and Measurement of the Informal Economy in 110
Countries around the World. In Workshop of Australian National Tax Centre,
ANU, Canberra, Australia.
Tan, P.-N. N., Steinbach, M., & Kumar, V. (2005). Introduction to Data Mining.
Boston: Addison Wesley.
Van Gestel, T., & Baesens, B. (2009). Credit Risk Management: Basic Concepts:
Financial Risk Components, Rating Analysis, Models, Economic and Regulatory
Capital. Oxford: Oxford University Press.
Van Vlasselaer, V., Eliassi-Rad, T., Akoglu, L., Snoeck, M., & Baesens, B. (2015).
Gotcha! Network-based Fraud Detection for Social Security Fraud. Management Science, Submitted.
Verbeke, W., Dejaeger, K., Martens, D., Hur, J., & Baesens, B. (2012). New
Insights into Churn Prediction in the Telecommunication Sector: A Profit
Driven Data Mining Approach. European Journal of Operational Research
218: 211–229.

C H A P T E R

2

Data Collection,
Sampling, and
Preprocessing

37

INTRODUCTION
Data is a key ingredient for any analytical exercise. Hence, it is of
key importance to thoroughly consider and list all data sources that
are potentially of interest and relevant before starting the analysis.
Large experiments as well as a broad experience in different fields
indicate that when it comes to data, bigger is better (see de Fortuny,
Martens, & Provost, 2013). However, real-life data can be (typically is)
dirty because of inconsistencies, incompleteness, duplication, merging,
and many other problems. Hence, throughout the analytical modeling
steps, various data-filtering mechanisms will be applied to clean up
and reduce the data to a manageable and relevant size. Worth mentioning here is the garbage in, garbage out (GIGO) principle, which
essentially states that messy data will yield messy analytical models.
Hence, it is of utmost importance that every data preprocessing step is
carefully justified, carried out, validated, and documented before proceeding with further analysis. Even the slightest mistake can make the
data totally unusable for further analysis and the results invalid and
of no use whatsoever. In what follows, we will elaborate on the most
important data preprocessing steps that should be considered during
an analytical modeling exercise to build a fraud detection model. But
first, let us have a closer look at what data to gather.

TYPES OF DATA SOURCES
Data can originate from a variety of different sources and provide
different types of information that might be useful for the purpose of
fraud detection, as will be further discussed in this section. Do remark
that the provided mixed discussion of different sources and types
of data concerns a broad, non-exhaustive and non-mutually exclusive
categorization, respectively in the sense that the most prominent data
sources and types of information as available in a typical organization
are listed, but clearly not all possible data sources and types of information are discussed, and possibly some overlap exists between the
enlisted categories.

38

DATA COLLECTION, SAMPLING, AND PREPROCESSING

39

Transactional data is a first important source of data. It consists
of structured and detailed information capturing the key characteristics
of a customer transaction (e.g., a purchase, claim, cash transfer, credit
card payment). It is usually stored in massive OLTP (online transaction processing) relational databases. This data can also be
summarized over longer time horizons by aggregating it into averages,
(absolute or relative) trends, maximum or minimum values, etc.
An important type of such aggregated transactional information in
a fraud detection setting concern RFM variables, which stands for
recency (R), frequency (F), and monetary (M) variables. RFM variables are often used for clustering in fraud detection, as will be
discussed in Chapter 3, but are useful as well in a supervised learning
setting as will be discussed in Chapter 4. RFM variables have been
originally introduced in a marketing setting (Cullinan, 1977), but they
clearly may also come in handy for fraud detection.
RFM variables can be operationalized in various ways. Let us take
the example of credit card fraud detection. The recency can be measured in a continuous way such as how much time elapsed since the
most recent transaction, or in a binary way such as was there a transaction made during the past day, week, month, and so on. The frequency
can be quantified as the number of transactions per day, week, month,
and so on. Similarly, the monetary variable can be quantified as the
minimum, maximum, average, median, or most recent value of a
transaction. Although these variables can be meaningful when interpreted individually (e.g., fraudsters make more frequent transactions
than nonfraudsters), also their interaction can be very useful for fraud
detection. For example, credit card fraudsters often try out a stolen
credit card for a low amount to see whether it works, before making a
big purchase, resulting in a recent and low monetary value transaction
followed by a recent and high monetary value transaction. RFM
variables can also prove their worth in an anti-money laundering
context by considering recency, frequency, and amount of cash
transfers between accounts, which possibly allows uncovering characteristic money laundering patterns. A final illustration concerns RFM
variables which may be operationalized for insurance claim fraud

40

FRAUD ANALYTICS

detection by constructing variables such as time since previous claim,
number of claims submitted in the previous twelve months, and total monetary
amount of claims since subscription of insurance contract.
Contractual, subscription, or account data may complement
transactional data if a contractual relation exists, which is often the
case for utilities such as gas, electricity, telecommunication, and so on.
Examples of subscription data are the start date of the relation, information on subscription renewals, characteristics of a subscription such
as type of services or products delivered, levels of service, cost of service, product guarantees and insurances, and so on. The moment when
customers subscribe to a service offers a unique opportunity for organizations to get to know their customers. Unique in the sense that it may
be the only time when a direct contact exists between an employee
and the customer, either in person, over the phone, or online, and as
such offers the opportunity for the organization to gather additional
information that is nonessential to the contract but may be useful
for purposes such as marketing but as well for fraud detection. Such
information is typically stored in an account management or customer
relationship management (CRM) database.
Subscription data may also be a source of sociodemographic
information, since typically subscription or registration requires identification. Examples of socioeconomic characteristics of a population
consisting of individuals are age, gender, marital status, income level, education level, occupation, religion, and so on. Although not very advanced
or complex measures, sociodemographic information may significantly relate to fraudulent behavior. For instance, it appears that both
gender as well as age is very often related to an individual’s likelihood
to commit fraud: female and older individuals are less likely to commit
fraud than male and younger individuals. Similar characteristics can
also be defined when the basic entities for which fraud is to be detected
do not concern individuals but instead companies or organizations.
In such a setting one rather speaks of slow-moving data dimensions,
factual data or static characteristics. Examples include the address,
year of foundation, industrial sector, activity type, and so on. These do
not change over time at all or as often as do other characteristics such
as turnover, solvency, number of employees, etc. These latter variables
are examples of what we will call below behavioral information.

DATA COLLECTION, SAMPLING, AND PREPROCESSING

41

Several data sources may be consulted for retrieving sociodemographic or factual data, including subscription data sources as
discussed above, and data poolers, survey data, and publicly available
data sources as discussed below.
Nowadays, data poolers are getting more and more important
in the industry. Examples are Experian, Equifax, CIFAS, Dun &
Bradstreet, Thomson Reuters, etc. The core business of these companies is to gather data (e.g., sociodemographic information) in particular
settings or for particular purposes (e.g., fraud detection, credit risk
assessment, and marketing) and sell it to interested customers looking
to enrich or extend their data sources. Additionally to selling data,
these data poolers typically also build predictive models themselves
and sell the output of these models as risk scores. This is a common
practice in credit risk, for instance, in the United States the FICO score
is a credit score ranging between 300 and 850 provided by the three
most important credit data poolers or credit bureaus: Experian, Equifax
and Transunion. Many financial institutions as well as commercial
vendors that give credit to customers use these FICO scores either as
their final internal model to assess creditworthiness, or to benchmark
it against an internally developed credit scorecard to better understand
the weaknesses of the latter. The use of generic, pre-defined fraud
risk scores is not yet common practice, but may become possible in
the near future.
Surveys are another source of data—that is, survey data. Such data
are gathered by inquiring the target population by means of an offline
(via mail, letter, etc.) or online (via phone call or the Internet, which
offers different contact channels such as the organization’s website, the
online helpdesk, social media profiles such as Facebook, LinkedIn, or
Twitter) survey. Surveys may aim at gathering sociodemographic data,
but also behavioral information.
Behavioral information concerns any information describing
the behavior of an individual or an entity in the particular context
under research. Such data are also called fast-moving data or dynamic
characteristics. Examples of behavioral variables include information
with regards to preferences of customers, usage information, frequencies of events, trend variables, and so on. When dealing with
organizations, examples of behavioral characteristics or dynamic

42

FRAUD ANALYTICS

characteristics are turnover, solvency, number of employees, and so on.
Marketing data results from monitoring the impact of marketing
actions on the target population, and concerns a particular type of
behavioral information.
Also, unstructured data embedded in text documents (e.g.,
emails, Web pages, claim forms) or multimedia content can be interesting to analyze. However, these sources typically require extensive
preprocessing before they can be successfully included in an analytical
exercise. Analyzing textual data is the goal of a particular branch of
analytics, called text analytics. Given the high level of specialization
involved, this book does not provide an extensive discussion of text
mining techniques, although a brief introduction is provided in the
final chapter of the book. For more information on this topic, one
may refer to academic textbooks on the subject (Chakraborty, Murali,
& Satish 2013).
A second type of unstructured information is contextual or
network information, meaning the context of a particular entity.
An example of such contextual information concerns relations of
a particular type that exist between an entity and other entities of
the same type or of another type. How to gather and represent such
information, as well as how to make use of it will be described in
detail in Chapter 5, which will focus on social network analytics for
fraud detection.
Another important source of data is qualitative, expert-based
data. An expert is a person with a substantial amount of subject
matter expertise within a particular setting (e.g., credit portfolio manager, brand manager). The expertise stems from both common sense
and business experience and it is important to elicit this knowledge
as much as possible before the analytical model building exercise is
started. It will allow steering the modeling in the right direction and
interpret the analytical results from the right perspective. A popular
example of applying expert-based validation is checking the univariate
signs of a regression model. For instance, an example already discussed
before concerns the observation that a higher age often results in
a lower likelihood of being fraudulent. Consequently, a negative
sign is expected when including age in a fraud prediction model

DATA COLLECTION, SAMPLING, AND PREPROCESSING

43

yielding the probability of an individual committing fraud. If this turns
out not to be the case due to whatever reason (e.g., bad data
quality, multicollinearity), the expert or business user will not be
tempted to use the analytical model at all, since it contradicts prior
expectations.
A final source of information concerns publicly available data
sources that can provide for instance, external information. This is
contextual information that is not related to a particular entity, such
as macroeconomic data (GDP, inflation, unemployment, etc.), and
weather observations. By enriching the data set with such information
one may see for example how the model and the model outputs vary
as a function of the state of the economy. Possibly fraud rates and
total amounts of fraud increase during economic downturn periods,
although no scientific evidence (or counter evidence, for that matter)
of such an effect is, to our knowledge, available (yet) in the scientific
literature. Also social media data from, for example, Facebook, Twitter,
LinkedIn, and so on, and that are publicly available can be an important source of information. However, one needs to be careful in both
gathering and using such data and make sure that local and international privacy regulations are respected at all times. Privacy concerns
in a data analytics context will be discussed in the final chapter.

MERGING DATA SOURCES
The application of both descriptive and predictive analytics typically
requires or presumes the data to be presented in a single table containing and representing all the data in a structured manner. A structured
data table allows straightforward processing and analysis. (Learning
from multiple tables that are related—that is, learning directly from
relational databases without merging the normalized tables—is a particular branch within the field of data mining called relational learning,
and shares techniques and approaches with social network analytics as
discussed in Chapter 5.)
Typically, the rows of a data table represent the basic entities to
which the analysis applies (e.g., customers, transactions, enterprises,
claims, cases). The rows are referred to as instances, observations, or

44

FRAUD ANALYTICS

lines. The columns in the data table contain information about the basic
entities. Plenty of synonyms are used to denote the columns of the data
table, such as (explanatory) variables, fields, characteristics, attributes,
indicators, features, and so on.
In order to construct the aggregated, non-normalized data table to
facilitate further analysis, often several normalized source data tables
have to be merged. Merging tables involves selecting information
from different tables related to an individual entity, and copying it
to the aggregated data table. The individual entity can be recognized
and selected in the different tables by making use of keys, which
are attributes that have exactly been included in the table to allow
identifying and relating observations from different source tables pertaining to the same entity. Figure 2.1 illustrates the process of merging
two tables—that is, transactions data and customer data, into a single
non-normalized data table by making use of the key attribute ID, which
allows connecting observations in the transactions table with observations in the customer table. The same approach can be taken to merge
as many tables as required, but clearly the more tables are merged,
the more duplicate data might be included in the resulting table.
When merging data tables, it is crucial that no errors happen, so
some checks should be applied to control the resulting table and to
make sure that all information is correctly integrated.
Transactions
ID

Date

Amount

XWV

2/01/2015

52 €

XWV

6/02/2015

21 €

XWV

3/03/2015

13 €

ID

Date

BBC

17/02/2015

45 €

XWV

2/01/2015

52 €

31

1/01/2015

BBC

1/03/2015

75 €

XWV

6/02/2015

21 €

31

1/01/2015

VVQ

2/03/2015

56 €

XWV

3/03/2015

13 €

31

1/01/2015

BBC 17/02/2015

45 €

49

10/02/2015

BBC

1/03/2015

75 €

49

10/02/2015

VVQ

2/03/2015

56 €

21

15/02/2015

Customer data
ID

Age

Start date

XWV

31

1/01/2015

BBC

49

10/02/2015

VVQ

21

15/02/2015

Non-normalized data table
Amount Age

Start date

Figure 2.1 Aggregating Normalized Data Tables into a Non-Normalized Data Table

DATA COLLECTION, SAMPLING, AND PREPROCESSING

45

SAMPLING
The aim of sampling is to take a subset of historical data (e.g., past
transactions) and use that to build an analytical model. A first obvious question that comes to mind concerns the need for sampling.
Obviously, with the availability of high-performance computing facilities (e.g., grid and cloud computing), one could also try to directly
analyze the full data set. However, a key requirement for a good sample is that it should be representative for the future entities on which
the analytical model will be run. Hence, the timing aspect becomes
important since for instance transactions of today are more similar
to transactions of tomorrow than transactions of yesterday. Choosing
the optimal time window of the sample involves a trade-off between
lots of data (and hence a more robust analytical model) and recent
data (which may be more representative). The sample should also be
taken from an average business period to get as accurate as possible a
picture of the target population.
It speaks for itself that sampling bias should be avoided as much as
possible. However, this is not always that straightforward. For instance,
in a credit card context one may take a month’s data as input, which
will already result in a substantial amount of data to be processed.
But which month is representative for future months? Clearly,
customers may use their credit card differently during the month of
December when buying gifts for the holiday period. When looking at
this example more closely, we discover in fact two sources of bias or
deviations from normal business periods. Credit card customers may
spend more during this period, both in total as well as on individual
products. Additionally, different types of products may be bought in
different stores usually frequented by the customer.
Let us consider two concrete solutions to address such a seasonality effect or bias, although other solutions may exist. Since every
month may in fact deviate from normal, if normal is defined as average,
it could make sense to build separate models for different months, or
for homogeneous time frames. This is a rather complex and demanding
solution from an operational perspective, since multiple models have
to be developed, run, maintained, and monitored.

46

FRAUD ANALYTICS

Alternatively, a sample may be gathered by sampling observations
over a period covering a full business cycle. Then only a single model
has to be developed, run, maintained, and monitored, which may possibly come at a cost of reduced fraud detection power since less tailored
to a particular time frame, yet clearly will be less complex and costly
to operate.
The sample to be gathered as such depends on the choice that is
made between these two alternative solutions in addressing potential sampling bias. This example illustrates the importance and direct
impact of sampling, not in the least on the performance of the model
that is built based on the gathered sample (i.e., the fraud detection
power).
In stratified sampling, a sample is taken according to predefined
strata. In a fraud detection context data sets are typically very skew
(e.g., 99 percent nonfraudulent and 1 percent fraudulent transactions).
When stratifying according to the target fraud indicator, the sample
will contain exactly the same percentages of (non-) fraudulent transactions as in the original data. Additional stratification can be applied
on predictor variables as well—for instance, in order for the number
of observations across different product categories to closely resemble
the real product transaction distribution. However, as long as no large
deviations exist with respect to the sample and observed distribution
of predictor variables, it will usually be sufficient to limit stratification
to the target variable.

TYPES OF DATA ELEMENTS
It is important to appropriately consider the different types of data
elements at the start of the analysis. The following types of data
elements can be considered:
◾

Continuous data
◾

These are data elements that are defined on an interval,
which can be both limited and unlimited.

◾

A distinction is sometimes made between continuous data
with and without a natural zero value, and which are respectively referred to as ratio (e.g., amounts) and interval data

DATA COLLECTION, SAMPLING, AND PREPROCESSING

47

(e.g., temperature in degrees Celsius or Fahrenheit). In the
latter case, you cannot make a statement like, “It is double or
twice as hot as last month”; since the value zero has no meaning you cannot take the ratio of two values, hence explaining
the name ratio data for data measured on a scale with a natural zero value. Most continuous data in a fraud detection
setting concerns ratio data, since often we are dealing with
amounts.
◾

◾

Examples: amount of transaction; balance on savings
account; (dis-) similarity index

Categorical data
◾

Nominal
◾ These are data elements that can only take on a limited set
of values with no meaningful ordering in between.
◾ Examples: marital status; payment type; country of origin

◾

Ordinal
◾ These are data elements that can only take on a limited set
of values with a meaningful ordering in between.
◾ Examples: age coded as young, middle-aged, and old

◾

Binary
◾ These are data elements that can only take on two values.
◾ Examples: daytime transaction (yes/no); online transaction
(yes/no)

Appropriately distinguishing between these different data elements
is of key importance to start the analysis when importing the data
into an analytics tool. For example, if marital status would be incorrectly specified as a continuous data element, then the software would
calculate its mean, standard deviation, and so on, which is obviously
meaningless and may perturb the analysis.

VISUAL DATA EXPLORATION AND EXPLORATORY
STATISTICAL ANALYSIS
Visual data exploration is a very important part of getting to know
your data in an “informal” way. It allows gaining some initial insights

48

FRAUD ANALYTICS

into the data, which can then be usefully adopted throughout the
modeling stage. Different plots/graphs can be useful here. Pie charts
are a popular example. A pie chart represents a variable’s distribution
as a pie, whereby each section represents the portion of the total
percent taken by each value of the variable. Figure 2.2 represents a
pie chart for a variable payment type with possible values credit card,
debit card, or cheque. A separate pie chart analysis for the fraudsters and
nonfraudsters indicates that for payment type cheque relatively more
fraud occurs, which can be a very useful starting insight. Bar charts
represent the frequency of each of the values (either absolute or
relative) as bars. Other handy visual tools are histograms and scatter
plots. A histogram provides an easy way to visualize the central
tendency and to determine the variability or spread of the data.
It also allows to contrast the observed data with standard known
distributions (e.g., normal distribution). Scatter plots allow users to
visualize one variable against another to see whether there are any
correlation patterns in the data. Also, OLAP based multidimensional
data analysis can be usefully adopted to explore patterns in the data
(see Chapter 3).
A next step after visual analysis could be inspecting some basic
statistical measurements such as averages, standard deviations, minimum, maximum, percentiles, confidence intervals, and so on. One
could calculate these measures separately for each of the target classes
(e.g., fraudsters versus nonfraudsters) to see whether there are any
interesting patterns present (e.g., do fraudsters usually have a lower
average age than nonfraudsters?).

BENFORD’S LAW
A both visual and numerical data exploration technique that is
particularly interesting in a fraud detection setting relates to what is
commonly known as Benford’s law. This law describes the frequency
distribution of the first digit in many real-life data sets and is shown
in Figure 2.3. When comparing the expected distribution following
Benford’s law with the observed distribution in a data set, strong
deviations from the expected frequencies may indicate the data to be
suspicious and possibly manipulated. For instance, government aid or

DATA COLLECTION, SAMPLING, AND PREPROCESSING

Total population

Credit card
Debit card
Check

Fraudsters

Credit card
Debit card
Check

Nonfraudsters

Credit card
Debit card
Check

Figure 2.2 Pie Charts for Exploratory Data Analysis

49

50

FRAUD ANALYTICS

First digit distribution

0.35
0.30
0.25
0.20
0.15
0.10
0.05
1

2

3

4

5

6

7

8

9

Figure 2.3 Benford’s Law Describing the Frequency Distribution of the First Digit

support eligibility typically depends on whether the applicant meets
certain requirements, such as an income below a certain threshold.
Therefore, data may be tampered with in order for the application
to comply with these requirements. It is exactly such types of fraud
that are prone to detection using Benford’s law, since the manipulated
or made-up numbers will not comply with the expected observed
frequency of the first digit as expressed by Benford’s law.
The mathematical formula describing this law expresses the
probability P(d) of the leading digit d to occur to be equal to:
)
(
1
P(d) = log10 1 +
d
Benford’s law can be used as a screening tool for fraud detection
(Jamain 2001). It is a partially negative rule, like many other rules,
meaning that if Benford’s law is not satisfied, then it is probable that
the involved data were manipulated and further investigation or
testing is required. Conversely, if a data set complies with Benford’s
law, it can still be fraudulent. Note that Benford’s law applies to a
data set, meaning that a sufficient amount of numbers related to an
individual case need to be gathered in order for Benford’s law to be
meaningful, which is typically the case when dealing with financial
statements, for instance.
A deviation from Benford’s law does not necessarily mean that
the data have been tampered with. It may also attract the analyst’s

DATA COLLECTION, SAMPLING, AND PREPROCESSING

51

attention toward data quality issues resulting from merging several
data sets in a single structured data table, toward duplicate data,
and so on. Hence, it may definitely be worthwhile for several purposes to control compliance with Benford’s law during the datapreprocessing phase.

DESCRIPTIVE STATISTICS
Similar to Benford’s law and in addition to the preparatory visual
data exploration, several descriptive statistics might be calculated
that provide basic insight or feeling for the data. Plenty of descriptive
statistics exist to summarize or provide information with respect
to a particular characteristic of the data, and therefore descriptive
statistics should be assessed together—in support and completion of
each other.
Basic descriptive statistics are the mean and median value of continuous variables, with the median value less sensitive to extreme values but then as well not providing as much information with respect to
the full distribution. Complementary to the mean value, the variation
or the standard deviation provide insight with respect to how much
the data is spread around the mean value. Likewise, percentile values
such as the 10, 25, 75, and 90 percentile, provide further information
with respect to the distribution and as a complement to the median
value.
Specific descriptive statistics exist to express the symmetry or
asymmetry of a distribution, such as the skewness measure, as well
as the peakedness or flatness of a distribution—for example, the
kurtosis measure. However, the exact values of these measures are
likely a bit harder to interpret than for instance the value of the mean
and standard deviation. This limits their practical use. Instead, one
could more easily assess these aspects by inspecting visual plots of the
distributions of the involved variables.
When dealing with categorical variables, instead of the median and
the mean value one may calculate the mode, which is the most frequently occurring value. In other words, the mode is the most typical
value for the variable at hand. The mode is not necessarily unique,
since multiple values can result in the same maximum frequency.

52

FRAUD ANALYTICS

MISSING VALUES
Missing values can occur because of various reasons. The information
can be nonapplicable. For example, when modeling amount of fraud
for users, then this information is only available for the fraudulent
accounts and not for the nonfraudulent accounts since it is not applicable there. The information can also be undisclosed. For example, a
customer decided not to disclose his or her income because of privacy.
Missing data can also originate because of an error during merging
(e.g., typos in name or ID).
Some analytical techniques (e.g., decision trees) can deal directly
with missing values. Other techniques need some additional preprocessing. The following are the most popular schemes to deal with
missing values (Little and Rubin 2002).
◾

Replace (Impute)
This implies replacing the missing value with a known
value. For example, consider the example in Table 2.1. One
could impute the missing credit bureau scores with the average
or median of the known values. For marital status, the mode can
then be used. One could also apply regression-based imputation
whereby a regression model is estimated to model a target variable (e.g., credit bureau score) based on the other information
available (e.g., age, income). The latter is more sophisticated, although the added value from an empirical viewpoint
(e.g., in terms of model performance) is questionable.

◾

Delete
This is the most straightforward option and consists of deleting observations or variables with lots of missing values. This, of
course, assumes that information is missing at random and has
no meaningful interpretation and/or relationship to the target.

◾

Keep
Missing values can be meaningful. For example, a customer
did not disclose his/her income because he/she is currently
unemployed. This fact may have a relation with fraud and
needs to be considered as a separate category.

DATA COLLECTION, SAMPLING, AND PREPROCESSING

Table 2.1

53

Dealing with Missing Values

ID Age Income Marital Status Credit Bureau Score Fraud
1

34

1,800

?

620

Yes

2

28

1,200

Single

?

No

3

22

1,000

Single

?

No

4

60

2,200

Widowed

700

Yes

5

58

2,000

Married

?

No

6

44

?

?

?

No

7

22

1,200

Single

?

No

8

26

1,500

Married

350

No

9

34

?

Single

?

Yes

10

50

2,100

Divorced

?

No

As a practical way of working, one can first start with statistically
testing whether missing information is related to the target variable or
not (using, e.g., a Chi-squared test, see the section on categorization).
If yes, then we can adopt the keep strategy and make a special category for it. If not, one can depending on the number of observations
available, decide to either delete or impute.

OUTLIER DETECTION AND TREATMENT
Outliers are extreme observations that are very dissimilar to the rest of
the population. Actually, two types of outliers can be considered:
◾

Valid observations, e.g., salary of boss is US$1,000,000

◾

Invalid observations, e.g., age is 300 years

Both are univariate outliers in the sense that they are outlying on
one dimension. However, outliers can be hidden in unidimensional
views of the data. Multivariate outliers are observations that are outlying in multiple dimensions. For example, Figure 2.4 gives an example
of two outlying observations considering both the dimensions of
income and age.

54

FRAUD ANALYTICS

Income

Income versus age
4500
4000
3500
3000
2500
2000
1500
1000
500
0

0

10

20

30

40

50

60

70

Age
Figure 2.4 Multivariate Outliers

Two important steps in dealing with outliers are detection and
treatment. A first obvious check for outliers is to calculate the minimum and maximum values for each of the data elements. Various
graphical tools can be used to detect outliers. Histograms are a first
example. Figure 2.5 presents an example of a distribution for age
whereby the circled areas clearly represent outliers.
Another useful visual mechanism is a box plot. A box plot represents three key quartiles of the data: the first quartile (25 percent of
the observations have a lower value), the median (50 percent of the
observations have a lower value), and the third quartile (75 percent

Frequency distribution age
3500
3000
2500
2000
1500
1000
500

Figure 2.5 Histogram for Outlier Detection

65
–7
0
15
0–
20
0

60
–6
5

55
–6
0

50
–5
5

45
–5
0

40
–4
5

35
–4
0

30
–3
5

25
–3
0

20
–2
5

0–
5

0

DATA COLLECTION, SAMPLING, AND PREPROCESSING

55

of the observations have a lower value). All three quartiles are represented as a box. The minimum and maximum value are then also
added unless they are too far away from the edges of the box. Too far
away is then quantified as more than 1.5 × Interquartile Range (IQR =
Q3 – Q1 ). Figure 2.6 gives an example of a box plot where three outliers
can be seen.
Another way is to calculate z-scores, measuring how many standard deviations an observation lies away from the mean, as follows:
zi =

xi − 𝜇
,
𝜎

where 𝜇 represents the average of the variable and 𝜎 its standard deviation. An example is given in Table 2.2. Note that, by definition, the
z-scores will have a mean of zero and unit standard deviation.

1,5 × IQR

Outliers
Min

Q3

Q1 M

Figure 2.6 Box Plots for Outlier Detection

Table 2.2

z-Scores for Outlier Detection

ID

Age

z-Score

1

30

(30 – 40)/10 = –1

2

50

(50 – 40)/10 = +1

3

10

(10 – 40)/10 = –3

4

40

(40 – 40)/10 = 0

5

60

(60 – 40)/10 = +2

6

80

(80 – 40)/10 = +4

𝜇 = 40
𝜎 = 10

𝜇=0
𝜎=1

…

56

FRAUD ANALYTICS

A practical rule of thumb, then, defines outliers when the absolute
value of the z-score |z| is bigger than three. Note that the z-score relies
on the normal distribution.
These methods all focus on univariate outliers. Multivariate
outliers can be detected by fitting regression lines and inspecting the
observations with large errors (using, e.g., a residual plot). Alternative
methods are clustering or calculating the Mahalanobis distance. Note,
however, that although potentially useful, multivariate outlier detection is typically not considered in many modeling exercises due to the
typical marginal impact on model performance.
Some analytical techniques (e.g., decision trees, neural networks,
SVMs) are fairly robust with respect to outliers. Others (e.g., linear/
logistic regression) are more sensitive to them. Various schemes exist
to deal with outliers. It highly depends on whether the outlier represents a valid or invalid observation. For invalid observations (e.g., age
is 300 years), one could treat the outlier as a missing value using any of
the schemes discussed in the previous section. For valid observations
(e.g., income is US$1,000,000), other schemes are needed. A popular
scheme is truncation/capping/winsorizing. One hereby imposes both
a lower and upper limit on a variable and any values below/above
are brought back to these limits. The limits can be calculated using
the z-scores (see Figure 2.7), or the IQR (which is more robust than the
z-scores) as follows:
Upper/Lower limit = M ± 3s, with M = median and s = IQR/(2 ×
0.6745) (Van Gestel and Baesens 2009). A sigmoid transformation
ranging between 0 and 1 can also be used for capping as follows:
f (x) =

1
1 + e−x

In addition, expert-based limits based on business knowledge
and/or experience can be imposed.
An important remark concerning outliers is the fact that not all
invalid values are outlying and, as such, may go unnoted if not explicitly looked into. For instance, a clear issue exists when observing customers with values gender = male and pregnant = yes. Which value is
invalid, either the value for gender or pregnant, cannot be determined,
but it needs to be noted that both values are not outlying and therefore

DATA COLLECTION, SAMPLING, AND PREPROCESSING

μ–3σ

μ

57

μ+3σ

Figure 2.7 Using the z-Scores for Truncation

such a conflict will not be noted by the analyst unless some explicit
precautions are taken. In order to detect particular invalid combinations, one may construct a set of rules that are formulated based on
expert knowledge and experience (similar to a fraud detection rule
engine in fact), which is applied to the data to check and alert for
issues. In this particular context, a network representation of the variables may be of use to construct the rule set and reason upon relations
that exist between the different variables, with links representing constraints that apply to the combination of variable values and resulting
in rules added to the rule set.

RED FLAGS
An important remark with respect to outlier treatment is to be made,
which particularly holds in a fraud detection setting. As discussed
in the introductory chapter, fraudsters may be detected by the very
fact that their behavior is different or deviant from nonfraudsters,
although most likely only slightly or in a complex (multivariate)
manner since they will cover their tracks to remain undetected.
These deviations from normality are called red flags of fraud and are
probably the most successful and widespread tool that is being used

58

FRAUD ANALYTICS

to detect fraud. As discussed in Grabosky and Duffield (2001) in the
broadest terms, the fundamental red flag of fraud is the anomaly,
that is, a variation from predictable patterns of behavior or, simply,
something that seems out of place. Some examples of red flags follow:
Tax evasion fraud red flags:
◾

An identical financial statement, since fraudulent companies
copy financial statements of nonfraudulent companies to look
less suspicious

◾

Name of an accountant is unique, since this might concern a
nonexisting accountant

Credit card fraud red flags:
◾

A small payment followed by a large payment immediately after,
since a fraudster might first check whether the card is still active
before placing a bet

◾

Regular rather small payments, which is a technique to avoid
getting noticed

Telecommunications-related fraud may be reflected in the
following red-flag activities (Grabosky and Duffield 2001):
◾

Long-distance access followed by reverse call charges accepted
from overseas

◾

High-volume usage over short periods before disconnection

◾

A large volume of calls where one call begins shortly after the
termination of another

◾

The nonpayment of bills

Such red-flag activities are typically translated in expert rules and
included in a rule engine as discussed in Chapter 1. When red flags
and expert rules are defined, the reasoning behind the red flag or
rule should be documented in order to inform the inspectors about
the underlying reasons or causes of suspicion underlying the red flag
that is raised, so they can focus their investigations on these causes
for suspicion.
Descriptive or unsupervised learning techniques will be discussed
in Chapter 3, including several approaches that aim at detecting
such slight or complex deviations from regular behavior associated

DATA COLLECTION, SAMPLING, AND PREPROCESSING

59

with fraud. When handling valid outliers in the data set using the
treatment techniques discussed before, we may impair in fact the
ability of descriptive analytics in finding anomalous fraud patterns.
Therefore, one should be extremely careful in treating valid outliers
when applying unsupervised learning techniques to build a fraud
detection model. This is true even when handling univariate outliers,
since these might be not by themselves but in combination with other
variables in a multivariate manner related to and as such be used to
uncover fraud. Invalid outliers, on the contrary, can straightforwardly
be treated as discussed above as missing values, preferably by including
an indicator that the value was missing or even more precisely an
invalid outlier. This allows users to test if there is any relation at all
between an invalid value and fraud.

STANDARDIZING DATA
Standardizing data is a data preprocessing activity targeted at scaling
variables to a similar range. Consider, for example, two variables gender (coded as 0/1) and income (ranging between 0 and US$1,000,000).
When building logistic regression models using both information elements, the coefficient for income might become very small. Hence,
it could make sense to bring them back to a similar scale. The following
standardization procedures could be adopted:
◾

Min/Max standardization
Xnew =

Xold − min(Xold )
(newmax − newmin) + newmin,
max(Xold ) − min(Xold )

whereby newmax and newmin are the newly imposed maximum and minimum (e.g., 1 and 0).
◾

z-score standardization
◾

◾

Calculate the z-scores (see the previous section).

Decimal scaling
◾

X

Divide by a power of 10 as follows: Xnew = 10oldn , with n the
number of digits of the maximum absolute value.

Again, note that standardization is especially useful for regressionbased approaches but is not needed for decision trees, for example.

60

FRAUD ANALYTICS

CATEGORIZATION
Categorization (also known as coarse-classification, classing, grouping,
or binning) can be done for various reasons. For categorical variables, it is needed to reduce the number of categories. Consider, for
example, the variable country of origin having 50 different values. When
this variable would be put into a regression model, one would need
49 dummy variables (50 – 1 because of the collinearity), which would
necessitate the estimation of 49 parameters for only one variable. With
categorization, one would create categories of values such that less
parameters will have to be estimated and a more robust model is
obtained.
For continuous variables, categorization may also be very beneficial. Consider, for example, the age variable and the observed amount
of fraudulent cases as depicted in Figure 2.8. Clearly, there is a nonmonotonous relation between risk of fraud and age. If a nonlinear
model (e.g., neural network, support vector machine) would be used,
then the nonlinearity can be perfectly modeled. However, if a regression model would be used (which is typically more common because
of its interpretability), then since it can only fit a line, it will miss out
on the nonmonotonicity. By categorizing the variable into ranges, part
of the nonmonotonicity can be taken into account in the regression.
Hence, categorization of continuous variables can be useful to model
nonlinear effects into linear models.
Fraud risk versus age
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
15

25

35

Figure 2.8 Default Risk Versus Age

45

55

65

75

DATA COLLECTION, SAMPLING, AND PREPROCESSING

61

Various methods can be used to do categorization. Two very basic
methods are equal interval binning and equal frequency binning.
Consider, for example, the income values 1,000, 1,200, 1,300, 2,000,
1,800, 1,400. Equal interval binning would create two bins with the
same range, Bin 1: 1,000, 1,500 and Bin 2: 1,500, 2,000, whereas equal
frequency binning would create two bins with the same number of
observations as follows, Bin 1: 1,000, 1,200, 1,300, Bin 2: 1,400, 1,800,
2,000. However, both methods are quite basic and do not take into
account a target variable (e.g., churn, fraud, credit risk).
Many analytics software tools have built-in facilities to do categorization using Chi-squared analysis. A very handy and simple approach
(available in Microsoft Excel) is to use pivot tables. Consider the
examples in Tables 2.3 and 2.4.

Table 2.3

Coarse Classifying the Product Type Variable

Customer ID

Age

Product Type

C1

44

clothes

No

C2

20

books

No

C3

58

music

Yes

C4

26

clothes

No

C5

30

electro

Yes

C6

32

games

No

C7

48

books

Yes

C8

60

clothes

No

…

Fraud

…

Table 2.4
Variable

Good
Bad
Odds

Pivot Table for Coarse Classifying the Product Type

Clothes

Books

Music

Electro

1,000

2,000

3,000

100

5,000

500

100

200

80

800

2

20

15

1.25

Games

6.25

…

62

FRAUD ANALYTICS

One can then construct a pivot table and calculate the odds as
follows:
We can then categorize the values based on similar odds. For example, category 1 (clothes, electro), category 2 (games), and category 3
(books and music).
Chi-squared analysis is a more sophisticated way to do coarse classification. Consider, for example, Table 2.5 for coarse classifying a variable product type.
Suppose we want three categories and consider the following
options:
◾

Option 1: clothes; books and music; others

◾

Option 2: clothes; electro; others

Both options can now be investigated using Chi-squared analysis. The purpose hereby is to compare the empirically observed with
the independence frequencies. For option 1, the empirically observed
frequencies are depicted in Table 2.6.
The independence frequencies can be calculated as follows. The
number of nonfraud observations given that the odds are the same as
in the whole population is 6,300/10,000 × 9,000/10,000 × 10,000 =
5,670. One then obtains Table 2.7.

Table 2.5

Coarse Classifying the Product Type Variable

Attribute

Clothes Books Music Electro Games Movies Total

No-fraud

6,000

Fraud
No-fraud: Fraud odds

1,600

350

950

90

300

400

140

100

50

10

1,000

4:1

2.5:1

9.5:1

1.8:1

1:1

9:1

Attribute

Clothes

Books & Music

Others

No-fraud

6,000

1,950

1,050

Total

9,000

20:1

Table 2.6 Empirical Frequencies Option 1 for Coarse
Classifying Product Type

Fraud

10

Total
9000

300

540

160

1,000

6,300

2,490

1,210

10,000

DATA COLLECTION, SAMPLING, AND PREPROCESSING

63

Table 2.7 Independence Frequencies Option 1 for Coarse
Classifying Product Type

Attribute

Clothes

Books & Music

Others

Total

No-fraud

5,670

2,241

1,089

9,000

630

249

121

1,000

6,300

2,490

1,210

10,000

Fraud
Total

The more the numbers in both tables differ, the less independence,
hence better dependence and a better coarse classification. Formally,
one can calculate the Chi-squared distance as follows:
◾

𝜒2 =

(6000 − 5670)2 (300 − 630)2 (1950 − 2241)2
+
+
5670
630
2241
(540 − 249)2 (1050 − 1089)2 (160 − 121)2
+
+
= 583
+
249
1089
121

Likewise, for option 2, the calculation becomes:
◾

𝜒2 =

(6000 − 5670)2 (300 − 630)2 (950 − 945)2
+
+
5670
630
945
(100 − 105)2 (2050 − 2385)2 (600 − 265)2
+
+
= 662
+
105
2385
265

So, based on the Chi-squared values, option 2 is the better categorization. Note that formally, one needs to compare the value with
a Chi-squared distribution with k−1 degrees of freedom with k the
number of values of the characteristic.

WEIGHTS OF EVIDENCE CODING
Categorization reduces the number of categories for categorical variables. For continuous variables, categorization will introduce new
variables. Consider, for example, a regression model with age (four
categories, so three parameters) and product type (five categories, so
four parameters) characteristics. The model then looks as follows:
Y = 𝛽0 + 𝛽1 Age1 + 𝛽2 Age2 + 𝛽3 Age3 + 𝛽4 Prod1 + 𝛽5 Prod2 + 𝛽6 Prod3
+ 𝛽7 Prod4

64

FRAUD ANALYTICS

Despite having only two characteristics, the model still needs
eight parameters to be estimated. It would be handy to have a monotonic transformation f(.) such that our model could be rewritten
as follows:
Y = 𝛽0 + 𝛽1 f (Age1 , Age2 , Age3 ) + 𝛽2 f (Prod1 , Prod2 , Prod3 , Prod4 )
The transformation should have a monotonically increasing or
decreasing relationship with Y. Weights-of-evidence coding is one
example of a transformation that can be used for this purpose. This is
illustrated in Table 2.8.
The WOE is calculated as: ln(Dist No-fraud/Dist Fraud). Because of
the logarithmic transformation, a positive (negative) WOE means Dist
No-fraud > (<) Dist Fraud. The WOE transformation thus implements
a transformation monotonically related to the target variable.
The model can then be reformulated as follows:
Y = 𝛽0 + 𝛽1 WOEage + 𝛽2 WOEproduct-type
This gives a more concise model than the model that we started
this section with. However, note that the interpretability of the model
becomes somewhat less straightforward when WOE variables are
being used.
Table 2.8

Age

Calculating Weights of Evidence (WOE)

Count Distr. Count No-fraud Distr No-fraud Fraud Distr Fraud

WOE

Missing

50

2.50%

42

2.33%

8

4.12%

−57.28%

18–22

200

10.00%

152

8.42%

48

24.74%

−107.83%

23–26

300

15.00%

246

13.62%

54

27.84%

−71.47%

27–29

450

22.50%

405

22.43%

45

23.20%

−3.38%

30–35

500

25.00%

475

26.30%

25

12.89%

71.34%

35–44

350

17.50%

339

18.77%

11

5.67%

119.71%

44+

150

7.50%

147

8.14%

3

1.55%

166.08%

2,000

1,806

194

DATA COLLECTION, SAMPLING, AND PREPROCESSING

65

VARIABLE SELECTION
Many analytical modeling exercises start with tons of variables, of
which typically only a few actually contribute to the prediction of
the target variable. For example, the average fraud model in fraud
detection has somewhere between 10 and 15 variables. The key question is how to find these variables. Filters are a very handy variable
selection mechanism. They work by measuring univariate correlations
between each variable and the target. As such, they allow for a quick
screening of which variables should be retained for further analysis.
Various filter measures have been suggested in the literature. One can
categorize them as depicted in Table 2.9.
The Pearson correlation 𝜌P is calculated as follows:
∑n
𝜌P = √∑
n
i=1

(Xi − X)(Yi − Y )
√∑n
(Xi − X)2
(Yi − Y )2
i=1

i=1

It measures a linear dependency between two variables and always
varies between −1 and +1. To apply it as a filter, one could select
all variables for which the Pearson correlation is significantly different from 0 (according to the p-value), or for example, the ones where
|ρP | > 0.50.
The Fisher score can be calculated as follows:
|X G− X B |
,
√
2
2
sG + sB
Table 2.9

Filters for Variable Selection

Continuous Target
(e.g., CLV, LGD)

Categorical Target (e.g.,
churn, fraud, credit risk)

Continuous variable

Pearson correlation

Fisher score

Categorical variable

Fisher score/ANOVA

Information value
Cramer’s V
Gain/entropy

66

FRAUD ANALYTICS

where X G (X B ) represents the average value of the variable for the nonfraudsters (fraudsters) and s2G (s2B ) the corresponding variances. High
values of the Fisher score indicate a predictive variable. To apply it as
a filter, one could, for example, keep the top 10 percent. Note that
the Fisher score may generalize to a well-known analysis of variance
(ANOVA) in case a variable has multiple categories.
The information value (IV) filter is based on weights of evidence
and is calculated as follows:

IV =

k
∑
(Dist Goodi − Dist Badi ) × WOEi ,
i=1

whereby k represents the number of categories of the variable. For the
example discussed in Table 2.10, the calculation becomes as shown.
The following rules of thumb apply for the information value:
◾

<0.02: unpredictive

◾

0.02 – 0.1: weak predictive

◾

0.1 – 0.3: medium predictive

◾

+0.3: strong predictive

Note that the information value assumes that the variable has been
categorized. It can actually also be used to adjust/steer the categorization so as to optimize the IV. Many software tools will provide
Table 2.10

Age

Calculating the Information Value Filter Measure

Count

Distr.
Count

No-fraud

Distr
Fraud
No-fraud

Distr
Fraud

WOE

IV

4.12%

Missing

50

2.50%

42

2.33%

8

−57.28%

0,0103

18–22

200

10.00%

152

8.42%

48

24.74% −107.83%

0,1760

23–26

300

15.00%

246

13.62%

54

27.84%

−71.47%

0,1016

27–29

450

22.50%

405

22.43%

45

23.20%

−3.38%

0,0003

30–35

500

25.00%

475

26.30%

25

12.89%

71.34%

0,0957

35–44

350

17.50%

339

18.77%

11

5.67%

119.71%

0,1568

44+

150

7.50%

147

8.14%

3

1.55%

166.08%

Information Value

0,1095
0,6502

DATA COLLECTION, SAMPLING, AND PREPROCESSING

67

interactive support to do this, whereby the modeler can adjust the
categories and gauge the impact on the IV. To apply it as a filter,
one can calculate the information value of all (categorical) variables
and only keep those for which the IV > 0.1 or, for example, the top
10 percent.
Another filter measure based on Chi-squared analysis is Cramer’s V.
Consider, for example, the contingency table depicted in Table 2.11 for
online/offline transaction versus nonfraud/fraud.
Similar to the example discussed in the section on categorization,
the Chi-squared value for independence can then be calculated as
follows:
(500 − 480)2 (100 − 120)2 (300 − 320)2 (100 − 80)2
+
+
+
480
120
320
80
= 10.41

𝜒2 =

This follows a Chi-squared distribution with k – 1 degrees of
freedom, with k being the number of classes of the characteristic.
The Cramer’s V measure can then be calculated as follows:
√
′

Cramer s V =

𝜒2
= 0.10,
n

with n the number of observations in the data set. Cramer’s V is always
bounded between 0 and 1 and higher values indicate better predictive
power. As a rule of thumb, a cut-off of 0.1 is commonly adopted. One
can then again select all variables where Cramer’s V is bigger than 0.1,
or consider, for example, the top 10 percent. Note that the Information Value and Cramer’s V typically consider the same characteristics
as most important.

Table 2.11 Contingency Table for Marital Status
versus Good/Bad Customer

Offline

Nonfraud

Fraud

Total

500

100

600

Online

300

100

400

Total

800

200

1000

68

FRAUD ANALYTICS

Filters are very handy, as they allow reduction in the number
of dimensions of the data set early in the analysis in a quick way.
Their main drawback is that they work univariately and typically
do not consider correlation between the dimensions individually,
for example. Hence, a follow-up input selection step during the
modeling phase will be necessary to further refine the characteristics.
Also worth mentioning here is that other criteria may play a role in
selecting variables, such as regulatory compliance and privacy issues.
Note that different regulations may apply in different geographical
regions and hence should be checked. Also operational issues could be
considered. For example, trend variables could be very predictive but
might require too much time to be computed in a real-time, online
fraud detection environment.

PRINCIPAL COMPONENTS ANALYSIS
An alternative method for input or variable selection is principal component analysis, which is a technique to reduce the dimensionality of
data by forming new variables that are linear composites of the original variables. These new variables describe the main components or
dimensions that are present in the original data set, hence its name.
The main dimensions may and often are different from the imposed
measurement dimensions, and as such are obtained as a linear combination of those. Figure 2.9 provides a two-dimensional illustration
of this. The two measurement dimensions represented by the X and Y
axes do not adequately capture the actual dimensions or components
present in the data. These are clearly situated in a 45-degree angle

Y

PC1

PC2

X
Figure 2.9 Illustration of Principal Component Analysis in a Two-Dimensional Data Set

DATA COLLECTION, SAMPLING, AND PREPROCESSING

69

compared to the X and Y dimensions and are described or captured
by the two principal components PC1 and PC2 .
The maximum number of new variables that can be formed or
derived from the original data (i.e., the number of principal components) is equal to the number of original variables. However, when the
aim is data reduction, then typically a reduced set of principal components is sufficient to replace the original larger set of variables, since
most of the variance in the original set of variables will be explained
by a limited number of principal components. In other words, most
of the information that is contained in the large set of original variables typically can be summarized by a small number of new variables.
To explain all the variance in the original data set, the full set of principal components is needed, but some of these will only account for a
very small fraction of variance and therefore can be left out, leading to
a reduced dimensionality or number of variables in the data set.
Example: A data set contains 80 financial ratio variables describing the financial situation or health of a firm, which may be indicative or relevant for detecting fraud. However, many of these financial
ratios are typically strongly correlated. In other words, many of these
ratios overlap, meaning that they basically express the same information. This may be explained by the fact that these ratios are derived
from and summarize the same basic information. Therefore, one could
prefer to combine or summarize the original large set of ratios by a
reduced number of financial indices, as can be done by performing a
principal component analysis.
Moreover, the new limited set of financial indices should preferably be uncorrelated, such they can be included in a fraud-detection
model without causing the final model to become unstable. Correlation
among the explanatory or predictor variables, which is called multicollinearity, may result in unstable models. The stability or robustness
of a model refers to the stability of the exact values of the parameters of the model that are being estimated based on the sample of
observations. If the values of these parameters heavily depend on the
exact sample of observations used to induce the model, then the model
is called unstable. The values of the parameters, in fact, express the
relation between the explanatory or predictor variables and the dependent or target variable. When the exact relation differs strongly for

70

FRAUD ANALYTICS

different samples of observations, then questions arise with respect to
the exact nature and reliability of this presumed relation. When the
explanatory variables included in a model are correlated, typically, the
resulting model is unstable. Therefore, an input selection procedure is
often performed—for example, using the filter approach discussed in
the previous paragraph, or alternatively a new set of factors may be
derived using principal component analysis to address this problem,
since the resulting new variables (i.e., the principal components, will
be uncorrelated among themselves).
Principal components are calculated by making use of the
eigenvector decomposition (which will not be explained within
the scope of this book; interested readers may refer to specialized
literature on principal component analysis and eigenvector decomposition). Let X1 , X2 ,…, Xp be the mean-corrected, standardized original
variables, and ˝ = cov(X) the corresponding covariance matrix. Let
𝜆1 ≥ 𝜆2 ≥ … ≥ 𝜆p ≥ 0 be the p eigenvalues of ˝ and e1 , e2 ,…, ep the
corresponding eigenvectors. The principal components PCj , j = 1… p,
corresponding with and in fact replacing the variables X1 , X2 ,…, Xp
are then given by:
PCj = e′ j X = ej1 X1 + ej2 X2 + · · · + ejp Xp = X ′ ej
The eigenvectors express the importance of each of the original
variables in the construction of the new variables. The eigenvectors
determine how the original variables are combined into new variables,
and in what proportions.
The observed variance in data set X that is explained by principal
component PCj is equal to the corresponding eigenvalue 𝜆j . The total
variance or information in the data will not change and remain constant as the sum of the variances of the principal components—that
is, the new variables—is equal to the sum of the variances of the original variables. Also note that the covariance or correlation between two
principal components is equal to zero, cov(PCi , PCj ) = 0 for i ≠ j.
For each observation in the data set, the values for the new variables can be calculated. These values are called the PC scores and can
be calculated by making use of the eigenvectors as weights on the
mean-corrected data.

DATA COLLECTION, SAMPLING, AND PREPROCESSING

71

Example: A data set consists of two original variables X1 and X2 ,
which we will replace by two principal components. The first step
consists of calculating the two eigenvectors e1 and e2 of the covariance
matrix Σ, which can be done straightforwardly using the available observations in the data set. As such, we get e1 = (e11 , e12 ) =
(0.562, 0.345) and e2 = (e21 , e22 ) = (−0.345, 0.562). Subsequently, in
a second step the PC scores for the two principal components (i.e.,
the two new variables) are calculated as a function of the original,
mean-corrected, values (x1 , x2 ):
PC1 = e11 x1 + e12 x2 = 0.562x1 + 0.345x2
PC2 = e21 x1 + e22 x2 = −0.345x1 + 0.562x2
Filling out the values x1 and x2 for the two original variables for
each observation in the data set then gives the new observations with
the values for the new variables.
How does the transformation of the two original variables in the
above example lead to reducing the data set and to select inputs?
By looking at the eigenvalues, one may decide about leaving out new
variables. If in the above example the eigenvalue corresponding to PC1
is significantly larger than the eigenvalue corresponding to PC2 (i.e., if
𝜆1 ≫ 𝜆2 ), then it can be decided to drop PC2 from the further analysis.
Note from this simple example that replacing the original variables
with a (reduced) set of uncorrelated principal components comes at
a price—reduced interpretability. The principal component variables
derived from the original set of variables cannot easily be interpreted,
since they are calculated as a weighted linear combination of the original variables. In the previous example, only two original variables were
combined into principal components, still allowing some interpretation
of the resulting principal components. But when the analysis spans
tens or even hundreds or thousands of variables, then any interpretation of the resulting components is clearly prohibited. As discussed
in Chapter 1, in certain settings this might be unacceptable, since the
analysts using the resulting model can no longer interpret it. However,
when interpretability is no concern, then principal component analysis
is a powerful data reduction tool that will yield a better model in terms
of stability as well as predictive performance.

72

FRAUD ANALYTICS

RIDITS
As an alternative to weights of evidence values, one may adopt another
approach to assign numerical values to categorical ordinal variables,
called RIDIT scoring, introduced by Bross (1958), who coined the term
RIDIT in analogy to logit and probit. The following discussion of RIDITs
and PRIDITs is based on a study by Brocket et al. (2002), who adopted
and adapted RIDITs for fraud detection in an unsupervised setting.
The RIDIT scoring mechanism incorporates the ranked nature of
responses, that is, categories of an ordinal categorical variable. Assume
the different response categories are ordered in decreasing likelihood of
fraud suspicion so that a higher categorical response indicates a lesser
suspicion of fraud. In ranking the categories from high- to low-fraud
risk, one may use expert input or historical observed fraud rates. The
RIDIT score for a categorical response value i to variable t, with p̂ tj indicating the proportion of the population having value i for variable t, is
then calculated as follows:
Bti =

∑
∑
p̂ tj −
p̂ tj i = 1, 2,…, kt
ji

The above formula transforms a set of categorical responses into
a set of meaningful numerical values in the interval [–1,1], reflecting the relative abnormality of a particular response. Intuitively, the
RIDIT score can be interpreted to be an adjusted or transformed percentile score.
Example: A binary response fraud indicator variable with value yes
occurring for 10 percent of the cases and considered by experts more
indicative of fraud than a value no, occurring for the other 90 percent of
the cases, results in RIDIT scores Bt1 (“yes”) = −0.9 and Bt2 (“no”) = 0.1.
For a similar binary fraud indicator with 50 percent of the cases having
a value “yes” and 50 percent having a value “no,” the resulting RIDIT
scores are Bt1 (“yes”) = −0.5 and Bt2 (“no”) = 0.5. This clearly indicates
that a response “yes” on the first indicator variable is more abnormal
or indicative of fraud than a response “yes” on the second indicator,
and as such the transformation yields RIDIT scores that can be easily
included in a quantitative model and make sense from an operational
or expert perspective.

DATA COLLECTION, SAMPLING, AND PREPROCESSING

73

Also for ordinal categorical variables with more than two categorical values RIDIT scores can be calculated using the above formula.
RIDIT scores may be used to replace the categorical fraud indicator
values, and as such allow these categorical variables to be directly
integrated in any numerical analysis for fraud detection.
Remark that for calculating RIDIT scores the actual target values do
not have to be known, as required for weights of evidence calculation.
Therefore, RIDIT scores can be used in an unsupervised learning setting
and when no labeled historical observations are available.

PRIDIT ANALYSIS
PRIDIT analysis combines the two techniques described in the two
previous paragraphs and results in overall fraud suspicion scores calculated from a set of ordinal categorical fraud indicators. As such, PRIDIT
analysis may be used to assemble these indicators into a single variable
that can be included in any further analysis. Alternatively, PRIDIT analysis can be used as a filter approach to reduce the number of indicator
variables included in the further analysis, as well as the final outputted
fraud suspicion score. Given the two first uses, we include PRIDIT
analysis in this chapter although it could be considered an unsupervised learning technique for fraud detection as discussed in Chapter 3.
The reader may refer to Brocket et al. (2002) for an extensive discussion regarding the mathematical derivation and interpretation of
PRIDIT scores.
Assume that only a set of ordinal categorical fraud indicators is
available, transformed into RIDIT scores as discussed in the above
section. Let F = (fit ) denote the matrix of individual RIDIT variable
scores for each of the variables t = 1, 2,…, m, for each of the cases
i = 1, 2,…, n to be analyzed and scored for fraud. A straightforward
overall fraud suspicion score aggregating these individual RIDIT scores
for the available fraud indicator variables can simply be calculated
by summing all the individual RIDITs. We then get the PRIDIT score
vector by multiplying the matrix F with a unity weight vector W =
(1, 1,…, 1)′ , with the prime indicating the transposed vector, that is:
S = FW

74

FRAUD ANALYTICS

Note that these simple aggregated suspicion scores are equally
dependent on each of the indicator variables, since the weights in
vector W , which determine the impact of an indicator on the resulting
score are all set equal to one. However, clearly not every indicator is
equally related to fraud and therefore serves as a predictor or warning
signal of fraud. Hence, an effective overall suspicion score should not
necessarily assign equal importance to each indicator, on the contrary.
A smarter aggregation of the individual indicators assesses the relative
importance and weighs the indicators accordingly when aggregating
them into a single overall suspicion score.
The intuition underlying the calculation of PRIDIT scores is to
adapt the weights according to the correlation or consistency between
the individual RIDIT scores and the resulting overall score. Basically,
PRIDIT scores assign higher weights to an individual fraud indicator
variable if the RIDIT scores of this variable over all cases included in
the analysis are in line with the resulting overall suspicion score. On
the other hand, when a variable is less consistent with the overall
score, then it receives a lower weight. As elaborated in Brocket et al.
(2002), a meaningful set of weights can be obtained by calculating the
first principal component, as discussed in a previous section, of the
matrix F ′ F. The first principal component is the weight vector that is
used in calculating the PRIDIT scores, assigning a relative importance
to each indicator according to the intuitive consistency principle
discussed in this paragraph.
The PRIDIT approach can as such be used for variable selection,
since indicators receiving weights that are not significantly different
from zero may be removed from the data set. Alternatively, similar to
principal component analysis to variable reduction, the set of ordinal
indicators aggregated into the PRIDIT score may be replaced by this
score, depending on the purpose and setup of the analysis.

SEGMENTATION
Sometimes the data are segmented before the analytical modeling
starts. A first reason for this could be strategic. For example, banks
might want to adopt special strategies to specific segments of customers. It could also be motivated from an operational viewpoint.

DATA COLLECTION, SAMPLING, AND PREPROCESSING

75

For example, new customers must have separate models because the
characteristics in the standard model do not make sense operationally
for them. Segmentation could also be needed to take into account
significant variable interactions. For example, if one variable strongly
interacts with a number of others, it might be sensible to segment
according to this variable.
The segmentation can be conducted using the experience and
knowledge from a business expert, or it could be based on statistical analysis using, for example, decision trees (cf. infra), k-means
clustering or self-organising maps (cf. infra).
Segmentation is a very useful preprocessing activity since one
can now estimate different analytical models each tailored to a
specific segment. However, one needs to be careful with it since, by
segmenting, the number of analytical models to estimate will increase,
which will obviously also increase the production, monitoring, and
maintenance costs.

REFERENCES
Armstrong, J. S. (2001). Selecting Forecasting Methods. In J.S. Armstrong,
ed. Principles of Forecasting: A Handbook for Researchers and Practitioners.
New York: Springer Science + Business Media, pp. 365–386.
Baesens, B. (2014). Analytics in a Big Data World: The Essential Guide to Data Science
and Its Applications. Hoboken, NJ: John Wiley & Sons.
Bolton, R. J., & Hand, D. J. (2002). Statistical Fraud Detection: A Review.
Statistical Science, 17 (3): 235–249.
Caron, F., Vanden Broucke, S., Vanthienen, J., & Baesens, B. (2013). Advanced
Rule-Based Process Analytics: Applications for Risk Response Decisions
and Management Control Activities. Expert Systems with Applications,
Submitted.
Chakraborty, G., Murali, P., & Satish, G. (2013). Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS. Cary, N.C.: SAS Institute.
Cressey, D. R. (1953). Other People’s Money; A Study of the Social Psychology of
Embezzlement. New York: Free Press.
Cullinan, G. J. (1977). Picking Them by Their Batting Averages’ Recency–Frequency–
Monetary Method of Controlling Circulation, Manual Release 2103. New York:
Direct Mail/Marketing Association.
Duffield, G., & Grabosky, P. (2001). The Psychology of Fraud. In Trends and
Issues in Crime and Criminal Justice, Australian Institute of Criminology (199).

76

FRAUD ANALYTICS

Elder IV, J., & Thomas, H. (2012). Practical Text Mining and Statistical Analysis for
Non-Structured Text Data Applications. New York: Academic Press.
Fawcett, T., & Provost, F. (1997). Adaptive Fraud Detection. Data Mining and
Knowledge Discovery 1–3(3): 291–316.
Grabosky, P., & Duffield, G. (2001). Red Flags of Fraud. Trends and Issues in Crime
and Criminal Justice, Australian Institute of Criminology (200).
Han, J., & Kamber, M. (2007). Data Mining: Concepts and Techniques, Third
Edition: Morgan Kaufmann.
Hand, D. (2007, September). Statistical Techniques for Fraud Detection,
Prevention, and Evaluation. Paper presented at the NATO ASI: Mining
Massive Data sets for Security, London, England.
Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of Data Mining. Cambridge, MA: Bradford.
Jamain, A. (2001). Benford’s Law. London: Imperial College.
Junqué de Fortuny, E., Martens, D., & Provost, F. (2013). Predictive Modeling
with Big Data: Is Bigger Really Better? Big Data 1(4): 215–226.
Little, R. J. A., & Rubin, D. B. (2002). Statistical Analysis with Missing Data, Second
Edition, New York: John Wiley & Sons, p. 408.
Maydanchik, A. (2007). Data Quality Assessment. Bradley Beach, NC: Technics
Publications.
Navarette, E. (2006). Practical Calculation of Expected and Unexpected Losses
in Operational Risk by Simulation Methods (Banca & Finanzas: Documentos de Trabajo, 1 (1): pp. 1–12).
Petropoulos, F., Makridakis, S., Assimakopoulos, V., & Nikolopoulos, K. (2014).
“Horses for Courses” in Demand Forecasting. European Journal of Operational Research.
Schneider, F. (2002). Size and measurement of the informal economy in 110
countries around the world. In Workshop of Australian National Tax Centre,
ANU, Canberra, Australia.
Tan, P.-N. N., Steinbach, M., & Kumar, V. (2005). Introduction to Data Mining.
Boston: Addison Wesley.
Van Gestel, T., & Baesens, B. (2009). Credit Risk Management: Basic Concepts:
Financial Risk Components, Rating Analysis, Models, Economic and Regulatory
Capital. Oxford: Oxford University Press.
Van Vlasselaer, V., Eliassi-Rad, T., Akoglu, L., Snoeck, M., & Baesens, B.
(2015). Gotcha! Network-based Fraud Detection for Social Security
Fraud. Management Science, Submitted.
Verbeke, W., Dejaeger, K., Martens, D., Hur, J., & Baesens, B. (2012). New
Insights into Churn Prediction in the Telecommunication Sector: A Profit
Driven Data Mining Approach. European Journal of Operational Research
218: 211–229.

C H A P T E R

3

Descriptive
Analytics for
Fraud Detection

77

INTRODUCTION
Descriptive analytics or unsupervised learning aims at finding unusual
anomalous behavior deviating from the average behavior or norm
(Bolton and Hand 2002). This norm can be defined in various ways.
It can be defined as the behavior of the average customer at a
snapshot in time, or as the average behavior of a given customer
across a particular time period, or as a combination of both. Predictive
analytics or supervised learning, as will be discussed in the following
chapter, assumes the availability of a historical data set with known
fraudulent transactions. The analytical models built can thus only
detect fraud patterns as they occurred in the past. Consequently, it
will be impossible to detect previously unknown fraud. Predictive
analytics can however also be useful to help explain the anomalies
found by descriptive analytics, as we will discuss later.
When used for fraud detection, unsupervised learning is often
referred to as anomaly detection, since it aims at finding anomalous
and thus suspicious observations. In the literature, anomalies are commonly described as outliers or exceptions. One of the first definitions
of an outlier was provided by Grubbs (1969), as follows:

“An outlying observation, or outlier, is one that appears to
deviate markedly from other members of the sample in
which it occurs.”
A first challenge when using unsupervised learning is to define the
average behavior or norm. Typically, this will highly depend on the
application field considered. Also the boundary between the norm and
the outliers is typically not clear-cut. As said earlier, fraudsters will try
to blend into the average or norm as good as possible, hereby substantially complicating their detection and the corresponding definition of
the norm. Furthermore, the norm may change over time, so the analytical models built need to be continuously monitored and updated,
possibly in real-time. Finally, anomalies do not necessarily represent
fraudulent observations. Hence, the usage of unsupervised learning
for fraud detection requires extensive follow-up and validation of the
identified, suspicious observations.
78

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

79

Unsupervised learning can be useful for organizations that start
doing fraud detection and thus have no labeled historical data set
available. It can also be used in existing fraud models by uncovering
new fraud mechanisms. This is especially relevant in environments
where fraudsters are continuously adapting their strategies to beat
the detection methods. A first example of this is credit card fraud
whereby fraudsters continuously try out new ways of committing
fraud. Another example is intrusion detection in a cyber-fraud setting.
Supervised methods are based on known intrusion patterns, whereas
unsupervised methods or anomaly detection can identify emerging
cyber threats.
In this chapter, we will explore various unsupervised techniques
to detect fraud.

GRAPHICAL OUTLIER DETECTION PROCEDURES
To detect one-dimensional outliers, a histogram or box plot can be used
(see Chapter 2). Two-dimensional outliers can be detected using a scatter plot. The latter can also be extended to a three-dimensional setting,
whereby spinning facilities can be handy to rotate the graph so as to
facilitate finding the outliers. This is illustrated in Figure 3.1. The plot
clearly shows three outliers marked by asterisks (*) representing claims
with an unusually high amount, a high number of cars of the claimant,
and a small number of days since the previous claim. Clearly, these
claims are suspicious and should be further investigated.
Ideally, graphical methods should be complemented with multidimensional data analysis and online analytical processing (OLAP) facilities. Figure 3.2 shows an example of an OLAP cube representing the
distribution or count of claims based on amount of claim, number of
cars, and recency of previous claim. Once the cube has been populated from a data warehouse or transactional data source, the following
OLAP operations can be performed:
◾

Roll-up: The idea here is to aggregate across one or more dimensions. An example of this is the distribution of amount of claim
and recency aggregated across all number of cars (roll up of
number of cars dimension). Another example is the distribution

FRAUD ANALYTICS

Number of cars

80

ys
da
f
ro
m
be ince clai
m
s us
u
N
io
ev
pr

Amount of claim
Figure 3.1 3D Scatter Plot for Detecting Outliers

Amount of claim

Between 5,000
and 10,000
Between 1,000
and 5,000

> 6 months
Between 3 and 6 months
Between 1 and 3 months
< 1 month

<1,000

1

Most recent claim

Count

> 10,000

2

3

Number of cars
Figure 3.2 OLAP Cube for Fraud Detection

≥4

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

81

of amount of claim aggregated across all number of cars and all
number of days since previous claim (roll up of number of cars
and number of days since previous claim dimensions).
◾

Drill-down: This is the opposite of roll-up whereby more detail
is asked for by adding another dimension to the analysis.

◾

Slicing: The idea here is to pick a slice along one of the dimensions. An example is to show the distribution of amount and
recency for all claims where the claimant has more than or equal
to four cars (slice along the number of cars dimension).

◾

Dicing: The idea here is to fix values for all the dimensions
and create a sub-cube. An example is to show the distribution
of amount, number of cars, and recency for all claims where
amount is between 1,000 and 10,000, number of cars is two or
three, and recency is between one and six months.

Many software tools are available to support OLAP analysis. They
excel in offering powerful visualizations, sometimes even augmented
with virtual reality technology for better detecting relationships in
the data. OLAP tools will also typically implement pivot tables for
multidimensional data analysis. Pivot tables allow analysts to summarize tabular data by cross-tabulating user specified dimensions using
a drag and drop interface. This is illustrated in Figure 3.3 for a credit
card fraud detection data set analyzed in Microsoft Excel. You can see
that the data set has 2,397 observations with 2,364 nonfrauds and
33 frauds. The columns depict the average for the recency, frequency,
and monetary (RFM) variables. Note that the data depicted has
been filtered to non-EU transactions as depicted in the upper-left
corner cell B2. From the results, it can be seen that when looking at
non EU-transactions, fraudulent transactions have a lower average
recency, higher average frequency, and higher average monetary
value when compared to nonfraudulent transactions. These are very
interesting starting insights to further explore fraud patterns in the
data. The pivot table can be easily manipulated using the panel to the
right. Other filters can be defined, columns and rows can be added,
and descriptive statistics or summarization measures can be varied
(e.g., count, minimum, maximum).
Graphical and OLAP methods are handy and easy to work with.
They are ideal tools to explore the data and get preliminary insights.

Figure 3.3 Example Pivot Table for Credit Card Fraud Detection

82

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

83

However, they are less formal and only limited to a few dimensions.
They require active involvement of the end-user in detecting the
anomalies. In other words, the user should select the dimensions and
decide on the OLAP routines to be performed. For a large dimensional
data set, this may be a cumbersome exercise. Besides being used during preprocessing, OLAP facilities are getting more and more popular
for model post-processing and monitoring, as we will discuss later.

STATISTICAL OUTLIER DETECTION PROCEDURES
A first well-known statistical outlier detection method is calculating
the z-scores, as previously discussed in Chapter 2. Remember, observations for which the absolute value of the z-score is bigger than 3 can
be considered as outliers. A more formal test is the Grubbs test, which
is formulated as follows (see Grubbs 1950):
H0 : There are no outliers in the data set.
HA : There is at least one outlier in the data set.
It starts by calculating the z-score for every observation. Let’s say
that the maximum absolute value of the observed z-scores equals G.
The corresponding observation is then considered an outlier at significance level 𝛼 if:
√
√
√
t 2𝛼
, N−2
N − 1√
2N
√
G> √ √
,
N − 2 + t 2𝛼
N
, N−2
2N

where N represents the number of observations, and t 2𝛼
2N

, N−2

is the

critical value of a Student’s t-distribution with N – 2 degrees of freedom and significance level equal to α/(2N). If an outlier is detected, it
is removed from the data set and the test can be run again. The test
can also be run for multivariate outliers whereby the z-score can be
replaced by the Mahalanobis distance defined as follows:
√
(x − x)t S−1 (x − x),
where x represents the observation, x the mean vector, and S the
covariance matrix. A key weakness of this test is that it assumes an
underlying normal distribution, which is not always the case.

84

FRAUD ANALYTICS

Other statistical procedures fit a distribution, or mixture of distributions (using, e.g., maximum likelihood or expectation-maximization
procedures) and label the observations with small values for the probability density function as outliers.

Break-Point Analysis

Amount spent

Break-point analysis is an intra-account fraud detection method
(Bolton and Hand, 2001). A break point indicates a sudden change in
account behavior, which merits further inspection. The method starts
from defining a fixed time window. This time window is then split into
an old and new part. The old part represents the local model or profile
against which the new observations will be compared. For example, in
(Bolton and Hand 2001), the time window was set to 24 transactions
whereby 20 transactions made up the local model, and 4 transactions
were used for testing. A Student’s t-test can be used to compare the
averages of the new and old parts. Observations can then be ranked
according to their value of the t-statistic. This is illustrated in Figure 3.4.

Local model

Break
point

0

5

10

15
Time

Figure 3.4 Break-Point Analysis

20

25

30

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

85

Peer-Group Analysis
Peer-group analysis was also introduced by Bolton and Hand (2001).
A peer group is a group of accounts that behave similarly to the target
account. When the behavior of the latter starts to deviate substantially
from its peers, an anomaly can be signaled. Peer-group analysis proceeds in two steps. In step 1, the peer group of a particular account
needs to be identified. This can be accomplished either by using prior
business knowledge or in a statistical way.
For example, in an employee fraud context, people sharing similar
jobs can be grouped as peers. Another example is healthcare fraud
detection, whereby the aim is to detect fraudulent claim behavior of
doctors across geographical regions. Suppose the target observation is
a cardiologist in San Francisco; then its peers are defined as all other
cardiologists in San Francisco. Statistical similarity metrics can also be
used to define peers, but these are typically application specific. Popular
examples here are Euclidean-based metrics (see, e.g., Bolton and Hand
2001; Weston et al. 2008). The number of peers should be carefully
selected. It cannot be too small, or the method becomes too local and
thus sensitive to noise, and also not too large, or the method becomes
too global and thus insensitive to local important irregularities.
In step 2, the behavior of the target account is contrasted with its
peers using a statistical test such a Student’s t-test, or a distance metric
such as the Mahalanobis distance, which works similar.
Let’s work out an example in a credit card context. Assume our
target account has the following time series:
y1 , y2 , . . . . yn−1 , yn
where yi represents the amount spent at time (e.g., day or week) i.
The aim is now to verify whether the amount spent at time n, yn , is
anomalous. We start by identifying the k peers of the target account.
These are depicted in gray in the Table 3.1 below, whereby all accounts
have been sorted according to their similarity to the target.
To see whether yn is an outlier, a t-score can be calculated as follows:
yn − x1∶k,n
s

,

86

FRAUD ANALYTICS

Table 3.1 Transaction Data Set
for Peer-Group Analysis

xm,1

xm,2

xm,n-1

xm,n

xk,2

xk,n-1

xk,n

x2,1

x2,2

x2,n-1

x2,n

x1,1

x1,2

x1,n-1

x1,n

y1

y2

yn-1

yn

…

…
xk,1
…

…

Amount spent

Outlier

Target account
Peer group

0

5

10

15

20

25

30

Time
Figure 3.5 Peer-Group Analysis

where x1∶k,n represents the average of x1,n , … , xk,n and s the corresponding standard deviation. Although a Student’s t-distribution can
be used for statistical interpretation, it is recommended to simply order
the observations in terms of their t-score and further inspect the ones
with the highest scores. This is illustrated in Figure 3.5.
A key advantage of peer-group analysis when compared to breakpoint analysis is that it tracks anomalies by considering inter-account
instead of intra-account behavior. For example, if one was to compare

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

87

transaction amounts for a particular account with previous amounts
on that same account (intra-account), then the spending behavior during Christmas will definitely be flagged as anomalous. By considering
peers instead (inter-account), this problem is avoided. It is important
to note that both break-point and peer-group analysis will detect local
anomalies rather than global anomalies. In other words, patterns or
sequences that are not unusual in the global population might still be
flagged as suspicious when they appear to be unusual compared to
their local profile or peer behavior.

Association Rule Analysis
Association rules detect frequently occurring relationships between
items (Agrawal, Imielinski et al. 1993). They were originally introduced in a market basket analysis context to detect which items are
frequently purchased together. The key input is a transactions database
D consisting of a transaction identifier and a set of items {i1 , i2 , …, in }
selected from all possible items I. An association rule is then an
implication of the form X Y, whereby x ⊂ I, Y ⊂ I and x ∩ Y = ∅.
x is referred to as the rule antecedent whereas Y is referred to as the
rule consequent. Examples of association rules could be:
◾

If a customer has a car loan and car insurance, then the customer
has a checking account in 80 percent of the cases.

◾

If a customer buys spaghetti, then the customer buys red wine
in 70 percent of the cases.

◾

If a customer visits web page A, then the customer will visit web
page B in 90 percent of the cases.

It is hereby important to note that association rules are stochastic
in nature. This means that they should not be interpreted as a universal truth, and are characterized by statistical measures quantifying the
strength of the association. Furthermore, the rules measure correlational associations and should not be interpreted in a causal way.
In a fraud setting, association rules can be used to detect fraud
rings in insurance. The transaction identifier then corresponds to a
claim identifier and the items to the various parties involved such as
the insured, claim adjuster, police officer and claim service provider

88

FRAUD ANALYTICS

Table 3.2

Transactions Database for Insurance Fraud Detection

Claim Identifier

Parties Involved

1

insured A, police officer X, claim adjuster 1, auto repair shop 1

2

insured A, claim adjuster 2, police officer X

3

insured A, police officer Y, auto repair shop 1

4

insured A, claim adjuster 1, claim adjuster 1, police officer Y

5

insured B, claim adjuster 2, auto repair shop 2, police officer Z

6

insured A, auto repair shop 2, auto repair shop 1, police officer X

7

insured C, police officer X, auto repair shop 1

8

insured A, auto repair shop 1, police officer Z

9

insured A, auto repair shop 1, police officer X, claim adjuster 1

10

insured B, claim adjuster 3, auto repair shop 1

(e.g., auto repair shop, medical provider, home repair contractor,
etc.). Let’s consider an example of a transactions database as depicted
in Table 3.2.
The goal is now to find frequently occurring relationships or association rules between the various parties involved in the handling of
the claim. This will be solved using a two-step procedure. In step 1,
the frequent item sets will be identified. The frequency of an item set
is measured by means of its support, which is the percentage of total
transactions in the database that contains the item set. Hence, the item
set X has support s if 100×s percent of the transactions in D contain X.
It can be formally defined as follows:
support(X) =

number of transactions supporting (X)
total number of transactions

Consider the item set {insured A, police officer X, auto repair
shop 1}. This item set occurs in transactions 1, 6, and 9 hereby giving a
support of 3/10 or 30 percent. A frequent item set can now be defined
as an item set of which the support is higher than a minimum value
as specified by the data scientist (e.g., 10 percent). Computationally
efficient procedures have been developed to identify the frequent item
sets (Agrawal, Imielinski et al. 1993).

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

89

Once the frequent item sets have been found, the association rules
can be derived in step 2. Multiple association rules can be defined based
on the same item set. Consider the item set {insured A, police officer
X, auto repair shop 1}. Example association rules could be:
If insured A And police officer X ⇒ auto repair shop 1
If insured A And auto repair shop 1 ⇒ police officer X
If insured A ⇒ auto repair shop 1 And police officer X
The strength of an association rule can be quantified by means of
its confidence. The confidence measures the strength of the association
and is defined as the conditional probability of the rule consequent,
given the rule antecedent. The rule X ⇒ Y has confidence c if 100×c
percent of the transactions in D that contain X also contain Y. It can be
formally defined as follows:
confidence (X → Y ) = P(Y |X) =

support (X ∪ Y )
support(X)

Consider the association rule “If insured A And police officer X
⇒ auto repair shop 1.” The antecedent item set {insured A, police
officer X} occurs in transactions 1, 2, 6, and 9. Out of these four transactions, three also include the consequent item set {auto repair shop
1}, which results into a confidence of three-fourths, or 75 percent.
Again, the data scientist has to specify a minimum confidence in order
for an association rule to be considered interesting.
Once all association rules have been found, they can be closer
inspected and validated. In our example, the association “If insured
A And police officer X ⇒ auto repair shop 1” does not necessarily
imply a fraud ring, but it’s a least worth the effort to further inspect
the relationship between these parties.

CLUSTERING
Introduction
The aim of clustering is to split up a set of observations into segments
such that the homogeneity within a segment is maximized (cohesive),

90

FRAUD ANALYTICS

and the heterogeneity between segments is maximized (separated)
(Everitt, Landau et al. 2010). Examples of applications in fraud
detection include:
◾

Clustering transactions in a credit card setting

◾

Clustering claims in an insurance setting

◾

Clustering tax statements in a tax-inspection setting

◾

Clustering cash transfers in an anti-money laundering setting

Various types of clustering data can be used, such as customer characteristics (e.g., sociodemographic, behavioral, lifestyle, …), account
characteristics, transaction characteristics, etc. A very popular sets of
transaction characteristics used for clustering in fraud detection are
the recency, frequency, and monetary (RFM) variables, as introduced
in Chapter 2.
Note that besides structured information, also unstructured information such as emails, call records, and social media information might
be considered. As always in analytics, it is important to carefully select
the data for clustering. The more data the better, although care should
be taken to avoid excessive amounts of correlated data by applying
unsupervised feature selection methods. One very simple approach
here is to simply calculate the Pearson correlation between each pair
of data characteristics and only retain one characteristic in case of a
significant correlation.
When used for fraud detection, a possible aim of clustering may
be to group anomalies into small, sparse clusters. These can then be
further analyzed and inspected in terms of their characteristics and
potentially fraudulent behavior (see Figure 3.6).
Different types of clustering techniques can be applied for fraud
detection. At a high level, they can be categorized as either hierarchical
or nonhierarchical (see Figure 3.7).

Distance Metrics
As said, the aim of clustering is to group observations based on similarity. Hence, a distance metric is needed to quantify similarity. Various
distance metrics have been introduced in the literature for both continuous and categorical data.

Frequency

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

91

Anomalies

Recency
Figure 3.6 Cluster Analysis for Fraud Detection

Clustering

Hierarchical

Agglomerative

Nonhierarchical

k-means

Divisive

SOM

Figure 3.7 Hierarchical Versus Nonhierarchical Clustering Techniques

For continuous data, the Minkowski distance or Lp norm between
two observations xi and xj can be defined as follows:
D(xi ; xj ) =

(∑n

|
|p
xik − xjk |
|
k = 1|
|

)1∕p
,

where n represents the number of variables. When p equals 1, the
Minkowski distance is also referred to the Manhattan or City block
distance. When p equals 2, the Minkowski distance becomes the
well-known Euclidean distance. Both are illustrated in Figure 3.8. For
the example depicted, the distance measures become:
√
Euclidean ∶ (1500 − 1000)2 + (10 − 5)2 ̃ = 500
Manhattan ∶|1500 − 100| + |10 − 5| = 505

92

FRAUD ANALYTICS

Recency

20

10

Manhattan

30

Manhattan

an
de
cli
Eu

50

Monetary

Figure 3.8 Euclidean Versus Manhattan Distance

From this example, it is clear that the amount variable clearly dominates the distance metric since it is measured on a larger scale than
the recency variable. Hence, to appropriately take this into account, it
is recommended to scale both variables to a similar range using any of
the standardization procedures we discussed in Chapter 2. It is obvious
that the Euclidean distance will always be shorter than the Manhattan distance. The Euclidean metric is the most popular metric used for
quantifying the distance between continuous variables. Other less frequently used distance measures are based on the Pearson correlation
or cosine measure.
Besides continuous variables, also categorical variables can be used
for clustering. Let’s first discuss the case of binary variables. These are
often used in insurance fraud detection methods, which are typically
based on a series of red-flag indicators to label a claim as suspicious
or not. Assume we have the following data set with binary red-flag
indicators.
Poor Driving
Record

Premium
Paid
in Cash

Car Purchase
Information
Available

Coverage Car Was Never
Increased Inspected
or Seen

Claim 1

Yes

No

Yes

Yes

No

Claim 2

Yes

Yes

No

No

No

…

A first way to calculate the distance or similarity between claim 1
and 2 is to use the simple matching coefficient (SMC), which simply

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

93

calculates the number of identical matches between the variable values
as follows:
SMC(Claim 1, Claim 2) = 2∕5.
A tacit assumption behind the SMC is that both states of the variable (Yes versus No) are equally important and should thus both be
considered. Another option is to use the Jaccard index whereby the
No-No match is left out of the computation as follows:
Jaccard(Claim 1, Claim 2) = 1∕4.
The Jaccard index measures the similarity between both claims
across those red flags that where raised at least once. It is especially useful in those situations where many red-flag indicators are available and
typically only a few are raised. Consider, for example, a fraud-detection
system with 100 red-flag indicators, of which on average 5 are raised.
If you would use the simple matching coefficient, then typically all
claims would be very similar since the 0–0 matches would dominate
the count, hereby creating no meaningful clustering solution. By using
the Jaccard index a better idea of the claim similarity can be obtained.
The Jaccard index has actually been very popular in fraud detection.
Let’s now consider the case of categorical variables with more
than two values. Assume we have the following data in a medical
insurance setting.
Treatment
Day

Distance between
Clinic and Subject’s
Home

Type of
Diagnosis

Risk
Class

Claim
Submitted by

Claim 1

Sunday

Medium

Severe

C

Phone

Claim 2

Wednesday

Medium

Life
threatening

C

Email

…

A first option here is to code the categorical variables as 0/1 dummies and apply the Manhattan or Euclidean distance metrics discussed
earlier. However, this may be cumbersome in case of categorical variables with lots of values. Coarse classification might be considered to
reduce the number of dummy variables, but, remember, since we don’t

94

FRAUD ANALYTICS

have a target variable in this unsupervised setting, it should be based on
expert knowledge. Another option would be to use the simple matching coefficient (SMC) and count the number of identical matches. In
our case, the SMC would become 2/5.
Many data sets will contain both continuous and categorical variables, which complicates the distance calculation. One option here is
to code the categorical variables as 0/1 dummies and use a continuous
distance measure. Another option is to use a (weighted) combination
of distance measures, although this is less straightforward and thus less
frequently used.

Hierarchical Clustering
Once the distance measures have been chosen, the clustering process
can start. A first popular set of techniques are hierarchical clustering
methods. Depending on the starting point of the analysis, divisive or
agglomerative hierarchical clustering methods can be used. Divisive
hierarchical clustering starts from the whole data set in one cluster,
and then breaks this up in each time smaller clusters until one observation per cluster remains (right to left in Figure 3.9). Agglomerative

Divisive
Step 4

Step 3

C1

C1
C2

C2

Step 2

C3
C4

C4
C5

C5

Step 0

Step 1

Step 2

Step 1

C3
C4
C5

Step 3

Agglomerative
Figure 3.9 Divisive Versus Agglomerative Hierarchical Clustering

Step 0
C1
C2
C3
C4
C5

Step 4

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

95

clustering works the other way around, and starts from each observation in one cluster, and then continues to merge the ones that are
most similar until all observations make up one big cluster (left to right
in Figure 3.9). The optimal clustering solution then lies somewhere in
between the extremes to the left and right, respectively.
Although we have earlier discussed various distance metrics to
quantify the distance between individual observations, we haven’t
talked about how to measure distances between clusters. Also here,
various options are available, as depicted in Figure 3.10. The single
linkage method defines the distance between two clusters as the
smallest possible distance, or the distance between the two most similar objects. The complete linkage method defines the distance between
two clusters as the biggest distance, or the distance between the two
most dissimilar objects. The average linkage method calculates the
average of all possible distances. The centroid method calculates the
distance between the centroids of both clusters. Finally, Ward’s distance
between two clusters Ci and Cj is calculated as the difference between
the total within cluster sum of squares for the two clusters separately,
and the total within cluster sum of squares obtained from merging the
clusters Ci and Cj into one cluster Cij . It is calculated as follows:
DWard (Ci , Cj ) =

∑
x∈Ci

(x − ci )2 +

∑

(x − cj )2 −

x∈Cj

∑

(x − cij )2 ,

x∈Cij

where ci , cj , cij is the centroid of cluster Ci , Cj , and Cij , respectively.

Single linkage

Complete linkage
Average linkage

Centroid method

Figure 3.10 Calculating Distances between Clusters

96

FRAUD ANALYTICS

In order to decide on the optimal number of clusters, one could use
a dendrogram or screen plot. A dendrogram is a tree-like diagram that
records the sequences of merges. The vertical (or horizontal scale) then
gives the distance between two clusters amalgamated. One can then
cut the dendrogram at the desired level to find the optimal clustering.
This is illustrated in Figure 3.11 and Figure 3.12 for a birds clustering

3

1

2

chicken

parrot
pigeon

canary

duck

5
owl

4

eagle
6

Figure 3.11 Example for Clustering Birds. The Numbers Indicate the Clustering Steps

6

5

4
3
2

1
chicken duck pigeon

parrot

canary

owl

eagle

Figure 3.12 Dendrogram for Birds Example. The Thick Black Line Indicates the Optimal
Clustering

97

Distance

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

Number of clusters

Figure 3.13 Scree Plot for Clustering

example. A scree plot is a plot of the distance at which clusters are
merged. The elbow point then indicates the optimal clustering. This is
illustrated in Figure 3.13.
A key advantage of hierarchical clustering is that the number of
clusters does not need to be specified prior to the analysis. A disadvantage is that the methods do not scale very well to large data sets. Also,
the interpretation of the clusters is often subjective and depends on the
business expert and/or data scientist.

Example of Hierarchical Clustering Procedures
To illustrate the various hierarchical clustering procedures discussed,
suppose we have a data set of seven observations, as depicted in
Table 3.3. Figure 3.14 displays the corresponding scatter plot.
The output of the various hierarchical clustering procedures is
depicted in Figure 3.15. As it can be observed, single linkage results
Table 3.3 Data Set for
Hierarchical Clustering

X

Y

A

4

4

B

5

4

C

7

5

D

8

5

E

11

5

F

2

7

G

1

3

98

8

FRAUD ANALYTICS

6

F

C
4

A

D

E

B

0

2

G

0

2

4

6

8

10

12

Figure 3.14 Scatter Plot of Hierarchical Clustering Data

E

2.5

Figure 3.15 Output of Hierarchical Clustering Procedures

D

C

B

A

1.0

1.5

2.0

Height

G

3.0

F

3.5

Single Linkage: Dendogram

99

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

8

Single Linkage

6

F

C
A

D

2

4

B

E

3

1
G

2

4
5

0

6

0

2

4

6

8

10

12

Figure 3.15 (Continued)

G

B

A

D

C

0

2

E

F

4

Height
6

8

10

Complete Linkage: Dendogram

100

FRAUD ANALYTICS

8

Complete Linkage

6

F

C
B

E

1
3

4

A

D

2
G

4

2

5

0

6

0

2

4

6

8

10

12

G

Figure 3.15 (Continued)

D

C

B

A

1

2

3

F

Height

4

E

5

6

Average Linkage: Dendogram

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

8

Average Linkage

6

F

D

C
B

1

2

4

A

E

3
G

5

2

4

0

6

0

2

4

6

8

10

12

3

Figure 3.15 (Continued)

D

C

B

A

0

2

Height

4

5

Centroid Method: Dendogram

101

102

FRAUD ANALYTICS

F
3

D

C

B

A

0

2

Height

G

4

E

5

Centroid Method: Dendogram

8

Centroid Method

6

F

C

2

B

4

A

E

D

1
G

3

5

2

4

0

6

0

2

Figure 3.15 (Continued)

4

6

8

10

12

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

103

6

B

A

D

C

0

2
8

Ward’s Method

6

F

C
A

D

E

1

B

4

3
2

G

2

4
5

0

6

0
Figure 3.15 (Continued)

2

4

6

8

10

12

G

E

F

4

Height

8

10

Ward’s Method: Dendogram

104

FRAUD ANALYTICS

in thin, long, and elongated clusters since dissimilar objects are not
accounted for. Complete linkage will make the cluster tighter, more
balanced and spherical, which is often more desirable. Average linkage
prefers to merge clusters with small variances, which often results in
clusters with similar variance. Although the centroid method seems
similar at first sight, the resulting clustering solution is different as
depicted in the figure. Ward’s method prefers to merge clusters with a
small number of observations and often results into balanced clusters.

k-Means Clustering
k-means clustering is a nonhierarchical procedure that works along the
following steps (see Jain 2010; MacQueen 1967):
1. Select k observations as initial cluster centroids (seeds).
2. Assign each observation to the cluster that has the closest
centroid (for example, in Euclidean sense).
3. When all observations have been assigned, recalculate the positions of the k centroids.
4. Repeat until the cluster centroids no longer change or a fixed
number of iterations is reached.
A key requirement here is that, as opposed to hierarchical clustering, the number of clusters, k, needs to be specified before the start
of the analysis. This decision can be made using expert based input
or based on the result of another (e.g., hierarchical) clustering procedure. Typically, multiple values of k are tried out and the resulting clusters evaluated in terms of their statistical characteristics and
interpretation. It is also advised to try out different seeds to verify the
stability of the clustering solution. Note that the mean is sensitive to
outliers, which are especially relevant in a fraud detection setting. A
more robust alternative is to use the median instead (k-medoid clustering). In case of categorical variables, the mode can be used (k-mode
clustering). As mentioned, k-means is most often used in combination
with a Euclidean distance metric, which typically results into spherical
or ball-shaped clusters. See Figures 3.16 to 3.22.

105

–0.5

0.0

0.5

1.0

1.5

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

–0.5

0.0

0.5

1.0

1.5

1.0

1.5

–0.5

0.0

0.5

1.0

1.5

Figure 3.16 k-Means Clustering: Start from Original Data

–0.5

0.0

0.5

Figure 3.17 k-Means Clustering Iteration 1: Randomly Select Initial Cluster Centroids

1.5
1.0
0.5
0.0
‒0.5
‒0.5

0.0

0.5

Figure 3.18 k-Means Clustering Iteration 1: Assign Remaining Observations

106

1.0

1.5

‒0.5

0.0

0.5

1.0

1.5

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

‒0.5

0.0

0.5

1.0

1.5

‒0.5

0.0

0.5

1.0

1.5

Figure 3.19 k-Means Iteration Step 2: Recalculate Cluster Centroids

‒0.5

0.0

0.5

1.0

Figure 3.20 k-Means Clustering Iteration 2: Reassign Observations

1.5

107

FRAUD ANALYTICS

‒0.5

0.0

0.5

1.0

1.5

108

‒0.5

0.0

0.5

1.0

1.5

‒0.5

0.0

0.5

1.0

1.5

Figure 3.21 k-Means Clustering Iteration 3: Recalculate Cluster Centroids

‒0.5

0.0

0.5

1.0

Figure 3.22 k-Means Clustering Iteration 3: Reassign Observations

1.5

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

109

Self-Organizing Maps
A self-organizing map (SOM) is an unsupervised learning algorithm
that allows users to visualize and cluster high-dimensional data on
a low-dimensional grid of neurons (Kohonen 2000; Huysmans et al.
2006; Seret et al. 2012). A SOM is a feedforward neural network with
two layers: an input and an output layer. The neurons from the output
layer are usually ordered in a two-dimensional rectangular or hexagonal grid (see Figure 3.23). For the former, every neuron has at most
eight neighbors, whereas for the latter, every neuron has at most six
neighbors.
Each input is connected to all neurons in the output layer with
weights w = [w1 , … , wN ], with N the number of variables. All weights
are randomly initialized. When a training vector x is presented, the
weight vector wc of each neuron c is compared with x, using for
example, the Euclidean distance metric (beware to standardize the
data first!):
√
∑N
d(x, wc ) =
(xi − wci )2 .
i=1

The neuron that is most similar to x in Euclidean sense is called
the best matching unit (BMU). The weight vector of the BMU
and its neighbors in the grid are then adapted using the following
learning rule:
wi (t + 1) = wi (t) + hci (t)[x(t) − wi (t)],
where t represents the time index during training and hci (t) defines
the neighborhood of the BMU c, specifying the region of influence.
Rectangular SOM Grid

Figure 3.23 Rectangular Versus Hexagonal SOM Grid

Hexagonal SOM Grid

110

FRAUD ANALYTICS

The neighborhood function hci (t) should be a nonincreasing function
of time and the distance from the BMU. Some popular choices are:
(

)
||rc − ri ||2
hci (t) = 𝛼(t)exp −
,
2𝜎 2 (t)
hci (t) = 𝛼(t) if ||rc − ri ||2 ≤ threshold, 0 otherwise,
where rc and ri represent the location of the BMU and neuron i on
the map, 𝜎 2 (t) represents the decreasing radius, and 0 ≤ 𝛼(t) ≤ 1 the
learning rate (e.g., 𝛼(t) = A∕(t + B), 𝛼(t) = exp(–At)). The decreasing
learning rate and radius will give a stable map after a certain amount of
training. The neurons will then move more and more toward the input
observations and interesting segments will emerge. Training is stopped
when the BMUs remain stable, or after a fixed number of iterations
(e.g., 500 times the number of SOM neurons).
SOMs can be visualized by means of a U-matrix or component
plane:
◾

A U (unified distance)-matrix essentially superimposes a
height Z dimension on top of each neuron visualizing the
average distance between the neuron and its neighbors,
whereby typically dark colors indicate a large distance and can
be interpreted as cluster boundaries.

◾

A component plane visualizes the weights between each specific input variable and its output neurons, and as such provides a visual overview of the relative contribution of each input
attribute to the output neurons.

Figure 3.24 provides a SOM example for clustering countries based
on a Corruption Perception Index (CPI). This is a score between 0
(highly corrupt) and 10 (highly clean), assigned to each country in
the world. The CPI is combined with demographic and macroeconomic
information for the years 1996, 2000, and 2004. Uppercase countries
(e.g., BEL) denote the situation in 2004, lowercase (e.g., bel) in 2000,
and sentence case (e.g., Bel) in 1996. It can be seen that many of the
European countries are situated in the upper-right corner of the map.
Figure 3.25 provides the component plane for literacy whereby
darker regions score worse on literacy. Figure 3.26 provides the

111

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

sgp
SGP

Sgp

Usa
usa
USA

Isr
isr

Hkg
hkg
HKG

AUS

Nzl
nzl
NZL

Can
aus

fin
Swe Nor swe
FIN
nor NOR
SWE
CHE

Fin

Che
che

Aut
aut

Aus
CAN

can

jap

BEL
Nld JPN
Gbr
nld NLD gbr FRA
AUT GBR
DEU
bel
fra
Bel
ITA
ESP
Deu
PRT GRC

Dnk
dnk
DNK

Fra
deu

TWN
Twn
twn

ISR

Esp

CHL
Kor
kor

Chl
chl
ARG
jor
Mys
JOR

MEX

KOR

Ven

BRA

ven

idn
Chn
CHN

Chn

bra

COL

IDN

Idn
EGY

Tur

tha
Tha
TUR

ECU

Col
col

Prt

THA

Arg
arg

MYS MEX mex

mys
VEN

tur

Ecu
ecu
Phl
phl

BOL

Zaf

ind

KEN

nga
NGA

Egy
egy
pak
PAK

Bol

lnd

bgd
BGD Bgd

bol
Jor

zaf
ZAF

IND

Bra
PHL

hun
HUN
POL
Cze
cze
pol

CZE

Ken
ken

Pak
Cmr
Nga
cmr

Uga
CMR uga
UGA

Figure 3.24 Clustering Countries Using SOMs

component plane for political rights whereby darker regions correspond to better political rights. It can be seen that many of the
European countries score good on both literacy and political rights.
SOMs are a very handy tool for clustering high-dimensional data
sets because of the visualization facilities. However, since there is no
real objective function to minimize, it is harder to compare various
SOM solutions against each other. Also, experimental evaluation and
expert interpretation is needed to decide on the optimal size of the
SOM. Unlike k-means clustering, a SOM does not force the number of
clusters to be equal to the number of output neurons.

Clustering with Constraints
In many fraud application domains, the expert(s) will have prior
knowledge about existing fraud patterns and/or anomalous behavior.
This knowledge can originate from both experience as well as existing

112

FRAUD ANALYTICS

Figure 3.25 Component Plane for Literacy

literature. It will be handy if this background knowledge can be incorporated to guide the clustering. This is the idea of semi-supervised
clustering or clustering with constraints (Basu, Davidson et al. 2012).
The idea here is to bias the clustering with expert knowledge such
that the clusters can be found quicker and with the desired properties.
Various types of constraints can be thought of. A first set of constraints is observation-level constraints. As the name suggests, these
constraints are set for individual observations. A must-link constraint
enforces that two observations should be assigned to the same cluster,
whereas a cannot-link constraint will put them into different clusters
(see Figure 3.27). This could be handy in a fraud-detection setting
if the fraud behavior of only a few observations is known as these
can then be forced into the same cluster. Cluster-level constraints
are defined at the level of the cluster. A minimum separation or 𝛿
constraint specifies that the distance between any pair of observations

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

113

Figure 3.26 Component Plane for Political Rights

k

ot lin

cann

must link

must link

Figure 3.27 Must-Link and Cannot-Link Constraints in Semi-Supervised Clustering

in two different clusters must be at least 𝛿 (see Figure 3.28). This will
allow data scientists to create well-separated clusters. An 𝜀-constraint
specifies that each observation in a cluster with more than one
observation must have another observation within a distance of at
most 𝜀 (see Figure 3.29). Another example of a constraint includes the
requirement to have balanced clusters, whereby each cluster contains

114

FRAUD ANALYTICS

≥δ
Figure 3.28 𝛿-Constraints in Semi-Supervised Clustering

ε

Figure 3.29 𝜀-Constraints in Semi-Supervised Clustering

the same amount of observations. Negative background information
can also be provided whereby the aim is to find a clustering which is
different from a given clustering.
The constraints can be enforced during the clustering process.
For example, in a k-means clustering setup, the cluster seeds will
be chosen such that the constraints are respected. Each time an
observation is (re-)assigned, the constraints will be verified and the
(re-)assignment halted in case violations occur. In a hierarchical
clustering procedure, a must-link constraint can be enforced by setting
the distance between two observations to 0, whereas a cannot-link
constraint can be enforced by setting the distance to a very high value.

Evaluating and Interpreting Clustering Solutions
Evaluating a clustering solution is by no means a trivial exercise since
there exists no universal criterion. From a statistical perspective, the
sum of squared errors (SSE) can be computed as follows:
SSE =

∑K ∑
i=1

x ∈ Ci

dist 2 (x, mi ),

where K represents the number of clusters and mi the centroid (e.g.,
mean) of cluster i. When comparing two clustering solutions, the one
with the lowest SSE can then be chosen. Besides a statistical evaluation,

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

115

a clustering solution will also be evaluated in terms of its interpretation.
To facilitate the interpretation of a clustering solution, various options
are available. A first one is to compare cluster distributions with population distributions across all variables on a cluster-by-cluster basis.
This is illustrated in Figure 3.30 whereby the distribution of a cluster
C1 is contrasted with the overall population distribution for the RFM
Recency
70
60
50
40
30
20
10
0

<1

1–2

2–3
Population

3–4

4+

10,000–100,000

100,000+

Cluster C1

Monetary
70
60
50
40
30
20
10
0

<1,000

1,000–5,000

5,000–10,000
Population

Cluster C1

Frequency
35
30
25
20
15
10
5
0

<5

5–10

10–20
Population

Figure 3.30 Cluster Profiling Using Histograms

20–50
Cluster C1

50+

116

FRAUD ANALYTICS

Table 3.4

Claim

Output from a k-Means Clustering Exercise (k=4)

Recency

Frequency

Monetary

....

ClusterID

Claim1

Cluster2

Claim2

Cluster4

Claim3

Cluster3

Claim4

Cluster2

Claim5

Cluster1

Claim6

Cluster4

…

variables. It can be clearly seen that cluster C1 has observations with
low recency values and high monetary values, whereas the frequency
is relatively similar to the original population.
Another way to explain a given clustering solution is by building a
decision tree with the ClusterID as the target variable. We will discuss
how to build decision trees in the next chapter, but for the moment
it suffices to understand how they should be interpreted. Assume we
have the following output from a k-means clustering exercise with k
equal to 4. (See Table 3.4)
We can now build a decision tree with the ClusterID as the target
variable as follows.
The decision tree in Figure 3.31 gives us a clear insight into the
distinguishing characteristics of the various clusters. For example,
cluster 2 is characterized by observations having recency < 1 day and
monetary > 1,000. Hence, using decision trees, we can easily assign
new observations to the existing clusters. This is an example of how
supervised or predictive techniques can be used to explain the solution
from a descriptive analytics exercise.
Recency < 1 day
No
Frequency > 5
Yes
Cluster1

No
Cluster3

Yes
Monetary >1000
Yes
Cluster2

Figure 3.31 Using Decision Trees for Clustering Interpretation

No
Cluster4

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

117

ONE-CLASS SVMS
One-class SVMs try to maximize the distance between a hyperplane
and the origin (Schölkopf et al. 2001). The idea is to separate the
majority of the observations from the origin. The observations that
lie on the other side of the hyperplane, closest to the origin, are then
considered as outliers. This is illustrated in Figure 3.32.
Let’s define the hyperplane as follows:
wT 𝜑(x) − 𝜌 = 0.
Normal observations lie above the hyperplane and outliers below
it, or in other words normal observations (outliers) will return a positive (negative) value for:
f (x) = sign(wT 𝜑(x) − 𝜌).
One-class SVMs then aim at solving the following optimization
function:
1 ∑N
1 ∑n
Minimize
wi2 − 𝜌 +
e
i
=
1
2
𝜐n i = 1 i
subject to wT 𝜑(xk ) ≥ 𝜌 − ek , k = 1… n
ek ≥ 0.
The error variables ei are introduced to allow observations to lie on
the side of the hyperplane closest to the origin. The parameter 𝜐 is a

Outliers

wT φ(x) – ρ = 0

Figure 3.32 One-Class Support Vector Machines

118

FRAUD ANALYTICS

regularization term. Mathematically, it can be shown that the distance
𝜌
between the hyperplane and the origin equals ||w||
(see Figure 3.32).
∑N
1
w2 – 𝜌, which is
This distance is now maximized by minimizing 2
i=1 i
the first part in the objective function. The second part of the objective
function then accounts for errors, or thus outliers. The constraints force
the majority of observations to lie above the hyperplane. The parameter 𝜐 ranges between 0 and 1, and sets an upper bound on the fraction
of outliers. A lower (higher) value of the regularization parameter 𝜐
will increase (decrease) the weight assigned to errors and thus decrease
(increase) the number of outliers. Given the importance of this parameter, one-class SVMs are sometimes also referred to as 𝜐-SVMs.
As with SVMs for supervised learning (see Chapter 4), the optimization problem can be solved by formulating its dual variant, which
also here yields a quadratic programming (QP) problem, and applying
the kernel trick. By again using Lagrangian optimization, the following
decision function is obtained
( n
)
∑
(
)
f (x) = sign(w 𝜑(x) − 𝜌) = sign
𝛼i K x, xi − 𝜌 ,
T

i=1

where αi represent the Lagrange multipliers, and K(x, xi ) the kernel
function. See Schölkopf et al. (2001) for more details.

REFERENCES
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining Association Rules
between Sets of Items in Massive Databases, Proceedings of the ACM SIGMOD
International Conference on Management of Data. Washington, D.C.
Basu, S., Davidson, I., & Wagstaff, K. L. (2012). Constrained Clustering: Advances
in Algorithms, Theory, and Applications, Boca Raton, FL: Chapman & Hall/
CRC.
Bolton, R. J., & Hand, D. J. (2002). Statistical Fraud Detection: A Review,
Statistical Science 17 (3): 235–255.
Cullinan, G. J. (1977). Picking Them by Their Batting Averages’ Recency–Frequency–
Monetary Method of Controlling Circulation, Manual Release 2103. New York:
Direct Mail/Marketing Association.
Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (2010). Cluster Analysis, 5th ed.
Hoboken, NJ: John Wiley & Sons.

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION

119

Grubbs, F. E. (1950). Sample Criteria for Testing Outlying Observations, The
Annals of Mathematical Statistics 21(1): 27–58.
Grubbs, F. E. (1969). Procedures for Detecting Outlying Observations in
Samples, Technometrics 11(1): 1–21.
Huysmans, J., Baesens, B., Van Gestel, T., & Vanthienen, J. (2006). Failure Prediction with Self Organizing Maps, Expert Systems with Applications, Special
Issue on Intelligent Information Systems for Financial Engineering 30(3):
479–487.
Jain, A. K. (2010). Data Clustering: 50 Years Beyond K-Means, Pattern Recognition Letters 31(8): 651–666.
Kohonen, T. (2000). Self-Organizing Maps. New York: Springer.
MacQueen, J. (1967). Some Methods for classification and Analysis of
Multivariate Observations, Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley: University of California Press,
pp. 281–297.
Schölkopf, B., and Platt, J. C., Shawe-Taylor, J., Smola, A., & Williamson, R. C.
(2001). Estimating the support of a high-dimensional distribution, Neural
Computation 13(7): 1443–1471.
Seret, A., Verbraken, T., Versailles, S., & Baesens, B. (2012). A New SOM-Based
Method for Profile Generation: Theory and an Application in Direct Marketing, European Journal of Operational Research 220 (1): 199–209.
Weston, D. J., Hand, D. J., Adams, N. M., Whitrow, C., & Juszczak, P. (2008).
Plastic card fraud detection using peer group analysis, Advances in Data
Analysis and Classification 2(1): 45–62.

C H A P T E R

4

Predictive
Analytics for
Fraud Detection

121

INTRODUCTION
In predictive analytics, the aim is to build an analytical model predicting a target measure of interest (Baesens 2014; Duda et al. 2001;
Flach 2012; Han and Kamber 2001; Hastie et al. 2001; Tan et al.
2006). The target is then typically used to steer the learning process
during an optimization procedure. Two types of predictive analytics
can be distinguished depending on the measurement level of the
target: regression and classification. In regression, the target variable is
continuous and varies along a predefined interval. This interval can
be limited (e.g., between 0 and 1) or unlimited (e.g., between 0 and
infinity). A typical example in a fraud detection setting is predicting
the amount of fraud. In classification, the target is categorical which
means that it can only take on a limited set of predefined values.
In binary classification, only two classes are considered (e.g., fraud
versus no-fraud) whereas in multiclass classification, the target can
belong to more than two classes (e.g., severe fraud, medium fraud,
no fraud).
In fraud detection, both classification and regression models can
be used simultaneously. Consider, for example, an insurance fraud setting. The expected loss due to fraud can be calculated as follows
Expected fraud loss (EFL) = PF × LGF + (1 − PF) × 0 = PF × LGF
where PF represents the probability of fraud and LGF the loss given
fraud. The latter can be expressed as an amount or as a percentage
of a maximum amount (e.g., the maximum insured amount). PF can
then be estimated using a classification technique whereas for LGF, a
regression model will be estimated.
Different types of predictive analytics techniques have been
developed in the literature originating from a variety of different
disciplines such as statistics, machine learning, artificial intelligence,
pattern recognition, and data mining. The distinction between
those disciplines is getting more and more blurred and is actually not that relevant. In what follows, we will discuss a selection
of techniques with a particular focus on the fraud practitioner’s
perspective.
122

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

123

TARGET DEFINITION
Since the target variable plays an important role in the learning process, it is of key importance that it is appropriately defined. In fraud
detection, the target fraud indicator is usually hard to determine since
one can never be fully sure that a certain transaction (e.g., credit card
fraud), claim (e.g., insurance fraud), or company (e.g., tax evasion
fraud) is fraudulent.
Let’s take the example of insurance fraud (Viaene et al. 2002). If
an applicant files a claim, the insurance company will perform various
checks to flag the claim as suspicious or nonsuspicious. When the claim
is considered as suspicious, the insurance firm will first decide whether
it’s worthwhile the effort to pursue the investigation. Obviously, this
will also depend on the amount of the claim, such that small amount
claims are most likely not further considered, even if they are fraudulent. When the claim is considered worthwhile to investigate, the firm
might start a legal procedure resulting into a court judgment and/or
legal settlement flagging the claim as fraudulent or not. It is clear that
also this procedure is not 100 percent error-proof and thus nonfraudulent claims might end up being flagged as fraudulent, or vice versa.
Another example is tax evasion fraud. An often-used fraud mechanism in this setting is a spider construction, as depicted in Figure 4.1
(Van Vlasselaer et al. 2013 and 2015).
The key company in the middle represents the firm who is the key
perpetrator of the fraud. It starts up a side company (Side Company
1), which makes revenue but deliberately does not pay its taxes and
hence intentionally goes bankrupt. On bankruptcy, its resources (e.g.,
employees, machinery, equipment, buyers, suppliers, physical address,
and other assets) are shifted toward a new side company (e.g., Side
Company 2), which repeats the fraud behavior and thus again goes
bankrupt. As such, a web of side companies evolves around the key
company. It is clear that in this setting it becomes hard to distinguish
a regular bankruptcy due to insolvency from a fraudulent bankruptcy
due to malicious intent. In other words, all fraudulent companies
go bankrupt, but not all bankrupt companies are fraudulent. This is
depicted in Figure 4.2. Although suspension is seen as a normal way of

124

FRAUD ANALYTICS

Figure 4.1 A Spider Construction in Tax Evasion Fraud

Figure 4.2 Regular Versus Fraudulent Bankruptcy

stopping a company’s activities (i.e., all debts redeemed), bankruptcy
indicates that the company did not succeed to pay back all its creditors. Distinguishing between regular and fraudulent bankruptcies is
subtle and hard to establish. Hence, it can be expected that some regular bankruptcies are in fact undetected fraudulent bankruptcies.

125

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

Also, the intensity of the fraud when measured as an amount might
be hard to determine since one has to take into account direct costs,
indirect costs, reputation damage, and the time value of the money
(e.g., by discounting).
To summarize, in supervised fraud detection the target labels are
typically not noise-free hereby complicating the analytical modeling
exercise. It is thus important for analytical techniques to be able to
cope with this.

LINEAR REGRESSION
Linear regression is undoubtedly the most commonly used technique
to model a continuous target variable. For example, in a car insurance
fraud detection context, a linear regression model can be defined to
model the amount of fraud in terms of the age of the claimant, claimed
amount, severity of the accident, and so on.
Amount of fraud = 𝛽0 + 𝛽1 Age + 𝛽2 ClaimedAmount + 𝛽3 Severity + …
The general formulation of the linear regression model then
becomes:
Y = 𝛽0 + 𝛽1 X1 + … + 𝛽N XN ,
where Y represents the target variable, and X1 , … , XN the explanatory
variables. The 𝛽 parameters measure the impact on the target variable
Y of each of the individual explanatory variables.
Let’s now assume we start with a data set with n observations and
N explanatory variables structured as depicted in Table 4.1.

Table 4.1

Data Set for Linear Regression

Observation

X𝟏

X𝟐

…

XN

Y

1

X11

X21

…

XN1

Y1

2

X12

X22

…

XN2

Y2

X1n

X2n

…

XNn

Yn

…
n

126

FRAUD ANALYTICS

The 𝛽 parameters of the linear regression model can then be estimated by minimizing the following squared error function:
1 ∑n 2 1 ∑n
1 ∑n
e =
(Y − Ŷi )2 =
(Y − (𝛽0 + 𝛽1 X1i + … + 𝛽N XNi ))2 ,
2 i=1 i
2 i=1 i
2 i=1 i
where Yi represents the target value for observation i and Ŷ i the prediction made by the linear regression model for observation i. Graphically,
this idea corresponds to minimizing the sum of all error squares as represented in Figure 4.3.
Straightforward mathematical calculus then yields the following
̂
closed-form formula for the weight parameter vector 𝛽:
⎡ 𝛽̂0 ⎤
⎢ 𝛽̂ ⎥
𝛽̂ = ⎢ 1 ⎥ = (X T X)−1 X T Y ,
⎢ …⎥
⎢̂ ⎥
⎣ 𝛽N ⎦
where X represents the matrix with the explanatory variable values
augmented with an additional column of ones to account for the intercept term 𝛽0 , and Y represents the target value vector (see Table 4.1).
This model and corresponding parameter optimization procedure are
often referred to as ordinary least squares (OLS) regression.
A key advantage of OLS regression is that it is simple and thus easy
to understand. Once the parameters have been estimated, the model
can be evaluated in a straightforward way, hereby contributing to its
operational efficiency.

Y

Y = β0 + β1X
ei

β0
X
Figure 4.3 OLS Regression

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

127

Note that more sophisticated variants have been suggested in
the literature, such as ridge regression, lasso regression, time series
models (ARIMA, VAR, GARCH), multivariate adaptive regression
splines (MARS), and so on. Most of these relax the linearity assumption
by introducing additional transformations, however, at the cost of
increased complexity.

LOGISTIC REGRESSION
Basic Concepts
Consider a classification data set in a tax-evasion setting as depicted in
Table 4.2.
When modeling the binary fraud target using linear regression,
one gets:
Y = 𝛽0 + 𝛽1 Revenue + 𝛽2 Employees + 𝛽3 VATCompliant
When estimating this using OLS, two key problems arise:
1. The errors/target are not normally distributed but follow a
Bernoulli distribution with only two values;
2. There is no guarantee that the target is between 0 and 1, which
would be handy since it can then be interpreted as a probability.
Consider now the following bounding function:
f (z) =

1
1 + e−z

which is shown in Figure 4.4.
Table 4.2

Company

Example Classification Data Set

Revenue

Employees

VATCompliant

…

Fraud

Y

ABC

3,000k

400

Y

No

0

BCD

200k

800

N

No

0

CDE

4,2000k

2,200

N

Yes

1

34k

50

N

Yes

1

…
XYZ

128

FRAUD ANALYTICS

1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
–7

–5

–3

–1

0
1

3

5

7

Figure 4.4 Bounding Function for Logistic Regression

For every possible value of z, the outcome is always between 0
and 1. Hence, by combining the linear regression with the bounding
function, we get the following logistic regression model:
P(fraud = yes|Revenue, Employees, VATCompliant)
=

1
1 + e−(𝛽0 +𝛽1 Revenue+𝛽2 Employees+𝛽3 VATcompliant)

The outcome of the above model is always bounded between 0 and
1, no matter which values of revenue, employees, and VAT compliant
are being used, and can as such be interpreted as a probability.
The general formulation of the logistic regression model then
becomes (Allison 2001):
P(Y = 1|X1 , … , Xn ) =

1
,
1 + e−(𝛽0 +𝛽1 X1 + … +𝛽N XN )

or alternatively,
P(Y = 0|X1 , … , XN ) = 1 − P(Y = 1|X1 , … , XN )
=1−

1
1+

e−(𝛽0 +𝛽1 X1 + … +𝛽N XN )

=

1
1+

e(𝛽0 +𝛽1 X1 + … +𝛽N XN )

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

129

Hence, both P(Y = 1|X1 , … , XN ) and P(Y = 0|X1 , … , XN ) are
bounded between 0 and 1.
Reformulating in terms of the odds, the model becomes:
P(Y = 1|X1 , … , XN )
= e(𝛽0 +𝛽1 X1 + … +𝛽N XN )
P(Y = 0|X1 , … , XN )
or in terms of the log odds (logit),

ln

( (
))
P Y = 1|X1 , … , XN
P(Y = 0|X1 , … , XN )

= 𝛽0 + 𝛽1 X1 + … + 𝛽N XN

The 𝛽i parameters of a logistic regression model are then estimated
using the idea of maximum likelihood. Maximum likelihood optimization choses the parameters in such a way as to maximize the probability
of getting the sample at hand. First, the likelihood function is constructed. For observation i, the probability of observing either class
equals:
P(Y = 1|X1i , … , XNi )Yi (1 − P(Y = 1|X1i , … , XNi )1−Yi ,
where Yi represents the target value (either 0 or 1) for observation i.
The likelihood function across all n observations then becomes:
∏n

i=1 P(Y

= 1|X1i , … , XNi )Yi (1 − P(Y = 1|X1i , … , XNi )1−Yi .

To simplify the optimization, the logarithmic transformation of the
likelihood function is taken and the corresponding log-likelihood can
then be optimized using for instance the Newton-Raphson method.

Logistic Regression Properties
Since logistic regression is linear in the log odds (logit), it basically
estimates a linear decision boundary to separate both classes. This is
illustrated in Figure 4.5 whereby F represents fraudulent firms and L
indicates legitimate or thus nonfraudulent firms.

130

FRAUD ANALYTICS

L

Employees

L

L

L
F
F

L
L

L L

F
F

LF
L
F

L

L
L
L

L

L
F
L L F F F

L

F

L
L L

L
L

L

L
F L
L L
L

L
L

L

L

L
L

LL

L

L

LL L
L

L
L

L
L

L
L
L
L

L
L
L

Revenue
Figure 4.5 Linear Decision Boundary of Logistic Regression

To interpret a logistic regression model, one can calculate the odds
ratio. Suppose variable Xi increases with one unit with all other variables being kept constant (ceteris paribus), then the new logit becomes
the old logit with 𝛽i added. Likewise, the new odds become the old odds
multiplied by e𝛽i . The latter represents the odds ratio, that is, the multiplicative increase in the odds when Xi increases by 1 (ceteris paribus).
Hence,
◾

𝛽i > 0 implies e𝛽i > 1 and the odds and probability increase
with Xi

◾

𝛽i < 0 implies e𝛽i < 1 and the odds and probability decrease
with Xi

Another way of interpreting a logistic regression model is by calculating the doubling amount. This represents the amount of change
required for doubling the primary outcome odds. It can be easily seen
that for a particular variable Xi , the doubling amount equals log(2)∕𝛽i .
Note that next to the f (z) transformation, other transformations
have been suggested in the literature. Popular examples are the probit
and cloglog transformation as follows:
−t 2
1
z
f (z) = √ ∫−∞ e 2 dt,
2𝜋
z

f (z) = 1 − e−e .
These transformations are visualized in Figure 4.6.

131

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

1

–3

0,8

linear

0,6

logit

0,4

probit

0,2

cloglog

–1

0

1

3

Figure 4.6 Other Transformations

Note, however, that empirical evidence suggests that all three
transformations typically perform equally well.

Building a Logistic Regression Scorecard
Logistic regression is a very popular supervised fraud-detection technique due to its simplicity and good performance. Just as with linear
regression, once the parameters have been estimated, it can be evaluated in a straightforward way, hereby contributing to its operational
efficiency. From an interpretability viewpoint, it can be easily transformed into an interpretable, user friendly points based fraud scorecard. Let’s assume we start from the following logistic regression model
whereby the explanatory variables have been coded using weight of
evidence coding:
P(fraud = yes|Revenue, Employees, VATCompliant, … )
=

1
1+e

−(𝛽0 +𝛽1 WOERevenue +𝛽2 WOEEmployees +𝛽3 WOEVATCompliant + … )

.

As discussed earlier, this model can be easily reexpressed in a linear
way, in terms of the log odds as follows:
(
log

P (fraud = yes|Revenue, Employees, VATCompliant, … )
P(fraud = no|Revenue, Employees, VATCompliant, … )

)

= 𝛽0 + 𝛽1 WOERevenue + 𝛽2 WOEEmployees + 𝛽3 WOEVATCompliant + … .

132

FRAUD ANALYTICS

A scaling can then be introduced by calculating a fraud score, which
is linearly related to the log odds as follows:
Fraud score = Offset + Factor × log(odds).
Assume that we want a fraud score of 100 for odds of 50:1, and a
fraud score of 120 for odds of 100:1. This gives the following:
100 = Offset + Factor × log(50)
120 = Offset + Factor × log(100)
The offset and factor then become:
Factor = 20∕ ln(2) = 28.85
Offset = 100 − Factor × ln(50) = –12.87
Once these values are known, the fraud score becomes:
(∑ (
)
)
N
i=1 WOEi × 𝛽i + 𝛽0 × Factor + Offset
(
(
))
𝛽0
∑N
Fraud score =
× Factor + Offset
i=1 WOEi × 𝛽i +
N
(
(
)
)
𝛽0
∑N
Offset
Fraud score =
×
𝛽
+
WOE
×
Factor
+
i
i
i=1
N
N

Fraud score =

Hence, the points for each attribute are calculated by multiplying
the weight of evidence of the attribute with the regression coefficient
of the characteristic, then adding a fraction of the regression intercept,
multiplying the result by the factor, and finally adding a fraction of
the offset. The corresponding fraud scorecard can then be visualized as
depicted in Figure 4.7.
The fraud scorecard is very easy to work with. Suppose a new firm
with the following characteristics needs to be scored:
Revenue = 750.000, Employees = 420, VAT Compliant = No, …

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

Characteristic Name
Revenue 1
Revenue 2
Revenue 3
Revenue 4

Attribute
Up to 100.000
100.000–500.000
500.000–1.000000
1.000.000+

Points
80
120
160
240

Employees 1
Employees 2
Employees 3

Up to 50
50–500
500+

5
20
80

VAT Compliant
VATCompliant
…

Yes
No

100
140

133

Figure 4.7 Fraud Detection Scorecard

The score for this firm can then be calculated as follows: 160 + 20
+ 140 + … This score can then be compared against a critical cut-off
to help decide whether the firm is fraudulent. A key advantage of the
fraud scorecard is its interpretability. One can clearly see which are
the most risky categories and how they contribute to the overall fraud
score. Hence, this is a very useful technique in fraud detection settings
where interpretability is a key concern.

VARIABLE SELECTION FOR LINEAR AND LOGISTIC
REGRESSION
Variable selection aims at reducing the number of variables in a
model. It will make the model more concise and faster to evaluate, which is especially relevant in a fraud detection setting. Both
linear and logistic regressions have built-in procedures to perform
variable selection. These are based on statistical hypotheses tests to
verify whether the coefficient of a variable i is significantly different
from zero:
H0 ∶ βi = 0
HA ∶ βi ≠ 0

134

FRAUD ANALYTICS

In linear regression, the test statistic becomes:
t=

𝛽̂i
s.e.(𝛽̂i )

,

and follows a Student’s t-distribution with n − 2 degrees of freedom,
whereas in logistic regression, the test statistic is:
(
𝜒 =
2

𝛽̂i
( )
s.e. 𝛽̂i

)2

and follows a Chi-squared distribution with 1 degree of freedom. Note
that both test statistics are intuitive in the sense that they will reject
the null hypothesis H0 if the estimated coefficient 𝛽̂i is high in absolute
value compared to its standard error s.e.(𝛽̂i ). The latter can be easily
obtained as a byproduct of the optimization procedure. Based on the
value of the test statistic, one calculates the p-value, which is the probability of a getting a more extreme value than the one observed. This
is visualized in Figure 4.8 assuming a value of 3 for the test statistic.
Note that since the hypothesis test is two-sided, the p-value adds the
areas to the right of 3 and to the left of –3.
In other words, a low (high) p-value represents an (in)significant
variable. From a practical viewpoint, the p-value can be compared
against a significance level. Table 4.3 presents some commonly used
values to decide on the degree of variable significance.
Various variable selection procedures can now be used based on
the p-value. Suppose one has four variables V1 , V2 , V3 , and V4 (e.g.,
amount of transaction, currency, transaction type, and merchant category). The number of optimal variable subsets equals 24 − 1 or 15, as
displayed in Figure 4.9.
When the number of variables is small, an exhaustive search
amongst all variable subsets can be performed. However, as the
number of variables increases, the search space grows exponentially
and heuristic search procedures are needed. Using the p-values, the
variable space can be navigated in three possible ways. Forward
regression starts from the empty model and always adds variables
based on low p-values. Backward regression starts from the full

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

135

0.32

0.27

0.22
p-value
0.17

0.12

0.07

0.02
–10

–8

–6

–4

–3

0
–2
–0.03

2

3

4

6

8

10

Figure 4.8 Calculating the p-Value with a Student’s t-Distribution
Table 4.3

Reference Values for Variable Significance

p-value < 0.01

Highly significant

0.01 < p-value < 0.05

Significant

0.05 < p-value < 0.10

Weakly significant

p-value > 0.10

Not significant

{}

{V1,V2}

{V1}

{V2}

{V3}

{V4}

{V1,V3}

{V2,V3}

{V1,V4}

{V2,V4}

{V1, V2, V3}

{V1, V2, V4}

{V1, V3, V4}

{V2, V3, V4}

{V1, V2, V3, V4}

Figure 4.9 Variable Subsets for Four Variables V1 , V2 , V3 , and V4

{V3,V4}

136

FRAUD ANALYTICS

model and always removes variables based on high p-values. Stepwise
regression is a mix between both. It starts off like forward regression,
but once the second variable has been added, it will always check
the other variables in the model and remove them if they turn out
to be insignificant according to their p-value. Obviously, all three
procedures assume preset significance levels, which should be set by
the user before the variable selection procedure starts.
In fraud detection, it is very important to be aware that statistical
significance is only one evaluation criterion to do variable selection. As
mentioned before, interpretability is also an important criterion. In both
linear and logistic regression, this can be easily evaluated by inspecting
the sign of the regression coefficient. It is hereby highly preferable that
a coefficient has the same sign as anticipated by the business expert,
otherwise he/she will be reluctant to use the model.
Coefficients can have unexpected signs due to multicollinearity
issues, noise or small sample effects. Sign restrictions can be easily
enforced in a forward regression setup by preventing variables with
the wrong sign from entering the model. Another criterion for variable
selection is operational efficiency. This refers to the amount of resources
that are needed for the collection and preprocessing of a variable.
For example, although trend variables are typically very predictive,
they require a lot of effort to be calculated and may thus not be
suitable to be used in an online, real-time fraud scoring environment
such as credit card fraud detection. The same applies to external
data, where the latency might hamper a timely decision. In both
cases, it might be worthwhile to look for a variable that is correlated
and less predictive but easier to collect and calculate. Finally, also
legal issues need to be properly taken into account. Some variables
cannot be used in fraud-detection applications because of privacy or
discrimination concerns.

DECISION TREES
Basic Concepts
Decision trees are recursive-partitioning algorithms (RPAs) that come
up with a tree-like structure representing patterns in an underlying

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

137

Transaction amount > $100,000
No

Yes

Previous fraud
Yes
Fraud

No
No Fraud

Unemployed
Yes
Fraud

No
No Fraud

Figure 4.10 Example Decision Tree

data set (Duda et al. 2001). Figure 4.10 provides an example of a decision tree in a fraud-detection setting.
The top node is the root node specifying a testing condition, of
which the outcome corresponds to a branch leading up to an internal
node. The terminal nodes of the tree assign the classifications (in our
case fraud labels) and are also referred to as the leave nodes. Many
algorithms have been suggested in the literature to construct decision
trees. Among the most popular are: C4.5 (See5) (Quinlan 1993), CART
(Breiman et al. 1984), and CHAID (Hartigan 1975). These algorithms
differ in their way of answering the key decisions to build a tree:
◾

Splitting decision: Which variable to split at what value (e.g.,
Transaction amount is > $100, 000 or not, Previous fraud is yes
or no, unemployed is yes or no)

◾

Stopping decision: When to stop adding nodes to the tree?

◾

Assignment decision: What class (e.g., fraud or no fraud) to
assign to a leave node?

Usually, the assignment decision is the most straightforward to
make since one typically looks at the majority class within the leave
node to make the decision. This idea is also referred to as winner-takeall learning. The other two decisions are less straightforward to be
made and are elaborated on in what follows.

Splitting Decision
In order to answer the splitting decision, one must define the concept of impurity or chaos. Consider, for example, the three data
sets of Figure 4.11 each containing good (unfilled circles) and bad

138

FRAUD ANALYTICS

Minimal Impurity

Maximal Impurity

Minimal Impurity

Figure 4.11 Example Data Sets for Calculating Impurity

(filled circles) customers. Quite obviously, the good customers are
nonfraudulent, whereas the bad customers are fraudulent. Minimal
impurity occurs when all customers are either good or bad. Maximal
impurity occurs when one has the same number of good and bad
customers (i.e., the data set in the middle).
Decision trees will now aim at minimizing the impurity in the data.
In order to do so appropriately, one needs a measure to quantify impurity. Various measures have been introduced in the literature and the
most popular are:
◾

Entropy: E(S) = –pG log 2 (pG ) − pB log 2 (pB ) (C4.5/See5)

◾

Gini: Gini(S) = 2pG pB (CART)

◾

Chi-squared analysis (CHAID)

with pG and pB being the proportions of good and bad, respectively.
Both measures are depicted in Figure 4.12 where it can be clearly seen
that the entropy (gini) is minimal when all customers are either good
or bad, and maximal in case of the same number of good and bad
customers.
In order to answer the splitting decision, various candidate splits
will now be evaluated in terms of their decrease in impurity. Consider
a split on age, as depicted in Figure 4.13.
The original data set had maximum entropy since the amount of
goods and bads were the same. The entropy calculations now become:
◾

Entropy top node = −1∕2 × log2 (1∕2) − 1∕2 × log2 (1∕2) = 1

◾

Entropy left node = −1∕3 × log2 (1∕3) − 2∕3 × log2 (2∕3) = 0.91

◾

Entropy right node = −1 × log2 (1) − 0 × log2 (0) = 0

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

139

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0

0.1

0.2

0.3

0.4

0.5

Entropy

0.6

0.7

0.8

0.9

1

Gini

Figure 4.12 Entropy Versus Gini

G
B
400 400
Age < 30

200 400
G

B

Age ≥ 30

200 0
G

B

Figure 4.13 Calculating the Entropy for Age Split

The weighted decrease in entropy, also known as the gain, can then
be calculated as follows:
Gain = 1 − (600∕800) × 0.91 − (200∕800) × 0 = 0.32
The gain measures the weighted decrease in entropy thanks to the
split. It speaks for itself that a higher gain is to be preferred. The decision

140

FRAUD ANALYTICS

tree algorithm will now consider different candidate splits for its root
node and adopt a greedy strategy by picking the one with the biggest
gain. Once the root node has been decided on, the procedure continues
in a recursive way, each time adding splits with the biggest gain. In
fact, this can be perfectly parallelized and both sides of the tree can
grow in parallel, hereby increasing the efficiency of the tree construction
algorithm.

Stopping Decision

Misclassification error

The third decision relates to the stopping criterion. Obviously, if the tree
continues to split, it will become very detailed with leaf nodes containing only a few observations. In the most extreme case, the tree will
have one leaf node per observation and as such perfectly fit the data.
However, by doing so, the tree will start to fit the specificities or noise
in the data, which is also referred to as overfitting. In other words, the
tree has become too complex and fails to correctly model the noise free
pattern or trend in the data. As such, it will generalize poorly to new
unseen data. In order to avoid this from happening, the data will be
split into a training sample and a validation sample. The training sample will be used to make the splitting decision. The validation sample is
an independent sample, set aside to monitor the misclassification error
(or any other performance metric such as a profit-based measure) as
the tree is grown. A commonly used split up is a 70 percent training
sample and 30 percent validation sample. One then typically observes
a pattern as depicted in Figure 4.14.

Validation set

STOP Growing tree!

minimum

Training
set
Number of tree nodes

Figure 4.14 Using a Validation Set to Stop Growing a Decision Tree

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

141

The error on the training sample keeps on decreasing as the splits
become more and more specific and tailored towards it. On the validation sample, the error will initially decrease, which indicates that
the tree splits generalize well. However, at some point the error will
increase since the splits become too specific for the training sample as
the tree starts to memorize it. Where the validation set curve reaches its
minimum, the procedure should be stopped, as otherwise overfitting
will occur. Note that, as already mentioned, besides classification error,
one might also use accuracy or profit based measures on the Y-axis
to make the stopping decision. Also note that sometimes, simplicity is
preferred above accuracy, and one can select a tree that does not necessarily have minimum validation set error, but a lower number of nodes.

Decision Tree Properties
In the example of Figure 4.10, every node had only two branches.
The advantage of this is that the testing condition can be implemented
as a simple yes/no question. Multiway splits allow for more than
two branches and can provide trees that are wider but less deep. In a
read-once decision tree, a particular attribute can be used only once
in a certain tree path. Every tree can also be represented as a rule set
since every path from a root note to a leave node makes up a simple
if-then rule. For the tree depicted in Figure 4.10, the corresponding
rules are:
If Transaction amount > $100, 000 And Unemployed = No
Then no fraud
If Transaction amount > $100, 000 And Unemployed = Yes Then
fraud
If Transaction amount ≤ $100, 000 And Previous fraud = Yes
Then fraud
If Transaction amount ≤ $100, 000 And Previous fraud = No
Then no fraud
These rules can then be easily implemented in all kinds of software
packages (e.g., Microsoft Excel).
Decision trees essentially model decision boundaries orthogonal to
the axes. This is illustrated in Figure 4.15 for an example decision tree.

142

FRAUD ANALYTICS

F

F

Amount

1200

F

F

NF
NF F
NF
NF
NF
NF
NF

NF

NF

NF
NF

NF

NF

F

F
F

NF

NF

Recency

NF
NF

NF

NF
NF NF NF
NF
NF
NF
NF
NF
NF
NF NF
NF
NF NF
NF
NF NF NF
NF
NF NF
NF

≤ 30

NF NF

30

> 30

Amount
≤ 1200
NF

NF

NF
> 1200
F

Recency

Figure 4.15 Decision Boundary of a Decision Tree

Regression Trees
Decision trees can also be used to predict continuous targets. Consider
the example of Figure 4.16 where a regression tree is used to predict
the fraud percentage (FP). The latter can be expressed as the percentage
of a predefined limit based on, for example, the maximum transaction amount.
Other criteria need now be used to make the splitting decision since
the impurity will need to be measured in another way. One way to
measure impurity in a node is by calculating the mean squared error
(MSE) as follows:
1 ∑n
(Y − Y )2 ,
n i=1 i
where n represents the number of observations in a leave node, Yi
the value of observation i, and Y , the average of all values in the leave
node. Obviously, it is desirable to have a low MSE in a leave node since
this indicates that the node is more homogeneous.
Credit class
High

Low
Medium

Merchant known
Yes

Previous fraud

No

Yes

FP = 22%

No

FP = 38% FP = 64%
FP = 6%

FP = 82%

Figure 4.16 Example Regression Tree for Predicting the Fraud Percentage

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

143

Another way to make the splitting decision is by conducting a simple analysis of variance (ANOVA) test and calculating an F-statistic as
follows:
SS
∕(B − 1)
F = between
∼ Fn−B, B−1 ,
SSwithin ∕(n − B)
whereby
SSbetween =

B
∑

nb (Y b − Y )2

b=1

SSwithin =

B nb
∑
∑

(Ybi − Y b )2

b=1 i=1

with B the number of branches of the split, nb the number of observations in branch b, Y b the average in branch b, Ybi the value of
observation i in branch b, and Y the overall average. Good splits favor
homogeneity within a node (low SSwithin ) and heterogeneity between
nodes (high SSbetween ). In other words, good splits should have a high
F-value, or low corresponding p-value.
The stopping decision can be made in a similar way as for classification trees but using a regression based performance measure
(e.g., mean squared error, mean absolute deviation, R-squared) on the
Y-axis. The assignment decision can be made by assigning the mean
(or median) to each leave node. Note that standard deviations and thus
confidence intervals may also be computed for each of the leaf nodes.

Using Decision Trees in Fraud Analytics
Decision trees can be used for various purposes in fraud analytics. First,
they can be used for variable selection as variables that occur at the top
of the tree are more predictive of the target. One could also simply calculate the gain of a characteristic to gauge its predictive power. As an
alternative, remember that we already discussed the information value
in Chapter 2 to measure the predictive strength of a variable. Typically,
both the gain and information value consider similar attributes to be
predictive, so one can just choose the measure that is readily available
in the analytics software. Decision trees can also be used for high-level
segmentation. One then typically builds a tree two or three levels deep

144

FRAUD ANALYTICS

as the segmentation scheme and then uses second-stage logistic regression models for further refinement. Finally, decision trees can also be
used as the final analytical fraud model to be used directly into the business environment. A key advantage here is that the decision tree gives
a white-box model with a clear explanation behind how it reaches its
classifications.
Many software tools will also allow to grow trees interactively by
providing at each level of the tree a top two (or more) of splits among
which the fraud modeler can choose. This allows users to choose splits
not only based on impurity reduction, but also on the interpretability and/or computational complexity of the split criterion. Hence, the
modeler may favor a split on a less predictive variable, but which is
easier to collect and/or interpret.
Decision trees are very powerful techniques and allow for more
complex decision boundaries than a logistic regression. As discussed,
they are also interpretable and operationally efficient. They are also
nonparametric in the sense that no normality or independence assumptions were needed to build a decision tree. Their most important disadvantage is that they are highly dependent on the sample that was used
for tree construction. A small variation in the underlying sample might
yield a totally different tree. In a later section, we will discuss how this
shortcoming can be addressed using the idea of ensemble learning.

NEURAL NETWORKS
Basic Concepts
A first perspective on the origin of neural networks states that they
are mathematical representations inspired by the functioning of the
human brain. Although this may sound appealing, another more
realistic perspective sees neural networks as generalizations of existing
statistical models (Bishop 1995; Zurada 1992). Let’s take logistic
regression as an example:
P(Y = 1|X1 , … , XN ) =

1
,
1 + e−(𝛽0 +𝛽1 X1 + … +𝛽N XN )

We could visualize this model as shown in Figure 4.17.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

X1

β1

X2

β2

145

P(Y | X1,..., XN) =

...

1
1 + e–(β0 + β1X1 +...+βNXN)

βN–1
XN–1

β0
βN

XN
Figure 4.17 Neural Network Representation of Logistic Regression

The processing element or neuron in the middle basically performs two operations: it takes the inputs and multiplies them with
the weights (including the intercept term 𝛽0 , which is called the
bias term in neural networks) and then puts this into a nonlinear
transformation function similar to the one we discussed in the section
on logistic regression. So logistic regression is a neural network with
one neuron. Similarly, we could visualize linear regression as a one
neuron neural network with the identity transformation f (z) = z.
We can now generalize the above picture to a multilayer perceptron
(MLP) neural network by adding more layers and neurons, as shown
in Figure 4.18 (Bishop 1995; Zurada 1992).

h1
W11

v1

b1

hj = f (

2

xiwij + bj)

i=1

x1
h2

b2

x2
W23

v2

v3
h3

b4
3

y=

vjhj + b4
j=1

b3
Figure 4.18 A Multilayer Perceptron (MLP) Neural Network

146

FRAUD ANALYTICS

The example in Figure 4.18 is an MLP with one input layer, one hidden layer, and one output layer. The hidden layer essentially works like
a feature extractor by combining the inputs into features that are then
subsequently offered to the output layer to make the optimal prediction.
The hidden layer has a nonlinear transformation function f() and the
output layer a linear transformation function. The most popular transformation functions (also called squashing, activation functions) are:
1
,
1+e−z

◾

Logistic, f (z) =

◾

Hyperbolic tangent, f (z) =

◾

Linear, f (z) = z, ranging between −∞ and +∞

ranging between 0 and 1
ez −e−z
,
ez +e−z

ranging between –1 and +1

Although theoretically the activation functions may differ per neuron, they are typically fixed for each layer. For classification (e.g., fraud
detection), it is common practice to adopt a logistic transformation in
the output layer, since the outputs can then be interpreted as probabilities (Baesens et al. 2002). For regression targets (e.g., amount of
fraud), one could use any of the transformation functions listed above.
Typically, one will use the hyperbolic tangent activation function in the
hidden layer.
In terms of hidden layers, theoretical works have shown that neural networks with one hidden layer are universal approximators, capable of approximating any function to any desired degree of accuracy on
a compact interval (Hornik et al. 1989). Only for discontinuous functions (e.g., a saw tooth pattern) or in a deep learning context, it could
make sense to try out more hidden layers. Note, however, that these
complex patterns rarely occur in practice. In a fraud setting, it is recommended to continue the analysis with one hidden layer.
In terms of data preprocessing, it is advised to standardize the
continuous variables using, for example, the z-scores. For categorical
variables, categorization can be used to reduce the number of categories, which can then be coded using, for example, dummy variables
or weight of evidence coding. Note that it is important to only consider
categorization for the categorical variables, and not for the continuous
variables. The latter can be categorized to model nonlinear effects
into linear models (e.g., linear or logistic regression), but since neural
networks are capable of modeling nonlinear relationships, it is not
needed here.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

147

Weight Learning
As discussed earlier, for simple statistical models such as linear regression, there exists a closed-form mathematical formula for the optimal
parameter values. However, for neural networks, the optimization is a
lot more complex and the weights sitting on the various connections
need to be estimated using an iterative algorithm. The algorithm then
optimizes a cost-function. Similarly to linear regression, when the target variable is continuous, a mean squared error (MSE) cost function
will be optimized as follows:
1 ∑n 2 1 ∑n
e =
(Y − Ŷi )2 ,
2 i=1 i
2 i=1 i
where Yi now represents the neural network prediction for observation i. In case of a binary target variable, a maximum likelihood cost
function can be optimized as follows:
∏n

i=1 P(Y

= 1|X1i , … , XNi )Yi (1 − P(Y = 1|X1i , … , XNi )1−Yi ,

where P(Y = 1|X1i , … , XNi ) represents the probability prediction from
the neural network.
The optimization procedure typically starts from a set of random
weights (e.g., drawn from a standard normal distribution), which are
then iteratively adjusted to the patterns in the data using an optimization algorithm. Popular optimization algorithms here are back propagation learning, conjugate gradient, and Levenberg-Marquardt. See for
more details (Bishop 1995). A key issue to note here is the curvature
of the objective function, which is not convex and may be multimodal
as illustrated in Figure 4.19. The error function can thus have multiple local minima but typically only one global minimum. Hence, if the
starting weights are chosen in a suboptimal way, one may get stuck in a
local minimum, which is clearly undesirable. One way to deal with this
is to try out different starting weights, start the optimization procedure
for a few steps, and then continue with the best intermediate solution.
This approach is sometimes referred to as preliminary training. The
optimization procedure then continues until the error function shows
no further progress, the weights stop changing substantially, or after a
fixed number of optimization steps (also called epochs).

148

FRAUD ANALYTICS

E

Local minimum!
Global minimum!
w
Figure 4.19 Local Versus Global Minima

Although multiple output neurons could be used (e.g., predicting fraud and fraud amount simultaneously), it is highly advised to
use only one to make sure that the optimization task is well focused.
The hidden neurons, however, should be carefully tuned and depend
on the nonlinearity in the data. More complex, nonlinear patterns
will require more hidden neurons. Although various procedures (e.g.,
cascade correlation, genetic algorithms, Bayesian methods) have been
suggested in the scientific literature to do this, the most straightforward, yet efficient procedure is as follows (Moody and Utans 1994):
1. Split the data into a training, validation, and test set.
2. Vary the number of hidden neurons from 1 to 10 in steps of one
or more.
3. Train a neural network on the training set and measure the performance on the validation set (may be train multiple neural
networks to deal with the local minimum issue).
4. Choose the number of hidden neurons with optimal validation
set performance.
5. Measure the performance on the independent test set.
Note that for fraud detection, the number of hidden neurons
typically varies between 6 and 12.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

149

Neural networks can model very complex patterns and decision
boundaries in the data and are as such very powerful. Just as with
decision trees, they are so powerful that they can even model the
noise in the training data, which is something that definitely should be
avoided. One way to avoid this overfitting is by using a validation set
in a similar way as with decision trees. This is illustrated in Figure 4.20.
The training set is used here to estimate the weights and the validation set is again an independent data set used to decide when to
stop training.
Another scheme to prevent a neural network from overfitting is
weight regularization, whereby the idea is to keep the weights small in
absolute sense since otherwise they may be fitting the noise in the data.
This is then implemented by adding a weight size term (e.g., Euclidean
norm) to the objective function of the neural network (Bartlett 1997).
In case of a continuous output (and thus mean squared error), the
objective function then becomes:
∑k
1 ∑n 2
2
i=1 ei + 𝜆 j=1 wj ,
2
where k represents the number of weights in the network and 𝜆 a
weight decay (also referred to as weight regularization) parameter to
weigh the importance of error versus weight minimization. Setting 𝜆
too low will cause overfitting, whereas setting it to high will cause
underfitting. A practical approach to determining 𝜆 is to try out different values on an independent validation set and select the one with
the best performance.

Error

STOP training !

Validation set
minimum
Training
set

Training steps
Figure 4.20 Using a Validation Set for Stopping Neural Network Training

150

FRAUD ANALYTICS

Opening the Neural Network Black Box
Although neural networks have their merits in terms of modeling
power, they are commonly described as black-box techniques since
they relate the inputs to the outputs in a mathematically complex,
nontransparent, and opaque way. They have been successfully applied
as high-performance analytical tools in settings where interpretability
is not a key concern (e.g., credit card fraud detection).
However, in application areas where insight into the fraud behavior
is important, one needs to be careful with neural networks (Baesens,
Martens et al. 2011). In what follows, we will discuss the following
three ways of opening the neural network black box:
1. Variable selection
2. Rule extraction
3. Two-stage models
A first way to get more insight into the functioning of a neural
network is by doing variable selection. As previously, the aim here
is to select those variables that actively contribute to the neural network output. In linear and logistic regression, the variable importance
was evaluated by inspecting the p-values. Unfortunately, in neural networks this is not that easy, as no p-values are readily available. One easy
and attractive way to do it is by visualizing the weights in a Hinton
diagram. A Hinton diagram visualizes the weights between the inputs
and the hidden neurons as squares, whereby the size of the square is
proportional to the size of the weight and the color of the square represents the sign of the weight (e.g., black colors represent a negative
weight and white colors a positive weight). Clearly, when all weights
connecting a variable to the hidden neurons are close to zero, it does
not contribute very actively to the neural network’s computations, and
one may consider leaving it out. Figure 4.21 shows an example of a
Hinton diagram for a neural network with four hidden neurons and
five variables. It can be clearly seen that the income variable has a
small negative and positive weight when compared to the other variables and can thus be considered for removal from the network. A very
straightforward variable selection procedure is:
1. Inspect the Hinton diagram and remove the variable whose
weights are closest to zero.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

151

1

Hidden neuron

2

3

4

Age

Income

Claim
amount

Time since
previous
claim

Accident
severity

Figure 4.21 Example Hinton Diagram

2. Reestimate the neural network with the variable removed. To
speed up the convergence, it could be beneficial to start from
the previous weights.
3. Continue with step 1 until a stopping criterion is met. The stopping criterion could be a decrease of predictive performance or
a fixed number of steps.
Another way to do variable selection is by using the following backward variable selection procedure:
1. Build a neural network with all N variables.
2. Remove each variable in turn and reestimate the network. This
will give N networks each having N – 1 variables.
3. Remove the variable whose absence gives the best performing
network (e.g., in terms of misclassification error, mean squared
error).
4. Repeat this procedure until the performance decreases significantly.

152

FRAUD ANALYTICS

Performance

N

Variables

Figure 4.22 Backward Variable Selection

When plotting the performance against the number of variables, a
pattern as depicted in Figure 4.22 will likely be obtained. Initially, the
performance will stagnate, or may even increase somewhat. When
important variables are being removed, the performance will start
decreasing. The optimal number of variables can then be situated
around the elbow region of the plot and can be decided in combination with a business expert. Sampling can be used to make the
procedure less resource intensive and more efficient. Note that this
performance-driven way of variable selection can easily be adopted
with other analytical techniques such as linear or logistic regression
or support vector machines (see next section).
Although variable selection allows users to see which variables are
important to the neural network and which ones are not, it does not
offer a clear insight into its internal workings. The relationship between
the inputs and the output remains nonlinear and complex. A first way
to get more transparency is by performing rule extraction, as will be
discussed next.
The purpose of rule extraction is to extract if-then classification
rules, mimicking the behavior of the neural network (Baesens 2003;
Baesens et al. 2003; Setiono et al. 2009). Two important approaches
here are decompositional and pedagogical techniques. Decompositional rule extraction approaches decompose the network’s internal
workings by inspecting weights and/or activation values. A typical
approach here could be (Lu et al. 1995; Setiono et al. 2011):
1. Train a neural network and do variable selection to make it as
concise as possible.
2. Categorize the hidden unit activation values by using clustering.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

153

3. Extract rules that describe the network output in terms of the
categorized hidden unit activation values.
4. Extract rules that describe the categorized hidden unit activation
values in terms of the network inputs.
5. Merge the rules obtained in step 3 and 4 to directly relate the
inputs to the outputs.
This is illustrated in Figure 4.23.
Pedagogical rule extraction techniques consider the neural network as a black box and use the neural network predictions as input
to a white-box analytical technique such as decision trees (Craven and
Shavlik 1996). This is illustrated in Figure 4.24.
In this approach, the learning data set can be further augmented
with artificial data, which is then labeled (e.g., classified or predicted)
by the neural network, so as to further increase the number of observations to make the splitting decisions when building the decision tree.
Note that since the pedagogical approach does not make use of the
parameters or internal model representation, it can essentially be used

Customer

Age

Emma

28

1000

Y

No

Will

44

1500

N

Yes

Dan

30

1200

N

No

Bob

58

2400

Y

Yes

Income Known Customer …

Customer Age Income Known Customer

h1

h2

Fraud

h3

Step 1: Start from original data

h1 h2 h3 Fraud

Emma

28

1000

Y

–1.20 2,34 0,66

1

3

2

No
Yes

Will

44

1500

N

0,78 1,22 0,82

2

3

2

Dan

30

1200

N

2,1 –0,18 0,16

3

1

2

No

Bob

58

2400

Y

–0,1

0,8 –2,34 1

2

1

Yes

Step 2: Build a neural network
(e.g., 3 hidden neurons)
Step 3: Categorize hidden unit
activations

If h1 = 1 and h2 = 3 Then Fraud = No
If h2 = 2 Then Fraud = Yes

Step 4: Extract rules relating
network outputs to categorized
hidden units

If Age < 28 and Income < 1000 Then h1 = 1
If Known Customer = Y Then h2 = 3
If Age > 34 and Income > 1500 Then h2 = 2

Step 5: Extract rules relating
categorized hidden units to
inputs

If Age < 28 and Income < 1000 and Known Customer = Y Then Fraud = No
If Age > 34 and Income > 1500 Then Fraud = Yes

Step 6: Merge
both rule sets

Figure 4.23 Decompositional Approach for Neural Network Rule Extraction

154

FRAUD ANALYTICS

Customer Age
Emma
Will
Dan
Bob

Customer
Emma
Will
Dan
Bob

28
44
30
58

Income Known Customer
1000
1500
1200
2400

Known
Age Income Customer
Y
28 1000
44 1500
N
30 1200
N
58 2400
Y

…

Fraud

Y

No

N

Yes

N

No

Y

Yes

Network
Prediction

Fraud

No

No

Yes

Yes

Yes

No

Yes

Yes

Step 1: Start from original data

Step 2: Build a neural network
Step 3: Get the network predictions
and add them to the data set

Age > 40
No
Known Customer
Yes
No
NN prediction:
No Fraud

Yes
Income > 2000
Yes
No

NN prediction: NN prediction:
Fraud
Fraud

Step 4: Extract rules relating network
predictions to original inputs. Generate
additional data where necessary.

NN prediction:
No Fraud

Figure 4.24 Pedagogical Approach for Rule Extraction

with any underlying algorithm, such as regression techniques, or SVMs
(see later).
When using either decompositional or pedagogical rule extraction
approaches, the rule sets should be evaluated in terms of their accuracy,
conciseness (e.g., number of rules, number of conditions per rule), and
fidelity. The latter measures to what extent the extracted rule set succeeds in mimicking the neural network and is calculated as follows:
Neural Network Classification
Rule Set Classification

No Fraud

Fraud

No Fraud

a

b

Fraud

c

d

Fidelity = (a + d)∕(b + c).
It is also important to always benchmark the extracted rules/trees
with a tree built directly on the original data to see the benefit of going
through the neural network.
Another approach to make neural networks more interpretable is
by using a two-stage model setup (Van Gestel et al. 2005, 2006). The

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

Customer Age

Income Known Customer …

Fraud

Emma

28

1000

Y

No

Will

44

1500

N

Yes

Dan

30

1200

N

No

Bob

58

2400

Y

Yes

Customer Age Income
Emma
Will
Dan
Bob

28
44
30
58

1000
1500
1200
2400

Customer Age Income
Emma
Will
Dan
Bob

28
44
30
58

1000
1500
1200
2400

Known
…
Customer
Y
N
N
Y

Fraud
No (=0)
Yes (=1)
No (=0)
Yes(=1)

Step 1: Start from original data

Logistic
Regression
output
0.44
0.76
0.18
0.88

Logistic
Known
… Fraud Regression
Customer
output
Y
No (=0)
0.44
N
Yes (=1)
0.76
N
No (=0)
0.18
Y
Yes(=1)
0.88

28

Step 2: Build Logistic Regression Model

Error

Step 3: Calculate errors from Logistic
Regression Model

–0.44
0.24
–0.18
0.12

Step 4: Build NN predicting errors from
Logistic Regression Model

Logistic
NN
Known
Finaloutput
… Regression
output
Customer
output
1000
Y
0.68
0.14
0.82

Customer Age Income
Bart

155

Step 5: Score new observations by adding up
logistic regression and NN scores

Figure 4.25 Two-Stage Models

idea here is to estimate an easy-to-understand model first (e.g., linear
regression, logistic regression). This will give us the interpretability part.
In a second stage, a neural network is used to predict the errors made
by the simple model using the same set of predictors. This will give us
the additional performance benefit of using a nonlinear model. Both
models are then combined in an additive way, for example as follows:
◾

Target = Linear regression (X1 , X2 , … XN ) + Neural network
(X1 , X2 , … XN )

◾

Score = Logistic regression (X1 , X2 , … XN ) + Neural network
(X1 , X2 , … XN )

This setup provides an ideal balance between model interpretability (which comes from the first part) and model performance (which
comes from the second part). This is illustrated in Figure 4.25.

SUPPORT VECTOR MACHINES
Linear Programming
Two key shortcomings of neural networks are the fact that the
objective function is nonconvex (and hence may have multiple

156

FRAUD ANALYTICS

local minima) and the effort that is needed to tune the number of
hidden neurons. Support vector machines (SVMs) deal with both of
these issues (Cristianini and Taylor 2000; Schölkopf and Smola 2001;
Vapnik 1995).
The origins of classification SVMs date back to the early dates
of linear programming (Mangasarian 1965). Consider, for example,
the following linear program (LP) for classification in a fraud
setting:
min e1 + e2 + … + ennf + … ennf +nf
subject to
w1 xi1 + w2 xi2 + … + wN xiN ≥ c − ei , 1 ≤ i ≤ nnf ,
w1 xi1 + w2 xi2 + … + wN xiN ≤ c + ei , nnf + 1 ≤ i ≤ nnf + nf ,
ei ≥ 0,
with xij the value of variable j for observation i, and nnf and nf
the number of no frauds and frauds, respectively. The LP assigns
the no frauds a score above the cut-off value c, and the frauds
a score below c. The error variables ei are needed to be able to
solve the program since perfect separation will typically not be
possible. Linear programming has been very popular in the early
days of credit scoring. One of its key benefits is that it is easy to
include domain or business knowledge by adding extra constraints
to the model. Suppose prior business experience indicates that age
(variable 1) is more important than income (variable 2). This can
be easily enforced by adding the constraint w1 ≥ w2 to the linear
program.
A key problem with linear programming is that it can estimate
multiple optimal decision boundaries as illustrated in Figure 4.26 for a
perfectly linearly separable case, where class 1 represents the fraudsters
and class 2 the non-fraudsters.

The Linear Separable Case
SVMs add an extra objective to the analysis. Consider the situation
depicted in Figure 4.27 with two variables x1 (e.g, age) and x2

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

x2

+

+
+

Class 2
+

+

157

+
+

x

x
x

+

x

x

x

x

x

Class 1

x1
Figure 4.26 Multiple Separating Hyperplanes

x2

2/||w||

+

+
+

Class 2
+

+

+
+

x

x
x

x

x

x

+

H1: wT x+b=+1
x

x

Class 1

H0: wT x+b=0
H2: wT x+b=–1
x1

Figure 4.27 SVM Classifier for the Perfectly Linearly Separable Case

(e.g, income). It has two hyperplanes sitting at the edges of both
classes, and a hyperplane in between which will serve as the classification boundary. The perpendicular distance from the first hyperplane
H1 to the origin equals |b − 1|∕||w||, whereby
√ ||w|| represents the
Euclidean norm of w calculated as ||w|| = w12 + w22 . Likewise, the
perpendicular distance from H2 to the origin equals |b + 1|∕||w||.

158

FRAUD ANALYTICS

Hence, the margin between both hyperplanes equals 2∕||w||. SVMs
will now aim at maximizing this margin to pull both classes as far
apart as possible. Maximizing the margin is similar to minimizing
∑
2
||w||, or minimizing 12 N
i=1 wi . In case of perfect linear separation, the
SVM classifier then becomes as follows.
Consider a training set: {xk , yk }nk=1 with xk ∈ RN and yk ∈ {−1; +1}
The goods (e.g., class +1) should be above hyperplane H1 , and the
bads (e.g., class –1) below hyperplane H2 , which gives:
wT xk + b ≥ 1, if yk = +1
wT xk + b ≤ −1, if yk = −1
Both can be combined as follows:
yk (wT xk + b) ≥ 1.
The optimization problem then becomes:
1∑ 2
w
2 i=1 i
N

Minimize

subject to yk (wT xk + b) ≥ 1, k = 1 … n.
This quadratic programming (QP) problem can now be solved
using Lagrangian optimization (Cristianini and Taylor 2000; Schölkopf
and Smola 2001; and Vapnik 1995). Important to note is that
the optimization problem has a quadratic cost function, giving a
convex optimization problem with no local minima and only one
global minimum. Training points that lie on one of the hyperplanes
H1 or H2 are called support vectors and are essential to the classification. The classification hyperplane itself is H0 and for new
observations, it needs to be checked whether they are situated above
H0 in which case the prediction is +1 or below (prediction − 1).
This can be easily accomplished using the sign operator as follows: y(x) = sign(wT x + b). Remember, sign(x) is +1 if x ≥ 0, and –1,
otherwise.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

159

The Linear Nonseparable Case
The SVM classifier discussed thus far assumed perfect separation is possible, which will, of course, be rarely the case for real-life data sets. In
case of overlapping class distributions (as illustrated in Figure 4.28),
the SVM classifier can be extended with error terms ei as follows:
Minimize

N
n
∑
1∑ 2
wi + C ei
2 i=1
i=1

subject to yk (wT xk + b) ≥ 1 − ek , k = 1 … n
ek ≥ 0.
The error variables ek are needed to allow for misclassifications. The
C hyperparameter in the objective function balances the importance of
maximizing the margin versus minimizing the error on the data. A high
(low) value of C implies a higher (lower) risk of overfitting. Note the
similarity with the idea of weight regularization discussed in the section
on neural networks. Also, there the objective function consisted out of
an error term and the sum of the squared weights. We will discuss
procedures to determine the optimal value of C later in this section.
Just as before, the problem is a quadratic programming (QP) problem,
which can be solved using Lagrangian optimization.
x2

2/||w||

+

+
+

Class 2
+

+

+
+

x

x

H1: wT x+b=+1

+
x

x

+

x

x

x

x

x

Class 1

H0: wT x+b=0
H2: wT x+b= –1
x1

Figure 4.28 SVM Classifier in Case of Overlapping Distributions

160

FRAUD ANALYTICS

The Nonlinear SVM Classifier
Finally, the nonlinear SVM classifier will first map the input data to
a higher dimensional feature space using some mapping 𝜑(x). This is
illustrated in Figure 4.29.
The SVM problem formulation now becomes:
Minimize

N
n
∑
1∑ 2
wi + C ei
2 i=1
i=1

subject to yk (wT 𝜑(xk ) + b) ≥ 1 − ek , k = 1 … n
ek ≥ 0.
When working out the Lagrangian optimization (Cristianini and
Taylor 2000; Schölkopf and Smola 2001; and Vapnik 1995), it turns
out that the mapping 𝜑(x) is never explicitly needed, but only implicitly by means of the kernel function K defined as follows K(xk , xl ) =
𝜑(xk )T 𝜑(xl ). Hence, the feature space does not need to be explicitly
specified. The nonlinear SVM classifier then becomes:
[

]
n
∑
(
)
𝛼k yk K x, xk + b ,
y(x) = sign
k=1

where 𝛼k are the Lagrangian multipliers stemming from the optimization. Support vectors will have nonzero 𝛼k since they are needed to
Feature Space

x

X X
X
O
O

X X
X X

X

X

Input Space

φ(x)

X

X

X

O
O O O
X
X
X X X

X

X

X

X
X

O
O

O

K(x1,x2) = φ(x1)T φ(x2)

O

O

O
O

O
O

O
O

Figure 4.29 The Feature Space Mapping

O

O
O

wTφ(xi) + b = 0

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

161

construct the classification hyperplane. All other observations have
zero 𝛼k , which is often referred to as the sparseness property of SVMs.
Different types of kernel functions can be used. The most popular are:
◾

Linear kernel: K(x, xk ) = xkT x

◾

Polynomial kernel: K(x, xk ) = (1 + xkT x)d

◾

Radial basis function (RBF) kernel: K(x, xk ) = exp{−||x − xk ||2 ∕
𝜎2}

Empirical evidence has shown that the RBF kernel usually performs best, but note that it includes an extra parameter 𝜎 to be tuned
(Van Gestel et al. 2004).
A key question to answer when building SVM classifiers is the tuning of the hyperparameters. For example, suppose one has an RBF
SVM, which has two hyperparameters C and 𝜎. Both can be tuned
using the following procedure (Van Gestel et al. 2004):
1. Partition the data into 40%∕30%∕30% training, validation and
test data.
2. Build an RBF SVM classifier for each (𝜎, C) combination from
the sets 𝜎 ∈ {0.5, 5, 10, 15, 25, 50, 100, 250, 500} and C ∈ {0.01,
0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500}.
3. Choose the (𝜎, C) combination with the best validation set performance.
4. Build an RBF SVM classifier with the optimal (𝜎, C) combination
on Combined training + Validation data set.
5. Calculate the performance of the estimated RBF SVM classifier
on the test set.
In case of linear or polynomial kernels, a similar procedure can be
adopted.

SVMs for Regression
SVMs can also be used for regression applications with a continuous
target. The idea here is to find a function f (x) that has at most 𝜀 deviation from the actual targets yi for all the training data, and is at the same
time as flat as possible. Hence, the loss function will tolerate (penalize)
errors less (higher) than 𝜀. This is visualized in Figure 4.30.

FRAUD ANALYTICS

ε

162

ε

x
x
x

x

x
x

Loss
function

x
x

x
x

x

x
x
x

x

–ε

+ε

Figure 4.30 SVMs for Regression

Consider a training set: {xk , yk }nk=1 with xk ∈ RN and yk ∈ R
The SVM formulation then becomes:

Minimize

N
n
∑
1∑ 2
wi + C (𝜀k + 𝜀∗k )
2 i=1
i=1

subject to
yk − wT 𝜑(xk ) − b ≤ 𝜀 + 𝜀k
wT 𝜑(xk ) + b − yk ≤ 𝜀 + 𝜀∗k
𝜀, 𝜀k , 𝜀∗k ≥ 0.
The hyperparameter C determines the trade-off between the flatness of f and the amount to which deviations larger than 𝜀 are tolerated.
Note the feature space mapping 𝜑(x), which is also used here. Using
Lagrangian optimization, the resulting nonlinear regression function
becomes:
n
∑
f (x) =
(𝛼k − 𝛼k∗ )K(xk , x) + b,
i=1

where 𝛼k and 𝛼k∗ represent the Lagrangian multipliers. The hyperparameters C and 𝜀 can be tuned using a procedure similar to the one
outlined for classification SVMs.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

163

Opening the SVM Black Box
Similar to neural networks, SVMs have a universal approximation
property. As an extra benefit, they do not require tuning of the number of hidden neurons and are characterized by convex optimization.
However, they are also very complex to be used in settings where
interpretability is important. Just as with neural networks, procedures
can be used to provide more transparency by opening up the SVM
black box.
Variable selection can be performed using the backward variable
selection procedure discussed in the section on neural networks. This
will essentially reduce the variables but not provide any additional
insight into the workings of the SVM. Rule extraction approaches can
then be used in a next step. In order to apply decompositional rule
extraction approaches, the SVM can be represented as a neural network as depicted in Figure 4.31.
The hidden layer uses kernel activation functions, whereas the output layer uses a linear activation function. Note that the number of
hidden neurons now corresponds to the number of support vectors
and follows automatically from the optimization. This is in strong contrast to neural networks where the number of hidden neurons needs
to be tuned manually. The decompositional approach can then proceed by first extracting If-Then rules relating the output to the hidden

K(x,x1)

α1

K(x,x2)

α2

x1

xn

K(x,xns)

αns

Figure 4.31 Representing an SVM Classifier as a Neural Network

b

164

FRAUD ANALYTICS

unit activation values. In a next step, rules are extracted relating the
hidden unit activation values to the inputs, followed by the merger of
both rule sets.
Since a pedagogical approach considers the underlying model as a
black box, it can be easily combined with SVMs. Just as in the neural
network case, the SVM is first used to construct a data set with SVM
predictions for each of the observations. This data set is then given
to a decision tree algorithm to build a decision tree. Also here, additional training set observations can be generated to facilitate the tree
construction process.
Finally, also two-stage models can be used to provide more comprehensibility. Remember, in this approach a simple model (e.g., linear
or logistic regression) is estimated first, followed by an SVM to correct
the errors of the latter.

ENSEMBLE METHODS
Ensemble methods aim at estimating multiple analytical models
instead of using only one. The idea here is that multiple models can
cover different parts of the data input space and as such complement
each other’s deficiencies. In order to successfully accomplish this, the
analytical technique needs to be sensitive to changes in the underlying
data. This is especially the case for decision trees and that’s why they
are commonly used in ensemble methods. In what follows, we will
discuss bagging, boosting, and random forests.

Bagging
Bagging (Bootstrap aggregating) starts by taking B bootstraps from the
underlying sample (Breiman 1996). Note that a bootstrap is a sample
with replacement (see section on evaluating predictive models). The
idea is then to build a classifier (e.g., decision tree) for every bootstrap.
For classification, a new observation will be classified by letting all B
classifiers vote, using, for example, a majority voting scheme whereby
ties are resolved arbitrarily. For regression, the prediction is the average
of the outcome of the B models (e.g., regression trees). Note that here
also a standard error and thus confidence interval can be calculated.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

165

The number of bootstraps B can either be fixed (e.g., 30) or tuned via
an independent validation data set.
The key element for bagging to be successful is the instability of
the analytical technique. If perturbing the data set by means of the
bootstrapping procedure can alter the model constructed, then bagging
will improve the accuracy (Breiman 1996). However, for models that
are robust with respect to the underlying data set, it will not give much
added value.

Boosting
Boosting works by estimating multiple models using a weighted sample
of the data (Freund and Schapire 1997, 1999). Starting from uniform
weights, boosting will iteratively reweight the data according to the
classification error whereby misclassified cases get higher weights. The
idea here is that difficult observations should get more attention. Either
the analytical technique can directly work with weighted observations,
or if not, we can just sample a new data set according to the weight
distribution. The final ensemble model is then a weighted combination
of all the individual models. A popular implementation of this is the
Adaptive boosting/Adaboost procedure, which works as follows:
1. Given the following observations: (x1 , y1 ), … , (xn , yn ) where xi is
the attribute vector of observation i and yi ∈ {1,–1}
2. Initialize the weights as follows: W1 (i) = 1∕n, i = 1, … , n
3. For t = 1 … T
a. Train a weak classifier (e.g., decision tree) using the
weights Wt
b. Get weak classifier Ct with classification error 𝜀t
)
(
1−𝜀
c. Choose 𝛼t = 12 ln 𝜀 t
t

d. Update the weights as follows:
i. Wt+1 (i) =
ii. Wt+1 (i) =

Wt (i)
Zt
Wt (i)
Zt

e−𝛼t if Ct (x) = yi
e𝛼t if Ct (x) ≠ yi

4. Output the final ensemble model: E(x) = sign

(∑ (
T
t=1

𝛼t Ct (x)

))

166

FRAUD ANALYTICS

Note that in this procedure, T represents the number of boosting
runs, 𝛼t measures the importance that is assigned to classifier Ct and
increases as 𝜀t gets smaller, Zt is a normalization factor needed to make
sure that the weights in step t make up a distribution and as such
sum to 1, and Ct (x) represents the classification of the classifier built
in step t for observation x. Multiple loss functions may be used to calculate the error 𝜀t although the misclassification rate is undoubtedly
the most popular. In substep i of step d, it can be seen that correctly
classified observations get lower weights, whereas substep ii assigns
higher weights to the incorrectly classified cases. Again, the number
of boosting runs T can be fixed or tuned using an independent validation set. Note that various variants of this Adaboost procedure exist,
such as Adaboost.M1, Adaboost.M2 (both for multiclass classification),
and Adaboost.R1, Adaboost.R2 (both for regression). See Freund and
Schapire (1997 and 1999) for more details. A key advantage of boosting
is that it is really easy to implement. A potential drawback is that there
may be a risk of overfitting to the hard (potentially noisy) examples in
the data, which will get higher weights as the algorithm proceeds. This
is especially relevant in a fraud detection setting because, as mentioned
earlier, the target labels in a fraud setting are typically quite noisy.

Random Forests
The technique of random forests was first introduced by Breiman
(2001). It creates a forest of decision trees as follows:
1. Given a data set with n observations and N inputs.
2. m = constant chosen on beforehand.
3. For t = 1, … , T
a. Take a bootstrap sample with n observations.
b. Build a decision tree whereby for each node of the tree,
randomly choose m variables on which to base the splitting
decision.
c. Split on the best of this subset.
d. Fully grow each tree without pruning.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

167

Common choices for m are 1, 2, or floor (log2 (N) + 1), which is recommended. Random forests can be used with both classification trees
and regression trees. Key in this approach is the dissimilarity amongst
the base classifiers (i.e., decision trees), which is obtained by adopting
a bootstrapping procedure to select the training samples of the individual base classifiers, the selection of a random subset of attributes at
each node, and the strength of the individual base models. As such the
diversity of the base classifiers creates an ensemble that is superior in
performance compared to the single models.
More recently, an alternative to random forests was proposed: rotation forests. This ensemble technique takes the idea of random forests
one step further. It combines the idea of pooling a large number of
decision trees built on a subset of the attributes and data, with the
application of principal component analysis prior to decision tree building, explaining its name. Rotating the axes prior to model building was
found to enhance base classifier accuracy at the expense of losing the
ability of ranking individual attributes by their importance (Rodriguez
et al. 2006).

Evaluating Ensemble Methods
Various benchmarking studies have shown that random forests can
achieve excellent predictive performance. Actually, they generally
rank amongst the best performing models across a wide variety of
prediction tasks (Dejaeger et al. 2012). They are also perfectly capable
of dealing with data sets having only a few observations, but with lots
of variables. They are highly recommended when high performing
analytical methods are needed for fraud detection. However, the price
that is paid for this, is that they are essentially black-box models. Due
to the multitude of decision trees that make up the ensemble, it is
very hard to see how the final classification is made. One way to shed
some light on the internal workings of an ensemble is by calculating
the variable importance. A popular procedure to do so is as follows:
1. Permute the values of the variable under consideration (e.g., Xj )
on the validation or test set.

168

FRAUD ANALYTICS

2. For each tree, calculate the difference between the error on the
original, unpermuted data and the error on the data with Xj permuted as follows:
VI (Xj ) =

1 ∑
(errort (D) − errort (D̃ j )),
ntree t

whereby ntree represents the number of trees in the ensemble,
D the original data, and D̃ j the data with variable Xj permuted.
In a regression setting, the error can be the mean squared error
(MSE), whereas in a classification setting, the error can be the
misclassification rate.
3. Order all variables according to their VI value. The variable with
the highest VI value is the most important.

MULTICLASS CLASSIFICATION TECHNIQUES
In the introduction of this chapter, we already discussed the difficulty of
appropriately determining the target label in a fraud detection setting.
One way to deal with this is by creating more than two target values, e.g., as follows: clear fraud, doubt case, no fraud. These values are
nominal, implying that there is no meaningful order between them.
As an alternative, the target values can also be ordinal: severe fraud,
medium fraud, light fraud, no fraud. All of the classification techniques
discussed earlier in this chapter can be easily extended to a multiclass
setting whereby more than two target values or classes are present.

Multiclass Logistic Regression
When estimating a multiclass logistic regression model, one first needs
to know whether the target variable is nominal or ordinal. For nominal
target variables, one of the target classes (say class K) will be chosen as
the base class as follows (Allison 2001):
P(Y = 1|X1 , … , XN )
1
1
1
1
= e(𝛽0 +𝛽1 X1 +𝛽2 X2 + … 𝛽N XN )
P(Y = K|X1 , … , XN )
P(Y = 2|X1 , … , XN )
2
2
2
2
= e(𝛽0 +𝛽1 X1 +𝛽2 X2 + … 𝛽N XN )
P(Y = K|X1 , … , XN )
…

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

169

P(Y = K − 1|X1 , … , XN )
K−1
K−1
K−1
K−1
= e(𝛽0 +𝛽1 X1 +𝛽2 X2 + … 𝛽N XN )
P(Y = K|X1 , … , XN )
Using the fact that all probabilities must sum to one, one can then
obtain the following:
1

1

1

1

2

2

2

2

P(Y = 1|X1 , … , XN ) =

e(𝛽0 +𝛽1 X1 +𝛽2 X2 + … 𝛽N XN )
∑K−1 (𝛽 k +𝛽 k X1 +𝛽 k X2 + … 𝛽 k XN )
N
2
1 + k=1
e 0 1

P(Y = 2|X1 , … , XN ) =

e(𝛽0 +𝛽1 X1 +𝛽2 X2 + … 𝛽N XN )
∑K−1 (𝛽 k +𝛽 k X1 +𝛽 k X2 + … 𝛽 k XN )
N
2
1 + k=1
e 0 1

P(Y = K|X1 , … , XN ) =

1+

∑K−1

k=1 e

1
kX )
(𝛽0k +𝛽1k X1 +𝛽2k X2 + … 𝛽N
N

The 𝛽 parameters are then usually estimated using maximum aposteriori estimation, which is an extension of maximum likelihood estimation. As with binary logistic regression, the procedure comes with
standard errors, confidence intervals, and p-values.
In case of ordinal targets, one could estimate a cumulative logistic
regression as follows (Allison 2001):
1
1 + e−𝜃1 +𝛽1 X1 + … +𝛽N XN
1
P(Y ≤ 2) =
1 + e−𝜃2 +𝛽1 X1 + … +𝛽N XN

P(Y ≤ 1) =

…
P(Y ≤ K − 1) =

1
1 + e−𝜃K−1 +𝛽1 X1 + … +𝛽N XN

or,
P(Y ≤ 1)
= e−𝜃1 +𝛽1 X1 + … +𝛽N XN
1 − P(Y ≤ 1)
P(Y ≤ 2)
= e−𝜃2 +𝛽1 X1 + … +𝛽N XN
1 − P(Y ≤ 2)
…

170

FRAUD ANALYTICS

P(Y ≤ K − 1)
= e−𝜃K−1 +𝛽1 X1 + … +𝛽N XN .
1 − P(Y ≤ K − 1)
Note that since P(Y ≤ K) = 1, 𝜃K = +∞.
The individual probabilities can then be obtained as follows:
P(Y = 1) = P(Y ≤ 1)
P(Y = 2) = P(Y ≤ 2) –P(Y ≤ 1)
…
P(Y = K) = 1 –P(Y ≤ K– 1).
Also for this model, the 𝛽 parameters can be estimated using a maximum likelihood procedure.

Multiclass Decision Trees
Decision trees can be easily extended to a multiclass setting. For the
splitting decision, assuming K classes, the impurity criteria become:
K
∑
Entropy(S) = − pk log2 (pk )
k=1

Gini(S) =

K
∑
pk (1 − pk ).
k=1

The stopping decision can be made in a similar way as for binary
target decision trees by using a training set for making the splitting
decision, and an independent validation data set on which the misclassification error rate is monitored. The assignment decision then looks
for the most prevalent class in each of the leave nodes.

Multiclass Neural Networks
A straightforward option for training a multiclass neural network for K
classes, is to create K output neurons, one for each class. An observation is then assigned to the output neuron with the highest activation
value (winner-take-all learning). Another option is to use a softmax
activation function (Bishop 1995).

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

171

Multiclass Support Vector Machines
A common practice to estimate a multiclass support vector machine is
to map the multiclass classification problem to a set of binary classification problems. Two well-known schemes here are One-versus-One
and One-versus-All coding (Van Gestel et al. 2015).
For K classes, One-versus-One coding estimates K(K − 1)∕2 binary
SVM classifiers contrasting every possible pair of classes. Every classifier as such can cast a vote on the target class and the final classification
is then the result of a (weighted) voting procedure. Ties are resolved
arbitrarily. This is illustrated in Figure 4.32 whereby the aim is to classify the white triangle.
For K classes, One-versus-All coding estimates K binary SVM classifiers each time contrasting one particular class against all the other
ones. A classification decision can then be made by assigning a particular observation to the class for which one of the binary classifiers
assigns the highest posterior probability. Ties are less likely to occur
with this scheme. This is illustrated in Figure 4.33, whereby the aim
is to classify the white triangle. Note that for more than three classes,
One-versus-All coding estimates less classifiers than One-versus-One
coding. However, in One-versus-One coding the binary classifiers are
estimated on a reduced subset of the data (i.e., each time contrasting
only two classes), whereas in One-versus-All coding, all binary classifiers are estimated on the entire data set (i.e., each time contrasting a
class against all the rest).

x2

a)

or

:

b)

or

:

c)

or

:

Class is

x1
Figure 4.32 One-Versus-One Coding for Multiclass Problems

!

172

FRAUD ANALYTICS

a)

x2

or other; p( ) = 0.92

b)

or other; p( ) = 0.18

c)

or other; p( ) = 0.30

Class is

!

x1
Figure 4.33 One-Versus-All Coding for Multiclass Problems

Both One-versus-One and One-versus-All coding are meta
schemes that can be used with other base classifiers (e.g., neural
networks) as well.

EVALUATING PREDICTIVE MODELS
Splitting Up the Data Set
When evaluating predictive models, two key decisions need to be
made. A first decision concerns the data set split up, which specifies
on what part of the data the performance will be measured. A second
decision concerns the performance metric. In what follows, we will
elaborate on both.
The decision how to split up the data set for performance measurement depends on its size. In case of large data sets (say, more
than 1,000 observations), the data can be split up into a training and
a test sample. The training sample (also called development or estimation sample) will be used to build the model whereas the test sample
(also called the hold out sample) will be used to calculate its performance (see Figure 4.34). A commonly applied split up is a 70 percent
training sample and a 30 percent test sample. There should be a strict
separation between training and test sample. No observation that was
used for model development, can be used for independent testing.
Note that in case of decision trees or neural networks, the validation

Train data
Customer
John
Sarah
Sophie
David
Peter

Age Income Gender Fraud … Target
30
1,200
M
Yes
1
25
800
F
No
0
52
2,200
F
No
0
48
2,000
M
Yes
1
34
1,800
M
No
0

Build Model

Data

1
1 e ( 0.10

0.50 age 0.0034 income ...)

Apply
Model

Test data
Customer Age
Emma
28
Will
44
Dan
30
Bob
58

P(Fraud | age, income,...) =

Income Gender … Fraud Score
1,000
F
No
0,44
1,500
M
Yes
0,76
1,200
M
No
0,18
Yes
2,400
M
0,88

Figure 4.34 Training Versus Test Sample Set Up for Performance Estimation

173

174

FRAUD ANALYTICS

sample is a separate sample since it is actively being used during model
development (i.e., to make the stopping decision). A typical split-up in
this case is a 40 percent training sample, 30 percent validation sample and 30 percent test sample. A stratified split-up ensures that the
fraudsters/nonfraudsters are equally distributed amongst the various
samples.
In case of small data sets (say, less than 1,000 observations), special
schemes need to be adopted. A very popular scheme is cross-validation.
In cross-validation, the data is split into K folds (e.g., 5 or 10). An
analytical model is then trained on K – 1 training folds and tested
on the remaining validation fold. This is repeated for all possible validation folds resulting in K performance estimates, which can then
be averaged. Note that also a standard deviation and/or confidence
interval can be calculated if desired. Common choices for K are 5 and
10. In its most extreme case, cross-validation becomes leave-one-out
cross-validation whereby every observation is left out in turn and a
model is estimated on the remaining K – 1 observations. This gives
K analytical models in total. In stratified cross validation, special care is
taken to make sure the no fraud/fraud odds are the same in each fold
(see Figure 4.35).
A key question to answer when doing cross-validation is what
should be the final model that is being outputted from the procedure.
Since cross-validation gives multiple models, this is not an obvious
question. Of course, one could let all models collaborate in an ensemble setup by using a (weighted) voting procedure. A more pragmatic

Validation fold
Training fold
Figure 4.35 Cross-Validation for Performance Measurement

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

175

answer would be to do leave one out cross-validation and pick one of
the models at random. Since the models differ up to one observation
only, they will be quite similar anyway. Alternatively, one may also
choose to build one final model on all observations but report the
performance coming out of the cross-validation procedure as the best
independent estimate.
For small samples, one may also adopt bootstrapping procedures
(Efron 1979). In bootstrapping, one takes samples with replacement
from a data set D (see Figure 4.36).
The probability that a customer is sampled equals 1/n, with n the
number of observations in the data set. Hence, the probability that a
customer is not sampled equals 1 – 1/n. Assuming a bootstrap with n
samples, the fraction of customers that is not sampled equals:
(
)
1 n
1−
.
n
We then have:
)
(
1 n
lim 1 −
= e−1 = 0.368
n→∞
n
where the approximation already works well for small values of n. So,
0.368 is the probability that a customer does not appear in the sample
and 0.632 the probability that a customer does appear. If we then take
the bootstrap sample as the training set, and the test set as all samples in D but not in the bootstrap, we can calculate the performance
as follows:
Error estimate = 0.368 Error (Training) + 0.632 Error (Test),
whereby obviously a higher weight is being put on the test set
performance.

C1
C4

C3
C5
C2

Figure 4.36 Bootstrapping

Bootstrap 1
Bootstrap 2

C3 C2 C5 C3 C2

C1 C4 C2 C1 C2

176

FRAUD ANALYTICS

Performance Measures for Classification Models
Consider the following fraud detection example for a five-customer
data set. The first column in Table 4.4 depicts the fraud status, whereas
the second column the fraud score as it comes from a logistic regression,
decision tree, neural network, and so on.
One can now map the scores to a predicted classification label by
assuming a default cut-off of 0.5 as shown in Figure 4.37.
A confusion matrix can now be calculated as shown in Table 4.5.
Based on this matrix, one can now calculate the following performance measures:
◾

Classification accuracy = (TP + TN)∕(TP + FP + FN + TN) = 3∕5

◾

Classification error = (FP + FN)∕(TP + FP + FN + TN) = 2∕5
Table 4.4

John
Sophie
David
Emma
Bob

Example Data Set for Performance Calculation

Fraud

Fraud Score

John

Yes

0.72

Sophie

No

0.56

David

Yes

0.44

Emma

No

0.18

Bob

No

0.36

Fraud Fraud Score
Yes
0.72
No
0.56
Cut-Off = 0.50
Yes
0.44
No
0.18
No
0.36

John
Sophie
David
Emma
Bob

Fraud Fraud Score Predicted
Yes
0.72
Yes
No
0.56
Yes
Yes
0.44
No
No
0.18
No
No
0.36
No

Figure 4.37 Calculating Predictions Using a Cut-Off
Table 4.5

Confusion Matrix

Actual Status

Predicted status

Positive (Fraud)

Negative (No Fraud)

Positive (Fraud)

True Positive (John)

False Positive (Sophie)

Negative (No Fraud)

False Negative (David)

True Negative (Emma, Bob)

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

◾

Sensitivity = Recall = Hit rate = TP∕(TP + FN) = 1∕2

◾

Specificity = TN∕(FP + TN) = 2∕3

◾

Precision = TP∕(TP + FP) = 1∕2

◾

F-measure = 2 × (Precision × Recall)∕(Precision + Recall) =
1∕2

177

The classification accuracy is the percentage of correctly classified
observations. The classification error is the complement thereof and
also referred to as the misclassification rate. The sensitivity, recall or
hit rate measures how many of the fraudsters are correctly labeled
by the model as a fraudster. The specificity looks at how many of
the nonfraudsters are correctly labeled by the model as nonfraudster.
The precision indicates how many of the predicted fraudsters are actually
fraudsters.
Note that all these classification measures depend on the cut-off.
For example, for a cut-off of 0 (1), the classification accuracy becomes
40 percent (60 percent), the error 60 percent (40 percent), the sensitivity 100 percent (0), the specificity 0 (100 percent), the precision
40 percent (0) and the F-measure 0.57 (0). Given this dependence, it
would be nice to have a performance measure that is independent from
the cut-off. One could construct a table with the sensitivity, specificity,
and 1-specificity for various cut-offs as shown in Table 4.6.
The receiver operating characteristic (ROC) curve then plots the
sensitivity versus 1-specificity as illustrated in Figure 4.38 (Fawcett
2003).

Table 4.6

Cut-off
0

Table for ROC Analysis

Sensitivity

Specificity

1-Specificity

1

0

1

0

1

0

0.01
0.02
....
0.99
1

178

FRAUD ANALYTICS

ROC Curve

Sensitivity

1
0,8
0,6
0,4
0,2
0
0

0,2

0,4

0,6

0,8

1

(1 - Specificity)
Model - A

Random

Model - B

Figure 4.38 The Receiver Operating Characteristic Curve

Note that a perfect model detects all the fraudsters and nonfraudsters at the same time, which results into a sensitivity of 1, and a specificity of 1 and is thus represented by the upper-left corner. The closer the
curve approaches this point, the better the performance. In Figure 4.38,
model A has a better performance than model B. A problem, however,
arises if the curves intersect. In this case, one can calculate the area
under the ROC curve (AUC) as a performance metric. The AUC provides
a simple figure-of-merit for the performance of the constructed classifier. The higher the AUC, the better the performance. The AUC is always
bounded between 0 and 1 and can be interpreted as a probability. In
fact, it represents the probability that a randomly chosen fraudster gets a
higher score than a randomly chosen nonfraudster (DeLong et al. 1988;
Hanley and McNeil 1982). Note that the diagonal represents a random
scorecard whereby sensitivity equals 1-specificity for all cut-off points.
Hence, a good classifier should have an ROC above the diagonal and
AUC bigger than 50 percent.
A lift curve is another important performance metric. It starts by
sorting the population from high score to low score. Suppose now that
in the top 10 percent highest scores, there are 60 percent fraudsters
whereas the total population has 10 percent fraudsters. The lift value
in the top decile then becomes 60 percent/10 percent, or 6. In other
words, the lift value represents the cumulative percentage of fraudsters
per decile, divided by the overall population percentage of fraudsters. Using no model, or a random sorting, the fraudsters would be
equally spread across the entire range and the lift value would always

179

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

6
5
4
3

model
baseline

2
1
0
10

20

30

40

50

60

70

80

90

100

% of sorted population
Figure 4.39 Lift Curve

equal 1. Obviously, the lift curve always decreases as one considers
bigger deciles, until it will reach 1. This is illustrated in Figure 4.39.
Note that a lift curve can also be expressed in a noncumulative way,
and is also often summarized as the top decile lift.
The cumulative accuracy profile (CAP), Lorenz or Power curve is
very closely related to the lift curve (see Figure 4.40). It also starts by
sorting the population from high score to low score and then measures
the cumulative percentage of fraudsters for each decile on the Y-axis.
The perfect models gives a linearly increasing curve up to the sample fraud rate and then flattens out. The diagonal again represents the
random model.
The CAP curve can be summarized in an accuracy ratio (AR) as
depicted in Figure 4.41.
The accuracy ratio is then defined as follows (see Figure 4.41):
(Area below power curve for current model – Area below power
curve for random model) ∕ (Area below power curve for
perfect model − Area below power curve for random model)
A perfect model will thus have an AR of 1 and a random model
an AR of 0. Note that the accuracy ratio is also often referred to as the

100%
90%

Percentage of fraudsters

80%
70%
60%
50%

scorecard

40%

random model
30%

perfect model

20%
10%
0%

0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

scorecard

0%

30%

50%

65%

78%

85%

90%

95%

97%

99%

100%

random model

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

perfect model

0%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

Percentage of sorted population

Figure 4.40 Cumulative Accuracy Profile

180

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

181

A
Current model

B
Perfect model
AR=B/(A+B)

Figure 4.41 Calculating the Accuracy Ratio

Gini coefficient. There is also a linear relation between the AR and the
AUC as follows: AR = 2 × AUC − 1.
The Kolmogorov-Smirnov distance is a separation measure
calculating the maximum distance between the cumulative score
distributions of the nonfraudsters P(s|NF) and fraudsters P(s|F) defined
as follows,
P(s|F) =

∑

p(x|F)

x≤s

P(s|NF) =

∑

p(x|NF).

x≤s

Note that by definition P(s|F) equals 1 – Sensitivity, and P(s|NF)
equals the specificity. Hence, it can easily be verified that the KS
distance can also be measured on an ROC graph. It fact, it is equal
to the maximum vertical distance between the ROC curve and the
1
0.9

P(s|NF)

0.8
0.7
0.6
0.5
0.4

KS distance

0.3
0.2

P(s|F)

0.1
0
Score

Figure 4.42 The Kolmogorov-Smirnov Statistic

182

FRAUD ANALYTICS

diagonal. The KS statistic ranges between 0 and 1. If there exists a
cut-off s such that all fraudsters have a score higher than s and all
nonfraudsters a score lower than s, then perfect separation is achieved
and the KS-statistic will equal 1.
Another performance measure is the Mahalanobis distance M
between the score distributions defined as follows:
M=

|𝜇F − 𝜇NF |
,
𝜎

where 𝜇NF (𝜇F ) represents the mean score of the nonfraudsters (fraudsters) and 𝜎 the pooled standard deviation. Obviously, a high Mahalanobis distance is preferred since it means both score distributions are
well separated. Closely related is the divergence metric D calculated as
follows:
(𝜇 − 𝜇F )2
D = 1 NF
.
(𝜎 2 + 𝜎F2 )
2 NF
For both M and D the minimum value is zero and there is no theoretical upper bound.
The Brier score (BS) measures the quality of the fraud probability
estimates as follows:
BS =

∑n

i=1 (PFi

− 𝜃i ) 2 ,

where PFi is the probability of fraud for observation i, and θi a binary
indicator (θi equals 1 if fraud; 0 otherwise). The Brier score is always
bounded between 0 and 1 and lower values indicate better discrimination ability.
In case of multiclass targets, other performance measures need to
be adopted. Assume we have developed an analytical fraud detection
model with four classes: A, B, C, and D. A first performance measure is
the multiclass confusion matrix, which contrasts the predicted classes
versus the actual classes as depicted in Table 4.7.
The on-diagonal elements correspond to the correct classifications.
Off-diagonal elements represent errors. For Table 4.7, the classification accuracy becomes (50 + 20 + 10 + 4)∕100, or 84 percent, and thus

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

Table 4.7

183

Multiclass Confusion Matrix

Actual class
Predicted class

A

B

C

D

A

50

2

1

1

B

3

20

2

1

C

1

2

10

0

D

1

0

2

4

the classification error equals 16 percent. Note that the sensitivity and
specificity are no longer uniquely defined and need to be considered
for each class separately. For example, for class A, 50 out of the 55
observations are correctly classified or 91 percent. Also the precision
needs to be considered for each class individually. For example, for the
26 class B predictions, 20 are correct, or thus 77 percent.
Assume now that the ratings are ordinal, A = severe fraud, B =
medium fraud, C = light fraud, D = no fraud. In this case, not all errors
have equal impact. Given the ordinal nature of the target variable, the
further away from the diagonal, the bigger the impact of the error. For
example, when target class A is predicted as B, this is a less severe error
than when target class A is predicted as D. One could summarize this
in a notch difference graph, which is a bar chart depicting the cumulative accuracy for increasing notch differences. For our example, at the
0 notch difference level the cumulative accuracy equals 84 percent, at
the one-notch difference level 95 percent, at the two-notch difference
level 98 percent, and at the three-notch difference level 100 percent.
Figure 4.43 gives the corresponding notch difference graph. Obviously,
the cumulative accuracy always increases as the notch level increases
and will become 100 percent eventually.
Although ROC analysis was originally introduced in the context
of binary classification, several multiclass extensions have been
developed in the literature, in line with the One-versus-All or Oneversus-One setups of casting a multiclass problem to several binary
ones, as discussed earlier (Hand 2001). A first approach is to generate
an ROC curve using each class in turn as the positive class and merging
all other classes into one negative class (similar to One-versus-All

184

FRAUD ANALYTICS

cumulative notch difference
100%

95%

90%

85%

80%

75%
0 notch

1 notch

2 notch

3 notch

Figure 4.43 A Cumulative Notch Difference Graph

coding). The multiclass AUC, AUCm , can then be computed as the
sum of the binary AUCs weighted by the class distribution as follows:
AUCm =

∑m

i=1 AUC(ci )p(ci ),

where m is the number of classes, AUC(ci ) the AUC obtained from considering class ci as the reference class, and p(ci ) the prior probability
of class ci . Another approach is based on the pairwise discriminability of classes and computes the multiclass AUC as follows (similar to
One-versus-One coding):
AUCm =

∑
2
AUC(ci , cj ),
m(m − 1) i C(+, –). The
costs are typically also determined on an aggregated basis, rather than
on an observation-by-observation basis.
A first straightforward way to make a classifier cost-sensitive is by
adopting a cost-sensitive cut-off to map the posterior class probabilities
to class labels. In other words, an observation x will be assigned to the
class that minimizes the expected misclassification cost:
argmini

(∑

)
P(j|x).C(i,
j)
,
j ∈ {−,+}

where P(j|x) is the posterior probability of observation x to belong to
class j. As an example, consider a fraud detection setting whereby class
1 are the fraudsters and class 2 the nonfraudsters. An observation x will
be classified as a fraudster (class 1) if
P(1|x) . C(1, 1) + P(2|x) . C(1, 2) < P(1|x) . C(2, 1) + P(2|x) . C(2, 2)
P(1|x) . C(2, 1) > P(2|x) . C(1, 2)
P(1|x) . C(2, 1) > (1 − P(1|x)) . C(1, 2)
C(1, 2)
C(1, 2) + C(2, 1)
1
P(1|x) >
1 + CC (2,1)
(1,2)

P(1|x) >

So, the cut-off only depends on the ratio of the misclassification
costs, which may be easier to determine than the individual misclassification costs themselves.
Another approach to cost-sensitive learning works by directly
minimizing the misclassification cost during classifier learning.
Again assuming there is no cost for correct classifications, the total
misclassification cost is then as follows:
Total cost = C(–, +) × FN + C(+, –) × FP,

200

FRAUD ANALYTICS

whereby FN and FP represent the number of false negatives and
positives, respectively. Various cost-sensitive versions of existing classification techniques have been introduced in the literature. Ting (2002)
introduced a cost-sensitive version of the C4.5 decision tree algorithm
where the splitting and stopping decisions are based on the misclassification cost. Veropolous et al. (1999) developed a cost-sensitive version
of SVMs whereby the misclassification costs are taken into account
in the objective function of the SVM. Domingos (1999) introduced
MetaCost, which is a meta-algorithm capable of turning any classifier
into a cost-sensitive classifier by first relabeling observations with their
estimated minimal-cost classes and then estimating a new classifier
on the relabeled data set. Fan et al. (1999) developed AdaCost, a
cost-sensitive variant of AdaBoost, which uses the misclassification
costs to update the weights in successive boosting runs.
To summarize, cost-sensitive learning approaches are usually
more complex to work with than the sampling approaches discussed
earlier. López et al. (2012) conducted a comparison of sampling versus
cost-sensitive learning approaches for imbalanced data sets and found
both methods are good and equivalent. Hence, from a pragmatic
viewpoint, it is recommended to use the sampling approaches in a
fraud-detection setting.

FRAUD PERFORMANCE BENCHMARKS
To conclude this chapter, Table 4.15 provides some references of scientific papers discussing fraud detection across a diversity of settings. The
type of fraud, size of data set used, class distribution, and performance
are reported. To facilitate the comparison, only papers that report the
area under the ROC curve (AUC) are included. The following conclusions can be drawn:
◾

Credit card, financial statement, and telecommunications fraud
case studies report the highest AUC.

◾

Insurance and social security fraud report the lowest AUC.

◾

All case studies, except for financial statement fraud, start from
highly skewed data sets.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

Table 4.15

201

Performance Benchmarks for Fraud Detection

Reference

Type of Fraud

Size of Data

Class

Ortega,
Figuerora et al.
(2006)

Medical
Insurance

Šubelj, Furlan
et al. (2011)

Performance

Set Used

Distribution

8,819

5% fraud

AUC: 74%

Automobile
Insurance fraud

3,451

1.3% fraud

AUC: 71%–
92%

Bhattacharyya,
Jha et al.
(2011)

Credit card fraud

50 million
transactions on
about 1 million
credit cards
from a single
country

0.005%
fraud

AUC: 90,8%–
95,3%

Whitrow, Hand
et al. (2009)

Credit card fraud

33,000–36,000
activity records

0.1% fraud

Gini: 85%
(∼ AUC =
92.5%)

Van Vlasselaer,
Bravo et al.
(2015)

Credit card fraud

3,3 million
transactions

<1% fraud

AUC: 98.6%

Dongshan and
Girolami
(2007)

Telecommunications fraud

809,395 calls
from 1,087
accounts

0.024%
fraud

AUC: 99.5%

Van Vlasselaer,
Meskens et al.
(2013)

Social security
fraud

2000
observations

1% fraud

AUC: 80–85%

Ravisankar,
Ravi et al.
(2011)

Financial
statement fraud

202 companies

50% fraud

AUC: 98.09%

REFERENCES
Allison, P D. (2001). Logistic Regression Using the SAS® System: Theory and
Application. Hoboken, NJ: John Wiley-SAS.
Baesens, B. (2014). Analytics in a Big Data World. Hoboken, NJ: John Wiley
& Sons.
Baesens, B., Martens, D., Setiono, R., & Zurada, J. (2011). White Box Nonlinear Prediction Models, editorial special issue, IEEE Transactions on Neural
Networks 22 (12): 2406–2408.

202

FRAUD ANALYTICS

Baesens, B., Mues, C., Martens, D., & Vanthienen, J. (2009). 50 Years of Data
Mining and OR: Upcoming Trends and Challenges. Journal of the Operational Research Society 60: 16–23.
Baesens, B., Setiono, R., Mues, C., & Vanthienen, J. (March 2003). Using
Neural Network Rule Extraction and Decision Tables for Credit-Risk
Evaluation. Management Science 49 (3): 312–329.
Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., & Vanthienen, J. (2003). Benchmarking State of the Art Classification Algorithms for Credit Scoring. Journal of the Operational Research Society 54 (6):
627–635.
Baesens, B., Viaene, S., Van den Poel, D., Vanthienen, J., & Dedene, G. (2002).
Bayesian Neural Network Learning for Repeat Purchase Modelling
in Direct Marketing. European Journal of Operational Research 138 (1):
191–211.
Bartlett P. L. (1997). For Valid Generalization, the Size of the Weights Is More
Important than the Size of the Network. In M. C. Mozer, M. I. Jordan,
and T. Petsche (Eds.), Advances in Neural Information Processing Systems 9.
Cambridge, MA: MIT Press, pp. 134–140.
Benjamin, N., Cathcart, A., & Ryan, K. (2006). Low Default Portfolios: A Proposal for Conservative Estimation of Default Probabilities; Discussion
paper, Financial Services Authority.
Bhattacharyya, S., Jha, S. K., Tharakunnel, K. K., & Westland, J. C. (2011).
Data Mining for Credit Card Fraud: A Comparative Study, Decision Support
Systems 50 (3): 602–613.
Bi, J., & Bennett, K. P. (2003). Regression Error Characteristic Curves, Proceedings of the 20th International Conference on Machine Learning (ICML),
pp. 43–50.
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford: Oxford
University Press.
Breiman, L. (1996). Bagging Predictors. Machine Learning 24 (2): 123–140.
Breiman, L., Friedman, J. H., Olshen, R. A., Stone, C. J. (1984). Classification
and Regression Trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced
Books & Software.
Breiman, L., (2001). Random Forests. Machine Learning 45 (1): 5–32.
Chawla, N. V., Bowyer, K.W., Hall, L. O., Kegelmeyer, W. P. (2002). SMOTE:
Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16: 321–357.
Craven, M., & Shavlik, J. (1996). Extracting Tree-Structured Representations
of Trained Networks. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

203

(Eds.), Advances in Neural Information Processing Systems 8. Cambridge, MA:
MIT Press, pp. 24–30.
Cristianini, N., Taylor J. S. (2000). An Introduction to Support Vector Machines
and Other Kernel-Based Learning Methods Cambridge: Cambridge University
Press.
Dejaeger, K., Verbeke, W., Martens, D., & Baesens, B. (2012). Data Mining
Techniques for Software Effort Estimation: A Comparative Study, IEEE
Transactions on Software Engineering 38 (2): 375–397.
Domingos, P. (1999). MetaCost: A General Method for Making Classifiers CostSensitive, Proceedings of the Fifth International Conference on Knowledge
Discovery and Data Mining. New York: ACM Press, pp. 155–164.
Dongshan X., & Girolami M. (2007). Employing Latent Dirichlet Allocation
for Fraud Detection in Telecommunications. Pattern Recognition Letters 28:
1727–1734.
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification. New York:
John Wiley & Sons.
Efron B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals
of Statistics 7 (1): 1–26.
Fan, W., Stolfo, S. J., Zhang J., & Chan P. K. (1999). AdaCost: Misclassification Cost-sensitive Boosting. Proceedings of the Sixteenth International
Conference on Machine Learning (ICML). Waltham, MA: Morgan Kaufmann,
pp. 97–105.
Fawcett, T. (2003). ROC Graphs: Notes and Practical Considerations for
Researchers, HP Labs Tech Report HPL-2003–4.
Flach P. (2012). Machine Learning: The Art and Science of Algorithms that Make Sense
of Data. Cambridge: Cambridge University Press.
Freund, Y., & Schapire, R. E. (August 1997). A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting. Journal of
Computer and System Sciences 55 (1):119–139.
Freund, Y., & Schapire, R. E. (September 1999). A Short Introduction to Boosting. Journal of Japanese Society for Artificial Intelligence 14 (5): 771–780.
Han, J., & Kamber, M. (2006). Data Mining: Concepts and Techniques, 2nd ed.
Waltham, MA: Morgan Kaufmann.
Hand, D., Till, R. J. (2001). A Simple Generalization of the Area under the
ROC Curve to Multiple Class Classification Problems. Machine Learning 45
(2):171–186.
Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of Data Mining.
Cambridge, MA: MIT Press.
Hanley, J. A., & McNeil, B. J. (1982). The Meaning and Use of Area under the
ROC Curve. Radiology 143: 29–36.

204

FRAUD ANALYTICS

Hartigan, J. A. (1975). Clustering Algorithms. New York: John Wiley & Sons.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). Elements of Statistical Learning:
Data Mining, Inference and Prediction. New York: Springer-Verlag.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer Feedforward Networks Are Universal Approximators. Neural Networks 2 (5):
359–366.
Huysmans, J., Baesens, B., Van Gestel, T., & Vanthienen, J. (April 2006). Using
Self-Organizing Maps for Credit Scoring. Expert Systems with Applications,
Special Issue on Intelligent Information Systems for Financial Engineering 30 (3):
479–487.
Kass, G. V. (1980). An Exploratory Technique for Investigating Large Quantities
of Categorical Data. Applied Statistics 29 (2): 119–127.
López, V., Fernández, A., Moreno-Torres, J., & Herrera, F. (2012). Analysis of
Preprocessing vs. Cost-Sensitive Learning for Imbalanced Classification.
Open Problems on Intrinsic Data Characteristics. Expert Systems with Applications 39: 6585–6608.
Lu, H., Setiono, R., & Liu, H. (September 1995). NeuroRule: a Connectionist
Approach to Data Mining. In Proceedings of 21st International Conference on
Very Large Data Bases, Zurich, Switzerland, pp. 478–489.
Mangasarian, O. L. (May–June 1965). Linear and Nonlinear Separation of
Patterns by Linear Programming, Operations Research 13: 444–452.
Martens, D., Baesens, B., & Van Gestel, T. (2009). Decompositional Rule
Extraction from Support Vector Machines by Active Learning. IEEE
Transactions on Knowledge and Data Engineering 21 (1): 178–191.
Martens, D., Baesens, B., Van Gestel, T., & Vanthienen, J. (2007). Comprehensible Credit Scoring Models Using Rule Extraction from Support Vector
Machines. European Journal of Operational Research 183: 1466–1476.
Martens, D., Vanthienen, J., Verbeke, W., & Baesens, B. (2011). Performance
of Classification Models from a User Perspective. Decision Support Systems,
Special Issue on Recent Advances in Data, Text, and Media Mining & Information
Issues in Supply Chain and in Service System Design 51 (4): 782–793.
Moody, J., & Utans, J. (1994). Architecture Selection Strategies for
Neural Networks: Application to Corporate Bond Rating Prediction.
In Apostolos-Paul Refenes (Ed.), Neural Networks in the Capital Markets.
New York: John Wiley & Sons.
Ortega, P.A., Figueora C.J., & Ruz. G. A. (2006). Medical Claim Fraud/Abuse
Detection System based on Data Mining: A Case Study in Chile, Proceedings
of the 2006 International Conference on Data Mining, DMIN.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

205

Pluto, K., & Tasche, D. (2005). Estimating Probabilities of Default for Low
Default Portfolios. In B. Engelmann and R. Rauhmeier (Eds.), The Basel
II Risk Parameters. New York: Springer.
Quinlan, J. R. (1993). C4.5 Programs for Machine Learning. Waltham, MA:
Morgan Kauffman Publishers.
Ravisankar, P., Ravi, V., Rao, G. R., & Bose, I. (2011). Detection of Financial
Statement Fraud and Feature Selection Using Data Mining Techniques,
Decision Support Systems 50: 491–500.
Saerens M., Latinne P., & Decaestecker C. (2002). Adjusting the Outputs of a
Classifier to New a Priori Probabilities: A Simple Procedure. Neural Computation 14 (1): 21–41.
Schölkopf B., & Smola A. (2002). Learning with Kernels. Cambridge: MIT Press.
Setiono R. Baesens B., & Mues C. (2009). A Note on Knowledge Discovery
Using Neural Networks and Its Application to Credit Card Screening. European Journal of Operational Research 192 (1): 326–332.
Setiono R., Baesens B. & Mues C. (2011). Rule Extraction from Minimal Neural
Network for Credit Card Screening. International Journal of Neural Systems
21 (4): 265–276.
Šubelj L., Furlan S., & Bajec M. An Expert System for Detecting Automobile
Insurance Fraud Using Social Network Analysis. Expert Systems with Applications 38 (1): 1039–1052.
Tan P. N., Steinbach M., & Kumar V. (2006). Introduction to Data Mining. New
York: Pearson.
Ting K. M. (2002). An Instance-Weighted Method to Induce Cost-Sensitive
Trees, IEEE Transactions on Knowledge and Data Engineering 14: 659–665.
Van Gestel, T., & Baesens B. (2009). Credit Risk Management: Basic Concepts: Financial Risk Components, Rating Analysis, Models, Economic and Regulatory Capital.
Oxford: Oxford University Press.
Van Gestel, T., Baesens B., & Martens D. (2015). Predictive Analytics: Techniques
and Applications in Credit Risk Modeling. Oxford: Oxford University Press,
forthcoming.
Van Gestel, T., Baesens, B., Van Dijcke, P., Suykens, J., Garcia, J., and & Alderweireld, T. (2005). Linear and Nonlinear Credit Scoring by Combining
Logistic Regression and Support Vector Machines. Journal of Credit Risk 1 (4).
Van Gestel, T., Suykens J., Baesens B., Viaene S., Vanthienen J., Dedene G., De
Moor B., & Vandewalle J. (January 2004). Benchmarking Least Squares
Support Vector Machine Classifiers. Machine Learning 54 (1): 5–32.
Van Vlasselaer, V., Akoglu L., Eliassi-Rad T., Snoeck M., & Baesens B.
(2015). Guilt-by-Constellation: Fraud Detection by Suspicious Clique

206

FRAUD ANALYTICS

Memberships, Proceedings of 48 Annual Hawaii International Conference on
System Sciences, HICSS-48, Kauai (Hawaii), January 5–8.
Van Vlasselaer, V., Bravo C., Caelen O., Eliassi-Rad T., Akoglu L., Snoeck M.,
& Baesens B. (2015). APATE: A novel approach for automated credit card
transaction fraud detection using network-based extensions. Decision Support Systems 75: 38–48.
Van Vlasselaer, V., Meskens J., Van Dromme D., & Baesens B. (2013). Using
Social Network Knowledge for Detecting Spider Constructions in Social
Security Fraud. Proceedings of the 2013 IEEE/ACM International Conference on
Advances in Social Network Analysis and Mining, Niagara Falls.
Vapnik, V. (1995). The Nature of Statistical Learning Theory, New York:
Springer-Verlag.
Veropoulos, K., Campbell, C., Cristianini N. (1999). Controlling the Sensitivity
of Support Vector Machines. Proceedings of the International Joint Conference
on AI, pp. 55–60.
Viaene, S., Derrig, R., Baesens, B., Dedene, G. (2002). A Comparison of
State-of-the-Art Classification Techniques for Expert Automobile Insurance Fraud Detection. Journal of Risk and Insurance, Special issue on Fraud
Detection 69 (3): 433–443.
Whitrow, C., Hand, D. J., Juszczak, P., Weston, D., Adams, N.M. (2009). Transaction Aggregation as a Strategy for Credit Card Fraud Detection. Data
Mining and Knowledge Discovery 18: 30–55.
Zurada, J.M. (1992). Introduction to Artificial Neural Systems. Boston: PWS
Publishing.

C H A P T E R

5

Social Network
Analysis for Fraud
Detection

207

In the last decade, the use of social media websites in everybody’s
daily life is booming. People can continue their conversations on
online social network sites like Facebook, Twitter, LinkedIn, Google+,
Instagram, and so on and share their experiences with their acquaintances, friends, family, and others. It only takes one click to update
your whereabouts to the rest of the world. Plenty of options exist to
broadcast your current activities: by picture, video, geo-location, links,
or just plain text. You are on the top of the world—and everybody’s
watching. And this is where it becomes interesting.
Users of online social network sites explicitly reveal their relationships with other people. As a consequence, social network sites are
a (almost) perfect mapping of the relationships that exist in the real
world. We know who you are, what your hobbies and interests are, to
whom you are married, how many children you have, your buddies
with whom you run every week, your friends at the wine club, etc. This
whole interconnected network of people knowing each other, somehow, is an extremely interesting source of information and knowledge.
Marketing managers no longer have to guess who might influence
whom to create the appropriate campaign. It is all there—and that is
exactly the problem. Social network sites acknowledge the richness of
the data sources they have, and are not willing to share them as such
and free of cost. Moreover, those data are often privatized and regulated, and well-hidden from commercial use. On the other hand, social
network sites offer many good built-in facilities to managers and other
interested parties to launch and manage their marketing campaigns by
exploiting the social network, without publishing the exact network
representation.
However, companies often forget that they can reconstruct (a
part of) the social network using in-house data. Telecommunication
providers, for example, have a massive transactional data base where
they record call behavior of their customers. Under the assumption
that good friends call each other more often, we can recreate the
network and indicate the tie strength between people based on the
frequency and/or duration of calls. Internet infrastructure providers
might map the relationships between people using their customers’
IP-addresses. IP-addresses that frequently communicate are represented by a stronger relationship. In the end, the IP-network will
208

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION

209

envisage the relational structure between people from another point
of view, but to a certain extent as observed in reality. Many more
examples can be found in the banking, retail, and online gaming
industry.
Also, the fraud detection domain might benefit from the analysis
of social networks. In this chapter, we underline the social character
of fraud. This means that we assume that the probability of someone
committing fraud depends on the people (s)he is connected to. These
are the so-called guilt-by-associations (Koutra et al. 2011). If we know
that five friends of Bob are fraudsters, what would we say about Bob?
Is he also likely to be a fraudster? If these friends are Bob’s only friends,
is it more likely that Bob will be influenced to commit fraud? What if
Bob has 200 other friends, will the influence of these five fraudsters be
the same?
In this chapter, we will briefly introduce the reader to networks
and their applications in a fraud detection setting. One of the main
questions answered in this chapter is how unstructured network information can be translated into useful and meaningful characteristics of
a subject. We will analyze and extract features from the direct neighborhood (i.e., the direct associates of a certain person or subject) as
well as the network as a whole (i.e., collective inferencing). Those
network-based features can serve as an enrichment of traditional data
analysis techniques.

NETWORKS: FORM, COMPONENTS, CHARACTERISTICS,
AND THEIR APPLICATIONS
Networks are everywhere. Making a telephone call requires setting up
a communication over a wired network of all possible respondents by
sending voice packages between the caller and the callee. The supply
of water, gas, and electricity for home usage is a complex distribution
network that consists of many source, intermediary, and destination
points where sources need to produce enough output such that they
meet the demand of the destination points. Delivery services need to
find the optimal route to make sure that all the packages are delivered
at their final destination as efficiently as possible. Even a simple trip to

210

FRAUD ANALYTICS

the store involves the processing of many networks. What is the best
route to drive from home to the store given the current traffic? Given
a shopping list, how can I efficiently visit the store such that I have
every product on my list?
One of humans’ talents is exactly the processing of these networks.
Subliminally, people have a very good sense in finding an efficient
way through a network. Consider your home-to-work connection.
Depending on the time and the day, you might change your route to
go from home to work without explicitly drawing the network and
running some optimization algorithm. Reaching other people, even
without the telecommunication media of nowadays like telephone
and internet, is often an easy task for people. There is always a friend
of a friend who knows the guy you are looking for.
The mathematical study of optimizing network-related problems
has been introduced many years ago by Euler (1736). He formulated
the problem of the Köningsberg bridges. Köningsberg (now Kaliningrad)
was a city in Lithuania that was divided into four parts by the river
Pregel. Seven bridges connected the four banks of the city (see
Figure 5.1a and Figure 5.1b). The problem is as follows, “Does there
exist a walking route that crosses all seven bridges exactly once?” A
path that can traverse all edges (here: bridges) of a network exactly
once, is a Eulerian path. Euler proved that such a path cannot exist
for the Köningsberg bridge problem. More specifically, an Eulerian
path only exists when all nodes (here: banks) are reached by an
even number of edges, except for the source and sink node of the

Figure 5.1a Köningsberg Bridges

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION

211

Figure 5.1b Schematic Representation of the Köningsberg Bridges

path which should have an odd number of bridges pointing to it.
Analogously, a Hamiltonian path in the network is a path that visits
each node exactly once. For example, the Traveling Salesman Problem
(TSP) tries to find a Hamiltonian path in the network. Given a set of
cities, the idea is that a salesman has to visit each city (i.e., node)
exactly once to deliver the packages. As this is an NP-hard problem,
research mainly focuses on finding good heuristics to solve the TSP.

Social Networks
Although in the previous example networks are built and developed
by humans, they are not social. A key question here is, “What makes
a network social?” In general, we might say that a network is social
whenever the actors are people or groups of people. A connection
between actors is based on any form of social interaction between
them, such as a friendship. As in the real world, social networks are
also able to reflect the intensity of a relationship between people.
How well do you know your contacts? The relationship between two
best friends completely differs from the relationship between two
distant acquaintances. Those relationships and their intensity are an
important source of information exchange.
The psychologist Stanley Milgram measured in 1967 how social
the whole world is. He conducted a Small World experiment whereby he
distributed 100 letters to random people all over the world. The task
at hand was to return the letter to a specified destination, which was

212

FRAUD ANALYTICS

one of Milgram’s friends. Rather than sending the letter back by mail,
people could only pass the letter to someone they knew. This person,
on their turn, had to forward the letter to one of his/her contacts, and
so on … until the letter reached its final destination. Milgram showed
that, on average, each letter reached its destination within six hops.
That is, less than six people are necessary to connect two random people in the network. This is the average path length of the network. The
result of the experiment is widely known as the six degrees of separation
theorem. Milgram also found that many letters reached their target
destination within three steps. This is the so-called funneling effect. Some
people are known and know many other people, often from highly
diverse contact groups (e.g., work, friends, hobby). Those people are
sociometric superstars, connecting different parts of the network to
each other. Many paths in the network pass through these people, giving them a high betweenness score (see section on Centrality metrics).
While the six degrees of separation theorem is based on results in
real-life, many studies already proved that an average path length of six
is an overestimation in online social networks. Those studies reported
an average path length of approximately four hops between any two
random people in an online social network (Kwak et al. 2010). Online
social networks are thus denser than real-life networks. However, the
intensity between the relationships might strongly differ.
Social networks are an important element in the analysis of fraud.
Fraud is often committed through illegal set-ups with many accomplices. When traditional analytical techniques fail to detect fraud due
to a lack of evidence, social network analysis might give new insights by
investigating how people influence each other. These are the so-called
guilt-by-associations, where we assume that fraudulent influences run
through the network. For example, insurance companies often have
to deal with groups of fraudsters, trying to swindle by resubmitting the
same claim using different people. Suspicious claims often involve the
same claimers, claimees, vehicles, witnesses, and so on. By creating and
analyzing an appropriate network, inspectors might gain new insights
in the suspiciousness of the claim and can prevent pursuit of the claim.
In social security fraud, employers try to avoid paying their tax
contributions to the government by intentionally going bankrupt.
Bankrupt employers are not capable of redeeming their tax debts to

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION

213

the government, and are discharged from their obligations. However,
social network analysis can reveal that the employer is refounded
using almost the same structure a couple of weeks later. As such,
experts can declare the foundation of the new employer unlawful
and still recover the outstanding debts. Opinion fraud occurs when
people untruthfully praise or criticize a product in a review. Especially
online reviews lack control to establish the genuineness of the review.
Matching people to their reviews and comparing the reviews with
others using a network representation, enables review websites to
detect the illicit reviews.
Identity theft is a special form of social fraud, as introduced in
Chapter 1, where an illicit person adopts another person’s profile.
Examples of identity theft can be found in telecommunications fraud
where fraudsters “share” an account with a legitimate customer. This
is depicted in Figure 5.2. As fraudsters cannot withstand to call their
family, friends, acquaintances, and so on, the network clearly links the
account with the fraudster’s previous (or current) account. Once the
fraudster takes over a new customers’ account, the contact list of the
customer is extended by other contacts who were never called before.
The frequent contact list of the fraudster is a strong indicator for fraud.

Figure 5.2 Identity Theft. The Frequent Contact List of a Person is Suddenly Extended
with Other Contacts (Light Gray Nodes). This Might Indicate that a Fraudster (Dark Gray
Node) Took Over that Customer’s Account and “shares” his/her Contacts

214

FRAUD ANALYTICS

While networks are a powerful visualization tool, they mainly
serve to support the findings by automated detection techniques.
We will focus on how to extend the detection process by extracting
useful and meaningful features from the network. The network
representation can be used afterward to verify the obtained results.

Network Components
This section will introduce the reader to graph theory, the mathematical foundation for the analysis and representation of networks.
Complex network analysis (CNA) studies the structure, characteristics, and dynamics of networks that are irregular, complex, and
dynamically evolving in time (Boccaletti et al. 2006). Those networks
often consist of millions of closely interconnected units. Most real-life
networks are complex. CNA uses graph theory to extract useful statistics from the network. Boccaletti et al. (2006) define graph theory as
the natural framework for the exact mathematical treatment of complex networks, and, they state that formally, a complex network is
represented as a graph.
A graph 𝓖 = (𝓥, 𝓔) consists of a set 𝓥 of vertices or nodes (the
points) and a set 𝓔 of edges or links (the lines connecting the points).
This is illustrated in Figure 5.3. A node 𝓋 ∈ 𝓥 represents real-world
objects such as people, computers or activities. An edge 𝓋 ∈ 𝓔 connects

Figure 5.3 Network Representation

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION

215

two nodes in the network, and
e(𝓋1 , 𝓋2 ) | e ∈ 𝓔 & 𝓋i ∈ 𝓥.
An edge represents a relationship between the nodes it connects,
such as a friendship (between people), a physical connection (between
computers), or attendance (of a person to an event).
A graph where the edges impose an order or direction between
the nodes in the network is a directed graph. If there is no order in
the network, we say that the graph is undirected. This is shown in
Figure 5.4. The social network website Twitter can be represented as
a directed graph. Users follow other users, without necessarily being
refollowed. This is expressed by the follower–followee relationships,
and is illustrated in Figure 5.5. User 1 follows User 2, 3, and 5 (follower

Figure 5.4 Example of a (Un)Directed Graph

Figure 5.5 Follower–Followee Relationships in a Twitter Network

216

FRAUD ANALYTICS

Figure 5.6 Edge Representation

relationships), and is followed by User 4 and 5 (followee relationship).
There is a mutual relationship between User 1 and 5.
In general, edges connect two nodes to each other. However, some
special variants are sometimes required to accurately map the reality
(see Figure 5.6):
◾

Self-edge: A self-edge is a connection between the node and
itself. For example, a person who transfers money from his/her
account to another account s/he owns.

◾

Multi-edge: A multi-edge exists when two nodes are connected by more than one edge. For example, in credit card
transaction fraud, a credit card holder is linked to a merchant
by a multi-edge if multiple credit card transactions occurred
between them.

◾

Hyper-edge: A hyper-edge is an edge that connects more than
one node in the network. For example, three people who went
to the same event.

A graph where the edges express the intensity of the relationships,
is a weighted graph 𝓖w = (𝓥, 𝓔).
◾

Binary weight: This is the standard network representation.
Here, the edge weight is either 0 or 1, and reflects whether or
not a link exists between two nodes. An extension of the binary
weighted graphs are the signed graphs where the edge weight
is negative (–1), neutral (0), or positive (1). Negative weights
are used to represent animosity, and positive weights are used
to represent friendships. Neutral weights represent an “I don’t
know you” relationship.

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION

217

◾

Numeric weight: A numeric edge weight expresses the affinity
of a person to other persons s/he is connected to. High values
indicate a closer affiliation. As people do not assign a weight
to each of their contacts by themselves, many approaches are
proposed to define an edge weight between nodes. A popular
way is the Common Neighbor approach. That is, the edge weight
equals the total number of common activities or events both
people attended. An activity/event should be interpreted in a
broad sense: the total number of messages sent between them,
common friends, likes on Facebook, and so on.

◾

Normalized weight: The normalized weight is a variant of the
numeric weight where all the outgoing edges of a node sum up
to 1. The normalized weight is often used in influence propagation over a network.

◾

Jaccard weight: The edge weight depends on how “social”
both nodes are (Gupte and Eliassi-Rad 2012), and
w(𝓋1 , 𝓋2 ) =

|𝚪(𝓋1 )∩𝚪(𝓋2 )|
|𝚪(𝓋1 )∪𝚪(𝓋2 )|

with 𝚪(𝓋i ) the number of events node 𝓋i attended. For example,
assume that person A attended 10 events and person B attended 5
events. They both went to 3 common events. Then, according to the
Jaccard Index, their edge weight equals 1/4.
Edge weights represent the connectivity within a network, and
are in some way a measure of the sociality between the nodes in the
network. Nodes, on the other hand, use labels to express the local characteristics. Those characteristics are mostly proper to the node and may
include, for example, demographics, preferences, interests, beliefs, and
so on. When analyzing fraud networks, we integrate the fraud label of
the nodes into the network. A node can be fraudulent or legitimate,
depending on the condition of the object it represents. For example,
Figure 5.7 shows a fraud network where the legitimate and fraudulent people are represented by white- and black-colored nodes, respectively. Given this graph, we know that node A and B committed fraud
beforehand. Node C is a friend of node A and is influenced by the
actions of node A. On the other hand, node D is influenced by both
node A and B. A simple conclusion would be that node D has the
highest probability of perpetrating fraud, followed by node C.

218

FRAUD ANALYTICS

Figure 5.7 Example of a Fraudulent Network

While real-life networks often contain billions of nodes and millions of links, sometimes the direct neighborhood of nodes provides
enough information to base decisions on. An ego-centered network or
egonet represents the one-hop neighborhood of the node of interest.
In other words, an egonet consists of a particular node and its immediate neighbors. The center of the egonet is the ego, and the surrounding nodes are the alters. An example of an egonet is illustrated in
Figure 5.8. Such networks are also called the first-order neighborhood
of a node. Analogously, the n-order neighborhood of a node encompasses all the nodes that can be reached within n hops from the node
of interest.

Figure 5.8 An Egonet. The Ego is Surrounded by Six Alters, of Whom Two are Legitimate
(White Nodes) and Four are Fraudulent (Gray Nodes)

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION

219

Network Representation
Transactional data sources often contain information about how
entities relate to each other (e.g., call record data, bank transfer data).
An example transactional data source of credit card fraud is given
in Table 5.1. Each line in the transactional data source represents
a money transfer between two actors: a credit card holder and a
merchant. Despite the structured representation of the data, the
relationships between credit card holders and merchants are hard to
capture. Real-life data sources contain billions of transactions, making
it impossible to extract correlations and useful insights. Network visualization tools offer a powerful solution to make information hidden
in networks easy to interpret and understand. Inspecting the visual
representation of a network can be part of the preprocessing phase
as it familiarizes the user with the data and can often quickly result
in some first findings and insights. In the post-processing phase, the
network is a useful representation to verify the obtained results and
understand the rationale. In general, a network can be represented in
two ways:
◾

Graphically

◾

Mathematically

Table 5.1

Example of Credit Card Transaction Data

Credit Card

Merchant

Merchant

Country

Amount

Date

Accept

Fraud

112.99

2013-11-06
00:28:38

TRUE

FALSE

Category
8202092217124626

207005

056

USA

1887940000202544

105930

234

IRL

3.58

2013-11-06
00:28:40

TRUE

FALSE

2070050002009251

79768

612

BEL

149.50

2013-11-06
00:28:47

TRUE

FALSE

1809340000672044

11525

056

BEL

118.59

2013-11-06
00:28:49

FALSE

FALSE

4520563752703209

323158

056

USA

22.27

2013-11-06
00:28:50

TRUE

TRUE

5542610001561826

68080

735

FRA

50.00

2013-11-06
00:28:51

TRUE

FALSE

…

220

FRAUD ANALYTICS

The graphical representation of a network, or sociogram, is the
most intuitive and straightforward visualization of a network. A toy
example of a credit card fraud network is shown in Figure 5.9. Credit
card holders are modeled by rectangles, the merchants by circles.
The thin (thick) edges represent legitimate (fraudulent) transactions.
Based on the figure, we expect that the credit card of user Y is stolen
and that merchant 1 acts suspiciously. The sociogram can be used to
present results at different levels in an organization: the operational,
tactical, and strategic management all benefit from interpreting the
network representation by evaluating how to detect and monitor
suspicious business processes (operational), how to act on it (tactical)
and how to deal with fraud in the future and take prevention measures
(strategic).
While graphical network representations are mainly appropriate
for visualization purposes, it is an unstructured form of data and
cannot be used to compute useful statistics and extract meaningful

Figure 5.9 Toy Example of Credit Card Fraud

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION

221

characteristics. As a consequence, there is an urge to represent the
network in a mathematically interesting way. The adjacency matrix
and the adjacency list are two network representations that fulfill
these requirements. The adjacency or connectivity matrix An×n is a matrix
of size n × n with n the number of nodes in the network; and ai,j = 1 if
a link exists between node i and j, and ai,j = 0 otherwise. Figure 5.10a
shows an example of a small network. The corresponding adjacency
matrix is depicted in Figure 5.10b. Note that the adjacency matrix is
a sparse matrix, containing many zero values. This is often the case
in real-life situations. Social networks have millions of members, but
people are only connected to a small number of friends—for example,
Twitter has 500 million users, and each user follows approximately
200 other users.1 The adjacency matrix of a network records which

Figure 5.10 Mathematical Representation of (a) a Sample Network: (b) the Adjacency or
Connectivity Matrix; (c) the Weight Matrix; (d) the Adjacency List; and (e) the Weight List

1 http://news.yahoo.com/twitter-statistics-by-the-numbers-153151584.html,

on December 2014.

retrieved

222

FRAUD ANALYTICS

nodes are connected to each other, irrespective of the edge weight.
The weight matrix Wn×n expresses the edge weight between the nodes
of a network, and wi,j ∈ ℝ if a link exists between node i and j, and
wi,j = 0 otherwise. The weight matrix of the sample network is given
in Figure 5.10c. The adjacency list is an abstract representation of the
adjacency matrix, and provides a list of all the connections present in
the network. A relationship between node vi and node vj is denoted
as (vi , vj ). This is illustrated in Figure 5.10d. The weight list extends the
adjacency matrix by specifying the weights of the relationships, and
has the following format (vi , vj , wi,j ) with wi,j the weight between node
vi and node vj (see Figure 5.10e).

IS FRAUD A SOCIAL PHENOMENON? AN INTRODUCTION
TO HOMOPHILY
One of the essential questions before analyzing the network regarding
fraud, is deciding whether the detection models might benefit from
CNA (complex network analysis). In other words, do the relationships
between people play an important role in fraud, and is fraud a contagious effect in the network? Are fraudsters randomly spread over the
network, or are there observable effects indicating that fraud is a social
phenomenon, that is, fraud tends to cluster together. We look for evidence that fraudsters are possibly exchanging knowledge about how
to commit fraud using the social structure. Fraudsters can be linked
together as they seem to attend the same events/activities, are involved
in the same crimes, use the same set of resources, or even are sometimes one and the same person (see also identity theft).
Homophily is a concept borrowed from sociology and boils down to
the expression: “Birds of a feather flock together.” People have a strong
tendency to associate with others whom they perceive as being similar to themselves in some way (Newman 2010). Friendships are mostly
built because of similar interests, same origin, high school, neighborhood, hobbies, etc, or even the tendency to commit fraud. Relationships determine which people are influenced by whom and the extent
to which information is exchanged.

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION

223

A network is homophilic if nodes with label x (e.g., fraud) are to a
larger extent connected to other nodes with label x. In marketing, the
concept of homophily is frequently exploited to assess how individuals
influence each other, and to determine which people are likely responders and should be targeted with a marketing incentive. For example,
if all John’s friends are connected to telecom provider Beta, John is
likely to sign a same contract with provider Beta. A network that is not
homophilic is heterophilic.
The same reasoning holds in fraud. We define a homophilic network as a network where fraudsters are more likely to be connected
to other fraudsters, and legitimate people are more likely to be connected to other legitimate people. A homophilic network is shown in
Figure 5.11. The gray (white) nodes represent fraudsters (legit people).
Visually, gray nodes are grouped together, the so-called web of frauds.
Legitimate nodes are also clustered together. Remark that the network
is not perfectly homophilic: fraudulent nodes are not uniquely connected to other fraudulent nodes, but connect to legitimate nodes as
well. Also, Figure 5.11 illustrates how small webs of frauds exist in the
network, for example, subgraphs of less than one or two nodes.
Advanced network techniques take into account the time dimension. Few fraudulent nodes that are popping up together in the network might indicate a newly originating web of fraud, while subgraphs
characterized with many fraudulent nodes are far-evolved structures.
Preventing the growth of new webs and the expansion of existing webs
are important challenges that both need to be addressed in the fraud
detection models.
We already showed that a graphical representation of the fraud
network can give a first indication of the homophilic character of the
network, and thus whether network analysis might make sense in the
fraud detection task at hand. Mathematically, a network is homophilic
if fraudulent nodes are significantly more connected to other fraudulent nodes, and as a consequence, legitimate nodes connect significantly more to other legitimate nodes. More concretely, let l be the
fraction of legitimate nodes in the network and f the fraction of fraudulent nodes in the network, then 2lf is the expected probability that
an edge connects two dissimilar labeled nodes. These edges are called

Figure 5.11 A Real-Life Example of a Homophilic Network

224

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION

225

cross-labeled edges. A network is homophilic if the observed fraction of
cross-labeled edges r̂ is significantly less than the expected probability
2lf , i.e. if the null hypothesis
H0 ∶ ̂r ≥ 2lf
can be rejected. Consider Figure 5.12. The black (white) nodes are
the fraudsters (legitimate people). The network consists in total of
12 nodes: 8 legitimate nodes and 4 fraudulent nodes. The fraction
8
4
l and f equal 12
and 12
, respectively. In a random network, we
8
4
8
would expect that 2lf = 2 ⋅ 12
⋅ 12
= 18
edges are cross-labeled. The
network in Figure 5.12 has 5 cross-labeled edges, and 3 fraud and

Figure 5.12 A Homophilic Network

226

FRAUD ANALYTICS

10 legit same-labeled edges. The observed fraction of cross-labeled
5
edges is thus ̂r = 18
. We expect to see 8 edges in the network that are
cross-labeled, instead of the 5 edges we observe. The null hypothesis
H0 is rejected with a significance level of 𝛼 = 0.05 (p-value of 0.02967)
using a one-tailed proportion test with a normal approximation. The
network is homophilic.
Other measures to assess whether there are significant patterns of
homophily present in the network include dyadicity and heterophilicity (Park and Barabási 2007). In many systems, the number of links
between nodes sharing a common property is larger than if the characteristics were distributed randomly in the network. This is the dyadic
effect. For a network where the labels can only take two values, 1
(Fraud) and 0 (Legitimate), let n1 (n0 ) be the number of fraudulent
(legitimate) nodes and N = n0 + n1 . Now, we can define three types of
dyads: (1 − 1), (1 − 0), and (0 − 0), indicating the label (Fraud-Fraud),
(Fraud-Legitimate) and (Legitimate-Legitimate) of two end points connected by a link. The total number of dyads of each kind are represented as m11 , m10 and m00 respectively, and M = m11 + m10 + m00 . If
nodes are randomly connected to other nodes regardless of their labels,
then the expected values of m11 and m10 equal:
m11

m10

( )
n (n − 1)p
n1
p= 1 1
=
2
2
( )( )
n0
n1
p = n1 (N − n1 )p
=
1
1

with p = 2M/(N(N – 1)) the connectance, representing the probability
that two nodes are connected. If p = 1, all nodes in the network are
connected to each other. Dyadicity and heterophilicity can then be
defined as:
D=
H=

m11
m11
m10
m10

A network is dyadic if D > 1, indicating that fraudulent nodes
tend to connect more densely among themselves than expected for

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION

227

a random configuration. A network is heterophobic (opposite of
heterophilic) if H < 1, meaning that fraud nodes have fewer connections to legitimate nodes than expected randomly (Park and Barabási
2007). The network represented in Figure 5.12 is dyadic and heterophobic. If a network is dyadic and heterophobic, it exhibits homophily;
a network that is anti-dyadic and heterophilic, is inverse-homophilic.
A network that exhibits evidence of homophily, is worthwhile to
investigate more thoroughly. For each instance of interest, we extract
features that characterize the instance based on its relational structure.

IMPACT OF THE NEIGHBORHOOD: METRICS
In this section, we will discuss the main metrics to measure the impact
of the social environment on the nodes of interest. In general, we distinguish between three types of analysis techniques:
◾

Neighborhood metrics

◾

Centrality metrics

◾

Collective Inference algorithms

Neighborhood metrics characterize the target of interest based on its
direct associates. The n-order neighborhood around a node consists of
the nodes that are n hops apart from that node. Due to scalability issues,
many detection models integrate features derived from the egonet or
first-order neighborhood (see the section on Network components).
That is, the node and its immediate contacts. Neighborhood metrics
include degree, triangles, density, relational neighbor, and probabilistic
relational neighbor.
Centrality metrics quantify the importance of an individual in a
social network (Boccaletti 2006). Centrality metrics are typically
extracted based on the whole network structure, or a subgraph.
We discuss geodesic paths, betweenness, closeness, and the graph
theoretic center.
Given a network with known fraudulent nodes, how can we use
this knowledge to infer a primary fraud probability for all the unlabeled
nodes (i.e., the currently legitimate nodes)? As opposed to neighborhood and centrality metrics, collective inference algorithms compute the
probability that a node is exposed to fraud and thus the probability

228

FRAUD ANALYTICS

that fraud influences a certain node. Collective inference algorithms
are designed such that the whole network is simultaneously updated.
As a result, long-distance propagation is possible (Hill, Provost, and
Volinsky 2007). We consider PageRank, and briefly explain Gibbs
sampling, iterative classification, relaxation labeling, and loopy belief
propagation.

Neighborhood Metrics
Neighborhood metrics are derived from a node’s n-hop neighborhood. We discuss degree, triangles, density, relational neighbor, and
probabilistic relational neighbor. For a graph of N nodes and M edges
Table 5.2 summarizes each of the metrics. Figure 5.13 shows a toy
example, which is used to illustrate how each metric is derived from
the network. In this section, we will use the egonet to extract the
features. This is the one-hop neighborhood of the target node.
Degree
The degree of a node summarizes how many neighbors the node
has. In fraud, it is often useful to distinguish between the number
of fraudulent and legitimate neighbors. This is the fraudulent and
Table 5.2

Overview of Neighborhood Metrics

Degree

Number of connections of a
node (in- versus out-degree if
the connections are directed)

Triangles

Number of fully connected
subgraphs consisting of three
nodes.

Density

The extent to which the nodes
are connected to each other
(reciprocal of farness).

d=

2M
N(N − 1)

Relational
Neighbor

Relative number of neighbors
that belong to class c (e.g., to
class fraud)

P(c|n) =
1∑
w(n, nj ),
Z {nj ∈Neighborhoodn |class(nj )=c}

Probabilistic
Relational
Neighbor

Probability to belong to class
c given the posterior class
probabilities of the neighbors

P(c|n) =
1∑
w(n, nj )P(c|nj )
Z {nj ∈Neighborhoodn }

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION

229

Figure 5.13 Sample Network

legitimate degree. In the toy example of Figure 5.13, node A has a
degree of 6, which is the highest degree in the network. Recall that we
derive the degree for the egonet of each node. Node G has a fraudulent
degree of 3 and is thus highly influenced by fraud. The degree of each
node is given in Table 5.3.
In case of a directed network, we can distinguish between the
in-degree and the out-degree. The in-degree specifies how many nodes
are pointing toward the target of interest. The out-degree describes
the number of nodes that can be reached from the target of interest.
Table 5.3

Summary of the Total, Fraudulent, and Legitimate Degree

NODE

A

B

C

D

E

F

G

H

I

J

Total Degree

6

3

1

4

2

3

4

2

4

1

Fraud Degree

1

1

0

2

2

2

3

0

0

0

Legit Degree

5

2

1

2

0

1

1

2

4

1

NODE

K

L

M

N

O

P

Q

R

S

T

Total Degree

3

2

2

3

3

1

3

1

1

1

Fraud Degree

1

0

0

1

0

0

0

0

0

0

Legit Degree

2

2

2

2

3

1

3

1

1

1

230

FRAUD ANALYTICS

The degree distribution of a network describes the probability
distribution of the degree in the network. The degree distribution in
real-life networks follows in general a power law. That is, many nodes
are only connected with few other nodes while only few nodes in the
network link to many other nodes. Figure 5.14a illustrates the degree
distribution of the network in Figure 5.14b gives an example of the
degree distribution (log-log scale) of a real-life fraud network of a
social security institution (Van Vlasselaer et al. 2015a).
Networks with a constant degree distribution are k-regular graphs.
Each node in the network has the same degree. An example of a
4-regular graph is given in Figure 5.15.

Figure 5.14a Degree Distribution

Figure 5.14b Illustration of the Degree Distribution for a Real-Life Network of Social
Security Fraud. The Degree Distribution Follows a Power Law (log-log axes)

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION

231

Figure 5.15 A 4-regular Graph

Triangles
A triangle in a network is a subgraph that consists of three nodes that
are all connected to each other. A triangle investigates the influential
effect of closely connected groups of individuals. Nodes that are part of
a group are to a larger extent affected by the beliefs, interests, opinions,
and so on of that group. In an egonet a triangle includes the ego and
two alters. If the two alters are both fraudulent (legitimate), we say that
the triangle is fraudulent (legitimate). If only one of the two alters are
fraudulent, the triangle is semi-fraudulent. In the network, there are
four triangles: A-G-I, L-M-N, D-F-G, and D-E-F. The egonet analysis
with regard to triangles is given in Table 5.4. The analysis of triangles
can easily be extended to quadrangles, pentagons, hexagons, and so on.

232

FRAUD ANALYTICS

Table 5.4 Summary of the Number of Legitimate, Fraudulent, and
Semi-Fraudulent Triangles

NODE

A

B

C

D

E

F

G

H

I

J

Total

1

0

0

2

1

2

2

0

1

0

Fraud

0

0

0

1

1

1

1

0

0

0

Legit

0

0

0

0

0

0

0

0

1

0

Semi-Fraud

1

0

0

1

0

1

1

0

0

0

NODE

K

L

M

N

O

P

Q

R

S

T

Total

0

1

1

1

0

0

0

0

0

0

Fraud

0

0

0

0

0

0

0

0

0

0

Legit

0

1

1

1

0

0

0

0

0

0

Semi-Fraud

0

0

0

0

0

0

0

0

0

0

Density
Another neighborhood metric is the network’s density. The density
measures the extent to which nodes in a network are(connected
to
)
N
= N(N−1)
each other. A fully connected network of N nodes has
2
2
edges. That is, each node is connected to every other node in the
network. The density measures the number of observed edges to the
maximum possible number of edges in the graph:
2M
M
d= ( ) =
N(N − 1)
N
2
with M and N the number of edges and nodes in the network
respectively. Remark that a triangle is a subgraph with density 1. An
egonet with degree 1, has also a density of 1. The density derives how
closely connected the group is. A high density might correspond to
an intensive information flow between the instances, which might
indicate that the nodes extensively influence each other. This may
be important in the analysis of fraud. The combination of a high
fraudulent degree and a high density can drastically increase a node’s
probability to commit fraud in the future. The density for each node
in the toy example (see Figure 5.13) is given in Table 5.5.

233

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION

Table 5.5

Summary of the Density

NODE

A

B

C

D

E

F

Density

0.33

0.5

1

0.6

NODE

K

L

M

N

O

Density

0.5

1

1

0.66

0.5

1

G

H

I

J

0.83

0.6

0.66

0.5

1

P

Q

R

S

T

1

1

1

0.5

1

Relational Neighbor Classifier
The relational neighbor classifier makes use of the homophily assumption which states that connected nodes have a propensity to belong
to the same class. This idea is also referred to as guilt-by-association.
If two nodes are associated, they tend to exhibit similar behavior.
The posterior class probability for node n to belong to class c is then
calculated as follows:
P(c|n) =

∑
1
Z {n ∈Neighborhood
j

w(n, nj ),

n |class(nj )=c}

whereby Neighborhoodn represents the neighborhood of node n,
w(n, nj ) the weight of the connection between n and nj , and Z is a
normalization factor to make sure all probabilities sum to one.
Consider, for example, the sample network depicted in Figure 5.16
whereby F and NF represent fraudulent and non-fraudulent nodes,
respectively.

Figure 5.16 Example Social Network for a Relational Neighbor Classifier

234

FRAUD ANALYTICS

The calculations then become:
1
(1 + 1)
Z
1
P(NF|?) = (1 + 1 + 1)
Z
P(F|?) =

Since both probabilities have to sum to 1, Z equals 5, so the probabilities become:
2
5
3
P(NF|?) =
5
P(F|?) =

The relational neighbor probabilities for the network in Figure 5.13
are depicted in Table 5.6. Remark that these probabilities can be added
as an additional feature to local models—that is, detection models
that do not use network variables. Another possibility is to use the
relational neighbor as a classifier. The final probabilities are then used
to distinguish between fraud and nonfraud. A high value indicates a
high likelihood that the node will be influenced to commit fraud as
well. A cut-off value separates legitimate and fraudulent nodes.
Probabilistic Relational Neighbor Classifier
The probabilistic relational neighbor classifier is a straightforward
extension of the relational neighbor classifier whereby the posterior
class probability for node n to belong to class c is calculated as follows:
P(c|n) =

∑
1
w(n, nj )P(c|nj ).
Z {n ∈Neighborhood }
j

Table 5.6

n

Summary of Relational Neighbor Probabilities

NODE

A

B

C

D

E

F

P(NF|?)

G

H

I

J

0.80

0.66

1

0.50

0

0.33

0.25

1

1

1

P(F|?)

0.20

0.33

0

0.50

1

0.66

0.75

0

0

0

NODE

K

L

M

P(NF|?)

0.66

1

1

N

O

P

Q

R

S

T

0.66

1

1

1

1

1

1

P(F|?)

0.33

0

0

0.33

0

0

0

0

0

0

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION

235

Figure 5.17 Example Social Network for a Probabilistic Relational Neighbor Classifier

Note that the summation now ranges over the entire neighborhood
of nodes. The probabilities P(c|nj ) can be the result of a local model,
or of a previously applied network model. Consider the network of
Figure 5.17.
The calculations then become:
1
(0.25 + 0.20 + 0.80 + 0.10 + 0.90)
Z
1
P(NF|?) = (0.75 + 0.80 + 0.20 + 0.90 + 0.10)
Z
P(F|?) =

Since both probabilities have to sum to 1, Z equals 5, so the probabilities become:
2.25
= 0.45
5
2.75
= 0.55
P(NF|?) =
5
P(F|?) =

Analogously to the relational neighbor, the probabilities returned
by the probabilistic relational neighbor can be used as an additional
variable in local models, or as a classifier.

236

FRAUD ANALYTICS

Relational Logistic Regression Classifier
Relational logistic regression was introduced by Lu and Getoor (2003).
It basically starts off from a data set with local node-specific characteristics. These node-specific characteristics are the so-called intrinsic
variables. Examples of intrinsic features in a social security fraud context include a company’s age, sector, legal form, and so on. The data
set is enriched with network characteristics, as follows:
◾

Most frequently occurring class of neighbor (mode-link)

◾

Frequency of the classes of the neighbors (count-link)

◾

Binary indicators indicating class presence (binary-link)

The network characteristics are illustrated in Figure 5.18. Remark
that the count equals the degree.
A logistic regression model is then estimated using the data set
which contains both intrinsic and network features. Note that there is
some correlation between the network characteristics added, which
should be filtered out during an input selection procedure (e.g.,
using stepwise logistic regression). This idea is also referred to as
featurization, since the network characteristics are basically added
as special features to the data set. These features can measure the
behavior of the neighbors in terms of the target variables (e.g., fraud or
not) or in terms of the intrinsic characteristics (e.g., age, sector, profit).
Figure 5.19 provides an example of social security fraud, aiming to

Figure 5.18 Example of Social Network Features for a Relational Logistic Regression
Classifier

Figure 5.19 Example of Featurization with Features Describing Intrinsic Behavior and Behavior of the Neighborhood

237

238

FRAUD ANALYTICS

Table 5.7

Summary of Relational Features by Lu and Getoor (2003)

NODE

A

B

C

D

E

Mode link

NF

NF

NF

NF/F

F

Count fraud

1

1

0

2

2

Count non-fraud

5

2

1

2

0

Binary fraud

1

1

0

1

Binary non-fraud

1

1

1

1

NODE

K

L

M

N

Mode link

NF

NF

NF

Count fraud

1

0

0

Count non-fraud

2

2

Binary fraud

1

Binary non-fraud

1

F

G

H

I

J

F

F

NF

NF

NF

2

3

0

0

0

1

1

2

4

1

1

1

1

0

0

0

0

1

1

1

1

1

O

P

Q

R

S

T

NF

NF

NF

NF

NF

NF

NF

1

0

0

0

0

0

0

2

2

3

1

3

1

1

1

0

0

1

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

detect fraudulent companies. Both intrinsic and network features are
added to describe target behavior (i.e., fraud). Table 5.7 summarizes the
features extracted from the network using the approach as suggested
by Lu and Getoor (2003) for the toy example of Figure 5.13.
The set of network features as suggested by the relational logistic
regression classifier can in a straightforward manner be extended with
other metrics discussed in this and the next sections (Neighborhood
Metrics, Centrality Metrics, and Collective Inference Algorithms).

Centrality Metrics
Centrality metrics are useful in fraud to prevent the expansion of future
fraudulent activities. They tend to find the central node(s), that is, nodes
that might impact many other nodes. These metrics include geodesic
path, degree (see Section Neighborhood Metrics), closeness, betweenness and graph theoretic center. Assume a network with n nodes vi ,
i = 1, … , n. The geodesic represents the shortest path between two
nodes. gjk represents the number of geodesics from node j to node k,
whereas gjk (vi ) represents the number of geodesics from node j to node
k passing through node vi . The centrality measures are summarized in
Table 5.8. The formulas each time calculate the metric for node vi . A
toy example is given in Figure 5.13.

239

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION

Table 5.8

Centrality Metrics

Geodesic path

Shortest path between two nodes in the
network.

Degree

Number of connections of a node (inversus out-degree if the connections are
directed). See previous section.

Closeness

The average distance of a node to all other
nodes in the network (reciprocal of
farness).

Betweenness

Graph theoretic center

Counts the number of times a node or
connection lies on the shortest path
between any two nodes in the network.

d(vi , vj )

[ ∑

d(vi ,vj )

]−1

j=1(j≠i)
n−1

∑
j

Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.6
Linearized                      : No
Author                          : Baesens, Bart(Author)
Modify Date                     : 2015:07:30 15:08:17-07:00
Create Date                     : 2015:07:17 13:05:52+05:30
EBX PUBLISHER                   : Wiley
Page Mode                       : UseNone
Has XFA                         : No
Page Count                      : 402
XMP Toolkit                     : Adobe XMP Core 5.2-c001 63.139439, 2010/09/27-13:37:26
Format                          : application/pdf
Creator                         : Bart Baesens
Title                           : Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques
Metadata Date                   : 2015:07:30 15:08:17-07:00
Producer                        : iTextSharpŽ 5.5.5 ©2000-2014 iText Group NV (AGPL-version); modified using iTextSharp™ 5.5.5 ©2000-2014 iText Group NV (AGPL-version)
Document ID                     : uuid:2ffec051-17be-4224-9530-ca27649e289c
Instance ID                     : uuid:b3a9ec3d-9524-4934-8296-a3a87ea23175
EXIF Metadata provided by EXIF.tools

Navigation menu