Wiley And SAS Business Series : Fraud Analytics Using Descriptive, Predictive, Social Network Techniques A Guide To Data S (Wiley Series) Bart Baesens, Veronique Van Vlasselaer, Wouter Verbeke Analyti

User Manual: Pdf

Open the PDF directly: View PDF .
Page Count: 402 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Cover
Title Page
Copyright
Contents
List of Figures
Foreword
Preface
Acknowledgments
Chapter 1 Fraud: Detection, Prevention, and Analytics!
Chapter 2 Data Collection, Sampling, and Preprocessing
Chapter 3 Descriptive Analytics for Fraud Detection
Chapter 4 Predictive Analytics for Fraud Detection
Chapter 5 Social Network Analysis for Fraud Detection
Chapter 6 Fraud Analytics: Post-Processing
Chapter 7 Fraud Analytics: A Broader Perspective
About the Authors
Index
EULA

Fraud Analytics Using

Descriptive, Predictive,

and Social Network

Techniques

Wiley & SAS Business

Series

The Wiley & SAS Business Series presents books that help senior-level

managers with their critical management decisions.

Titles in the Wiley & SAS Business Series include:

Analytics in a Big Data World: The Essential Guide to Data Science and Its

Applications by Bart Baesens

Bank Fraud: Using Technology to Combat Losses by Revathi Subrama-

nian

Big Data Analytics: Turning Big Data into Big Money by Frank Ohlhorst

Big Data, Big Innovation: Enabling Competitive Differentiation through

Business Analytics by Evan Stubbs

Business Analytics for Customer Intelligence by Gert Laursen

Business Intelligence Applied: Implementing an Effective Information and

Communications Technology Infrastructure by Michael Gendron

Business Intelligence and the Cloud: Strategic Implementation Guide by

Michael S. Gendron

Business Transformation: A Roadmap for Maximizing Organizational

Insights by Aiman Zeid

Connecting Organizational Silos: Taking Knowledge Flow Management to

the Next Level with Social Media by Frank Leistner

Data-Driven Healthcare: How Analytics and BI Are Transforming the

Industry by Laura Madsen

Delivering Business Analytics: Practical Guidelines for Best Practice by

Evan Stubbs

Demand-Driven Forecasting: A Structured Approach to Forecasting,

second edition by Charles Chase

Demand-Driven Inventory Optimization and Replenishment: Creating a

More Efﬁcient Supply Chain by Robert A. Davis

Developing Human Capital: Using Analytics to Plan and Optimize Your

Learning and Development Investments by Gene Pease, Barbara Beres-

ford, and Lew Walker

The Executive’s Guide to Enterprise Social Media Strategy: How Social Net-

works Are Radically Transforming Your Business by David Thomas and

Mike Barlow

Economic and Business Forecasting: Analyzing and Interpreting Economet-

ric Results by John Silvia, Azhar Iqbal, Kaylyn Swankoski, Sarah

Watt, and Sam Bullard

Financial Institution Advantage and The Optimization of Information

Processing by Sean C. Keenan

Foreign Currency Financial Reporting from Euros to Yen to Yuan: A Guide

to Fundamental Concepts and Practical Applications by Robert Rowan

Harness Oil and Gas Big Data with Analytics: Optimize Exploration and

Production with Data Driven Models by Keith Holdaway

Health Analytics: Gaining the Insights to Transform Health Care by Jason

Burke

Heuristics in Analytics: A Practical Perspective of What Inﬂuences Our Ana-

lytical World by Carlos Andre Reis Pinheiro and Fiona McNeill

Human Capital Analytics: How to Harness the Potential of Your Organi-

zation’s Greatest Asset by Gene Pease, Boyce Byerly, and Jac Fitz-enz

Implement, Improve and Expand Your Statewide Longitudinal Data Sys-

tem: Creating a Culture of Data in Education by Jamie McQuiggan and

Armistead Sapp

Killer Analytics: Top 20 Metrics Missing from Your Balance Sheet by Mark

Brown

Predictive Analytics for Human Resources by Jac Fitz-enz and John

Mattox II

Predictive Business Analytics: Forward-Looking Capabilities to Improve

Business Performance by Lawrence Maisel and Gary Cokins

Retail Analytics: The Secret Weapon by Emmett Cox

Social Network Analysis in Telecommunications by Carlos Andre Reis

Pinheiro

Statistical Thinking: Improving Business Performance, second edition by

Roger W. Hoerl and Ronald D. Snee

Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data

Streams with Advanced Analytics by Bill Franks

Too Big to Ignore: The Business Case for Big Data by Phil Simon

The Value of Business Analytics: Identifying the Path to Proﬁtability by

Evan Stubbs

The Visual Organization: Data Visualization, Big Data, and the Quest for

Better Decisions by Phil Simon

Understanding the Predictive Analytics Lifecycle by Al Cordoba

Unleashing Your Inner Leader: An Executive Coach Tells All by Vickie

Bevenour

Using Big Data Analytics: Turning Big Data into Big Money by Jared

Dean

Win with Advanced Business Analytics: Creating Business Value from Your

Data by Jean Paul Isson and Jesse Harriott

For more information on these and other titles in the series, please

visit www.wiley.com.

Fraud Analytics

Using Descriptive,

Predictive, and

Social Network

Techniques

A Guide to Data Science

for Fraud Detection

Bart Baesens

Véronique Van Vlasselaer

Wouter Verbeke

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or

transmitted in any form or by any means, electronic, mechanical, photocopying,

recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the

1976 United States Copyright Act, without either the prior written permission of the

Publisher, or authorization through payment of the appropriate per-copy fee to the

750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to the

Publisher for permission should be addressed to the Permissions Department, John

Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201)

748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used

their best efforts in preparing this book, they make no representations or warranties

with respect to the accuracy or completeness of the contents of this book and

speciﬁcally disclaim any implied warranties of merchantability or ﬁtness for a particular

purpose. No warranty may be created or extended by sales representatives or written

sales materials. The advice and strategies contained herein may not be suitable for your

situation. You should consult with a professional where appropriate. Neither the

publisher nor author shall be liable for any loss of proﬁt or any other commercial

damages, including but not limited to special, incidental, consequential, or other

damages.

For general information on our other products and services or for technical support,

please contact our Customer Care Department within the United States at (800)

762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley publishes in a variety of print and electronic formats and by print-on-demand.

Some material included with standard print versions of this book may not be included

in e-books or in print-on-demand. If this book refers to media such as a CD or DVD

that is not included in the version you purchased, you may download this material at

http://booksupport.wiley.com. For more information about Wiley products, visit

www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Baesens, Bart.

Fraud analytics using descriptive, predictive, and social network techniques : a guide

to data science for fraud detection / Bart Baesens, Veronique Van Vlasselaer, Wouter

Verbeke.

pages cm. — (Wiley & SAS business series)

Includes bibliographical references and index.

ISBN 978-1-119-13312-4 (cloth) — ISBN 978-1-119-14682-7 (epdf) —

ISBN 978-1-119-14683-4 (epub)

1. Fraud—Statistical methods. 2. Fraud—Prevention. 3. Commercial

crimes—Prevention. I. Title.

HV6691.B34 2015

364.16

′

3015195—dc23

2015017861

Cover Design: Wiley

Cover Image: ©iStock.com/aleksandarvelasevic

Printed in the United States of America

10987654321

To my wonderful wife, Katrien, and kids, Ann-Sophie, Victor,

and Hannelore.

To my parents and parents-in-law.

To my husband and soul mate, Niels, for his never-ending

support.

To my parents, parents-in-law, and siblings-in-law.

To Luit and Titus.

Contents

List of Figures xv

Foreword xxiii

Preface xxv

Acknowledgments xxix

Chapter 1 Fraud: Detection, Prevention, and Analytics! 1

Introduction 2

Fraud! 2

Fraud Detection and Prevention 10

Big Data for Fraud Detection 15

Data-Driven Fraud Detection 17

Fraud-Detection Techniques 19

Fraud Cycle 22

The Fraud Analytics Process Model 26

Fraud Data Scientists 30

A Fraud Data Scientist Should Have Solid Quantitative

Skills 30

A Fraud Data Scientist Should Be a Good Programmer 31

A Fraud Data Scientist Should Excel in

Communication and Visualization Skills 31

A Fraud Data Scientist Should Have a Solid Business

Understanding 32

A Fraud Data Scientist Should Be Creative 32

A Scientiﬁc Perspective on Fraud 33

References 35

Chapter 2 Data Collection, Sampling, and Preprocessing 37

Introduction 38

Types of Data Sources 38

Merging Data Sources 43

Sampling 45

Types of Data Elements 46

xCONTENTS

Visual Data Exploration and Exploratory Statistical

Analysis 47

Benford’s Law 48

Descriptive Statistics 51

Missing Values 52

Outlier Detection and Treatment 53

Red Flags 57

Standardizing Data 59

Categorization 60

Weights of Evidence Coding 63

Variable Selection 65

Principal Components Analysis 68

RIDITs 72

PRIDIT Analysis 73

Segmentation 74

References 75

Chapter 3 Descriptive Analytics for Fraud Detection 77

Introduction 78

Graphical Outlier Detection Procedures 79

Statistical Outlier Detection Procedures 83

Break-Point Analysis 84

Peer-Group Analysis 85

Association Rule Analysis 87

Clustering 89

Introduction 89

Distance Metrics 90

Hierarchical Clustering 94

Example of Hierarchical Clustering Procedures 97

k-Means Clustering 104

Self-Organizing Maps 109

Clustering with Constraints 111

Evaluating and Interpreting Clustering Solutions 114

One-Class SVMs 117

References 118

Chapter 4 Predictive Analytics for Fraud Detection 121

Introduction 122

Target Deﬁnition 123

Linear Regression 125

Logistic Regression 127

Basic Concepts 127

Logistic Regression Properties 129

Building a Logistic Regression Scorecard 131

CONTENTS xi

Variable Selection for Linear and Logistic Regression 133

Decision Trees 136

Basic Concepts 136

Splitting Decision 137

Stopping Decision 140

Decision Tree Properties 141

Regression Trees 142

Using Decision Trees in Fraud Analytics 143

Neural Networks 144

Basic Concepts 144

Weight Learning 147

Opening the Neural Network Black Box 150

Support Vector Machines 155

Linear Programming 155

The Linear Separable Case 156

The Linear Nonseparable Case 159

The Nonlinear SVM Classiﬁer 160

SVMs for Regression 161

Opening the SVM Black Box 163

Ensemble Methods 164

Bagging 164

Boosting 165

Random Forests 166

Evaluating Ensemble Methods 167

Multiclass Classiﬁcation Techniques 168

Multiclass Logistic Regression 168

Multiclass Decision Trees 170

Multiclass Neural Networks 170

Multiclass Support Vector Machines 171

Evaluating Predictive Models 172

Splitting Up the Data Set 172

Performance Measures for Classiﬁcation Models 176

Performance Measures for Regression Models 185

Other Performance Measures for Predictive Analytical

Models 188

Developing Predictive Models for Skewed Data Sets 189

Varying the Sample Window 190

Undersampling and Oversampling 190

Synthetic Minority Oversampling Technique (SMOTE) 192

Likelihood Approach 194

Adjusting Posterior Probabilities 197

Cost-sensitive Learning 198

Fraud Performance Benchmarks 200

References 201

xii CONTENTS

Chapter 5 Social Network Analysis for Fraud Detection 207

Networks: Form, Components, Characteristics, and Their

Applications 209

Social Networks 211

Network Components 214

Network Representation 219

Is Fraud a Social Phenomenon? An Introduction to

Homophily 222

Impact of the Neighborhood: Metrics 227

Neighborhood Metrics 228

Centrality Metrics 238

Collective Inference Algorithms 246

Featurization: Summary Overview 254

Community Mining: Finding Groups of Fraudsters 254

Extending the Graph: Toward a Bipartite Representation 266

Multipartite Graphs 269

Case Study: Gotcha! 270

References 277

Chapter 6 Fraud Analytics: Post-Processing 279

Introduction 280

The Analytical Fraud Model Life Cycle 280

Model Representation 281

Trafﬁc Light Indicator Approach 282

Decision Tables 283

Selecting the Sample to Investigate 286

Fraud Alert and Case Management 290

Visual Analytics 296

Backtesting Analytical Fraud Models 302

Introduction 302

Backtesting Data Stability 302

Backtesting Model Stability 305

Backtesting Model Calibration 308

Model Design and Documentation 311

References 312

Chapter 7 Fraud Analytics: A Broader Perspective 313

Introduction 314

Data Quality 314

Data-Quality Issues 314

Data-Quality Programs and Management 315

Privacy 317

The RACI Matrix 318

Accessing Internal Data 319

CONTENTS xiii

Label-Based Access Control (LBAC) 324

Accessing External Data 325

Capital Calculation for Fraud Loss 326

Expected and Unexpected Losses 327

Aggregate Loss Distribution 329

Capital Calculation for Fraud Loss Using Monte Carlo

Simulation 331

An Economic Perspective on Fraud Analytics 334

Total Cost of Ownership 334

Return on Investment 335

In Versus Outsourcing 337

Modeling Extensions 338

Forecasting 338

Text Analytics 340

The Internet of Things 342

Corporate Fraud Governance 344

References 346

About the Authors 347

Index 349

List of Figures

Figure 1.1 Fraud Triangle 7

Figure 1.2 Fire Incident Claim-Handling Process 13

Figure 1.3 The Fraud Cycle 23

Figure 1.4 Outlier Detection at the Data Item Level 25

Figure 1.5 Outlier Detection at the Data Set Level 25

Figure 1.6 The Fraud Analytics Process Model 26

Figure 1.7 Proﬁle of a Fraud Data Scientist 33

Figure 1.8 Screenshot of Web of Science Statistics for

Scientiﬁc Publications on Fraud between 1996

and 2014 34

Figure 2.1 Aggregating Normalized Data Tables into a

Non-Normalized Data Table 44

Figure 2.2 Pie Charts for Exploratory Data Analysis 49

Figure 2.3 Benford’s Law Describing the Frequency

Distribution of the First Digit 50

Figure 2.4 Multivariate Outliers 54

Figure 2.5 Histogram for Outlier Detection 54

Figure 2.6 Box Plots for Outlier Detection 55

Figure 2.7 Using the z-Scores for Truncation 57

Figure 2.8 Default Risk Versus Age 60

Figure 2.9 Illustration of Principal Component Analysis in a

Two-Dimensional Data Set 68

Figure 3.1 3D Scatter Plot for Detecting Outliers 80

Figure 3.2 OLAP Cube for Fraud Detection 80

Figure 3.3 Example Pivot Table for Credit Card Fraud

Detection 82

xvi LIST OF FIGURES

Figure 3.4 Break-Point Analysis 84

Figure 3.5 Peer-Group Analysis 86

Figure 3.6 Cluster Analysis for Fraud Detection 91

Figure 3.7 Hierarchical Versus Nonhierarchical Clustering

Techniques 91

Figure 3.8 Euclidean Versus Manhattan Distance 92

Figure 3.9 Divisive Versus Agglomerative Hierarchical

Clustering 94

Figure 3.10 Calculating Distances between Clusters 95

Figure 3.11 Example for Clustering Birds. The Numbers

Indicate the Clustering Steps 96

Figure 3.12 Dendrogram for Birds Example. The Thick Black

Line Indicates the Optimal Clustering 96

Figure 3.13 Screen Plot for Clustering 97

Figure 3.14 Scatter Plot of Hierarchical Clustering Data 98

Figure 3.15 Output of Hierarchical Clustering Procedures 98

Figure 3.16 k-Means Clustering: Start from Original Data 105

Figure 3.17 k-Means Clustering Iteration 1: Randomly

Select Initial Cluster Centroids 105

Figure 3.18 k-Means Clustering Iteration 1: Assign Remaining

Observations 106

Figure 3.19 k-Means Iteration Step 2: Recalculate Cluster

Centroids 107

Figure 3.20 k-Means Clustering Iteration 2: Reassign

Observations 107

Figure 3.21 k-Means Clustering Iteration 3: Recalculate

Cluster Centroids 108

Figure 3.22 k-Means Clustering Iteration 3: Reassign

Observations 108

Figure 3.23 Rectangular Versus Hexagonal SOM Grid 109

Figure 3.24 Clustering Countries Using SOMs 111

Figure 3.25 Component Plane for Literacy 112

LIST OF FIGURES xvii

Figure 3.26 Component Plane for Political Rights 113

Figure 3.27 Must-Link and Cannot-Link Constraints in

Semi-Supervised Clustering 113

Figure 3.28 𝛿-Constraints in Semi-Supervised Clustering 114

Figure 3.29 𝜀-Constraints in Semi-Supervised Clustering 114

Figure 3.30 Cluster Proﬁling Using Histograms 115

Figure 3.31 Using Decision Trees for Clustering

Interpretation 116

Figure 3.32 One-Class Support Vector Machines 117

Figure 4.1 A Spider Construction in Tax Evasion Fraud 124

Figure 4.2 Regular Versus Fraudulent Bankruptcy 124

Figure 4.3 OLS Regression 126

Figure 4.4 Bounding Function for Logistic Regression 128

Figure 4.5 Linear Decision Boundary of Logistic Regression 130

Figure 4.6 Other Transformations 131

Figure 4.7 Fraud Detection Scorecard 133

Figure 4.8 Calculating the p-Value with a Student’s

t-Distribution 135

Figure 4.9 Variable Subsets for Four Variables V1,V2,V3,

and V4135

Figure 4.10 Example Decision Tree 137

Figure 4.11 Example Data Sets for Calculating Impurity 138

Figure 4.12 Entropy Versus Gini 139

Figure 4.13 Calculating the Entropy for Age Split 139

Figure 4.14 Using a Validation Set to Stop Growing a

Decision Tree 140

Figure 4.15 Decision Boundary of a Decision Tree 142

Figure 4.16 Example Regression Tree for Predicting the

Fraud Percentage 142

Figure 4.17 Neural Network Representation of Logistic

Regression 145

xviii LIST OF FIGURES

Figure 4.18 A Multilayer Perceptron (MLP) Neural Network 145

Figure 4.19 Local Versus Global Minima 148

Figure 4.20 Using a Validation Set for Stopping Neural

Network Training 149

Figure 4.21 Example Hinton Diagram 151

Figure 4.22 Backward Variable Selection 152

Figure 4.23 Decompositional Approach for Neural Network

Rule Extraction 153

Figure 4.24 Pedagogical Approach for Rule Extraction 154

Figure 4.25 Two-Stage Models 155

Figure 4.26 Multiple Separating Hyperplanes 157

Figure 4.27 SVM Classiﬁer for the Perfectly Linearly

Separable Case 157

Figure 4.28 SVM Classiﬁer in Case of Overlapping

Distributions 159

Figure 4.29 The Feature Space Mapping 160

Figure 4.30 SVMs for Regression 162

Figure 4.31 Representing an SVM Classiﬁer as a Neural

Network 163

Figure 4.32 One-Versus-One Coding for Multiclass Problems 171

Figure 4.33 One-Versus-All Coding for Multiclass Problems 172

Figure 4.34 Training Versus Test Sample Set Up for

Performance Estimation 173

Figure 4.35 Cross-Validation for Performance Measurement 174

Figure 4.36 Bootstrapping 175

Figure 4.37 Calculating Predictions Using a Cut-Off 176

Figure 4.38 The Receiver Operating Characteristic Curve 178

Figure 4.39 Lift Curve 179

Figure 4.40 Cumulative Accuracy Proﬁle 180

Figure 4.41 Calculating the Accuracy Ratio 181

Figure 4.42 The Kolmogorov-Smirnov Statistic 181

LIST OF FIGURES xix

Figure 4.43 A Cumulative Notch Difference Graph 184

Figure 4.44 Scatter Plot: Predicted Fraud Versus Actual

Fraud 185

Figure 4.45 CAP Curve for Continuous Targets 187

Figure 4.46 Regression Error Characteristic (REC) Curve 188

Figure 4.47 Varying the Time Window to Deal with Skewed

Data Sets 190

Figure 4.48 Oversampling the Fraudsters 191

Figure 4.49 Undersampling the Nonfraudsters 191

Figure 4.50 Synthetic Minority Oversampling Technique

(SMOTE) 193

Figure 5.1a Köningsberg Bridges 210

Figure 5.1b Schematic Representation of the Köningsberg

Bridges 211

Figure 5.2 Identity Theft. The Frequent Contact List of a

Person is Suddenly Extended with Other Contacts

(Light Gray Nodes). This Might Indicate that a

Fraudster (Dark Gray Node) Took Over that

Customer’s Account and “shares” his/her

Contacts 213

Figure 5.3 Network Representation 214

Figure 5.4 Example of a (Un)Directed Graph 215

Figure 5.5 Follower–Followee Relationships in a Twitter

Network 215

Figure 5.6 Edge Representation 216

Figure 5.7 Example of a Fraudulent Network 218

Figure 5.8 An Egonet. The Ego is Surrounded by Six Alters,

of Whom Two are Legitimate (White Nodes) and

Four are Fraudulent (Gray Nodes) 218

Figure 5.9 Toy Example of Credit Card Fraud 220

xx LIST OF FIGURES

Figure 5.10 Mathematical Representation of (a) a Sample

Network: (b) the Adjacency or Connectivity Matrix;

(e) the Weight List 221

Figure 5.11 A Real-Life Example of a Homophilic Network 224

Figure 5.12 A Homophilic Network 225

Figure 5.13 Sample Network 229

Figure 5.14a Degree Distribution 230

Figure 5.14b Illustration of the Degree Distribution for a

Real-Life Network of Social Security Fraud.

The Degree Distribution Follows a Power Law

(log-log axes) 230

Figure 5.15 A4-regular Graph 231

Figure 5.16 Example Social Network for a Relational Neighbor

Classiﬁer 233

Figure 5.17 Example Social Network for a Probabilistic

Relational Neighbor Classiﬁer 235

Figure 5.18 Example of Social Network Features for a

Relational Logistic Regression Classiﬁer 236

Figure 5.19 Example of Featurization with Features

Describing Intrinsic Behavior and Behavior of

the Neighborhood 237

Figure 5.20 Illustration of Dijkstra’s Algorithm 241

Figure 5.21 Illustration of the Number of Connecting Paths

Between Two Nodes 242

Figure 5.22 Illustration of Betweenness Between Communities

of Nodes 245

Figure 5.23 Pagerank Algorithm 247

Figure 5.24 Illustration of Iterative Process of the PageRank

Algorithm 249

Figure 5.25 Sample Network 254

Figure 5.26 Community Detection for Credit Card Fraud 259

Figure 5.27 Iterative Bisection 261

LIST OF FIGURES xxi

Figure 5.28 Dendrogram of the Clustering of Figure 5.27 by

the Girvan-Newman Algorithm. The Modularity

Qis Maximized When Splitting the Network into

Two Communities ABC – DEFG 262

Figure 5.29 Complete (a) and Partial (b) Communities 264

Figure 5.30 Overlapping Communities 265

Figure 5.31 Unipartite Graph 266

Figure 5.32 Bipartite Graph 267

Figure 5.33 Connectivity Matrix of a Bipartite Graph 268

Figure 5.34 A Multipartite Graph 269

Figure 5.35 Sample Network of Gotcha! 270

Figure 5.36 Exposure Score of the Resources Derived by a

Propagation Algorithm. The Results are Based

on a Real-life Data Set in Social Security Fraud 273

Figure 5.37 Egonet in Social Security Fraud. A Company Is

Associated with its Resources 274

Figure 5.38 ROC Curve of the Gotcha! Model, which

Combines both Intrinsic and Relational Features 275

Figure 6.1 The Analytical Model Life Cycle 280

Figure 6.2 Trafﬁc Light Indicator Approach 282

Figure 6.3 SAS Social Network Analysis Dashboard 293

Figure 6.4 SAS Social Network Analysis Claim Detail

Investigation 294

Figure 6.5 SAS Social Network Analysis Link Detection 295

Figure 6.6 Distribution of Claim Amounts and Average

Claim Value 297

Figure 6.7 Geographical Distribution of Claims 298

Figure 6.8 Zooming into the Geographical Distribution

of Claims 299

Figure 6.9 Measuring the Efﬁciency of the Fraud-Detection

Process 300

Figure 6.10 Evaluating the Efﬁciency of Fraud Investigators 301

xxii LIST OF FIGURES

Figure 7.1 RACI Matrix 318

Figure 7.2 Anonymizing a Database 321

Figure 7.3 Different SQL Views Deﬁned for a Database 323

Figure 7.4 Aggregate Loss Distribution with Indication

of Expected Loss, Value at Risk (VaR) at

99.9 Percent Conﬁdence Level and Unexpected

Loss 331

Figure 7.5 Snapshot of a Credit Card Fraud Time Series

Data Set and Associated Histogram of the Fraud

Amounts 332

Figure 7.6 Aggregate Loss Distribution Resulting from a

Monte Carlo Simulation with Poisson

Distributed Monthly Fraud Frequency and

Associated Pareto Distributed Fraud Loss 334

Foreword

Fraud will always be with us. It is linked both to organized crime and to

terrorism, and it inﬂicts substantial economic damage. The perpetrators

of fraud play a dynamic cat and mouse game with those trying to stop

them. Preventing a particular kind of fraud does not mean the fraud-

sters give up, but merely that they change their tactics: they are con-

stantly on the lookout for new avenues for fraud, for new weaknesses

in the system. And given that our social and ﬁnancial systems are

forever developing, there are always new opportunities to be exploited.

This book is a clear and comprehensive outline of the current state-

of-the-art in fraud-detection and prevention methodology. It describes

the data necessary to detect fraud, and then takes the reader from

the basics of fraud-detection data analytics, through advanced pattern

recognition methodology, to cutting-edge social network analysis and

fraud ring detection.

If we cannot stop fraud altogether, an awareness of the contents of

this book will at least enable readers to reduce the extent of fraud, and

make it harder for criminals to take advantage of the honest. The read-

ers’ organizations, be they public or private, will be better protected if

they implement the strategies described in this book. In short, this book

is a valuable contribution to the well-being of society and of the people

within it.

Professor David J. Hand

Imperial College, London

xxiii

Preface

It is estimated that a typical organization loses about 5 percent of its

revenues due to fraud each year. In this book, we will discuss how

state-of-the-art descriptive, predictive and social network analytics can

be used to ﬁght fraud by learning fraud patterns from historical data.

The focus of this book is not on the mathematics or theory, but

on the practical applications. Formulas and equations will only be

included when absolutely needed from a practitioner’s perspective.

It is also not our aim to provide exhaustive coverage of all analytical

techniques previously developed but, rather, give coverage of the ones

that really provide added value in a practical fraud detection setting.

Being targeted at the business professional in the ﬁrst place, the

book is written in a condensed, focused way. Prerequisite knowledge

consists of some basic exposure to descriptive statistics (e.g., mean,

standard deviation, correlation, conﬁdence intervals, hypothesis

testing), data handling (using for example, Microsoft Excel, SQL,

etc.), and data visualization (e.g., bar plots, pie charts, histograms,

scatter plots, etc.). Throughout the discussion, many examples of

real-life fraud applications will be included in, for example, insurance

fraud, tax evasion fraud, and credit card fraud. The authors will also

integrate both their research and consulting experience throughout

the various chapters. The book is aimed at (senior) data analysts,

(aspiring) data scientists, consultants, analytics practitioners, and

researchers (e.g., PhD candidates) starting to explore the ﬁeld.

Chapter 1 sets the stage on fraud detection, prevention, and analyt-

ics. It starts by deﬁning fraud and then zooms into fraud detection and

prevention. The impact of big data for fraud detection and the fraud

analytics process model are reviewed next. The chapter concludes by

summarizing the key skills of a fraud data scientist.

Chapter 2 provides extensive discussion on the basic ingredient

of any fraud analytical model: data! It introduces various types of

xxv

xxvi PREFACE

data sources and discusses how to merge and sample them. The next

sections discuss the different types of data elements, visual exploration,

Benford’s law, and descriptive statistics. These are all essential tools

to start understanding the characteristics and limitations of the data

available. Data preprocessing activities are also extensively covered:

handling missing values, detecting and treating outliers, deﬁning red

ﬂags, standardizing data, categorizing variables, weights of evidence

coding, and variable selection. Principal component analysis is out-

lined as a technique to reduce the dimensionality of the input data.

This is then further illustrated with RIDIT and PRIDIT analysis. The

chapter ends by reviewing segmentation and the risks thereof.

Chapter 3 continues by exploring the use of descriptive analytics

for fraud detection. The idea here is to look for unusual patterns or

outliers in a fraud data set. Both graphical and statistical outlier detec-

tion procedures are reviewed ﬁrst. This is followed by an overview of

break-point analysis, peer group analysis, association rules, clustering,

and one-class SVMs.

Chapter 4 zooms into predictive analytics for fraud detection. We

start from a labeled data set of transactions whereby each transaction

has a target of interest that can either be binary (e.g., fraudulent or

not) or continuous (e.g., amount of fraud). We then discuss various

analytical techniques to build predictive models: linear regression,

logistic regression, decision trees, neural networks, support vector

machines, ensemble methods, and multiclass classiﬁcation techniques.

A next section reviews how to measure the performance of a pre-

dictive analytical model by ﬁrst deciding on the data set split-up and

then on the performance metric. The class imbalance problem is

also extensively elaborated. The chapter concludes by giving some

performance benchmarks.

Chapter 5 introduces the reader to social network analysis and

its use for fraud detection. Stating that the propensity to fraud is

often inﬂuenced by the social neighborhood, we describe the main

components of a network and illustrate how transactional data sources

can be transformed in networks. In the next section, we elaborate

on featurization, the process on how to extract a set of meaningful

features from the network. We distinguish between three main types

of features: neighborhood metrics, centrality metrics, and collective

PREFACE xxvii

inference algorithms. We then zoom into community mining, where

we aim at ﬁnding groups of fraudsters closely connected in the

network. By introducing multipartite graphs, we address the fact that

fraud often depends on a multitude of different factors and that the

inclusion of all these factors in a network representation contribute to

a better understanding and analysis of the detection problem at hand.

The chapter is concluded with a real-life example of social security

fraud.

Chapter 6 deals with the postprocessing of fraud analytical models.

It starts by giving an overview of the analytical fraud model lifecycle. It

then discusses the trafﬁc light indicator approach and decision tables as

two popular model representations. This is followed by a set of guide-

lines to appropriately select the fraud sample to investigate. Fraud alert

and case management are covered next. We also illustrate how visual

analytics can contribute to the postprocessing activities. We describe

how to backtest analytical fraud models by considering data stability,

model stability, and model calibration. The chapter concludes by giving

some guidelines about model design and documentation.

Chapter 7 provides a broader perspective on fraud analytics. We

provide some guidelines for setting up and managing data quality

programs. We zoom into privacy and discuss various ways to ensure

appropriate access to both internal and external data. We discuss how

analytical fraud estimates can be used to calculate both expected and

unexpected losses, which can then help to determine provisioning

and capital buffers. A discussion of total cost of ownership and return

on investment provides an economic perspective on fraud analytics.

This is followed by a discussion of in- versus outsourcing of analytical

model development. We brieﬂy zoom into some interesting modeling

extensions, such as forecasting and text analytics. The potential and

danger of the Internet of Things for fraud analytics is also covered.

The chapter concludes by giving some recommendations for corporate

fraud governance.

Acknowledgments

It is a great pleasure to acknowledge the contributions and assistance of

various colleagues, friends, and fellow analytics lovers to the writing of

this book. This book is the result of many years of research and teaching

in analytics, risk management, and fraud. We ﬁrst would like to thank

our publisher, John Wiley & Sons, for accepting our book proposal less

than one year ago.

We are grateful to the active and lively analytics and fraud detec-

tion community for providing various user fora, blogs, online lectures,

and tutorials, which proved very helpful.

We would also like to acknowledge the direct and indirect contribu-

tions of the many colleagues, fellow professors, students, researchers,

and friends with whom we collaborated during the past years.

Last but not least, we are grateful to our partners, parents, and

families for their love, support, and encouragement.

We have tried to make this book as complete, accurate, and

enjoyable as possible. Of course, what really matters is what you, the

reader, think of it. Please let us know your views by getting in touch.

The authors welcome all feedback and comments—so do not hesitate

to let us know your thoughts!

Bart Baesens

Véronique Van Vlasselaer

Wouter Verbeke

August 2015

xxix

Fraud Analytics Using

Descriptive, Predictive,

and Social Network

Techniques

CHAPTER 1

Fraud: Detection,

Prevention, and

Analytics!

INTRODUCTION

In this ﬁrst chapter, we set the scene for what’s ahead by introducing

fraud analytics using descriptive, predictive, and social network tech-

niques. We start off by deﬁning and characterizing fraud and discuss

different types of fraud. Next, fraud detection and prevention is dis-

cussed as a means to address and limit the amount and overall impact of

fraud. Big data and analytics provide powerful tools that may improve

an organization’s fraud detection system. We discuss in detail how and

why these tools complement traditional expert-based fraud-detection

approaches. Subsequently, the fraud analytics process model is intro-

duced, providing a high-level overview of the steps that are followed

in developing and implementing a data-driven fraud-detection sys-

tem. The chapter concludes by discussing the characteristics and skills

of a good fraud data scientist, followed by a scientiﬁc perspective on

the topic.

FRAUD!

Since a thorough discussion or investigation requires clear and precise

deﬁnitions of the subject of interest, this ﬁrst section starts by deﬁning

fraud and by highlighting a number of essential characteristics. Sub-

sequently, an explanatory conceptual model will be introduced that

provides deeper insight in the underlying drivers of fraudsters, the

individuals committing fraud. Insight in the ﬁeld of application—or

in other words, expert knowledge—is crucial for analytics to be suc-

cessfully applied in any setting, and matters eventually as much as

technical skill. Expert knowledge or insight in the problem at hand

helps an analyst in gathering and processing the right information in

the right manner, and to customize data allowing analytical techniques

to perform as well as possible in detecting fraud.

The Oxford Dictionary deﬁnes fraud as follows:

Wrongful or criminal deception intended to result in

ﬁnancial or personal gain.

On the one hand, this deﬁnition captures the essence of fraud

and covers the many different forms and types of fraud that will be

FRAUD: DETECTION, PREVENTION, AND ANALYTICS! 3

discussed in this book. On the other hand, it does not very precisely

describe the nature and characteristics of fraud, and as such, does not

provide much direction for discussing the requirements of a fraud

detection system. A more useful deﬁnition will be provided below.

Fraud is deﬁnitely not a recent phenomenon unique to modern

society, nor is it even unique to mankind. Animal species also engage

in what could be called fraudulent activities, although maybe we should

classify the behavior as displayed by, for instance, chameleons, stick

insects, apes, and others rather as manipulative behavior instead of fraud-

ulent activities, since wrongful or criminal are human categories or con-

cepts that do not straightforwardly apply to animals. Indeed, whether

activities are wrongful or criminal depends on the applicable rules or

legislation, which deﬁnes explicitly and formally these categories that

are required in order to be able to classify behavior as being fraudulent.

A more thorough and detailed characterization of the multifaceted

phenomenon of fraud is provided by Van Vlasselaer et al. (2015):

Fraud is an uncommon, well-considered,

imperceptibly concealed, time-evolving and often

carefully organized crime which appears in many

types of forms.

This deﬁnition highlights ﬁve characteristics that are associated

with particular challenges related to developing a fraud-detection

system, which is the main topic of this book. The ﬁrst emphasized

characteristic and associated challenge concerns the fact that fraud

is uncommon. Independent of the exact setting or application, only

a minority of the involved population of cases typically concerns

fraud, of which furthermore only a limited number will be known to

concern fraud. This makes it difﬁcult to both detect fraud, since the

fraudulent cases are covered by the nonfraudulent ones, as well as to

learn from historical cases to build a powerful fraud-detection system

since only few examples are available.

In fact, fraudsters exactly try to blend in and not to behave different

from others in order not to get noticed and to remain covered by non-

fraudsters. This effectively makes fraud imperceptibly concealed,since

fraudsters do succeed in hiding by well considering and planning how

4FRAUD ANALYTICS

to precisely commit fraud. Their behavior is deﬁnitely not impulsive

and unplanned, since if it were, detection would be far easier.

They also adapt and reﬁne their methods, which they need to do

in order to remain undetected. Fraud-detection systems improve and

learn by example. Therefore, the techniques and tricks fraudsters adopt

evolve in time along with, or better ahead of fraud-detection mecha-

nisms. This cat-and-mouse play between fraudsters and fraud ﬁghters

may seem to be an endless game, yet there is no alternative solution

so far. By adopting and developing advanced fraud-detection and pre-

vention mechanisms, organizations do manage to reduce losses due to

fraud because fraudsters, like other criminals, tend to look for the easy

way and will look for other, easier opportunities. Therefore, ﬁghting

fraud by building advanced and powerful detection systems is deﬁ-

nitely not a pointless effort, but admittedly, it is very likely an effort

without end.

Fraud is often as well a carefully organized crime, meaning that

fraudsters often do not operate independently, have allies, and may

induce copycats. Moreover, several fraud types such as money laun-

dering and carousel fraud involve complex structures that are set up in

order to commit fraud in an organized manner. This makes fraud not

to be an isolated event, and as such in order to detect fraud the context

(e.g., the social network of fraudsters) should be taken into account.

Research shows that fraudulent companies indeed are more connected

to other fraudulent companies than to nonfraudulent companies, as

shown in a company tax-evasion case study by Van Vlasselaer et al.

(2015). Social network analytics for fraud detection, as discussed

in Chapter 5, appears to be a powerful tool for unmasking fraud by

making clever use of contextual information describing the network or

environment of an entity.

A ﬁnal element in the description of fraud provided by Van

Vlasselaer et al. indicates the many different types of forms in which

fraud occurs. This both refers to the wide set of techniques and

approaches used by fraudsters as well as to the many different settings

in which fraud occurs or economic activities that are susceptible to

fraud. Table 1.1 provides a nonexhaustive overview and description of

a number of important fraud types—important being deﬁned in terms of

frequency of occurrence as well as the total monetary value involved.

FRAUD: DETECTION, PREVENTION, AND ANALYTICS! 5

Table 1.1 Nonexhaustive List of Fraud Categories and Types

Credit card fraud In credit card fraud there is an unauthorized taking of another’s credit.

Some common credit card fraud subtypes are counterfeiting credit cards

(for the deﬁnition of counterfeit, see below), using lost or stolen cards, or

fraudulently acquiring credit through mail (deﬁnition adopted from

deﬁnitions.uslegal.com). Two subtypes can been identiﬁed, as described

by Bolton and Hand (2002): (1) Application fraud, involving individuals

obtaining new credit cards from issuing companies by using false

personal information, and then spending as much as possible in a short

space of time; (2) Behavioral fraud, where details of legitimate cards are

obtained fraudulently and sales are made on a “Cardholder Not Present”

basis. This does not necessarily require stealing the physical card, only

stealing the card credentials. Behavioral fraud concerns most of the credit

card fraud. Also, debit card fraud occurs, although less frequent. Credit

card fraud is a form of identity theft, as will be deﬁned below.

Insurance fraud Broad category-spanning fraud related to any type of insurance, both from

the side of the buyer or seller of an insurance contract. Insurance fraud

from the issuer (seller) includes selling policies from nonexistent

companies, failing to submit premiums and churning policies to create

more commissions. Buyer fraud includes exaggerated claims (property

insurance: obtaining payment that is worth more than the value of the

property destroyed), falsiﬁed medical history (healthcare insurance: fake

injuries), postdated policies, faked death, kidnapping or murder (life

insurance fraud), and faked damage (automobile insurance: staged

collision) (deﬁnition adopted from www.investopedia.com).

Corruption Corruption is the misuse of entrusted power (by heritage, education,

marriage, election, appointment, or whatever else) for private gain. This

deﬁnition is similar to the deﬁnition of fraud provided by the Oxford

Dictionary discussed before, in that the objective is personal gain. It is

different in that it focuses on misuse of entrusted power. The deﬁnition

covers as such a broad range of different subtypes of corruption, so does

not only cover corruption by a politician or a public servant, but also, for

example, by the CEO or CFO of a company, the notary public, the team

leader at a workplace, the administrator or admissions-ofﬁcer to a private

school or hospital, the coach of a soccer team, and so on (deﬁnition

adopted from www.corruptie.org).

Counterfeit An imitation intended to be passed off fraudulently or deceptively as

genuine. Counterfeit typically concerns valuable objects, credit cards,

identity cards, popular products, money, etc. (deﬁnition adopted from

www.dictionary.com).

(continued)

6FRAUD ANALYTICS

Table 1.1 (Continued)

Product warranty

fraud

A product warranty is a type of guarantee that a manufacturer or similar

party makes regarding the condition of its product, and also refers to the

terms and situations in which repairs or exchanges will be made in the

event that the product does not function as originally described or

intended (deﬁnition adopted from www.investopedia.com). When a

product fails to offer the described functionalities or displays deviating

characteristics or behavior that are a consequence of the production

process and not a consequence of misuse by the customer, compensation

or remuneration by the manufacturer or provider can be claimed. When

the conditions of the product have been altered due to the customer’s use

of the product, then the warranty does not apply. Intentionally wrongly

claiming compensation or remuneration based on a product warranty is

called product warranty fraud.

Healthcare fraud Healthcare fraud involves the ﬁling of dishonest healthcare claims in

order to make proﬁt. Practitioner schemes include: individuals obtaining

subsidized or fully covered prescription pills that are actually unneeded

and then selling them on the black market for a proﬁt; billing by

practitioners for care that they never rendered; ﬁling duplicate claims for

the same service rendered; billing for a noncovered service as a covered

service; modifying medical records, and so on. Members can commit

healthcare fraud by providing false information when applying for

programs or services, forging or selling prescription drugs, loaning or

using another’s insurance card, and so on (deﬁnition adopted from

www.law.cornell.edu).

Telecommunica-

tions fraud

Telecommunication fraud is the theft of telecommunication services

(telephones, cell phones, computers, etc.) or the use of

telecommunication services to commit other forms of fraud (deﬁnition

adopted from itlaw.wikia.com). An important example concerns cloning

fraud (i.e. the cloning of a phone number and the related call credit by a

fraudster), which is an instance of superimposition fraud in which

fraudulent usage is superimposed on (added to) the legitimate usage of

an account (Fawcett and Provost 1997).

Money

laundering

The process of taking the proceeds of criminal activity and making them

appear legal. Laundering allows criminals to transform illegally obtained

gain into seemingly legitimate funds. It is a worldwide problem, with an

estimated $300 billion going through the process annually in the United

States (deﬁnition adopted from legal-dictionary.thefreedictionary.com).

Click fraud Click fraud is an illegal practice that occurs when individuals click on a

website’s click-through advertisements (either banner ads or paid text

links) to increase the payable number of clicks to the advertiser. The

illegal clicks could either be performed by having a person manually click

the advertising hyperlinks or by using automated software or online bots

that are programmed to click these banner ads and pay-per-click text ad

links (deﬁnition adopted from www.webopedia.com).

FRAUD: DETECTION, PREVENTION, AND ANALYTICS! 7

Table 1.1 (Continued)

Identity theft The crime of obtaining the personal or ﬁnancial information of another

person for the purpose of assuming that person’s name or identity in order

to make transactions or purchases. Some identity thieves sift through

trash bins looking for bank account and credit card statements; other more

high-tech methods involve accessing corporate databases to steal lists of

customer information (deﬁnition adopted from www.investopedia.com).

Tax evasion Tax evasion is the illegal act or practice of failing to pay taxes that are

owed. In businesses, tax evasion can occur in connection with income

taxes, employment taxes, sales and excise taxes, and other federal, state,

and local taxes. Examples of practices that are considered tax evasion

include knowingly not reporting income or underreporting income (i.e.,

claiming less income than you actually received from a speciﬁc source)

(deﬁnition adopted from biztaxlaw.about.com).

Plagiarism Plagiarizing is deﬁned by Merriam Webster’s online dictionary as to steal

and pass off (the ideas or words of another) as one’s own, to use

(another’s production) without crediting the source, to commit literary

theft, to present as new and original an idea or product derived from an

existing source. It involves both stealing someone else’s work and lying

about it afterward (deﬁnition adopted from www.plagiarism.org).

In the end, fraudulent activities are intended to result in gains or

beneﬁts for the fraudster, as emphasized by the deﬁnition of fraud pro-

vided by the Oxford Dictionary. The potential, usually monetary, gain or

beneﬁt forms in the large majority of cases the basic driver for commit-

ting fraud.

The so-called fraud triangle as depicted in Figure 1.1 provides a

more elaborate explanation for the underlying motives or drivers for

Pressure

RationalizationOpportunity

Fraud Triangle

Figure 1.1 Fraud Triangle

8FRAUD ANALYTICS

committing occupational fraud. The fraud triangle originates from a

hypothesis formulated by Donald R. Cressey in his 1953 book Other

People’s Money: A Study of the Social Psychology of Embezzlement:

Trusted persons become trust violators when they conceive

of themselves as having a ﬁnancial problem which is

non-shareable, are aware this problem can be secretly

resolved by violation of the position of ﬁnancial trust, and

are able to apply to their own conduct in that situation

verbalizations which enable them to adjust their

conceptions of themselves as trusted persons with their

conceptions of themselves as users of the entrusted funds

or property.

This basic conceptual model explains the factors that together cause

or explain the drivers for an individual to commit occupational fraud,

yet provides a useful insight in the fraud phenomenon from a broader

point of view as well. The model has three legs that together institute

fraudulent behavior:

1. Pressure is the ﬁrst leg and concerns the main motivation for

committing fraud. An individual will commit fraud because a

pressure or a problem is experienced of ﬁnancial, social, or any

other nature, and it cannot be resolved or relieved in an autho-

rized manner.

2. Opportunity is the second leg of the model, and concerns

the precondition for an individual to be able to commit

fraud. Fraudulent activities can only be committed when

the opportunity exists for the individual to resolve or relieve

the experienced pressure or problem in an unauthorized but

concealed or hidden manner.

3. Rationalization is the psychological mechanism that explains

why fraudsters do not refrain from committing fraud and think

of their conduct as acceptable.

An essay by Dufﬁeld and Grabosky (2001) further explores

the motivational basis of fraud from a psychological perspective.

FRAUD: DETECTION, PREVENTION, AND ANALYTICS! 9

It concludes that a number of psychological factors may be present

in those persons who commit fraud, but that these factors are also

associated with entirely legitimate forms of human endeavor. And so

fraudsters cannot be distinguished from nonfraudsters purely based

on psychological characteristics or patterns.

Fraud is a social phenomenon in the sense that the potential bene-

ﬁts for the fraudsters come at the expense of the victims. These victims

are individuals, enterprises, or the government, and as such society as

a whole. Some recent numbers give an indication of the estimated size

and the ﬁnancial impact of fraud:

◾A typical organization loses 5 percent of its revenues to fraud

each year (www.acfe.com).

◾The total cost of insurance fraud (non–health insurance) in the

United States is estimated to be more than $40 billion per year

(www.fbi.gov).

◾Fraud is costing the United Kingdom £73 billion a year (National

Fraud Authority).

◾Credit card companies “lose approximately seven cents per

every hundred dollars of transactions due to fraud” (Andrew

Schrage, Money Crashers Personal Finance, 2012).

◾The average size of the informal economy, as a percent of ofﬁ-

cial GNI in the year 2000, in developing countries is 41 percent,

in transition countries 38 percent, and in OECD countries 18

percent (Schneider 2002).

Even though these numbers are rough estimates rather than exact

measurements, they are based on evidence and do indicate the impor-

tance and impact of the phenomenon, and therefore as well the need

for organizations and governments to actively ﬁght and prevent fraud

with all means they have at their disposal. As will be further elaborated

in the ﬁnal chapter, these numbers also indicate that it is likely worth-

while to invest in fraud-detection and fraud-prevention systems, since

a signiﬁcant ﬁnancial return on investment can be made.

The importance and need for effective fraud-detection and fraud-

prevention systems is furthermore highlighted by the many different

10 FRAUD ANALYTICS

forms or types of fraud of which a number have been summarized in

Table 1.1, which is not exhaustive but, rather, indicative, and which

illustrates the widespread occurrence across different industries and

product and service segments. The broad fraud categories enlisted and

brieﬂy deﬁned in Table 1.1 can be further subdivided into more speciﬁc

subtypes, which, although interesting, would lead us too far into the

particularities of each of these forms of fraud. One may refer to the fur-

ther reading sections at the end of each chapter of this book, providing

selected references to specialized literature on different forms of fraud.

A number of particular fraud types will also be further elaborated in

real-life case studies throughout the book.

FRAUD DETECTION AND PREVENTION

Two components that are essential parts of almost any effective strat-

egy to ﬁght fraud concern fraud detection and fraud prevention.Fraud

detection refers to the ability to recognize or discover fraudulent activ-

ities, whereas fraud prevention refers to measures that can be taken

to avoid or reduce fraud. The difference between both is clear-cut; the

former is an ex post approach whereas the latter an ex ante approach.

Both tools may and likely should be used in a complementary manner

to pursue the shared objective, fraud reduction.

However, as will be discussed in more detail further on, preventive

actions will change fraud strategies and consequently impact detection

power. Installing a detection system will cause fraudsters to adapt and

change their behavior, and so the detection system itself will impair

eventually its own detection power. So although complementary, fraud

detection and prevention are not independent and therefore should be

aligned and considered a whole.

The classic approach to fraud detection is an expert-based approach,

meaning that it builds on the experience, intuition, and business

or domain knowledge of the fraud analyst. Such an expert-based

approach typically involves a manual investigation of a suspicious

case, which may have been signaled, for instance, by a customer

complaining of being charged for transactions he did not do. Such

a disputed transaction may indicate a new fraud mechanism to have

been discovered or developed by fraudsters, and therefore requires

FRAUD: DETECTION, PREVENTION, AND ANALYTICS! 11

a detailed investigation for the organization to understand and sub-

sequently address the new mechanism.

Comprehension of the fraud mechanism or pattern allows extend-

ing the fraud detection and prevention mechanism that is often imple-

mented as a rule base or engine, meaning in the form of a set of If-Then

rules, by adding rules that describe the newly detected fraud mecha-

nism. These rules, together with rules describing previously detected

fraud patterns, are applied to future cases or transactions and trigger

an alert or signal when fraud is or may be committed by use of this

mechanism. A simple, yet possibly very effective, example of a fraud

detection rule in an insurance claim fraud setting goes as follows:

IF:

◾Amount of claim is above threshold OR

◾Severe accident, but no police report OR

◾Severe injury, but no doctor report OR

◾Claimant has multiple versions of the accident OR

◾Multiple receipts submitted

THEN:

◾Flag claim as suspicious AND

◾Alert fraud investigation ofﬁcer.

Such an expert approach suffers from a number of disadvantages.

Rule bases or engines are typically expensive to build, since they

require advanced manual input by the fraud experts, and often turn

out to be difﬁcult to maintain and manage. Rules have to be kept

up to date and only or mostly trigger real fraudulent cases, since

every signaled case requires human follow-up and investigation.

Therefore, the main challenge concerns keeping the rule base lean

and effective—in other words, deciding when and which rules to add,

remove, update, or merge.

It is important to realize that fraudsters can, for instance by trial

and error, learn the business rules that block or expose them and will

devise inventive workarounds. Since the rules in the rule-based detec-

tion system are based on past experience, new emerging fraud pat-

terns are not automatically ﬂagged or signaled. Fraud is a dynamic

12 FRAUD ANALYTICS

phenomenon, as will be discussed below in more detail, and therefore

needs to be traced continuously. Consequently, a fraud detection and

prevention system also needs to be continuously monitored, improved,

and updated to remain effective.

An expert-based fraud-detection system relies on human expert

input, evaluation, and monitoring, and as such involves a great deal

of labor intense human interventions. An automated approach to

build and maintain a fraud-detection system, requiring less human

involvement, could lead to a more efﬁcient and effective system

for detecting fraud. The next section in this chapter will introduce

several alternative approaches to expert systems that leverage the

massive amounts of data that nowadays can be gathered and pro-

cessed at very low cost, in order to develop, monitor, and update a

high-performing fraud-detection system in a more automated and

efﬁcient manner. These alternative approaches still require and build

on expert knowledge and input, which remains crucial in order to

build an effective system.

EXAMPLE CASE: EXPERT-BASED APPROACH TO

INTERNAL FRAUD DETECTION IN AN INSURANCE

CLAIM-HANDLING PROCESS

EXAMPLE CASE

An example expert-based detection and prevention system to signal potential

fraud committed by claim handling ofﬁcers concerns the business process

depicted in Figure 1.2, illustrating the handling of ﬁre incident claims without

any form of bodily injury (including death) (Caron et al. 2013). The process

involves the following types of activities:

◾Administrative activities

◾Evaluation-related activities

◾In-depth assessment by internal and external experts

◾Approval activities

◾Leniency-related activities

◾Fraud investigation activities

FRAUD: DETECTION, PREVENTION, AND ANALYTICS! 13

Additional evaluation

activities

Leniency-related

activities

Approval cycle

Update provision

complete

0.380

Discard provisions

complete

0.660

Compensate complete

0.415

Cluster 37

2 elements

–0.092

agreement complete

0.413

Cluster 38

7 elements

–0.033

Send claim acceptance

letter complete

0.369

Consult Policy complete

0.518

Cluster 33

5 elements

–0.033

Figure 1.2 Fire Incident Claim-Handling Process

A number of harmful process deviations and related risks can be

identiﬁed regarding these activities:

◾Forgetting to discard provisions

◾Multiple partial compensations (exceeding limit)

14 FRAUD ANALYTICS

◾Collusion between administrator and experts

◾Lack of approval cycle

◾Suboptimal task allocation

◾Fraud investigation activities

◾Processing rejected claim

◾Forced claim acceptance, absence of a timely primary evaluation

Deviations marked in bold may relate to and therefore indicate fraud. By

adopting business policies as a governance instrument and prescribing

procedures and guidelines, the insurer may reduce the risks involved in

processing the insurance claims. For instance:

Business policy excerpt 1 (customer relationship

management related): If the insured requires immediate assistance (e.g.,

to prevent the development of additional damage), arrangements will be made

for a single partial advanced compensation (maximum x% of expected

covered loss).

◾Potential risk: The expected (covered) loss could be exceeded

through partial advanced compensations.

Business policy excerpt 2 (avoid ﬁnancial loss): Settlements

need to be approved.

◾Potential risk: Collusion between the drafter of the settlement and

the insured

Business policy excerpt 3 (avoid ﬁnancial loss): The proposal of

a settlement and its approval must be performed by different actors.

◾Potential risk: A person might hold both the team-leader and the

expert role in the information system.

Business policy excerpt 4 (avoid ﬁnancial loss): After approval of

the decision (settlement or claim rejection) no changes may occur.

◾Potential risk: The modiﬁer and the insured might collude.

◾Potential risk: A rejected claim could undergo further processing.

FRAUD: DETECTION, PREVENTION, AND ANALYTICS! 15

Since detecting fraud based on speciﬁed business rules requires prior

knowledge of the fraud scheme, the existence of fraud issues will be the direct

result of either:

◾An inadequate internal control system (controls fail to prevent

fraud); or

◾Risks accepted by the management (no preventive or corrective

controls are in place).

Some examples of fraud-detection rules that can be derived from these

business policy excerpts and process deviations, and that may be added to the

fraud-detection rule engine are as follows:

Business policy excerpt 1:

◾IF multiple advanced payments for one claim, THEN suspicious

case.

Business policy excerpt 2:

◾IF settlement was not approved before it was paid, THEN suspi-

cious case.

Business policy excerpt 3:

◾IF settlement is proposed AND approved by the same person, THEN

suspicious case.

Business policy excerpt 4:

◾IF settlement is approved AND changed afterward, THEN suspi-

cious case.

◾IF claim is rejected AND processed afterward (e.g., look for a set-

tlement proposal, payment,…activity), THEN suspicious case.

BIG DATA FOR FRAUD DETECTION

When fraudulent activities have been detected and conﬁrmed to effec-

tively concern fraud, two types of measures are typically taken:

1. Corrective measures, that aim to resolve the fraud and correct the

wrongful consequences—for instance by means of pursuing

16 FRAUD ANALYTICS

restitution or compensation for the incurred losses. These cor-

rective measures might also include actions to retrospectively

detect and subsequently address similar fraud cases that made

use of the same mechanism or loopholes in the fraud detection

and prevention system the organization has in place.

2. Preventive measures, which may both include actions that aim at

preventing future fraud by the caught fraudster (e.g., by ter-

minating a contractual agreement with a customer, as well as

actions that aim at preventing fraud of the same type by other

individuals). When an expert-based approach is adopted, an

example preventive measure is to extend the rule engine by

incorporating additional rules that allow detecting and prevent-

ing the uncovered fraud mechanism to be applied in the future.

A fraud case must be investigated thoroughly so the underlying

mechanism can be unraveled, extending the available expert

knowledge and allowing it to prevent the fraud mechanism to

be used again in the future by making the organization more

robust and less vulnerable to fraud by adjusting the detection

and prevention system.

Typically, the sooner corrective measures are taken and therefore

the sooner fraud is detected, the more effective such measures may

be and the more losses can be avoided or recompensed. On the other

hand, fraud becomes easier to detect the more time has passed, for a

number of particular reasons.

When a fraud mechanism or path exists—meaning a loophole in the

detection and prevention system of an organization—the number of

times this path will be followed (i.e., the fraud mechanism used) grows

in time and therefore as well the number of occurrences of this partic-

ular type of fraud. The more a fraud path is taken the more apparent it

becomes and typically, in fact statistically, the easier to detect. The num-

ber of occurrences of a particular type of fraud can be expected to grow

since many fraudsters appear to be repeat offenders. As the expression

goes, “Once a thief, always a thief.” Moreover, a fraud mechanism may

well be discovered by several individuals or the knowledge shared

between fraudsters. As will be shown in Chapter 5 on social network

analytics for fraud detection, certainly some types of fraud tend to

FRAUD: DETECTION, PREVENTION, AND ANALYTICS! 17

spread virally and display what are called social network effects,

indicating that fraudsters share their knowledge on how to commit

fraud. This effect, too, leads to a growing number of occurrences and,

therefore, a higher risk or chance, depending on one’s perspective, of

detection.

Once a case of a particular type of fraud has been revealed, this will

lead to the exposition of similar fraud cases that were committed in the

past and made use of the same mechanism. Typically, a retrospective

screening is performed to assess the size or impact of the newly detected

type of fraud, as well as to resolve (by means of corrective measures,

cf. supra) as much as possible fraud cases. As such, fraud becomes

easier to detect the more time has passed, since more similar fraud

cases will occur in time, increasing the probability that the particular

fraud type will be uncovered, as well as because fraudsters committing

repeated fraud will increase their individual risk of being exposed. The

individual risk will increase the more fraud a fraudster commits for the

same basic reason: The chances of getting noticed get larger.

A ﬁnal reason why fraud becomes easier to detect the more time

has passed is because better detection techniques are being developed,

are getting readily available, and are being implemented and applied by

a growing amount of organizations. An important driver for improve-

ments with respect to detection techniques is growing data availability.

The informatization and digitalization of almost every aspect of society

and daily life leads to an abundance of available data. This so-called big

data can be explored and exploited for a range of purposes including

fraud detection (Baesens 2014), at a very low cost.

DATA-DRIVEN FRAUD DETECTION

Although classic, expert-based fraud-detection approaches as discussed

before are still in widespread use and deﬁnitely represent a good start-

ing point and complementary tool for an organization to develop an

effective fraud-detection and prevention system, a shift is taking place

toward data-driven or statistically based fraud-detection methodolo-

gies for three apparent reasons:

1. Precision. Statistically based fraud-detection methodologies offer

an increased detection power compared to classic approaches.

18 FRAUD ANALYTICS

By processing massive volumes of information, fraud patterns

may be uncovered that are not sufﬁciently apparent to the

human eye. It is important to notice that the improved power

of data-driven approaches over human processing can be

observed in similar applications such as credit scoring or cus-

tomer churn prediction. Most organizations only have a limited

capacity to have cases checked by an inspector to conﬁrm

whether or not the case effectively concerns fraud. The goal

of a fraud-detection system may be to make the most optimal

use of the limited available inspection capacity, or in other

words to maximize the fraction of fraudulent cases among

the inspected cases (and possibly in addition, the detected

amount of fraud). A system with higher precision, as delivered

by data-based methodologies, directly translates in a higher

fraction of fraudulent inspected cases.

2. Operational efﬁciency. In certain settings, there is an increasing

amount of cases to be analyzed, requiring an automated pro-

cess as offered by data-driven fraud-detection methodologies.

Moreover, in several applications, operational requirements

exist, imposing time constraints on the processing of a case.

For instance, when evaluating a transaction with a credit

card, an almost immediate decision is required with respect to

approve or block the transaction because of suspicion of fraud.

Another example concerns fraud detection for customs in a

harbor, where a decision has to be made within a conﬁned

time window whether to let a container pass and be shipped

inland, or whether to further inspect it, possibly causing delays.

Automated data-driven approaches offer such functionality and

are able to comply with stringent operational requirements.

3. Cost efﬁciency. As already mentioned in the previous section,

developing and maintaining an effective and lean expert-based

fraud-detection system is both challenging and labor intensive.

A more automated and, as such, more efﬁcient approach to

develop and maintain a fraud-detection system, as offered

by data-driven methodologies, is preferred. Chapters 6 and

7 discuss the cost efﬁciency and return on investment of

data-driven fraud-detection models.

FRAUD: DETECTION, PREVENTION, AND ANALYTICS! 19

An additional driver for the development of improved fraud-

detection technologies concerns the growing amount of interest that

fraud detection is attracting from the general public, the media, gov-

ernments, and enterprises. This increasing awareness and attention

for fraud is likely due to its large negative social as well as ﬁnancial

impact, and leads to growing investments and research into the

matter, both from academia, industry, and government.

Although fraud-detection approaches have gained signiﬁcant

power over the past years by adopting potent statistically based

methodologies and by analyzing massive amounts of data in order

to discover fraud patterns and mechanisms, still fraud remains hard

to detect. It appears the Pareto principle holds with respect to the

required effort and difﬁculty of detecting fraud: It appears the prin-

ciple of decreasing returns holds with respect to the required effort

and so forth. In order to explain the hardness and complexity of

the problem, it is important to acknowledge the fact that fraud is a

dynamic phenomenon, meaning that its nature changes in time. Not

only fraud-detection mechanisms evolve, but also fraudsters adapt

their approaches and are inventive in ﬁnding more reﬁned and less

apparent ways to commit fraud without being exposed. Fraudsters

probe fraud-detection and prevention systems to understand their

functioning and to discover their weaknesses, allowing them to adapt

their methods and strategies.

FRAUD-DETECTION TECHNIQUES

Indeed, fraudsters develop advanced strategies to cleverly cover their

tracks in order to avoid being uncovered. Fraudsters tend to try and

blend in as much as possible into the surroundings. Such an approach

reminds of camouﬂage techniques as used by the military or by ani-

mals such as chameleons and stick insects. This is clearly no fraud by

opportunity, but rather, is carefully planned, leading to a need for new

techniques that are able to detect and address patterns that initially

seem to comply with normal behavior, but in reality instigate fraudu-

lent activities.

Detection mechanisms based on unsupervised learning techniques

or descriptive analytics, as discussed in Chapter 3, typically aim at

20 FRAUD ANALYTICS

ﬁnding behavior that deviates from normal behavior, or in other words

at detecting anomalies. These techniques learn from historical obser-

vations, and are called unsupervised since they do not require these

observations to be labeled as either a fraudulent or a nonfraudulent

example case. An example of behavior that does not comply with

normal behavior in a telecommunications subscription fraud setting

is provided by the transaction data set with call detail records of a

particular subscriber shown in Table 1.2 (Fawcett and Provost 1997).

Remark that the calls found to be fraudulent (last column in the table

indicating bandit) are not suspicious by themselves; however, they are

deviating from normal behavior for this particular subscriber.

Outlier-detection techniques have great value and allow detecting

a signiﬁcant fraction of fraudulent cases. In particular, they might

allow detecting fraud that is different in nature from historical fraud,

or in other words fraud that makes use of new, unknown mechanisms

resulting in a novel fraud pattern. These new patterns are not discovered

by expert systems, and as such descriptive analytics may be a ﬁrst

Table 1.2 Call Detail Records of a Customer with Outliers Indicating Suspicious Activity

(deviating behavior starting at a certain moment in time) at the Customer Subscription

(Fawcett and Provost 1997)

Date (m/d) Time Day Duration Origin Destination Fraud

1/01 10:05:01 Mon 13 mins Brooklyn, NY Stamford, CT

1/05 14:53:27 Fri 5mins Brooklyn, NY Greenwich, CT

1/08 09:42:01 Mon 3mins Bronx, NY White Plains, NY

1/08 15:01:24 Mon 9mins Brooklyn, NY Brooklyn, NY

1/09 15:06:09 Tue 5mins Manhattan, NY Stamford, CT

1/09 16:28:50 Tue 53 sec Brooklyn, NY Brooklyn, NY

1/10 01:45:36 Wed 35 sec Boston, MA Chelsea, MA Bandit

1/10 01:46:29 Wed 34 sec Boston, MA Yonkers, MA Bandit

1/10 01:50:54 Wed 39 sec Boston, MA Chelsea, MA Bandit

1/10 11:23:28 Wed 24 sec White Plains, NY Congers, NY

1/11 22:00:28 Thu 37 sec Boston, MA East Boston, MA Bandit

1/11 22:04:01 Thu 37 sec Boston, MA East Boston, MA Bandit

FRAUD: DETECTION, PREVENTION, AND ANALYTICS! 21

complementary tool to be adopted by an organization in order to

improve its expert rule–based fraud-detection system.

Descriptive techniques however show to be prone to deception,

exactly by the camouﬂage-like fraud strategies already discussed.

Therefore, the detection system can be further improved by com-

plementing it by a tool that is able to unmask fraudsters adopting a

camouﬂage-like technique.

Therefore, in Chapter 4, a second type of techniques is introduced.

Supervised learning techniques or predictive analytics aim to learn

from historical information or observations in order to retrieve

patterns that allow differentiating between normal and fraudulent

behavior. These techniques exactly aim at ﬁnding silent alarms, the

parts of their tracks that fraudsters cannot cover up. Supervised

learners can be applied to predict or detect fraud as well as to estimate

the amount of fraud.

Predictive analytics has limitations as well, probably the most

important one being that they need historical examples to learn from

(i.e., a labeled data set of historically observed fraud behavior). This

reduces their detection power with respect to drastically different

fraud types making use of new mechanisms or methods, and which

have not been detected thus far and are therefore not included in the

historical database of fraud cases from which the predictive model

was learned. As already discussed, descriptive analytics may perform

better with respect to detecting such new fraud mechanisms, at

least if a new fraud mechanism leads to detectable deviations from

normality. This illustrates the complementarity of supervised and

unsupervised methods and motivates the use of both types of methods

as complementary tools in developing a powerful fraud-detection and

prevention system.

A third type of complementary tool concerns social network

analysis, which further extends the abilities of the fraud-detection

system by learning and detecting characteristics of fraudulent behavior

in a network of linked entities. Social network analytics is the newest

tool in our toolbox to ﬁght fraud, and proofs to be a very powerful

means as will appear from the discussion and presented case study

22 FRAUD ANALYTICS

in Chapter 5. Social network analytics allows including an extra

source of information in the analysis, being the relationships between

entities, and as such may contribute in uncovering particular patterns

indicating fraud.

It is important to stress that these three different types of tech-

niques may complement each other since they focus on different

aspects of fraud and are not to be considered as exclusive alternatives.

An effective fraud-detection and prevention system will make use of

and combine these different tools, which have different possibilities

and limitations and therefore reinforce each other when applied in

a combined setup. When developing a fraud-detection system, an

organization will likely follow the order in which the different tools

have been introduced; as a ﬁrst step an expert-based rule engine

may be developed, which in a second step may be complemented

by descriptive analytics, and subsequently by predictive and social

network analytics. Developing a fraud-detection system in this order

allows the organization to gain expertise and insight in a stepwise

manner, hereby facilitating each next step. However, the exact order

of adopting the different techniques may depend on the characteristics

of the type of fraud an organization is faced with.

FRAUD CYCLE

Figure 1.3 introduces the fraud cycle, and depicts four essential activities:

◾Fraud detection: Applying detection models on new, unseen

observations and assigning a fraud risk to every observation.

◾Fraud investigation: A human expert is often required to

investigate suspicious, ﬂagged cases given the involved subtlety

and complexity.

◾Fraud conﬁrmation: Determining true fraud label, possibly

involving ﬁeld research.

◾Fraud prevention: Preventing fraud to be committed in the

future. This might even result in detecting fraud even before

the fraudster knows s/he will commit fraud, which is exactly the

FRAUD: DETECTION, PREVENTION, AND ANALYTICS! 23

Fraud

detection

Fraud

investigation

Fraud

confirmation

Fraud

prevention

Automated detection algorithm

Current process

Figure 1.3 The Fraud Cycle

premise of the 1956 science ﬁction short story Minority Report by

Philip K. Dick.

Remark the feedback loop in Figure 1.3 from the fraud conﬁrma-

tion activity toward the fraud-detection activity. Newly detected cases

should be added (as soon as possible!) to the database of historical

fraud cases, which is used to learn or induce the detection model.

The fraud-detection model may not be retrained every time a new

case is conﬁrmed; however, a regular update of the model is recom-

mendable given the dynamic nature of fraud and the importance of

detecting fraud as soon as possible. The required frequency of retrain-

ing or updating the detection model depends on several factors:

◾The volatility of the fraud behavior

◾The detection power of the current model, which is related to

the volatility of the fraud behavior

◾The amount of (similar) conﬁrmed cases already available in the

database

◾The rate at which new cases are being conﬁrmed

◾The required effort to retrain the model

Depending on the emerging need for retraining as determined

by these factors, as well as possible additional factors, an automated

approach such as reinforcement learning may be considered which

continuously updates the detection model by learning from the

newest observations.

24 FRAUD ANALYTICS

EXAMPLE CASE: SUPERVISED AND UNSUPERVISED

LEARNING FOR DETECTING CREDIT CARD FRAUD

EXAMPLE CASE

In order to ﬁght fraud and given the abundant data availability, credit card

companies have been among the early adopters of big data approaches to

develop effective fraud-detection and prevention systems. A typical credit card

transaction is registered in the systems of the credit card company by logging

up to a hundred or more characteristics describing the details of a transaction.

Table 1.3 provides for illustrative purposes a number of such characteristics

or variables that are being captured (Hand 2007).

Table 1.3 Example Credit Card Transaction Data Fields

Transaction ID Transaction type Date of transaction

Time of transaction Amount Currency

Local currency amount Merchant ID Merchant category

Card issuer ID ATM ID Cheque account preﬁx

By logging this information over a period of time, a dataset is being

created that allows applying descriptive analytics. This includes outlier

detection techniques, which allow detecting abnormal or anomalous behavior

and/or characteristics in a data set. So-called outliers may indicate suspicious

activities, and may occur at the data item level or the data set level.

Figure 1.4 provides an illustration of outliers at the data item level, in this

example transactions that deviate from the normal behavior by a customer.

The scatter plot clearly shows three clusters of regular, frequently occurring

types as characterized by the time and place dimension of transactions for one

particular customer, as well as two deviating transactions marked in black.

These outliers are suspicious and possibly concern fraudulent transactions,

and therefore may be ﬂagged for further human investigation.

An outlier at the data set level means that the behavior of a person or

instance does not comply with the overall behavior. Figure 1.5 plots the age

and income characteristics of customers as provided when applying for a

credit card. The two outliers marked in black in the plot may indicate so-called

subscription fraud (cf. Table 1.1, deﬁnition of credit card fraud), since these

combinations of age and income strongly deviate from the normal behavior.

FRAUD: DETECTION, PREVENTION, AND ANALYTICS! 25

0 20406080 100 120 140 160

Time

Place

Time versus Place

Figure 1.4 Outlier Detection at the Data Item Level

500

1000

1500

2000

2500

3000

3500

4000

4500

010203040506070

Income

Age

Income versus Age

Figure 1.5 Outlier Detection at the Data Set Level

The addition of a ﬁeld in the available transaction data set that indicates

whether a transaction was fraudulent allows predictive analytics to be applied

to yield a model predicting or classifying an instance as being fraudulent or

not. As will be discussed in the next section as well as in Chapter 4, such

models may be interpreted to understand the underlying credit card fraud

behavior patterns that lead the model to predict whether a transaction might be

fraudulent. Such patterns may be:

◾Small purchase followed by a big one

◾Large number of online purchases in a short period

26 FRAUD ANALYTICS

◾Spending as much as possible as quickly as possible

◾Spending smaller amounts, spread across time

Such pattern may be harder to detect and concern advanced methods

adopted by fraudsters and developed exactly to avoid detection.

THE FRAUD ANALYTICS PROCESS MODEL

Figure 1.6 provides a high-level overview of the analytics process

model (Han and Kamber 2011; Hand, Mannila, and Smyth 2001; Tan,

Steinbach, and Kumar 2005). As a ﬁrst step, a thorough deﬁnition of

the business problem is needed to be solved with analytics. Next, all

source data must be identiﬁed that could be of potential interest. This

is a very important step, as data are the key ingredient to any analytical

exercise and the selection of data will have a deterministic impact on

the analytical models that will be built in a subsequent step. All data

will then be gathered in a staging area that could be a data mart or data

warehouse. Some basic exploratory analysis can be considered here

using for instance OLAP (online analytical processing, see Chapter 3)

facilities for multidimensional data analysis (e.g., roll-up, drill down,

slicing and dicing). This will be followed by a data-cleaning step to

Identify

Business

Problem

Identify

Data

Sources

Select

the

Data

Clean

the

Data

Transform

the

Data

Analyze

the

Data

Intepret,

Evaluate,

and Deploy

the Model

Analytics

Preprocessing

Post-

processing

Figure 1.6 The Fraud Analytics Process Model

FRAUD: DETECTION, PREVENTION, AND ANALYTICS! 27

get rid of all inconsistencies, such as missing values and duplicate

data. Additional transformations may also be considered, such as

binning, alphanumeric to numeric coding, geographical aggregation,

and so on. In the analytics step, an analytical model will be estimated

on the preprocessed and transformed data. In this stage, the actual

fraud-detection model is built. Finally, once the model has been built,

it will be interpreted and evaluated by the fraud experts.

Trivial patterns that may be detected by the model, for instance

similar to expert rules, are interesting as they provide some valida-

tion of the model. But of course, the key issue is to ﬁnd the unknown

yet interesting and actionable patterns (sometimes also referred to as

knowledge diamonds) that can provide added insight and detection

power. Once the analytical model has been appropriately validated and

approved, it can be put into production as an analytics application (e.g.,

decision support system, scoring engine). Important to consider here

is how to represent the model output in a user-friendly way, how to

integrate it with other applications (e.g., detection and prevention sys-

tem, risk engines), and how to make sure the analytical model can

be appropriately monitored and backtested on an ongoing basis. These

post-processing steps will be discussed in detail in Chapter 6.

It is important to note that the process model outlined in Figure 1.6

is iterative in nature in the sense that one may have to go back to

previous steps during the exercise. For example, during the analyt-

ics step, the need for additional data may be identiﬁed, which may

necessitate additional cleaning, transformation, and so on. The most

time-consuming step typically is the data selection and preprocessing

step, which usually takes around 80 percent of the total efforts needed

to build an analytical model.

A fraud-detection model must be thoroughly evaluated before

being adopted. Depending on the exact setting and usage of the model,

different aspects may need to be assessed during evaluation in order

to ensure the model to be acceptable for implementation. Table 1.4

reviews several key characteristics of successful fraud analytics models

that may or may not apply, depending on the exact application.

A number of particular challenges may present themselves when

developing and implementing a fraud-detection model, possibly

leading to difﬁculties in meeting the objectives as expressed by the

28 FRAUD ANALYTICS

Table 1.4 Key Characteristics of Successful Fraud Analytics Models

Statistical

accuracy

Refers to the detection power and the correctness of the statistical model in

ﬂagging cases as being suspicious. Several statistical evaluation criteria

exist and may be applied to evaluate this aspect, such as the hit rate, lift

curves, AUC, etc. A number of suitable measures will be discussed in detail

in Chapter 4. Statistical accuracy may also refer to statistical signiﬁcance,

meaning that the patterns that have been found in the data have to be real and

not the consequence of coincidence. In other words, we need to make sure

that the model generalizes well and is not overﬁtted to the historical data set.

Interpretability When a deeper understanding of the detected fraud patterns is required, for

instance to validate the model before it is adopted for use, a fraud-detection

model may have to be interpretable. This aspect involves a certain degree of

subjectivism, since interpretability may depend on the user’s knowledge. The

interpretability of a model depends on its format, which, in turn, is

determined by the adopted analytical technique. Models that allow the user

to understand the underlying reasons why the model signals a case to be

suspicious are called white-box models, whereas complex incomprehensible

mathematical models are often referred to as black-box models. It may well

be in a fraud-detection setting that black-box models are acceptable,

although in most settings some level of understanding and in fact validation

which is facilitated by interpretability is required for the management to have

conﬁdence and allow the effective operationalization of the model.

Operational

efﬁciency

Operational efﬁciency refers to the time that is required to evaluate the

model, or in other words, the time required to evaluate whether a case is

suspicious or not. When cases need to be evaluated in real time, for instance

to signal possible credit card fraud, operational efﬁciency is crucial and is a

main concern during model performance assessment. Operational efﬁciency

also entails the efforts needed to collect and preprocess the data, evaluate the

model, monitor and backtest the model, and reestimate it when necessary.

Economical

cost

Developing and implementing a fraud-detection model involves a signiﬁcant

cost to an organization. The total cost includes the costs to gather,

preprocess, and analyze the data, and the costs to put the resulting analytical

models into production. In addition, the software costs, as well as human

and computing resources, should be taken into account. Possibly also

external data has to be bought to enrich the available in-house data. Clearly,

it is important to perform a thorough cost-beneﬁt analysis at the start of the

project, and to gain insight in the constituent factors of the returns on

investment of building an advanced fraud-detection system.

Regulatory

compliance

Depending on the context there may be internal or organization-speciﬁc and

external regulation and legislation that applies to the development and

application of a model. Clearly, a fraud-detection model should be in line

and comply with all applicable regulation and legislation, for instance with

respect to privacy, the use of cookies in a web-browser, etc.

FRAUD: DETECTION, PREVENTION, AND ANALYTICS! 29

characteristics discussed in Table 1.4. A ﬁrst key challenge concerns

the dynamic nature of fraud. Fraudsters constantly try to beat

detection and prevention systems by developing new strategies and

methods. Therefore, adaptive analytical models and detection and

prevention systems are required, in order to detect and resolve fraud

as soon as possible. Detecting fraud as early as possible is crucial, as

discussed before.

Clearly, it is also crucial to detect fraud as accurately as possible,

and not to miss out on too many fraud cases, especially on fraud cases

involving a large amount or ﬁnancial impact. The cost of missing a

fraudulent case or a fraud mechanism may be signiﬁcant. Related to

having good detection power is the requirement of having at the same

time a low false alarm rate, since we also want to avoid harassing

good customers and prevent accounts or transactions to be blocked

unnecessarily.

In developing analytical models with good detection power and

low false alarm rate, an additional difﬁculty concerns the skewedness

of the data, meaning that we typically have plenty of historical

examples of nonfraudulent cases, but only a limited number of

fraudulent cases. For instance, in a credit card fraud setting, typically

less than 0.5 percent of transactions are fraudulent. Such a problem is

commonly referred to as a needle-in-a-haystack problem and might

cause an analytical technique to experience difﬁculties in learning an

accurate model. A number of approaches to address the skewedness

of the data will be discussed in Chapter 4.

Depending on the exact application, also operational efﬁciency

may be a key requirement, meaning that the fraud-detection system

might only have a limited amount of time available to reach a decision

and let a transaction pass or not. As an example, in a credit card

fraud-detection setting the decision time has to be typically less than

eight seconds. Such a requirement clearly impacts the design of the

operational IT systems, but also the design of the analytical model.

The analytical model should not take too long to be evaluated, and

the information or the variables that are used by the model should not

take too long to be gathered or calculated. Calculating trend variables

in real time, for instance, may not be feasible from an operational

perspective, since this is taking too much valuable time. This also

30 FRAUD ANALYTICS

relates to the ﬁnal challenge of dealing with the massive volumes of

data that are available and need to be processed.

FRAUD DATA SCIENTISTS

Whereas in the previous section we discussed the characteristics of a

good fraud-detection model, in this paragraph we will elaborate on the

key characteristics of a good fraud data scientist from the perspective

of the hiring manager. It is based on our consulting and research expe-

rience, having collaborated with many companies worldwide on the

topic of big data, analytics, and fraud detection.

A Fraud Data Scientist Should Have Solid

Quantitative Skills

Obviously, a fraud data scientist should have a thorough background

in statistics, machine learning and/or data mining. The distinction

between these various disciplines is getting more and more blurred

and is actually not that relevant. They all provide a set of quantita-

tive techniques to analyze data and ﬁnd business relevant patterns

within a particular context such as fraud detection. A data scientist

should be aware of which technique can be applied when and how.

He/she should not focus too much on the underlying mathematical

(e.g., optimization) details but, rather, have a good understanding

of what analytical problem a technique solves, and how its results

should be interpreted. In this context, the education of engineers

in computer science and/or business/industrial engineering should

aim at an integrated, multidisciplinary view, with graduates formed

in both the use of the techniques, and with the business acumen

necessary to bring new endeavors to fruition. Also important is to

spend enough time validating the analytical results obtained so as

to avoid situations often referred to as data massage and/or data

torture whereby data is (intentionally) misrepresented and/or too

much focus is spent discussing spurious correlations. When selecting

the optimal quantitative technique, the fraud data scientist should

FRAUD: DETECTION, PREVENTION, AND ANALYTICS! 31

take into account the speciﬁcities of the context and the problem

or fraud-detection application at hand. Typical requirements for

fraud-detection models have been discussed in the previous section

and the fraud data scientist should have a basic understanding and

feeling for all of those. Based on a combination of these requirements,

the data scientist should be capable of selecting the best analytical

technique to solve the particular business problem.

A Fraud Data Scientist Should Be a Good Programmer

As per deﬁnition, data scientists work with data. This involves plenty

of activities such as sampling and preprocessing of data, model estima-

tion and post-processing (e.g., sensitivity analysis, model deployment,

backtesting, model validation). Although many user-friendly software

tools are on the market nowadays to automate and support these tasks,

every analytical exercise requires tailored steps to tackle the speciﬁci-

ties of a particular business problem and setting. In order to successfully

perform these steps, programming needs to be done. Hence, a good

data scientist should possess sound programming skills (e.g., SAS, R,

Python, etc.). The programming language itself is not that important

as such, as long as he/she is familiar with the basic concepts of pro-

gramming and knows how to use these to automate repetitive tasks or

perform speciﬁc routines.

A Fraud Data Scientist Should Excel in Communication

and Visualization Skills

Like it or not, analytics is a technical exercise. At this moment, there

is a huge gap between the analytical models and the business users.

To bridge this gap, communication and visualization facilities are

key. Hence, data scientists should know how to represent analytical

models and their accompanying statistics and reports in user-friendly

ways using trafﬁc-light approaches, OLAP (online analytical pro-

cessing) facilities, If-then business rules, and so on. They should be

capable of communicating the right amount of information without

32 FRAUD ANALYTICS

getting lost into complex (e.g., statistical) details, which will inhibit a

model’s successful deployment. By doing so, business users will better

understand the characteristics and behavior in their (big) data which

will improve their attitude toward and acceptance of the resulting

analytical models. Educational institutions must learn to balance,

since many academic degrees form students who are skewed to either

too much analytical or too much practical knowledge.

A Fraud Data Scientist Should Have a Solid Business

Understanding

While this might be obvious, we have witnessed (too) many data sci-

ence projects that failed because the respective analyst did not under-

stand the business problem at hand. By “business” we refer to the

respective application area. Several examples of such application areas

of fraud-detection techniques were summarized in Table 1.1. Each of

those ﬁelds has its own particularities that are important for a fraud

data scientist to know and understand in order to be able to design and

implement a customized fraud-detection system. The more aligned the

detection system with the environment, the better its performance will

be, as evaluated on each of the dimensions already discussed.

A Fraud Data Scientist Should Be Creative

A data scientist needs creativity on at least two levels. First, on a techni-

cal level, it is important to be creative with regard to feature selection,

data transformation, and cleaning. These steps of the standard ana-

lytics process have to be adapted to each particular application, and

often the “right guess” could make a big difference. Second, big data

and analytics is a fast-evolving ﬁeld. New problems, technologies, and

corresponding challenges pop up on an ongoing basis. Moreover, also

fraudsters are very creative and adapt their tactics and methods on an

ongoing basis. Therefore, it is crucial that a fraud data scientist keeps up

with these new evolutions and technologies and has enough creativity

to see how they can create new opportunities.

Figure 1.7 summarizes the key characteristics and strengths con-

stituting the ideal fraud data scientist proﬁle.

FRAUD: DETECTION, PREVENTION, AND ANALYTICS! 33

Programmer

Modeling

CommunicationBusiness

Creative

"Close to ideal"

"Too specialized"

Figure 1.7 Proﬁle of a Fraud Data Scientist

A SCIENTIFIC PERSPECTIVE ON FRAUD

To conclude this chapter, let’s provide a scientiﬁc perspective about

the research on fraud. Figure 1.8 shows a screenshot of the Web of

Science statistics when querying all scientiﬁc publications between

1996 and 2014 for the key term fraud. It shows the total number of

papers published each year, the number of citations and the top ﬁve

most-cited papers.

A couple of conclusions can be drawn as follows:

◾6,174 scientiﬁc papers have been published on the topic of fraud

during the period reported.

◾The h-index is 44, implying that there are at least 44 papers with

44 citations on the topic of fraud.

◾The number of publications is steadily increasing, which shows

a growing interest from the academic community and research

on the topic.

◾The citations are exponentially growing, which is associated

with the increasing number of publications.

◾Two of the ﬁve papers mentioned study the use of analytics for

fraud detection, clearly illustrating the growing attention in the

ﬁeld for data-driven approaches.

Figure 1.8 Screenshot of Web of Science Statistics for Scientiﬁc Publications on Fraud between 1996 and 2014

FRAUD: DETECTION, PREVENTION, AND ANALYTICS! 35

REFERENCES

Armstrong, J. S. (2001). Selecting Forecasting Methods. In J.S. Armstrong, ed.

Principles of Forecasting: A Handbook for Researchers and Practitioners.New

York: Springer Science +Business Media, pp. 365–386.

Baesens, B. (2014). Analytics in a Big Data World: The Essential Guide to Data Science

and Its Applications. Hoboken, NJ: John Wiley & Sons.

Bolton, R. J., & Hand, D. J. (2002). Statistical Fraud Detection: A Review.

Statistical Science,17 (3): 235–249.

Caron, F., Vanden Broucke, S., Vanthienen, J., & Baesens, B. (2013). Advanced

Rule-Based Process Analytics: Applications for Risk Response Decisions

and Management Control Activities. Expert Systems with Applications,

Submitted.

Chakraborty, G., Murali, P., & Satish, G. (2013). Text Mining and Analysis: Prac-

tical Methods, Examples, and Case Studies Using SAS. Cary, NC: SAS Institute.

Cressey, D. R. (1953). Other People’s Money; A Study of the Social Psychology of

Embezzlement. New York: Free Press.

Dufﬁeld, G., & Grabosky, P. (2001). The Psychology of Fraud. In Trends and

Issues in Crime and Criminal Justice, Australian Institute of Criminology (199).

Elder IV,, J., & Thomas, H. (2012). Practical Text Mining and Statistical Analysis for

Non-Structured Text Data Applications. New York: Academic Press.

Fawcett, T., & Provost, F. (1997). Adaptive Fraud Detection. Data Mining and

Knowledge Discovery 1–3 (3): 291–316.

Grabosky, P., & Dufﬁeld, G. (2001). Red Flags of Fraud. Trends and Issues in Crime

and Criminal Justice, Australian Institute of Criminology (200).

Han, J., & Kamber, M. (2011). Data Mining: Concepts and Techniques, Third

Edition: Morgan Kaufmann.

Hand, D. (2007, September). Statistical Techniques for Fraud Detection, Prevention,

and Evaluation. Paper presented at the NATO ASI: Mining Massive Data

sets for Security, London, England.

Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of Data Mining.

Cambridge, MA: Bradford.

Jamain, A. (2001). Benford’s Law. London: Imperial College.

Junqué de Fortuny, E., Martens, D., & Provost, F. (2013). Predictive Modeling

with Big Data: Is Bigger Really Better? Big Data 1 (4): 215–226.

Little, R. J. A., & Rubin, D. B. (2002). Statistical Analysis with Missing Data

(2nd ed.). Hoboken, NJ: Wiley.

Maydanchik, A. (2007). Data Quality Assessment. Bradley Beach, NC: Technics

Publications.

36 FRAUD ANALYTICS

Navarette, E. (2006). Practical Calculation of Expected and Unexpected Losses

in Operational Risk by Simulation Methods (Banca & Finanzas: Documen-

tos de Trabajo, 1(1): pp. 1–12).

Petropoulos, F., Makridakis, S., Assimakopoulos, V., & Nikolopoulos, K. (2014).

“Horses for courses” in demand forecasting. European Journal of Operational

Research,237 (1): 152–163.

Schneider, F. (2002). Size and Measurement of the Informal Economy in 110

Countries around the World. In Workshop of Australian National Tax Centre,

ANU, Canberra, Australia.

Tan, P.-N. N., Steinbach, M., & Kumar, V. (2005). Introduction to Data Mining.

Boston: Addison Wesley.

Van Gestel, T., & Baesens, B. (2009). Credit Risk Management: Basic Concepts:

Financial Risk Components, Rating Analysis, Models, Economic and Regulatory

Capital. Oxford: Oxford University Press.

Van Vlasselaer, V., Eliassi-Rad, T., Akoglu, L., Snoeck, M., & Baesens, B. (2015).

Gotcha! Network-based Fraud Detection for Social Security Fraud. Man-

agement Science, Submitted.

Verbeke, W., Dejaeger, K., Martens, D., Hur, J., & Baesens, B. (2012). New

Insights into Churn Prediction in the Telecommunication Sector: A Proﬁt

Driven Data Mining Approach. European Journal of Operational Research

218: 211–229.

CHAPTER 2

Data Collection,

Sampling, and

Preprocessing

INTRODUCTION

Data is a key ingredient for any analytical exercise. Hence, it is of

key importance to thoroughly consider and list all data sources that

are potentially of interest and relevant before starting the analysis.

Large experiments as well as a broad experience in different ﬁelds

indicate that when it comes to data, bigger is better (see de Fortuny,

Martens, & Provost, 2013). However, real-life data can be (typically is)

dirty because of inconsistencies, incompleteness, duplication, merging,

and many other problems. Hence, throughout the analytical modeling

steps, various data-ﬁltering mechanisms will be applied to clean up

and reduce the data to a manageable and relevant size. Worth men-

tioning here is the garbage in, garbage out (GIGO) principle, which

essentially states that messy data will yield messy analytical models.

Hence, it is of utmost importance that every data preprocessing step is

carefully justiﬁed, carried out, validated, and documented before pro-

ceeding with further analysis. Even the slightest mistake can make the

data totally unusable for further analysis and the results invalid and

of no use whatsoever. In what follows, we will elaborate on the most

important data preprocessing steps that should be considered during

an analytical modeling exercise to build a fraud detection model. But

ﬁrst, let us have a closer look at what data to gather.

TYPES OF DATA SOURCES

Data can originate from a variety of different sources and provide

different types of information that might be useful for the purpose of

fraud detection, as will be further discussed in this section. Do remark

that the provided mixed discussion of different sources and types

of data concerns a broad, non-exhaustive and non-mutually exclusive

categorization, respectively in the sense that the most prominent data

sources and types of information as available in a typical organization

are listed, but clearly not all possible data sources and types of infor-

mation are discussed, and possibly some overlap exists between the

enlisted categories.

DATA COLLECTION, SAMPLING, AND PREPROCESSING 39

Transactional data is a ﬁrst important source of data. It consists

of structured and detailed information capturing the key characteristics

of a customer transaction (e.g., a purchase, claim, cash transfer, credit

card payment). It is usually stored in massive OLTP (online tran-

saction processing) relational databases. This data can also be

summarized over longer time horizons by aggregating it into averages,

(absolute or relative) trends, maximum or minimum values, etc.

An important type of such aggregated transactional information in

a fraud detection setting concern RFM variables, which stands for

recency (R), frequency (F), and monetary (M) variables. RFM vari-

ables are often used for clustering in fraud detection, as will be

discussed in Chapter 3, but are useful as well in a supervised learning

setting as will be discussed in Chapter 4. RFM variables have been

originally introduced in a marketing setting (Cullinan, 1977), but they

clearly may also come in handy for fraud detection.

RFM variables can be operationalized in various ways. Let us take

the example of credit card fraud detection. The recency can be mea-

sured in a continuous way such as how much time elapsed since the

most recent transaction, or in a binary way such as was there a transac-

tion made during the past day, week, month, and so on. The frequency

can be quantiﬁed as the number of transactions per day, week, month,

and so on. Similarly, the monetary variable can be quantiﬁed as the

minimum, maximum, average, median, or most recent value of a

transaction. Although these variables can be meaningful when inter-

preted individually (e.g., fraudsters make more frequent transactions

than nonfraudsters), also their interaction can be very useful for fraud

detection. For example, credit card fraudsters often try out a stolen

credit card for a low amount to see whether it works, before making a

big purchase, resulting in a recent and low monetary value transaction

followed by a recent and high monetary value transaction. RFM

variables can also prove their worth in an anti-money laundering

context by considering recency, frequency, and amount of cash

transfers between accounts, which possibly allows uncovering charac-

teristic money laundering patterns. A ﬁnal illustration concerns RFM

variables which may be operationalized for insurance claim fraud

40 FRAUD ANALYTICS

detection by constructing variables such as time since previous claim,

number of claims submitted in the previous twelve months,andtotal monetary

amount of claims since subscription of insurance contract.

Contractual, subscription, or account data may complement

transactional data if a contractual relation exists, which is often the

case for utilities such as gas, electricity, telecommunication, and so on.

Examples of subscription data are the start date of the relation, infor-

mation on subscription renewals, characteristics of a subscription such

as type of services or products delivered, levels of service, cost of ser-

vice, product guarantees and insurances, and so on. The moment when

customers subscribe to a service offers a unique opportunity for organi-

zations to get to know their customers. Unique in the sense that it may

be the only time when a direct contact exists between an employee

and the customer, either in person, over the phone, or online, and as

such offers the opportunity for the organization to gather additional

information that is nonessential to the contract but may be useful

for purposes such as marketing but as well for fraud detection. Such

information is typically stored in an account management or customer

relationship management (CRM) database.

Subscription data may also be a source of sociodemographic

information, since typically subscription or registration requires iden-

tiﬁcation. Examples of socioeconomic characteristics of a population

consisting of individuals are age,gender,marital status,income level,edu-

cation level,occupation,religion, and so on. Although not very advanced

or complex measures, sociodemographic information may signiﬁ-

cantly relate to fraudulent behavior. For instance, it appears that both

gender as well as age is very often related to an individual’s likelihood

to commit fraud: female and older individuals are less likely to commit

fraud than male and younger individuals. Similar characteristics can

also be deﬁned when the basic entities for which fraud is to be detected

do not concern individuals but instead companies or organizations.

In such a setting one rather speaks of slow-moving data dimensions,

factual data or static characteristics. Examples include the address,

year of foundation, industrial sector, activity type, and so on. These do

not change over time at all or as often as do other characteristics such

as turnover, solvency, number of employees, etc. These latter variables

are examples of what we will call below behavioral information.

DATA COLLECTION, SAMPLING, AND PREPROCESSING 41

Several data sources may be consulted for retrieving sociodemo-

graphic or factual data, including subscription data sources as

discussed above, and data poolers, survey data, and publicly available

data sources as discussed below.

Nowadays, data poolers are getting more and more important

in the industry. Examples are Experian, Equifax, CIFAS, Dun &

Bradstreet, Thomson Reuters, etc. The core business of these compa-

nies is to gather data (e.g., sociodemographic information) in particular

settings or for particular purposes (e.g., fraud detection, credit risk

assessment, and marketing) and sell it to interested customers looking

to enrich or extend their data sources. Additionally to selling data,

these data poolers typically also build predictive models themselves

and sell the output of these models as risk scores. This is a common

practice in credit risk, for instance, in the United States the FICO score

is a credit score ranging between 300 and 850 provided by the three

most important credit data poolers or credit bureaus: Experian, Equifax

and Transunion. Many ﬁnancial institutions as well as commercial

vendors that give credit to customers use these FICO scores either as

their ﬁnal internal model to assess creditworthiness, or to benchmark

it against an internally developed credit scorecard to better understand

the weaknesses of the latter. The use of generic, pre-deﬁned fraud

risk scores is not yet common practice, but may become possible in

the near future.

Surveys are another source of data—that is, survey data. Such data

are gathered by inquiring the target population by means of an ofﬂine

(via mail, letter, etc.) or online (via phone call or the Internet, which

offers different contact channels such as the organization’s website, the

online helpdesk, social media proﬁles such as Facebook, LinkedIn, or

Twitter) survey. Surveys may aim at gathering sociodemographic data,

but also behavioral information.

Behavioral information concerns any information describing

the behavior of an individual or an entity in the particular context

under research. Such data are also called fast-moving data or dynamic

characteristics. Examples of behavioral variables include information

with regards to preferences of customers, usage information, fre-

quencies of events, trend variables, and so on. When dealing with

organizations, examples of behavioral characteristics or dynamic

42 FRAUD ANALYTICS

characteristics are turnover,solvency,number of employees,andsoon.

Marketing data results from monitoring the impact of marketing

actions on the target population, and concerns a particular type of

behavioral information.

Also, unstructured data embedded in text documents (e.g.,

emails, Web pages, claim forms) or multimedia content can be inter-

esting to analyze. However, these sources typically require extensive

preprocessing before they can be successfully included in an analytical

exercise. Analyzing textual data is the goal of a particular branch of

analytics, called text analytics. Given the high level of specialization

involved, this book does not provide an extensive discussion of text

mining techniques, although a brief introduction is provided in the

ﬁnal chapter of the book. For more information on this topic, one

may refer to academic textbooks on the subject (Chakraborty, Murali,

& Satish 2013).

A second type of unstructured information is contextual or

network information, meaning the context of a particular entity.

An example of such contextual information concerns relations of

a particular type that exist between an entity and other entities of

the same type or of another type. How to gather and represent such

information, as well as how to make use of it will be described in

detail in Chapter 5, which will focus on social network analytics for

fraud detection.

Another important source of data is qualitative, expert-based

data. An expert is a person with a substantial amount of subject

matter expertise within a particular setting (e.g., credit portfolio man-

ager, brand manager). The expertise stems from both common sense

and business experience and it is important to elicit this knowledge

as much as possible before the analytical model building exercise is

started. It will allow steering the modeling in the right direction and

interpret the analytical results from the right perspective. A popular

example of applying expert-based validation is checking the univariate

signs of a regression model. For instance, an example already discussed

before concerns the observation that a higher age often results in

a lower likelihood of being fraudulent. Consequently, a negative

sign is expected when including age in a fraud prediction model

DATA COLLECTION, SAMPLING, AND PREPROCESSING 43

yielding the probability of an individual committing fraud. If this turns

out not to be the case due to whatever reason (e.g., bad data

quality, multicollinearity), the expert or business user will not be

tempted to use the analytical model at all, since it contradicts prior

expectations.

A ﬁnal source of information concerns publicly available data

sources that can provide for instance, external information.Thisis

contextual information that is not related to a particular entity, such

as macroeconomic data (GDP, inﬂation, unemployment, etc.), and

weather observations. By enriching the data set with such information

one may see for example how the model and the model outputs vary

as a function of the state of the economy. Possibly fraud rates and

total amounts of fraud increase during economic downturn periods,

although no scientiﬁc evidence (or counter evidence, for that matter)

of such an effect is, to our knowledge, available (yet) in the scientiﬁc

literature. Also social media data from, for example, Facebook, Twitter,

LinkedIn, and so on, and that are publicly available can be an impor-

tant source of information. However, one needs to be careful in both

gathering and using such data and make sure that local and interna-

tional privacy regulations are respected at all times. Privacy concerns

in a data analytics context will be discussed in the ﬁnal chapter.

MERGING DATA SOURCES

The application of both descriptive and predictive analytics typically

requires or presumes the data to be presented in a single table contain-

ing and representing all the data in a structured manner. A structured

data table allows straightforward processing and analysis. (Learning

from multiple tables that are related—that is, learning directly from

relational databases without merging the normalized tables—is a par-

ticular branch within the ﬁeld of data mining called relational learning,

and shares techniques and approaches with social network analytics as

discussed in Chapter 5.)

Typically, the rows of a data table represent the basic entities to

which the analysis applies (e.g., customers, transactions, enterprises,

claims, cases). The rows are referred to as instances, observations, or

44 FRAUD ANALYTICS

lines. The columns in the data table contain information about the basic

entities. Plenty of synonyms are used to denote the columns of the data

table, such as (explanatory) variables, ﬁelds, characteristics, attributes,

indicators, features, and so on.

In order to construct the aggregated, non-normalized data table to

facilitate further analysis, often several normalized source data tables

have to be merged. Merging tables involves selecting information

from different tables related to an individual entity, and copying it

to the aggregated data table. The individual entity can be recognized

and selected in the different tables by making use of keys,which

are attributes that have exactly been included in the table to allow

identifying and relating observations from different source tables per-

taining to the same entity. Figure 2.1 illustrates the process of merging

two tables—that is, transactions data and customer data, into a single

non-normalized data table by making use of the key attribute ID, which

allows connecting observations in the transactions table with observa-

tions in the customer table. The same approach can be taken to merge

as many tables as required, but clearly the more tables are merged,

the more duplicate data might be included in the resulting table.

When merging data tables, it is crucial that no errors happen, so

some checks should be applied to control the resulting table and to

make sure that all information is correctly integrated.

ID Date Amount

XWV 2/01/2015 52 €

XWV 6/02/2015 21 €

XWV 3/03/2015 13 €

BBC 17/02/2015 45 €

BBC 1/03/2015 75 €

VVQ 2/03/2015 56 €

Transactions

ID Age Start date

XWV 31 1/01/2015

BBC 49 10/02/2015

VVQ 21 15/02/2015

Customer data

ID Date Amount Age Start date

XWV 2/01/2015 52 € 31 1/01/2015

XWV 6/02/2015 21 € 31 1/01/2015

XWV 3/03/2015 13 € 31 1/01/2015

BBC 17/02/2015 45 € 49 10/02/2015

BBC 1/03/2015 75 € 49 10/02/2015

VVQ 2/03/2015 56 € 21 15/02/2015

Non-normalized data table

Figure 2.1 Aggregating Normalized Data Tables into a Non-Normalized Data Table

DATA COLLECTION, SAMPLING, AND PREPROCESSING 45

SAMPLING

The aim of sampling is to take a subset of historical data (e.g., past

transactions) and use that to build an analytical model. A ﬁrst obvi-

ous question that comes to mind concerns the need for sampling.

Obviously, with the availability of high-performance computing facili-

ties (e.g., grid and cloud computing), one could also try to directly

analyze the full data set. However, a key requirement for a good sam-

ple is that it should be representative for the future entities on which

the analytical model will be run. Hence, the timing aspect becomes

important since for instance transactions of today are more similar

to transactions of tomorrow than transactions of yesterday. Choosing

the optimal time window of the sample involves a trade-off between

lots of data (and hence a more robust analytical model) and recent

data (which may be more representative). The sample should also be

taken from an average business period to get as accurate as possible a

picture of the target population.

It speaks for itself that sampling bias should be avoided as much as

possible. However, this is not always that straightforward. For instance,

in a credit card context one may take a month’s data as input, which

will already result in a substantial amount of data to be processed.

But which month is representative for future months? Clearly,

customers may use their credit card differently during the month of

December when buying gifts for the holiday period. When looking at

this example more closely, we discover in fact two sources of bias or

deviations from normal business periods. Credit card customers may

spend more during this period, both in total as well as on individual

products. Additionally, different types of products may be bought in

different stores usually frequented by the customer.

Let us consider two concrete solutions to address such a season-

ality effect or bias, although other solutions may exist. Since every

month may in fact deviate from normal, if normal is deﬁned as average,

it could make sense to build separate models for different months, or

for homogeneous time frames. This is a rather complex and demanding

solution from an operational perspective, since multiple models have

to be developed, run, maintained, and monitored.

46 FRAUD ANALYTICS

Alternatively, a sample may be gathered by sampling observations

over a period covering a full business cycle. Then only a single model

has to be developed, run, maintained, and monitored, which may pos-

sibly come at a cost of reduced fraud detection power since less tailored

to a particular time frame, yet clearly will be less complex and costly

to operate.

The sample to be gathered as such depends on the choice that is

made between these two alternative solutions in addressing poten-

tial sampling bias. This example illustrates the importance and direct

impact of sampling, not in the least on the performance of the model

that is built based on the gathered sample (i.e., the fraud detection

power).

In stratiﬁed sampling, a sample is taken according to predeﬁned

strata. In a fraud detection context data sets are typically very skew

(e.g., 99 percent nonfraudulent and 1 percent fraudulent transactions).

When stratifying according to the target fraud indicator, the sample

will contain exactly the same percentages of (non-) fraudulent trans-

actions as in the original data. Additional stratiﬁcation can be applied

on predictor variables as well—for instance, in order for the number

of observations across different product categories to closely resemble

the real product transaction distribution. However, as long as no large

deviations exist with respect to the sample and observed distribution

of predictor variables, it will usually be sufﬁcient to limit stratiﬁcation

to the target variable.

TYPES OF DATA ELEMENTS

It is important to appropriately consider the different types of data

elements at the start of the analysis. The following types of data

elements can be considered:

◾Continuous data

◾These are data elements that are deﬁned on an interval,

which can be both limited and unlimited.

◾A distinction is sometimes made between continuous data

with and without a natural zero value, and which are respec-

tively referred to as ratio (e.g., amounts) and interval data

DATA COLLECTION, SAMPLING, AND PREPROCESSING 47

(e.g., temperature in degrees Celsius or Fahrenheit). In the

latter case, you cannot make a statement like, “It is double or

twice as hot as last month”; since the value zero has no mean-

ing you cannot take the ratio of two values, hence explaining

the name ratio data for data measured on a scale with a nat-

ural zero value. Most continuous data in a fraud detection

setting concerns ratio data, since often we are dealing with

amounts.

◾Examples: amount of transaction; balance on savings

account; (dis-) similarity index

◾Categorical data

◾Nominal

◾These are data elements that can only take on a limited set

of values with no meaningful ordering in between.

◾Examples: marital status; payment type; country of origin

◾Ordinal

◾These are data elements that can only take on a limited set

of values with a meaningful ordering in between.

◾Examples: age coded as young, middle-aged, and old

◾Binary

◾These are data elements that can only take on two values.

◾Examples: daytime transaction (yes/no); online transaction

(yes/no)

Appropriately distinguishing between these different data elements

is of key importance to start the analysis when importing the data

into an analytics tool. For example, if marital status would be incor-

rectly speciﬁed as a continuous data element, then the software would

calculate its mean, standard deviation, and so on, which is obviously

meaningless and may perturb the analysis.

VISUAL DATA EXPLORATION AND EXPLORATORY

STATISTICAL ANALYSIS

Visual data exploration is a very important part of getting to know

your data in an “informal” way. It allows gaining some initial insights

48 FRAUD ANALYTICS

into the data, which can then be usefully adopted throughout the

modeling stage. Different plots/graphs can be useful here. Pie charts

are a popular example. A pie chart represents a variable’s distribution

as a pie, whereby each section represents the portion of the total

percent taken by each value of the variable. Figure 2.2 represents a

pie chart for a variable payment type with possible values credit card,

debit card,orcheque. A separate pie chart analysis for the fraudsters and

nonfraudsters indicates that for payment type cheque relatively more

fraud occurs, which can be a very useful starting insight. Bar charts

represent the frequency of each of the values (either absolute or

relative) as bars. Other handy visual tools are histograms and scatter

plots. A histogram provides an easy way to visualize the central

tendency and to determine the variability or spread of the data.

It also allows to contrast the observed data with standard known

distributions (e.g., normal distribution). Scatter plots allow users to

visualize one variable against another to see whether there are any

correlation patterns in the data. Also, OLAP based multidimensional

data analysis can be usefully adopted to explore patterns in the data

(see Chapter 3).

A next step after visual analysis could be inspecting some basic

statistical measurements such as averages, standard deviations, min-

imum, maximum, percentiles, conﬁdence intervals, and so on. One

could calculate these measures separately for each of the target classes

(e.g., fraudsters versus nonfraudsters) to see whether there are any

interesting patterns present (e.g., do fraudsters usually have a lower

average age than nonfraudsters?).

BENFORD’S LAW

A both visual and numerical data exploration technique that is

particularly interesting in a fraud detection setting relates to what is

commonly known as Benford’s law. This law describes the frequency

distribution of the ﬁrst digit in many real-life data sets and is shown

in Figure 2.3. When comparing the expected distribution following

Benford’s law with the observed distribution in a data set, strong

deviations from the expected frequencies may indicate the data to be

suspicious and possibly manipulated. For instance, government aid or

DATA COLLECTION, SAMPLING, AND PREPROCESSING 49

Total population

Fraudsters

Credit card

Debit card

Check

Credit card

Debit card

Check

Nonfraudsters

Credit card

Debit card

Check

Figure 2.2 Pie Charts for Exploratory Data Analysis

50 FRAUD ANALYTICS

0.05

0.10

0.15

0.20

0.25

0.30

0.35

123456789

First digit distribution

Figure 2.3 Benford’s Law Describing the Frequency Distribution of the First Digit

support eligibility typically depends on whether the applicant meets

certain requirements, such as an income below a certain threshold.

Therefore, data may be tampered with in order for the application

to comply with these requirements. It is exactly such types of fraud

that are prone to detection using Benford’s law, since the manipulated

or made-up numbers will not comply with the expected observed

frequency of the ﬁrst digit as expressed by Benford’s law.

The mathematical formula describing this law expresses the

probability P(d) of the leading digit dto occur to be equal to:

P(d)=log10 1+1

d

Benford’s law can be used as a screening tool for fraud detection

(Jamain 2001). It is a partially negative rule, like many other rules,

meaning that if Benford’s law is not satisﬁed, then it is probable that

the involved data were manipulated and further investigation or

testing is required. Conversely, if a data set complies with Benford’s

law, it can still be fraudulent. Note that Benford’s law applies to a

data set, meaning that a sufﬁcient amount of numbers related to an

individual case need to be gathered in order for Benford’s law to be

meaningful, which is typically the case when dealing with ﬁnancial

statements, for instance.

A deviation from Benford’s law does not necessarily mean that

the data have been tampered with. It may also attract the analyst’s

DATA COLLECTION, SAMPLING, AND PREPROCESSING 51

attention toward data quality issues resulting from merging several

data sets in a single structured data table, toward duplicate data,

and so on. Hence, it may deﬁnitely be worthwhile for several pur-

poses to control compliance with Benford’s law during the data-

preprocessing phase.

DESCRIPTIVE STATISTICS

Similar to Benford’s law and in addition to the preparatory visual

data exploration, several descriptive statistics might be calculated

that provide basic insight or feeling for the data. Plenty of descriptive

statistics exist to summarize or provide information with respect

to a particular characteristic of the data, and therefore descriptive

statistics should be assessed together—in support and completion of

each other.

Basic descriptive statistics are the mean and median value of con-

tinuous variables, with the median value less sensitive to extreme val-

ues but then as well not providing as much information with respect to

the full distribution. Complementary to the mean value, the variation

or the standard deviation provide insight with respect to how much

the data is spread around the mean value. Likewise, percentile values

such as the 10, 25, 75, and 90 percentile, provide further information

with respect to the distribution and as a complement to the median

value.

Speciﬁc descriptive statistics exist to express the symmetry or

asymmetry of a distribution, such as the skewness measure, as well

as the peakedness or ﬂatness of a distribution—for example, the

kurtosis measure. However, the exact values of these measures are

likely a bit harder to interpret than for instance the value of the mean

and standard deviation. This limits their practical use. Instead, one

could more easily assess these aspects by inspecting visual plots of the

distributions of the involved variables.

When dealing with categorical variables, instead of the median and

the mean value one may calculate the mode, which is the most fre-

quently occurring value. In other words, the mode is the most typical

value for the variable at hand. The mode is not necessarily unique,

since multiple values can result in the same maximum frequency.

52 FRAUD ANALYTICS

MISSING VALUES

Missing values can occur because of various reasons. The information

can be nonapplicable. For example, when modeling amount of fraud

for users, then this information is only available for the fraudulent

accounts and not for the nonfraudulent accounts since it is not appli-

cable there. The information can also be undisclosed. For example, a

customer decided not to disclose his or her income because of privacy.

Missing data can also originate because of an error during merging

(e.g., typos in name or ID).

Some analytical techniques (e.g., decision trees) can deal directly

with missing values. Other techniques need some additional prepro-

cessing. The following are the most popular schemes to deal with

missing values (Little and Rubin 2002).

◾Replace (Impute)

This implies replacing the missing value with a known

value. For example, consider the example in Table 2.1. One

could impute the missing credit bureau scores with the average

or median of the known values. For marital status, the mode can

then be used. One could also apply regression-based imputation

whereby a regression model is estimated to model a target vari-

able (e.g., credit bureau score) based on the other information

available (e.g., age, income). The latter is more sophisti-

cated, although the added value from an empirical viewpoint

(e.g., in terms of model performance) is questionable.

◾Delete

This is the most straightforward option and consists of delet-

ing observations or variables with lots of missing values. This, of

course, assumes that information is missing at random and has

no meaningful interpretation and/or relationship to the target.

◾Keep

Missing values can be meaningful. For example, a customer

did not disclose his/her income because he/she is currently

unemployed. This fact may have a relation with fraud and

needs to be considered as a separate category.

DATA COLLECTION, SAMPLING, AND PREPROCESSING 53

Table 2.1 Dealing with Missing Values

ID Age Income Marital Status Credit Bureau Score Fraud

134 1,800 ?620 Yes

228 1,200 Single ?No

322 1,000 Single ?No

460 2,200 Widowed 700 Yes

558 2,000 Married ?No

644 ? ? ? No

722 1,200 Single ?No

826 1,500 Married 350 No

934 ?Single ?Yes

10 50 2,100 Divorced ?No

As a practical way of working, one can ﬁrst start with statistically

testing whether missing information is related to the target variable or

not (using, e.g., a Chi-squared test, see the section on categorization).

If yes, then we can adopt the keep strategy and make a special cate-

gory for it. If not, one can depending on the number of observations

available, decide to either delete or impute.

OUTLIER DETECTION AND TREATMENT

Outliers are extreme observations that are very dissimilar to the rest of

the population. Actually, two types of outliers can be considered:

◾Valid observations, e.g., salary of boss is US$1,000,000

◾Invalid observations, e.g., age is 300 years

Both are univariate outliers in the sense that they are outlying on

one dimension. However, outliers can be hidden in unidimensional

views of the data. Multivariate outliers are observations that are outly-

ing in multiple dimensions. For example, Figure 2.4 gives an example

of two outlying observations considering both the dimensions of

income and age.

54 FRAUD ANALYTICS

500

1000

1500

2000

2500

3000

3500

4000

4500

0 10203040506070

Income

Age

Income versus age

Figure 2.4 Multivariate Outliers

Two important steps in dealing with outliers are detection and

treatment. A ﬁrst obvious check for outliers is to calculate the min-

imum and maximum values for each of the data elements. Various

graphical tools can be used to detect outliers. Histograms are a ﬁrst

example. Figure 2.5 presents an example of a distribution for age

whereby the circled areas clearly represent outliers.

Another useful visual mechanism is a box plot. A box plot repre-

sents three key quartiles of the data: the ﬁrst quartile (25 percent of

the observations have a lower value), the median (50 percent of the

observations have a lower value), and the third quartile (75 percent

500

0–5

20–25

25–30

30–35

35–40

40–45

45–50

50–55

55–60

60–65

65–70

150–200

1000

1500

2000

2500

3000

3500

Frequency distribution age

Figure 2.5 Histogram for Outlier Detection

DATA COLLECTION, SAMPLING, AND PREPROCESSING 55

of the observations have a lower value). All three quartiles are rep-

resented as a box. The minimum and maximum value are then also

added unless they are too far away from the edges of the box. Too far

away is then quantiﬁed as more than 1.5 ×Interquartile Range (IQR =

Q3–Q1). Figure 2.6 gives an example of a box plot where three outliers

can be seen.

Another way is to calculate z-scores, measuring how many stan-

dard deviations an observation lies away from the mean, as follows:

zi=xi−𝜇

𝜎,

where 𝜇represents the average of the variable and 𝜎its standard devi-

ation. An example is given in Table 2.2. Note that, by deﬁnition, the

z-scores will have a mean of zero and unit standard deviation.

1,5 × IQR

Min Q1Q3

Outliers

Figure 2.6 Box Plots for Outlier Detection

Table 2.2 z-Scores for Outlier Detection

ID Age z-Score

130 (30 – 40)/10 =–1

250 (50 – 40)/10 =+1

310 (10 – 40)/10 =–3

440 (40 – 40)/10 =0

560 (60 – 40)/10 =+2

680 (80 – 40)/10 =+4

…

𝜇=40

𝜎=10

𝜇=0

𝜎=1

56 FRAUD ANALYTICS

A practical rule of thumb, then, deﬁnes outliers when the absolute

value of the z-score |z| is bigger than three. Note that the z-score relies

on the normal distribution.

These methods all focus on univariate outliers. Multivariate

outliers can be detected by ﬁtting regression lines and inspecting the

observations with large errors (using, e.g., a residual plot). Alternative

methods are clustering or calculating the Mahalanobis distance. Note,

however, that although potentially useful, multivariate outlier detec-

tion is typically not considered in many modeling exercises due to the

typical marginal impact on model performance.

Some analytical techniques (e.g., decision trees, neural networks,

SVMs) are fairly robust with respect to outliers. Others (e.g., linear/

logistic regression) are more sensitive to them. Various schemes exist

to deal with outliers. It highly depends on whether the outlier repre-

sents a valid or invalid observation. For invalid observations (e.g., age

is 300 years), one could treat the outlier as a missing value using any of

the schemes discussed in the previous section. For valid observations

(e.g., income is US$1,000,000), other schemes are needed. A popular

scheme is truncation/capping/winsorizing. One hereby imposes both

a lower and upper limit on a variable and any values below/above

are brought back to these limits. The limits can be calculated using

the z-scores (see Figure 2.7), or the IQR (which is more robust than the

z-scores) as follows:

Upper/Lower limit =M±3s, with M=median and s=IQR/(2 ×

0.6745) (Van Gestel and Baesens 2009). A sigmoid transformation

ranging between 0 and 1 can also be used for capping as follows:

f(x)= 1

1+e−x

In addition, expert-based limits based on business knowledge

and/or experience can be imposed.

An important remark concerning outliers is the fact that not all

invalid values are outlying and, as such, may go unnoted if not explic-

itly looked into. For instance, a clear issue exists when observing cus-

tomers with values gender =male and pregnant =yes. Which value is

invalid, either the value for gender or pregnant, cannot be determined,

but it needs to be noted that both values are not outlying and therefore

DATA COLLECTION, SAMPLING, AND PREPROCESSING 57

μμ–3σμ+3σ

Figure 2.7 Using the z-Scores for Truncation

such a conﬂict will not be noted by the analyst unless some explicit

precautions are taken. In order to detect particular invalid combina-

tions, one may construct a set of rules that are formulated based on

expert knowledge and experience (similar to a fraud detection rule

engine in fact), which is applied to the data to check and alert for

issues. In this particular context, a network representation of the vari-

ables may be of use to construct the rule set and reason upon relations

that exist between the different variables, with links representing con-

straints that apply to the combination of variable values and resulting

in rules added to the rule set.

RED FLAGS

An important remark with respect to outlier treatment is to be made,

which particularly holds in a fraud detection setting. As discussed

in the introductory chapter, fraudsters may be detected by the very

fact that their behavior is different or deviant from nonfraudsters,

although most likely only slightly or in a complex (multivariate)

manner since they will cover their tracks to remain undetected.

These deviations from normality are called red ﬂags of fraud and are

probably the most successful and widespread tool that is being used

58 FRAUD ANALYTICS

to detect fraud. As discussed in Grabosky and Dufﬁeld (2001) in the

broadest terms, the fundamental red ﬂag of fraud is the anomaly,

that is, a variation from predictable patterns of behavior or, simply,

something that seems out of place. Some examples of red ﬂags follow:

Tax evasion fraud red ﬂags:

◾An identical ﬁnancial statement, since fraudulent companies

copy ﬁnancial statements of nonfraudulent companies to look

less suspicious

◾Name of an accountant is unique, since this might concern a

nonexisting accountant

Credit card fraud red ﬂags:

◾A small payment followed by a large payment immediately after,

since a fraudster might ﬁrst check whether the card is still active

before placing a bet

◾Regular rather small payments, which is a technique to avoid

getting noticed

Telecommunications-related fraud may be reﬂected in the

following red-ﬂag activities (Grabosky and Dufﬁeld 2001):

◾Long-distance access followed by reverse call charges accepted

from overseas

◾High-volume usage over short periods before disconnection

◾A large volume of calls where one call begins shortly after the

termination of another

◾The nonpayment of bills

Such red-ﬂag activities are typically translated in expert rules and

included in a rule engine as discussed in Chapter 1. When red ﬂags

and expert rules are deﬁned, the reasoning behind the red ﬂag or

rule should be documented in order to inform the inspectors about

the underlying reasons or causes of suspicion underlying the red ﬂag

that is raised, so they can focus their investigations on these causes

for suspicion.

Descriptive or unsupervised learning techniques will be discussed

in Chapter 3, including several approaches that aim at detecting

such slight or complex deviations from regular behavior associated

DATA COLLECTION, SAMPLING, AND PREPROCESSING 59

with fraud. When handling valid outliers in the data set using the

treatment techniques discussed before, we may impair in fact the

ability of descriptive analytics in ﬁnding anomalous fraud patterns.

Therefore, one should be extremely careful in treating valid outliers

when applying unsupervised learning techniques to build a fraud

detection model. This is true even when handling univariate outliers,

since these might be not by themselves but in combination with other

variables in a multivariate manner related to and as such be used to

uncover fraud. Invalid outliers, on the contrary, can straightforwardly

be treated as discussed above as missing values, preferably by including

an indicator that the value was missing or even more precisely an

invalid outlier. This allows users to test if there is any relation at all

between an invalid value and fraud.

STANDARDIZING DATA

Standardizing data is a data preprocessing activity targeted at scaling

variables to a similar range. Consider, for example, two variables gen-

der (coded as 0/1) and income (ranging between 0 and US$1,000,000).

When building logistic regression models using both information ele-

ments, the coefﬁcient for income might become very small. Hence,

it could make sense to bring them back to a similar scale. The following

standardization procedures could be adopted:

◾Min/Max standardization

Xnew =Xold −min(Xold)

max(Xold)−min(Xold)(newmax −newmin)+newmin,

whereby newmax and newmin are the newly imposed maxi-

mum and minimum (e.g., 1 and 0).

◾z-score standardization

◾Calculate the z-scores (see the previous section).

◾Decimal scaling

◾Divide by a power of 10 as follows: Xnew =Xold

10n,withnthe

number of digits of the maximum absolute value.

Again, note that standardization is especially useful for regression-

based approaches but is not needed for decision trees, for example.

60 FRAUD ANALYTICS

CATEGORIZATION

Categorization (also known as coarse-classiﬁcation, classing, grouping,

or binning) can be done for various reasons. For categorical vari-

ables, it is needed to reduce the number of categories. Consider, for

example, the variable country of origin having 50 different values. When

this variable would be put into a regression model, one would need

49 dummy variables (50 – 1 because of the collinearity), which would

necessitate the estimation of 49 parameters for only one variable. With

categorization, one would create categories of values such that less

parameters will have to be estimated and a more robust model is

obtained.

For continuous variables, categorization may also be very beneﬁ-

cial. Consider, for example, the age variable and the observed amount

of fraudulent cases as depicted in Figure 2.8. Clearly, there is a non-

monotonous relation between risk of fraud and age. If a nonlinear

model (e.g., neural network, support vector machine) would be used,

then the nonlinearity can be perfectly modeled. However, if a regres-

sion model would be used (which is typically more common because

of its interpretability), then since it can only ﬁt a line, it will miss out

on the nonmonotonicity. By categorizing the variable into ranges, part

of the nonmonotonicity can be taken into account in the regression.

Hence, categorization of continuous variables can be useful to model

nonlinear effects into linear models.

0.005

0.01

0.015

0.02

0.025

0.03

0.035

15 25 35 45 55 65 75

Fraud risk versus age

Figure 2.8 Default Risk Versus Age

DATA COLLECTION, SAMPLING, AND PREPROCESSING 61

Various methods can be used to do categorization. Two very basic

methods are equal interval binning and equal frequency binning.

Consider, for example, the income values 1,000, 1,200, 1,300, 2,000,

1,800, 1,400. Equal interval binning would create two bins with the

same range, Bin 1: 1,000, 1,500 and Bin 2: 1,500, 2,000, whereas equal

frequency binning would create two bins with the same number of

observations as follows, Bin 1: 1,000, 1,200, 1,300, Bin 2: 1,400, 1,800,

2,000. However, both methods are quite basic and do not take into

account a target variable (e.g., churn, fraud, credit risk).

Many analytics software tools have built-in facilities to do categori-

zation using Chi-squared analysis. A very handy and simple approach

(available in Microsoft Excel) is to use pivot tables. Consider the

examples in Tables 2.3 and 2.4.

Table 2.3 Coarse Classifying the Product Type Variable

Customer ID Age Product Type …Fraud

C1 44 clothes No

C2 20 books No

C3 58 music Yes

C4 26 clothes No

C5 30 electro Yes

C6 32 games No

C7 48 books Yes

C8 60 clothes No

…

Table 2.4 Pivot Table for Coarse Classifying the Product Type

Variable

Clothes Books Music Electro Games …

Good 1,000 2,000 3,000 100 5,000

Bad 500 100 200 80 800

Odds 220 15 1.25 6.25

62 FRAUD ANALYTICS

One can then construct a pivot table and calculate the odds as

follows:

We can then categorize the values based on similar odds. For exam-

ple, category 1 (clothes, electro), category 2 (games), and category 3

(books and music).

Chi-squared analysis is a more sophisticated way to do coarse clas-

siﬁcation. Consider, for example, Table 2.5 for coarse classifying a vari-

able product type.

Suppose we want three categories and consider the following

options:

◾Option 1: clothes; books and music; others

◾Option 2: clothes; electro; others

Both options can now be investigated using Chi-squared analy-

sis. The purpose hereby is to compare the empirically observed with

the independence frequencies. For option 1, the empirically observed

frequencies are depicted in Table 2.6.

The independence frequencies can be calculated as follows. The

number of nonfraud observations given that the odds are the same as

in the whole population is 6,300/10,000 ×9,000/10,000 ×10,000 =

5,670. One then obtains Table 2.7.

Table 2.5 Coarse Classifying the Product Type Variable

Attribute Clothes Books Music Electro Games Movies Total

No-fraud 6,000 1,600 350 950 90 10 9,000

Fraud 300 400 140 100 50 10 1,000

No-fraud: Fraud odds 20:1 4:1 2.5:1 9.5:1 1.8:1 1:1 9:1

Table 2.6 Empirical Frequencies Option 1 for Coarse

Classifying Product Type

Attribute Clothes Books & Music Others Total

No-fraud 6,000 1,950 1,050 9000

Fraud 300 540 160 1,000

Total 6,300 2,490 1,210 10,000

DATA COLLECTION, SAMPLING, AND PREPROCESSING 63

Table 2.7 Independence Frequencies Option 1 for Coarse

Classifying Product Type

Attribute Clothes Books & Music Others Total

No-fraud 5,670 2,241 1,089 9,000

Fraud 630 249 121 1,000

Total 6,300 2,490 1,210 10,000

The more the numbers in both tables differ, the less independence,

hence better dependence and a better coarse classiﬁcation. Formally,

one can calculate the Chi-squared distance as follows:

◾𝜒2=(6000 −5670)2

5670 +(300 −630)2

630 +(1950 −2241)2

2241

+(540 −249)2

249 +(1050 −1089)2

1089 +(160 −121)2

121 =583

Likewise, for option 2, the calculation becomes:

◾𝜒2=(6000 −5670)2

5670 +(300 −630)2

630 +(950 −945)2

945

+(100 −105)2

105 +(2050 −2385)2

2385 +(600 −265)2

265 =662

So, based on the Chi-squared values, option 2 is the better cate-

gorization. Note that formally, one needs to compare the value with

a Chi-squared distribution with k−1 degrees of freedom with kthe

number of values of the characteristic.

WEIGHTS OF EVIDENCE CODING

Categorization reduces the number of categories for categorical vari-

ables. For continuous variables, categorization will introduce new

variables. Consider, for example, a regression model with age (four

categories, so three parameters) and product type (ﬁve categories, so

four parameters) characteristics. The model then looks as follows:

Y=𝛽0+𝛽1Age1+𝛽2Age2+𝛽3Age3+𝛽4Prod1+𝛽5Prod2+𝛽6Prod3

+𝛽7Prod4

64 FRAUD ANALYTICS

Despite having only two characteristics, the model still needs

eight parameters to be estimated. It would be handy to have a mono-

tonic transformation f(.) such that our model could be rewritten

as follows:

Y=𝛽0+𝛽1f(Age1,Age2,Age3)+𝛽2f(Prod1,Prod2,Prod3,Prod4)

The transformation should have a monotonically increasing or

decreasing relationship with Y. Weights-of-evidence coding is one

example of a transformation that can be used for this purpose. This is

illustrated in Table 2.8.

The WOE is calculated as: ln(Dist No-fraud/Dist Fraud). Because of

the logarithmic transformation, a positive (negative) WOE means Dist

No-fraud >(<) Dist Fraud. The WOE transformation thus implements

a transformation monotonically related to the target variable.

The model can then be reformulated as follows:

Y=𝛽0+𝛽1WOEage +𝛽2WOEproduct-type

This gives a more concise model than the model that we started

this section with. However, note that the interpretability of the model

becomes somewhat less straightforward when WOE variables are

being used.

Table 2.8 Calculating Weights of Evidence (WOE)

Age Count Distr. Count No-fraud Distr No-fraud Fraud Distr Fraud WOE

Missing 50 2.50% 42 2.33% 84.12% −57.28%

18–22 200 10.00% 152 8.42% 48 24.74% −107.83%

23–26 300 15.00% 246 13.62% 54 27.84% −71.47%

27–29 450 22.50% 405 22.43% 45 23.20% −3.38%

30–35 500 25.00% 475 26.30% 25 12.89% 71.34%

35–44 350 17.50% 339 18.77% 11 5.67% 119.71%

44+150 7.50% 147 8.14% 31.55% 166.08%

2,000 1,806 194

DATA COLLECTION, SAMPLING, AND PREPROCESSING 65

VARIABLE SELECTION

Many analytical modeling exercises start with tons of variables, of

which typically only a few actually contribute to the prediction of

the target variable. For example, the average fraud model in fraud

detection has somewhere between 10 and 15 variables. The key ques-

tion is how to ﬁnd these variables. Filters are a very handy variable

selection mechanism. They work by measuring univariate correlations

between each variable and the target. As such, they allow for a quick

screening of which variables should be retained for further analysis.

Various ﬁlter measures have been suggested in the literature. One can

categorize them as depicted in Table 2.9.

The Pearson correlation 𝜌Pis calculated as follows:

𝜌P=n

i=1(Xi−X)(Yi−Y)

n

i=1(Xi−X)2n

i=1(Yi−Y)2

It measures a linear dependency between two variables and always

varies between −1and+1. To apply it as a ﬁlter, one could select

all variables for which the Pearson correlation is signiﬁcantly differ-

ent from 0 (according to the p-value), or for example, the ones where

ρP>0.50.

The Fisher score can be calculated as follows:

XG−XB

s2

G+s2

Table 2.9 Filters for Variable Selection

Continuous Target

(e.g., CLV, LGD)

Categorical Target (e.g.,

churn, fraud, credit risk)

Continuous variable Pearson correlation Fisher score

Categorical variable Fisher score/ANOVA Information value

Cramer’s V

Gain/entropy

66 FRAUD ANALYTICS

where XG(XB)represents the average value of the variable for the non-

fraudsters (fraudsters) and s2

G(s2

B)the corresponding variances. High

values of the Fisher score indicate a predictive variable. To apply it as

a ﬁlter, one could, for example, keep the top 10 percent. Note that

the Fisher score may generalize to a well-known analysis of variance

(ANOVA) in case a variable has multiple categories.

The information value (IV) ﬁlter is based on weights of evidence

and is calculated as follows:

IV =



i=1

(Dist Goodi−Dist Badi)×WOEi,

whereby krepresents the number of categories of the variable. For the

example discussed in Table 2.10, the calculation becomes as shown.

The following rules of thumb apply for the information value:

◾<0.02: unpredictive

◾0.02 – 0.1: weak predictive

◾0.1 – 0.3: medium predictive

◾+0.3: strong predictive

Note that the information value assumes that the variable has been

categorized. It can actually also be used to adjust/steer the catego-

rization so as to optimize the IV. Many software tools will provide

Table 2.10 Calculating the Information Value Filter Measure

Age Count Distr. No-fraud Distr Fraud Distr WOE IV

Count No-fraud Fraud

Missing 50 2.50% 42 2.33% 84.12% −57.28% 0,0103

18–22 200 10.00% 152 8.42% 48 24.74% −107.83% 0,1760

23–26 300 15.00% 246 13.62% 54 27.84% −71.47% 0,1016

27–29 450 22.50% 405 22.43% 45 23.20% −3.38% 0,0003

30–35 500 25.00% 475 26.30% 25 12.89% 71.34% 0,0957

35–44 350 17.50% 339 18.77% 11 5.67% 119.71% 0,1568

44+150 7.50% 147 8.14% 31.55% 166.08% 0,1095

Information Value 0,6502

DATA COLLECTION, SAMPLING, AND PREPROCESSING 67

interactive support to do this, whereby the modeler can adjust the

categories and gauge the impact on the IV. To apply it as a ﬁlter,

one can calculate the information value of all (categorical) variables

and only keep those for which the IV >0.1 or, for example, the top

10 percent.

Another ﬁlter measure based on Chi-squared analysis is Cramer’s V.

Consider, for example, the contingency table depicted in Table 2.11 for

online/ofﬂine transaction versus nonfraud/fraud.

Similar to the example discussed in the section on categorization,

the Chi-squared value for independence can then be calculated as

follows:

𝜒2=(500 −480)2

480 +(100 −120)2

120 +(300 −320)2

320 +(100 −80)2

=10.41

This follows a Chi-squared distribution with k– 1 degrees of

freedom, with kbeing the number of classes of the characteristic.

The Cramer’s V measure can then be calculated as follows:

Cramer′sV =𝜒2

n=0.10,

with nthe number of observations in the data set. Cramer’s V is always

bounded between 0 and 1 and higher values indicate better predictive

power. As a rule of thumb, a cut-off of 0.1 is commonly adopted. One

can then again select all variables where Cramer’s V is bigger than 0.1,

or consider, for example, the top 10 percent. Note that the Informa-

tion Value and Cramer’s V typically consider the same characteristics

as most important.

Table 2.11 Contingency Table for Marital Status

versus Good/Bad Customer

Nonfraud Fraud Total

Ofﬂine 500 100 600

Online 300 100 400

Total 800 200 1000

68 FRAUD ANALYTICS

Filters are very handy, as they allow reduction in the number

of dimensions of the data set early in the analysis in a quick way.

Their main drawback is that they work univariately and typically

do not consider correlation between the dimensions individually,

for example. Hence, a follow-up input selection step during the

modeling phase will be necessary to further reﬁne the characteristics.

Also worth mentioning here is that other criteria may play a role in

selecting variables, such as regulatory compliance and privacy issues.

Note that different regulations may apply in different geographical

regions and hence should be checked. Also operational issues could be

considered. For example, trend variables could be very predictive but

might require too much time to be computed in a real-time, online

fraud detection environment.

PRINCIPAL COMPONENTS ANALYSIS

An alternative method for input or variable selection is principal com-

ponent analysis, which is a technique to reduce the dimensionality of

data by forming new variables that are linear composites of the orig-

inal variables. These new variables describe the main components or

dimensions that are present in the original data set, hence its name.

The main dimensions may and often are different from the imposed

measurement dimensions, and as such are obtained as a linear com-

bination of those. Figure 2.9 provides a two-dimensional illustration

of this. The two measurement dimensions represented by the Xand Y

axes do not adequately capture the actual dimensions or components

present in the data. These are clearly situated in a 45-degree angle

YPC1

PC2

Figure 2.9 Illustration of Principal Component Analysis in a Two-Dimensional Data Set

DATA COLLECTION, SAMPLING, AND PREPROCESSING 69

compared to the Xand Ydimensions and are described or captured

by the two principal components PC1and PC2.

The maximum number of new variables that can be formed or

derived from the original data (i.e., the number of principal compo-

nents) is equal to the number of original variables. However, when the

aim is data reduction, then typically a reduced set of principal compo-

nents is sufﬁcient to replace the original larger set of variables, since

most of the variance in the original set of variables will be explained

by a limited number of principal components. In other words, most

of the information that is contained in the large set of original vari-

ables typically can be summarized by a small number of new variables.

To explain all the variance in the original data set, the full set of prin-

cipal components is needed, but some of these will only account for a

very small fraction of variance and therefore can be left out, leading to

a reduced dimensionality or number of variables in the data set.

Example: A data set contains 80 ﬁnancial ratio variables describ-

ing the ﬁnancial situation or health of a ﬁrm, which may be indica-

tive or relevant for detecting fraud. However, many of these ﬁnancial

ratios are typically strongly correlated. In other words, many of these

ratios overlap, meaning that they basically express the same informa-

tion. This may be explained by the fact that these ratios are derived

from and summarize the same basic information. Therefore, one could

prefer to combine or summarize the original large set of ratios by a

reduced number of ﬁnancial indices, as can be done by performing a

principal component analysis.

Moreover, the new limited set of ﬁnancial indices should prefer-

ably be uncorrelated, such they can be included in a fraud-detection

model without causing the ﬁnal model to become unstable. Correlation

among the explanatory or predictor variables, which is called multi-

collinearity, may result in unstable models. The stability or robustness

of a model refers to the stability of the exact values of the param-

eters of the model that are being estimated based on the sample of

observations. If the values of these parameters heavily depend on the

exact sample of observations used to induce the model, then the model

is called unstable. The values of the parameters, in fact, express the

relation between the explanatory or predictor variables and the depen-

dent or target variable. When the exact relation differs strongly for

70 FRAUD ANALYTICS

different samples of observations, then questions arise with respect to

the exact nature and reliability of this presumed relation. When the

explanatory variables included in a model are correlated, typically, the

resulting model is unstable. Therefore, an input selection procedure is

often performed—for example, using the ﬁlter approach discussed in

the previous paragraph, or alternatively a new set of factors may be

derived using principal component analysis to address this problem,

since the resulting new variables (i.e., the principal components, will

be uncorrelated among themselves).

Principal components are calculated by making use of the

eigenvector decomposition (which will not be explained within

the scope of this book; interested readers may refer to specialized

literature on principal component analysis and eigenvector decompo-

sition). Let X1,X2,…,Xpbe the mean-corrected, standardized original

variables, and ˝ =cov(X)the corresponding covariance matrix. Let

𝜆1≥𝜆2≥…≥𝜆p≥0bethepeigenvalues of ˝ and e1,e2,…,epthe

corresponding eigenvectors. The principal components PCj,j=1…p,

corresponding with and in fact replacing the variables X1,X2,…,Xp

are then given by:

PCj=e′jX=ej1X1+ej2X2+···+ejpXp=X′ej

The eigenvectors express the importance of each of the original

variables in the construction of the new variables. The eigenvectors

determine how the original variables are combined into new variables,

and in what proportions.

The observed variance in data set Xthat is explained by principal

component PCjis equal to the corresponding eigenvalue 𝜆j.Thetotal

variance or information in the data will not change and remain con-

stant as the sum of the variances of the principal components—that

is, the new variables—is equal to the sum of the variances of the origi-

nal variables. Also note that the covariance or correlation between two

principal components is equal to zero, cov(PCi,PCj)=0fori≠j.

For each observation in the data set, the values for the new vari-

ables can be calculated. These values are called the PC scores and can

be calculated by making use of the eigenvectors as weights on the

mean-corrected data.

DATA COLLECTION, SAMPLING, AND PREPROCESSING 71

Example: A data set consists of two original variables X1and X2,

which we will replace by two principal components. The ﬁrst step

consists of calculating the two eigenvectors e1and e2of the covariance

matrix Σ, which can be done straightforwardly using the avail-

able observations in the data set. As such, we get e1=(e11,e12)=

(0.562,0.345)and e2=(e21,e22)=(−0.345,0.562). Subsequently, in

a second step the PC scores for the two principal components (i.e.,

the two new variables) are calculated as a function of the original,

mean-corrected, values (x1,x2):

PC1=e11x1+e12x2=0.562x1+0.345x2

PC2=e21x1+e22x2=−0.345x1+0.562x2

Filling out the values x1and x2for the two original variables for

each observation in the data set then gives the new observations with

the values for the new variables.

How does the transformation of the two original variables in the

above example lead to reducing the data set and to select inputs?

By looking at the eigenvalues, one may decide about leaving out new

variables. If in the above example the eigenvalue corresponding to PC1

is signiﬁcantly larger than the eigenvalue corresponding to PC2(i.e., if

𝜆1≫𝜆

2), then it can be decided to drop PC2from the further analysis.

Note from this simple example that replacing the original variables

with a (reduced) set of uncorrelated principal components comes at

a price—reduced interpretability. The principal component variables

derived from the original set of variables cannot easily be interpreted,

since they are calculated as a weighted linear combination of the origi-

nal variables. In the previous example, only two original variables were

combined into principal components, still allowing some interpretation

of the resulting principal components. But when the analysis spans

tens or even hundreds or thousands of variables, then any interpre-

tation of the resulting components is clearly prohibited. As discussed

in Chapter 1, in certain settings this might be unacceptable, since the

analysts using the resulting model can no longer interpret it. However,

when interpretability is no concern, then principal component analysis

is a powerful data reduction tool that will yield a better model in terms

of stability as well as predictive performance.

72 FRAUD ANALYTICS

RIDITS

As an alternative to weights of evidence values, one may adopt another

approach to assign numerical values to categorical ordinal variables,

called RIDIT scoring, introduced by Bross (1958), who coined the term

RIDIT in analogy to logit and probit. The following discussion of RIDITs

and PRIDITs is based on a study by Brocket et al. (2002), who adopted

and adapted RIDITs for fraud detection in an unsupervised setting.

The RIDIT scoring mechanism incorporates the ranked nature of

responses, that is, categories of an ordinal categorical variable. Assume

the different response categories are ordered in decreasing likelihood of

fraud suspicion so that a higher categorical response indicates a lesser

suspicion of fraud. In ranking the categories from high- to low-fraud

risk, one may use expert input or historical observed fraud rates. The

RIDIT score for a categorical response value ito variable t,witĥ

ptj indi-

cating the proportion of the population having value ifor variable t,is

then calculated as follows:

Bti =

j<i

ptj −

j>i

ptj i=1,2,…,kt

The above formula transforms a set of categorical responses into

a set of meaningful numerical values in the interval [–1,1], reﬂect-

ing the relative abnormality of a particular response. Intuitively, the

RIDIT score can be interpreted to be an adjusted or transformed percen-

tile score.

Example: A binary response fraud indicator variable with value yes

occurring for 10 percent of the cases and considered by experts more

indicative of fraud than a value no, occurring for the other 90 percent of

the cases, results in RIDIT scores Bt1(“yes”)=−0.9andBt2(“no”)=0.1.

For a similar binary fraud indicator with 50 percent of the cases having

a value “yes” and 50 percent having a value “no,” the resulting RIDIT

scores are Bt1(“yes”)=−0.5andBt2(“no”)=0.5. This clearly indicates

that a response “yes” on the ﬁrst indicator variable is more abnormal

or indicative of fraud than a response “yes” on the second indicator,

and as such the transformation yields RIDIT scores that can be easily

included in a quantitative model and make sense from an operational

or expert perspective.

DATA COLLECTION, SAMPLING, AND PREPROCESSING 73

Also for ordinal categorical variables with more than two categor-

ical values RIDIT scores can be calculated using the above formula.

RIDIT scores may be used to replace the categorical fraud indicator

values, and as such allow these categorical variables to be directly

integrated in any numerical analysis for fraud detection.

Remark that for calculating RIDIT scores the actual target values do

not have to be known, as required for weights of evidence calculation.

Therefore, RIDIT scores can be used in an unsupervised learning setting

and when no labeled historical observations are available.

PRIDIT ANALYSIS

PRIDIT analysis combines the two techniques described in the two

previous paragraphs and results in overall fraud suspicion scores calcu-

lated from a set of ordinal categorical fraud indicators. As such, PRIDIT

analysis may be used to assemble these indicators into a single variable

that can be included in any further analysis. Alternatively, PRIDIT anal-

ysis can be used as a ﬁlter approach to reduce the number of indicator

variables included in the further analysis, as well as the ﬁnal outputted

fraud suspicion score. Given the two ﬁrst uses, we include PRIDIT

analysis in this chapter although it could be considered an unsuper-

vised learning technique for fraud detection as discussed in Chapter 3.

The reader may refer to Brocket et al. (2002) for an extensive dis-

cussion regarding the mathematical derivation and interpretation of

PRIDIT scores.

Assume that only a set of ordinal categorical fraud indicators is

available, transformed into RIDIT scores as discussed in the above

section. Let F=(fit)denote the matrix of individual RIDIT variable

scores for each of the variables t=1,2,…,m, for each of the cases

i=1,2,…,nto be analyzed and scored for fraud. A straightforward

overall fraud suspicion score aggregating these individual RIDIT scores

for the available fraud indicator variables can simply be calculated

by summing all the individual RIDITs. We then get the PRIDIT score

vector by multiplying the matrix Fwith a unity weight vector W=

(1,1,…,1)′, with the prime indicating the transposed vector, that is:

S=FW

74 FRAUD ANALYTICS

Note that these simple aggregated suspicion scores are equally

dependent on each of the indicator variables, since the weights in

vector W, which determine the impact of an indicator on the resulting

score are all set equal to one. However, clearly not every indicator is

equally related to fraud and therefore serves as a predictor or warning

signal of fraud. Hence, an effective overall suspicion score should not

necessarily assign equal importance to each indicator, on the contrary.

A smarter aggregation of the individual indicators assesses the relative

importance and weighs the indicators accordingly when aggregating

them into a single overall suspicion score.

The intuition underlying the calculation of PRIDIT scores is to

adapt the weights according to the correlation or consistency between

the individual RIDIT scores and the resulting overall score. Basically,

PRIDIT scores assign higher weights to an individual fraud indicator

variable if the RIDIT scores of this variable over all cases included in

the analysis are in line with the resulting overall suspicion score. On

the other hand, when a variable is less consistent with the overall

score, then it receives a lower weight. As elaborated in Brocket et al.

(2002), a meaningful set of weights can be obtained by calculating the

ﬁrst principal component, as discussed in a previous section, of the

matrix F′F. The ﬁrst principal component is the weight vector that is

used in calculating the PRIDIT scores, assigning a relative importance

to each indicator according to the intuitive consistency principle

discussed in this paragraph.

The PRIDIT approach can as such be used for variable selection,

since indicators receiving weights that are not signiﬁcantly different

from zero may be removed from the data set. Alternatively, similar to

principal component analysis to variable reduction, the set of ordinal

indicators aggregated into the PRIDIT score may be replaced by this

score, depending on the purpose and setup of the analysis.

SEGMENTATION

Sometimes the data are segmented before the analytical modeling

starts. A ﬁrst reason for this could be strategic. For example, banks

might want to adopt special strategies to speciﬁc segments of cus-

tomers. It could also be motivated from an operational viewpoint.

DATA COLLECTION, SAMPLING, AND PREPROCESSING 75

For example, new customers must have separate models because the

characteristics in the standard model do not make sense operationally

for them. Segmentation could also be needed to take into account

signiﬁcant variable interactions. For example, if one variable strongly

interacts with a number of others, it might be sensible to segment

according to this variable.

The segmentation can be conducted using the experience and

knowledge from a business expert, or it could be based on statisti-

cal analysis using, for example, decision trees (cf. infra), k-means

clustering or self-organising maps (cf. infra).

Segmentation is a very useful preprocessing activity since one

can now estimate different analytical models each tailored to a

speciﬁc segment. However, one needs to be careful with it since, by

segmenting, the number of analytical models to estimate will increase,

which will obviously also increase the production, monitoring, and

maintenance costs.

REFERENCES

Armstrong, J. S. (2001). Selecting Forecasting Methods. In J.S. Armstrong,

ed. Principles of Forecasting: A Handbook for Researchers and Practitioners.

New York: Springer Science +Business Media, pp. 365–386.

Baesens, B. (2014). Analytics in a Big Data World: The Essential Guide to Data Science

and Its Applications. Hoboken, NJ: John Wiley & Sons.

Bolton, R. J., & Hand, D. J. (2002). Statistical Fraud Detection: A Review.

Statistical Science,17 (3): 235–249.

Caron, F., Vanden Broucke, S., Vanthienen, J., & Baesens, B. (2013). Advanced

Rule-Based Process Analytics: Applications for Risk Response Decisions

and Management Control Activities. Expert Systems with Applications,

Submitted.

Chakraborty, G., Murali, P., & Satish, G. (2013). Text Mining and Analysis: Prac-

tical Methods, Examples, and Case Studies Using SAS. Cary, N.C.: SAS Institute.

Cressey, D. R. (1953). Other People’s Money; A Study of the Social Psychology of

Embezzlement. New York: Free Press.

Cullinan, G. J. (1977). Picking Them by Their Batting Averages’ Recency–Frequency–

Monetary Method of Controlling Circulation, Manual Release 2103. New York:

Direct Mail/Marketing Association.

Dufﬁeld, G., & Grabosky, P. (2001). The Psychology of Fraud. In Trends and

Issues in Crime and Criminal Justice, Australian Institute of Criminology (199).

76 FRAUD ANALYTICS

Elder IV, J., & Thomas, H. (2012). Practical Text Mining and Statistical Analysis for

Non-Structured Text Data Applications. New York: Academic Press.

Fawcett, T., & Provost, F. (1997). Adaptive Fraud Detection. Data Mining and

Knowledge Discovery 1–3(3): 291–316.

Grabosky, P., & Dufﬁeld, G. (2001). Red Flags of Fraud. Trends and Issues in Crime

and Criminal Justice, Australian Institute of Criminology (200).

Han, J., & Kamber, M. (2007). Data Mining: Concepts and Techniques, Third

Edition: Morgan Kaufmann.

Hand, D. (2007, September). Statistical Techniques for Fraud Detection,

Prevention, and Evaluation. Paper presented at the NATO ASI: Mining

Massive Data sets for Security, London, England.

Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of Data Mining.Cam-

bridge, MA: Bradford.

Jamain, A. (2001). Benford’s Law. London: Imperial College.

Junqué de Fortuny, E., Martens, D., & Provost, F. (2013). Predictive Modeling

with Big Data: Is Bigger Really Better? Big Data 1(4): 215–226.

Little, R. J. A., & Rubin, D. B. (2002). Statistical Analysis with Missing Data, Second

Edition, New York: John Wiley & Sons, p. 408.

Maydanchik, A. (2007). Data Quality Assessment. Bradley Beach, NC: Technics

Publications.

Navarette, E. (2006). Practical Calculation of Expected and Unexpected Losses

in Operational Risk by Simulation Methods (Banca & Finanzas: Documen-

tos de Trabajo, 1(1): pp. 1–12).

Petropoulos, F., Makridakis, S., Assimakopoulos, V., & Nikolopoulos, K. (2014).

“Horses for Courses” in Demand Forecasting. European Journal of Opera-

tional Research.

Schneider, F. (2002). Size and measurement of the informal economy in 110

countries around the world. In Workshop of Australian National Tax Centre,

ANU, Canberra, Australia.

Tan, P.-N. N., Steinbach, M., & Kumar, V. (2005). Introduction to Data Mining.

Boston: Addison Wesley.

Van Gestel, T., & Baesens, B. (2009). Credit Risk Management: Basic Concepts:

Financial Risk Components, Rating Analysis, Models, Economic and Regulatory

Capital. Oxford: Oxford University Press.

Van Vlasselaer, V., Eliassi-Rad, T., Akoglu, L., Snoeck, M., & Baesens, B.

(2015). Gotcha! Network-based Fraud Detection for Social Security

Fraud. Management Science, Submitted.

Verbeke, W., Dejaeger, K., Martens, D., Hur, J., & Baesens, B. (2012). New

Insights into Churn Prediction in the Telecommunication Sector: A Proﬁt

Driven Data Mining Approach. European Journal of Operational Research

218: 211–229.

CHAPTER 3

Descriptive

Analytics for

Fraud Detection

INTRODUCTION

Descriptive analytics or unsupervised learning aims at ﬁnding unusual

anomalous behavior deviating from the average behavior or norm

(Bolton and Hand 2002). This norm can be deﬁned in various ways.

It can be deﬁned as the behavior of the average customer at a

snapshot in time, or as the average behavior of a given customer

across a particular time period, or as a combination of both. Predictive

analytics or supervised learning, as will be discussed in the following

chapter, assumes the availability of a historical data set with known

fraudulent transactions. The analytical models built can thus only

detect fraud patterns as they occurred in the past. Consequently, it

will be impossible to detect previously unknown fraud. Predictive

analytics can however also be useful to help explain the anomalies

found by descriptive analytics, as we will discuss later.

When used for fraud detection, unsupervised learning is often

referred to as anomaly detection, since it aims at ﬁnding anomalous

and thus suspicious observations. In the literature, anomalies are com-

monly described as outliers or exceptions. One of the ﬁrst deﬁnitions

of an outlier was provided by Grubbs (1969), as follows:

“An outlying observation, or outlier, is one that appears to

deviate markedly from other members of the sample in

which it occurs.”

A ﬁrst challenge when using unsupervised learning is to deﬁne the

average behavior or norm. Typically, this will highly depend on the

application ﬁeld considered. Also the boundary between the norm and

the outliers is typically not clear-cut. As said earlier, fraudsters will try

to blend into the average or norm as good as possible, hereby substan-

tially complicating their detection and the corresponding deﬁnition of

the norm. Furthermore, the norm may change over time, so the ana-

lytical models built need to be continuously monitored and updated,

possibly in real-time. Finally, anomalies do not necessarily represent

fraudulent observations. Hence, the usage of unsupervised learning

for fraud detection requires extensive follow-up and validation of the

identiﬁed, suspicious observations.

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 79

Unsupervised learning can be useful for organizations that start

doing fraud detection and thus have no labeled historical data set

available. It can also be used in existing fraud models by uncovering

new fraud mechanisms. This is especially relevant in environments

where fraudsters are continuously adapting their strategies to beat

the detection methods. A ﬁrst example of this is credit card fraud

whereby fraudsters continuously try out new ways of committing

fraud. Another example is intrusion detection in a cyber-fraud setting.

Supervised methods are based on known intrusion patterns, whereas

unsupervised methods or anomaly detection can identify emerging

cyber threats.

In this chapter, we will explore various unsupervised techniques

to detect fraud.

GRAPHICAL OUTLIER DETECTION PROCEDURES

To detect one-dimensional outliers, a histogram or box plot can be used

(see Chapter 2). Two-dimensional outliers can be detected using a scat-

ter plot. The latter can also be extended to a three-dimensional setting,

whereby spinning facilities can be handy to rotate the graph so as to

facilitate ﬁnding the outliers. This is illustrated in Figure 3.1. The plot

clearly shows three outliers marked by asterisks (*) representing claims

with an unusually high amount, a high number of cars of the claimant,

and a small number of days since the previous claim. Clearly, these

claims are suspicious and should be further investigated.

Ideally, graphical methods should be complemented with multidi-

mensional data analysis and online analytical processing (OLAP) facil-

ities. Figure 3.2 shows an example of an OLAP cube representing the

distribution or count of claims based on amount of claim, number of

cars, and recency of previous claim. Once the cube has been popu-

lated from a data warehouse or transactional data source, the following

OLAP operations can be performed:

◾Roll-up: The idea here is to aggregate across one or more dimen-

sions. An example of this is the distribution of amount of claim

and recency aggregated across all number of cars (roll up of

number of cars dimension). Another example is the distribution

80 FRAUD ANALYTICS

Amount of claim

Number of cars

Number of days

since

previous claim

Figure 3.1 3D Scatter Plot for Detecting Outliers

< 1 month

Between 3 and 6 months

> 6 months

Between 1 and 3 months

≥4

123

Number of cars

Amount of claim

<1,000

Between 1,000

and 5,000

Between 5,000

and 10,000

> 10,000 Count

Most recent claim

Figure 3.2 OLAP Cube for Fraud Detection

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 81

of amount of claim aggregated across all number of cars and all

number of days since previous claim (roll up of number of cars

and number of days since previous claim dimensions).

◾Drill-down: This is the opposite of roll-up whereby more detail

is asked for by adding another dimension to the analysis.

◾Slicing: The idea here is to pick a slice along one of the dimen-

sions. An example is to show the distribution of amount and

recency for all claims where the claimant has more than or equal

to four cars (slice along the number of cars dimension).

◾Dicing: The idea here is to ﬁx values for all the dimensions

and create a sub-cube. An example is to show the distribution

of amount, number of cars, and recency for all claims where

amount is between 1,000 and 10,000, number of cars is two or

three, and recency is between one and six months.

Many software tools are available to support OLAP analysis. They

excel in offering powerful visualizations, sometimes even augmented

with virtual reality technology for better detecting relationships in

the data. OLAP tools will also typically implement pivot tables for

multidimensional data analysis. Pivot tables allow analysts to summa-

rize tabular data by cross-tabulating user speciﬁed dimensions using

a drag and drop interface. This is illustrated in Figure 3.3 for a credit

card fraud detection data set analyzed in Microsoft Excel. You can see

that the data set has 2,397 observations with 2,364 nonfrauds and

33 frauds. The columns depict the average for the recency, frequency,

and monetary (RFM) variables. Note that the data depicted has

been ﬁltered to non-EU transactions as depicted in the upper-left

corner cell B2. From the results, it can be seen that when looking at

non EU-transactions, fraudulent transactions have a lower average

recency, higher average frequency, and higher average monetary

value when compared to nonfraudulent transactions. These are very

interesting starting insights to further explore fraud patterns in the

data. The pivot table can be easily manipulated using the panel to the

right. Other ﬁlters can be deﬁned, columns and rows can be added,

and descriptive statistics or summarization measures can be varied

(e.g., count, minimum, maximum).

Graphical and OLAP methods are handy and easy to work with.

They are ideal tools to explore the data and get preliminary insights.

Figure 3.3 Example Pivot Table for Credit Card Fraud Detection

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 83

However, they are less formal and only limited to a few dimensions.

They require active involvement of the end-user in detecting the

anomalies. In other words, the user should select the dimensions and

decide on the OLAP routines to be performed. For a large dimensional

data set, this may be a cumbersome exercise. Besides being used dur-

ing preprocessing, OLAP facilities are getting more and more popular

for model post-processing and monitoring, as we will discuss later.

STATISTICAL OUTLIER DETECTION PROCEDURES

A ﬁrst well-known statistical outlier detection method is calculating

the z-scores, as previously discussed in Chapter 2. Remember, obser-

vations for which the absolute value of the z-score is bigger than 3 can

be considered as outliers. A more formal test is the Grubbs test, which

is formulated as follows (see Grubbs 1950):

H0: There are no outliers in the data set.

HA: There is at least one outlier in the data set.

It starts by calculating the z-score for every observation. Let’s say

that the maximum absolute value of the observed z-scores equals G.

The corresponding observation is then considered an outlier at signiﬁ-

cance level 𝛼if:

G>N−1

N





t2𝛼

2N,N−2

N−2+t2𝛼

2N,N−2

where Nrepresents the number of observations, and t2𝛼

2N,N−2is the

critical value of a Student’s t-distribution with N– 2 degrees of free-

dom and signiﬁcance level equal to α/(2N). If an outlier is detected, it

is removed from the data set and the test can be run again. The test

can also be run for multivariate outliers whereby the z-score can be

replaced by the Mahalanobis distance deﬁned as follows:

(x−x)tS−1(x−x),

where xrepresents the observation, xthe mean vector, and Sthe

covariance matrix. A key weakness of this test is that it assumes an

underlying normal distribution, which is not always the case.

84 FRAUD ANALYTICS

Other statistical procedures ﬁt a distribution, or mixture of distri-

butions (using, e.g., maximum likelihood or expectation-maximization

procedures) and label the observations with small values for the prob-

ability density function as outliers.

Break-Point Analysis

Break-point analysis is an intra-account fraud detection method

(Bolton and Hand, 2001). A break point indicates a sudden change in

account behavior, which merits further inspection. The method starts

from deﬁning a ﬁxed time window. This time window is then split into

an old and new part. The old part represents the local model or proﬁle

against which the new observations will be compared. For example, in

(Bolton and Hand 2001), the time window was set to 24 transactions

whereby 20 transactions made up the local model, and 4 transactions

were used for testing. A Student’s t-test can be used to compare the

averages of the new and old parts. Observations can then be ranked

according to their value of the t-statistic. This is illustrated in Figure 3.4.

0 5 10 15 20 25 30

Amount spent

Time

Local model

Break

point

Figure 3.4 Break-Point Analysis

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 85

Peer-Group Analysis

Peer-group analysis was also introduced by Bolton and Hand (2001).

A peer group is a group of accounts that behave similarly to the target

account. When the behavior of the latter starts to deviate substantially

from its peers, an anomaly can be signaled. Peer-group analysis pro-

ceeds in two steps. In step 1, the peer group of a particular account

needs to be identiﬁed. This can be accomplished either by using prior

business knowledge or in a statistical way.

For example, in an employee fraud context, people sharing similar

jobs can be grouped as peers. Another example is healthcare fraud

detection, whereby the aim is to detect fraudulent claim behavior of

doctors across geographical regions. Suppose the target observation is

a cardiologist in San Francisco; then its peers are deﬁned as all other

cardiologists in San Francisco. Statistical similarity metrics can also be

used to deﬁne peers, but these are typically application speciﬁc. Popular

examples here are Euclidean-based metrics (see, e.g., Bolton and Hand

2001; Weston et al. 2008). The number of peers should be carefully

selected. It cannot be too small, or the method becomes too local and

thus sensitive to noise, and also not too large, or the method becomes

too global and thus insensitive to local important irregularities.

In step 2, the behavior of the target account is contrasted with its

peers using a statistical test such a Student’s t-test, or a distance metric

such as the Mahalanobis distance, which works similar.

Let’s work out an example in a credit card context. Assume our

target account has the following time series:

y1,y2,....yn−1,yn

where yirepresents the amount spent at time (e.g., day or week) i.

The aim is now to verify whether the amount spent at time n,yn,is

anomalous. We start by identifying the kpeers of the target account.

These are depicted in gray in the Table 3.1 below, whereby all accounts

have been sorted according to their similarity to the target.

To see whether ynis an outlier, a t-score can be calculated as follows:

yn−x1∶k,n

86 FRAUD ANALYTICS

Table 3.1 Transaction Data Set

for Peer-Group Analysis

xm,1 xm,2 …xm,n-1 xm,n

…

xk,1xk,2 xk,n-1 xk,n

…

x2,1 x2,2 x2,n-1 x2,n

x1,1 x1,2 x1,n-1 x1,n

y1y2yn-1 yn

…

Outlier

Peer group

Target account

0 5 10 15 20 25 30

Amount spent

Time

Figure 3.5 Peer-Group Analysis

where x1∶k,nrepresents the average of x1,n,…,xk,nand sthe corre-

sponding standard deviation. Although a Student’s t-distribution can

be used for statistical interpretation, it is recommended to simply order

the observations in terms of their t-score and further inspect the ones

with the highest scores. This is illustrated in Figure 3.5.

A key advantage of peer-group analysis when compared to break-

point analysis is that it tracks anomalies by considering inter-account

instead of intra-account behavior. For example, if one was to compare

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 87

transaction amounts for a particular account with previous amounts

on that same account (intra-account), then the spending behavior dur-

ing Christmas will deﬁnitely be ﬂagged as anomalous. By considering

peers instead (inter-account), this problem is avoided. It is important

to note that both break-point and peer-group analysis will detect local

anomalies rather than global anomalies. In other words, patterns or

sequences that are not unusual in the global population might still be

ﬂagged as suspicious when they appear to be unusual compared to

their local proﬁle or peer behavior.

Association Rule Analysis

Association rules detect frequently occurring relationships between

items (Agrawal, Imielinski et al. 1993). They were originally intro-

duced in a market basket analysis context to detect which items are

frequently purchased together. The key input is a transactions database

Dconsisting of a transaction identiﬁer and a set of items {i1,i2,…,in}

selected from all possible items I. An association rule is then an

implication of the form XY, whereby x⊂I,Y⊂Iand x∩Y=∅.

xis referred to as the rule antecedent whereas Yis referred to as the

rule consequent. Examples of association rules could be:

◾If a customer has a car loan and car insurance, then the customer

has a checking account in 80 percent of the cases.

◾If a customer buys spaghetti, then the customer buys red wine

in 70 percent of the cases.

◾If a customer visits web page A, then the customer will visit web

page Bin 90 percent of the cases.

It is hereby important to note that association rules are stochastic

in nature. This means that they should not be interpreted as a univer-

sal truth, and are characterized by statistical measures quantifying the

strength of the association. Furthermore, the rules measure correla-

tional associations and should not be interpreted in a causal way.

In a fraud setting, association rules can be used to detect fraud

rings in insurance. The transaction identiﬁer then corresponds to a

claim identiﬁer and the items to the various parties involved such as

the insured, claim adjuster, police ofﬁcer and claim service provider

88 FRAUD ANALYTICS

Table 3.2 Transactions Database for Insurance Fraud Detection

Claim Identiﬁer Parties Involved

1 insured A, police ofﬁcer X, claim adjuster 1, auto repair shop 1

2 insured A, claim adjuster 2, police ofﬁcer X

3 insured A, police ofﬁcer Y, auto repair shop 1

4 insured A, claim adjuster 1, claim adjuster 1, police ofﬁcer Y

5 insured B, claim adjuster 2, auto repair shop 2, police ofﬁcer Z

6 insured A, auto repair shop 2, auto repair shop 1, police ofﬁcer X

7 insured C, police ofﬁcer X, auto repair shop 1

8 insured A, auto repair shop 1, police ofﬁcer Z

9 insured A, auto repair shop 1, police ofﬁcer X, claim adjuster 1

10 insured B, claim adjuster 3, auto repair shop 1

(e.g., auto repair shop, medical provider, home repair contractor,

etc.). Let’s consider an example of a transactions database as depicted

in Table 3.2.

The goal is now to ﬁnd frequently occurring relationships or asso-

ciation rules between the various parties involved in the handling of

the claim. This will be solved using a two-step procedure. In step 1,

the frequent item sets will be identiﬁed. The frequency of an item set

is measured by means of its support, which is the percentage of total

transactions in the database that contains the item set. Hence, the item

set X has support sif 100×spercent of the transactions in Dcontain X.

It can be formally deﬁned as follows:

support(X)= number of transactions supporting (X)

total number of transactions

Consider the item set {insured A, police ofﬁcer X, auto repair

shop 1}. This item set occurs in transactions 1, 6, and 9 hereby giving a

support of 3/10 or 30 percent. A frequent item set can now be deﬁned

as an item set of which the support is higher than a minimum value

as speciﬁed by the data scientist (e.g., 10 percent). Computationally

efﬁcient procedures have been developed to identify the frequent item

sets (Agrawal, Imielinski et al. 1993).

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 89

Once the frequent item sets have been found, the association rules

can be derived in step 2. Multiple association rules can be deﬁned based

on the same item set. Consider the item set {insured A, police ofﬁcer

X, auto repair shop 1}. Example association rules could be:

If insured AAnd police ofﬁcer X ⇒auto repair shop 1

If insured AAnd auto repair shop 1 ⇒police ofﬁcer X

If insured A⇒auto repair shop 1 And police ofﬁcer X

The strength of an association rule can be quantiﬁed by means of

its conﬁdence. The conﬁdence measures the strength of the association

and is deﬁned as the conditional probability of the rule consequent,

given the rule antecedent. The rule X⇒Yhas conﬁdence c if 100×c

percent of the transactions in Dthat contain Xalso contain Y. It can be

formally deﬁned as follows:

conﬁdence (X→Y)=P(YX)= support (X∪Y)

support(X)

Consider the association rule “If insured A And police ofﬁcer X

⇒auto repair shop 1.” The antecedent item set {insured A, police

ofﬁcer X} occurs in transactions 1, 2, 6, and 9. Out of these four trans-

actions, three also include the consequent item set {auto repair shop

1}, which results into a conﬁdence of three-fourths, or 75 percent.

Again, the data scientist has to specify a minimum conﬁdence in order

for an association rule to be considered interesting.

Once all association rules have been found, they can be closer

inspected and validated. In our example, the association “If insured

AAnd police ofﬁcer X⇒auto repair shop 1” does not necessarily

imply a fraud ring, but it’s a least worth the effort to further inspect

the relationship between these parties.

CLUSTERING

Introduction

The aim of clustering is to split up a set of observations into segments

such that the homogeneity within a segment is maximized (cohesive),

90 FRAUD ANALYTICS

and the heterogeneity between segments is maximized (separated)

(Everitt, Landau et al. 2010). Examples of applications in fraud

detection include:

◾Clustering transactions in a credit card setting

◾Clustering claims in an insurance setting

◾Clustering tax statements in a tax-inspection setting

◾Clustering cash transfers in an anti-money laundering setting

Various types of clustering data can be used, such as customer char-

acteristics (e.g., sociodemographic, behavioral, lifestyle, …), account

characteristics, transaction characteristics, etc. A very popular sets of

transaction characteristics used for clustering in fraud detection are

the recency, frequency, and monetary (RFM) variables, as introduced

in Chapter 2.

Note that besides structured information, also unstructured infor-

mation such as emails, call records, and social media information might

be considered. As always in analytics, it is important to carefully select

the data for clustering. The more data the better, although care should

be taken to avoid excessive amounts of correlated data by applying

unsupervised feature selection methods. One very simple approach

here is to simply calculate the Pearson correlation between each pair

of data characteristics and only retain one characteristic in case of a

signiﬁcant correlation.

When used for fraud detection, a possible aim of clustering may

be to group anomalies into small, sparse clusters. These can then be

further analyzed and inspected in terms of their characteristics and

potentially fraudulent behavior (see Figure 3.6).

Different types of clustering techniques can be applied for fraud

detection. At a high level, they can be categorized as either hierarchical

or nonhierarchical (see Figure 3.7).

Distance Metrics

As said, the aim of clustering is to group observations based on simi-

larity. Hence, a distance metric is needed to quantify similarity. Various

distance metrics have been introduced in the literature for both con-

tinuous and categorical data.

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 91

Anomalies

Recency

Frequency

Figure 3.6 Cluster Analysis for Fraud Detection

Clustering

Hierarchical Nonhierarchical

Agglomerative Divisive k-means SOM

Figure 3.7 Hierarchical Versus Nonhierarchical Clustering Techniques

For continuous data, the Minkowski distance or Lpnorm between

two observations xiand xjcan be deﬁned as follows:

D(xi;xj)=n

k=1xik −xjk

p1∕p,

where nrepresents the number of variables. When pequals 1, the

Minkowski distance is also referred to the Manhattan or City block

distance. When pequals 2, the Minkowski distance becomes the

well-known Euclidean distance. Both are illustrated in Figure 3.8. For

the example depicted, the distance measures become:

Euclidean ∶(1500 −1000)2+(10 −5)2̃=500

Manhattan ∶1500 −100+10 −5=505

92 FRAUD ANALYTICS

Manhattan

Euclidean

Manhattan

30 50

Monetary

Recency

Figure 3.8 Euclidean Versus Manhattan Distance

From this example, it is clear that the amount variable clearly dom-

inates the distance metric since it is measured on a larger scale than

the recency variable. Hence, to appropriately take this into account, it

is recommended to scale both variables to a similar range using any of

the standardization procedures we discussed in Chapter 2. It is obvious

that the Euclidean distance will always be shorter than the Manhat-

tan distance. The Euclidean metric is the most popular metric used for

quantifying the distance between continuous variables. Other less fre-

quently used distance measures are based on the Pearson correlation

or cosine measure.

Besides continuous variables, also categorical variables can be used

for clustering. Let’s ﬁrst discuss the case of binary variables. These are

often used in insurance fraud detection methods, which are typically

based on a series of red-ﬂag indicators to label a claim as suspicious

or not. Assume we have the following data set with binary red-ﬂag

indicators.

Poor Driving

Record

Premium

Paid

in Cash

Car Purchase

Information

Available

Coverage

Increased

Car Was Never

Inspected

or Seen

Claim 1 Yes No Yes Yes No

Claim 2 Yes Yes No No No

…

A ﬁrst way to calculate the distance or similarity between claim 1

and 2 is to use the simple matching coefﬁcient (SMC), which simply

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 93

calculates the number of identical matches between the variable values

as follows:

SMC(Claim 1,Claim 2)=2∕5.

A tacit assumption behind the SMC is that both states of the vari-

able (Yes versus No) are equally important and should thus both be

considered. Another option is to use the Jaccard index whereby the

No-No match is left out of the computation as follows:

Jaccard(Claim 1,Claim 2)=1∕4.

The Jaccard index measures the similarity between both claims

across those red ﬂags that where raised at least once. It is especially use-

ful in those situations where many red-ﬂag indicators are available and

typically only a few are raised. Consider, for example, a fraud-detection

system with 100 red-ﬂag indicators, of which on average 5 are raised.

If you would use the simple matching coefﬁcient, then typically all

claims would be very similar since the 0–0 matches would dominate

the count, hereby creating no meaningful clustering solution. By using

the Jaccard index a better idea of the claim similarity can be obtained.

The Jaccard index has actually been very popular in fraud detection.

Let’s now consider the case of categorical variables with more

than two values. Assume we have the following data in a medical

insurance setting.

Treatment

Day

Distance between

Clinic and Subject’s

Home

Type of

Diagnosis

Risk

Class

Claim

Submitted by

Claim 1 Sunday Medium Severe CPhone

Claim 2 Wednesday Medium Life

threatening

CEmail

…

A ﬁrst option here is to code the categorical variables as 0/1 dum-

mies and apply the Manhattan or Euclidean distance metrics discussed

earlier. However, this may be cumbersome in case of categorical vari-

ables with lots of values. Coarse classiﬁcation might be considered to

reduce the number of dummy variables, but, remember, since we don’t

94 FRAUD ANALYTICS

have a target variable in this unsupervised setting, it should be based on

expert knowledge. Another option would be to use the simple match-

ing coefﬁcient (SMC) and count the number of identical matches. In

our case, the SMC would become 2/5.

Many data sets will contain both continuous and categorical vari-

ables, which complicates the distance calculation. One option here is

to code the categorical variables as 0/1 dummies and use a continuous

distance measure. Another option is to use a (weighted) combination

of distance measures, although this is less straightforward and thus less

frequently used.

Hierarchical Clustering

Once the distance measures have been chosen, the clustering process

can start. A ﬁrst popular set of techniques are hierarchical clustering

methods. Depending on the starting point of the analysis, divisive or

agglomerative hierarchical clustering methods can be used. Divisive

hierarchical clustering starts from the whole data set in one cluster,

and then breaks this up in each time smaller clusters until one obser-

vation per cluster remains (right to left in Figure 3.9). Agglomerative

Step 3 Step 2 Step 1 Step 0Step 4

Step 1 Step 2 Step 3 Step 4Step 0

Agglomerative

Divisive

Figure 3.9 Divisive Versus Agglomerative Hierarchical Clustering

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 95

clustering works the other way around, and starts from each obser-

vation in one cluster, and then continues to merge the ones that are

most similar until all observations make up one big cluster (left to right

in Figure 3.9). The optimal clustering solution then lies somewhere in

between the extremes to the left and right, respectively.

Although we have earlier discussed various distance metrics to

quantify the distance between individual observations, we haven’t

talked about how to measure distances between clusters. Also here,

various options are available, as depicted in Figure 3.10. The single

linkage method deﬁnes the distance between two clusters as the

smallest possible distance, or the distance between the two most simi-

lar objects. The complete linkage method deﬁnes the distance between

two clusters as the biggest distance, or the distance between the two

most dissimilar objects. The average linkage method calculates the

average of all possible distances. The centroid method calculates the

distance between the centroids of both clusters. Finally, Ward’s distance

between two clusters Ciand Cjis calculated as the difference between

the total within cluster sum of squares for the two clusters separately,

and the total within cluster sum of squares obtained from merging the

clusters Ciand Cjinto one cluster Cij. It is calculated as follows:

DWard (Ci,Cj)=

x∈Ci

(x−ci)2+

x∈Cj

(x−cj)2−

x∈Cij

(x−cij)2,

where ci,cj,cij is the centroid of cluster Ci,Cj,andCij, respectively.

Single linkage

Complete linkage

Average linkage

Centroid method

Figure 3.10 Calculating Distances between Clusters

96 FRAUD ANALYTICS

In order to decide on the optimal number of clusters, one could use

a dendrogram or screen plot. A dendrogram is a tree-like diagram that

records the sequences of merges. The vertical (or horizontal scale) then

gives the distance between two clusters amalgamated. One can then

cut the dendrogram at the desired level to ﬁnd the optimal clustering.

This is illustrated in Figure 3.11 and Figure 3.12 for a birds clustering

chicken

duck

pigeon

parrot

owl

eagle

canary

Figure 3.11 Example for Clustering Birds. The Numbers Indicate the Clustering Steps

chicken pigeonduck owl eaglecanaryparrot

1 2

Figure 3.12 Dendrogram for Birds Example. The Thick Black Line Indicates the Optimal

Clustering

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 97

Number of clusters

Distance

Figure 3.13 Scree Plot for Clustering

example. A scree plot is a plot of the distance at which clusters are

merged. The elbow point then indicates the optimal clustering. This is

illustrated in Figure 3.13.

A key advantage of hierarchical clustering is that the number of

clusters does not need to be speciﬁed prior to the analysis. A disadvan-

tage is that the methods do not scale very well to large data sets. Also,

the interpretation of the clusters is often subjective and depends on the

business expert and/or data scientist.

Example of Hierarchical Clustering Procedures

To illustrate the various hierarchical clustering procedures discussed,

suppose we have a data set of seven observations, as depicted in

Table 3.3. Figure 3.14 displays the corresponding scatter plot.

The output of the various hierarchical clustering procedures is

depicted in Figure 3.15. As it can be observed, single linkage results

Table 3.3 Data Set for

Hierarchical Clustering

X Y

A 4 4

B 5 4

C 7 5

D 8 5

E11 5

F 2 7

G 1 3

98 FRAUD ANALYTICS

02468

CDE

46810 12

Figure 3.14 Scatter Plot of Hierarchical Clustering Data

Single Linkage: Dendogram

Height

1.0 1.5 2.0 2.5 3.0 3.5

Figure 3.15 Output of Hierarchical Clustering Procedures

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 99

2 3

Single Linkage

0246810 12

02468

Complete Linkage: Dendogram

Height

0246810

Figure 3.15 (Continued)

100 FRAUD ANALYTICS

Complete Linkage

4 5

CDE

0246810 12

02468

Average Linkage: Dendogram

Height

123456

Figure 3.15 (Continued)

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 101

Average Linkage

1 2

5 4

0246810 12

02468

CD E

Centroid Method: Dendogram

Height

02345

Figure 3.15 (Continued)

102 FRAUD ANALYTICS

Centroid Method: Dendogram

Height

02345

4 5

C D E

Centroid Method

0246810 12

02468

Figure 3.15 (Continued)

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 103

Ward’s Method: Dendogram

Height

46810

Ward’s Method

C D E

0246810 12

02468

Figure 3.15 (Continued)

104 FRAUD ANALYTICS

in thin, long, and elongated clusters since dissimilar objects are not

accounted for. Complete linkage will make the cluster tighter, more

balanced and spherical, which is often more desirable. Average linkage

prefers to merge clusters with small variances, which often results in

clusters with similar variance. Although the centroid method seems

similar at ﬁrst sight, the resulting clustering solution is different as

depicted in the ﬁgure. Ward’s method prefers to merge clusters with a

small number of observations and often results into balanced clusters.

k-Means Clustering

k-means clustering is a nonhierarchical procedure that works along the

following steps (see Jain 2010; MacQueen 1967):

1. Select kobservations as initial cluster centroids (seeds).

2. Assign each observation to the cluster that has the closest

centroid (for example, in Euclidean sense).

3. When all observations have been assigned, recalculate the posi-

tions of the kcentroids.

4. Repeat until the cluster centroids no longer change or a ﬁxed

number of iterations is reached.

A key requirement here is that, as opposed to hierarchical cluster-

ing, the number of clusters, k, needs to be speciﬁed before the start

of the analysis. This decision can be made using expert based input

or based on the result of another (e.g., hierarchical) clustering pro-

cedure. Typically, multiple values of kare tried out and the result-

ing clusters evaluated in terms of their statistical characteristics and

interpretation. It is also advised to try out different seeds to verify the

stability of the clustering solution. Note that the mean is sensitive to

outliers, which are especially relevant in a fraud detection setting. A

more robust alternative is to use the median instead (k-medoid clus-

tering). In case of categorical variables, the mode can be used (k-mode

clustering). As mentioned, k-means is most often used in combination

with a Euclidean distance metric, which typically results into spherical

or ball-shaped clusters. See Figures 3.16 to 3.22.

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 105

1.51.00.50.0–0.5

–0.5 0.0 0.5 1.0 1.5

Figure 3.16 k-Means Clustering: Start from Original Data

1.51.00.50.0–0.5

–0.5 0.0 0.5 1.0 1.5

Figure 3.17 k-Means Clustering Iteration 1: Randomly Select Initial Cluster Centroids

106

0.0 0.5 1.0 1.5‒0.5

‒0.5 0.5 1.51.00.0

Figure 3.18 k-Means Clustering Iteration 1: Assign Remaining Observations

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 107

0.0 0.5 1.0 1.5‒0.5

‒0.5 0.5 1.51.00.0

Figure 3.19 k-Means Iteration Step 2: Recalculate Cluster Centroids

0.0 0.5 1.0 1.5‒0.5

‒0.5 0.5 1.51.00.0

Figure 3.20 k-Means Clustering Iteration 2: Reassign Observations

108 FRAUD ANALYTICS

0.0 0.5 1.0 1.5‒0.5

‒0.5 0.5 1.51.00.0

Figure 3.21 k-Means Clustering Iteration 3: Recalculate Cluster Centroids

0.0 0.5 1.0 1.5‒0.5

‒0.5 0.5 1.51.00.0

Figure 3.22 k-Means Clustering Iteration 3: Reassign Observations

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 109

Self-Organizing Maps

A self-organizing map (SOM) is an unsupervised learning algorithm

that allows users to visualize and cluster high-dimensional data on

a low-dimensional grid of neurons (Kohonen 2000; Huysmans et al.

2006; Seret et al. 2012). A SOM is a feedforward neural network with

two layers: an input and an output layer. The neurons from the output

layer are usually ordered in a two-dimensional rectangular or hexag-

onal grid (see Figure 3.23). For the former, every neuron has at most

eight neighbors, whereas for the latter, every neuron has at most six

neighbors.

Each input is connected to all neurons in the output layer with

weights w=[w1,…,wN],withNthe number of variables. All weights

are randomly initialized. When a training vector xis presented, the

weight vector wcof each neuron cis compared with x,usingfor

example, the Euclidean distance metric (beware to standardize the

data ﬁrst!):

d(x,wc)=N

i=1(xi−wci)2.

The neuron that is most similar to xin Euclidean sense is called

the best matching unit (BMU). The weight vector of the BMU

and its neighbors in the grid are then adapted using the following

learning rule:

wi(t+1)=wi(t)+hci (t)[x(t)−wi(t)],

where trepresents the time index during training and hci(t)deﬁnes

the neighborhood of the BMU c, specifying the region of inﬂuence.

Rectangular SOM Grid Hexagonal SOM Grid

Figure 3.23 Rectangular Versus Hexagonal SOM Grid

110 FRAUD ANALYTICS

The neighborhood function hci(t)should be a nonincreasing function

of time and the distance from the BMU. Some popular choices are:

hci(t)=𝛼(t)exp −rc−ri2

2𝜎2(t),

hci(t)=𝛼(t)if rc−ri2≤threshold,0otherwise,

where rcand rirepresent the location of the BMU and neuron ion

the map, 𝜎2(t)represents the decreasing radius, and 0 ≤𝛼(t)≤1the

learning rate (e.g., 𝛼(t)=A∕(t+B),𝛼(t)=exp(–At)). The decreasing

learning rate and radius will give a stable map after a certain amount of

training. The neurons will then move more and more toward the input

observations and interesting segments will emerge. Training is stopped

when the BMUs remain stable, or after a ﬁxed number of iterations

(e.g., 500 times the number of SOM neurons).

SOMs can be visualized by means of a U-matrix or component

plane:

◾AU (uniﬁed distance)-matrix essentially superimposes a

height Zdimension on top of each neuron visualizing the

average distance between the neuron and its neighbors,

whereby typically dark colors indicate a large distance and can

be interpreted as cluster boundaries.

◾Acomponent plane visualizes the weights between each spe-

ciﬁc input variable and its output neurons, and as such pro-

vides a visual overview of the relative contribution of each input

attribute to the output neurons.

Figure 3.24 provides a SOM example for clustering countries based

on a Corruption Perception Index (CPI). This is a score between 0

(highly corrupt) and 10 (highly clean), assigned to each country in

the world. The CPI is combined with demographic and macroeconomic

information for the years 1996, 2000, and 2004. Uppercase countries

(e.g., BEL) denote the situation in 2004, lowercase (e.g., bel) in 2000,

and sentence case (e.g., Bel) in 1996. It can be seen that many of the

European countries are situated in the upper-right corner of the map.

Figure 3.25 provides the component plane for literacy whereby

darker regions score worse on literacy. Figure 3.26 provides the

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 111

sgp

SGP

Hkg

hkg

HKG

Swe

nor

Nor

NOR

fin

swe

FIN

SWE

Gbr

FRA

DEU Dnk

dnk

DNK

Fin gbr

fra

CHEAUS

isr

can

TWN

ISR CHL

Chl

chl

Arg

arg

mex

Kor

kor KOR Prt

Cze

cze

pol

Esp CZE

Twn

twn

jor

JOR Mys

MYS

Ven

mys

VENCOL

ECU bra

Bra bgd

BGD

nga

NGA

Egy

egy

Ecu

ecu

pak

PAK

ZafBOLPHL

Col

col

IDN

idn bol Pak

ven

Chn

CHN

Idn

EGY

Phl

phl

Ken

ken

Cmr

Nga

cmr

Uga

uga

UGA

Chn Tur Jor Bol CMR

Bgd

tur ind lnd KEN

MEX IND

MEX

ARG THA

BRA

Tha

TUR

zaf

ZAF

tha

hun

HUN

POL

jap Deu

PRT GRC

Nzl

nzl

NZL

Can

aus

Che

che

Aut

aut

Bel

ESP

Fra

deu

ITA

Aus

CAN

bel

Nld

nld

AUT

BEL

JPN

NLD

GBR

Sgp

Usa

usa

USA

Isr

Figure 3.24 Clustering Countries Using SOMs

component plane for political rights whereby darker regions cor-

respond to better political rights. It can be seen that many of the

European countries score good on both literacy and political rights.

SOMs are a very handy tool for clustering high-dimensional data

sets because of the visualization facilities. However, since there is no

real objective function to minimize, it is harder to compare various

SOM solutions against each other. Also, experimental evaluation and

expert interpretation is needed to decide on the optimal size of the

SOM. Unlike k-means clustering, a SOM does not force the number of

clusters to be equal to the number of output neurons.

Clustering with Constraints

In many fraud application domains, the expert(s) will have prior

knowledge about existing fraud patterns and/or anomalous behavior.

This knowledge can originate from both experience as well as existing

112 FRAUD ANALYTICS

Figure 3.25 Component Plane for Literacy

literature. It will be handy if this background knowledge can be incor-

porated to guide the clustering. This is the idea of semi-supervised

clustering or clustering with constraints (Basu, Davidson et al. 2012).

The idea here is to bias the clustering with expert knowledge such

that the clusters can be found quicker and with the desired properties.

Various types of constraints can be thought of. A ﬁrst set of con-

straints is observation-level constraints. As the name suggests, these

constraints are set for individual observations. A must-link constraint

enforces that two observations should be assigned to the same cluster,

whereas a cannot-link constraint will put them into different clusters

(see Figure 3.27). This could be handy in a fraud-detection setting

if the fraud behavior of only a few observations is known as these

can then be forced into the same cluster. Cluster-level constraints

are deﬁned at the level of the cluster. A minimum separation or 𝛿

constraint speciﬁes that the distance between any pair of observations

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 113

Figure 3.26 Component Plane for Political Rights

must link

cannot link

must link

Figure 3.27 Must-Link and Cannot-Link Constraints in Semi-Supervised Clustering

in two different clusters must be at least 𝛿(see Figure 3.28). This will

allow data scientists to create well-separated clusters. An 𝜀-constraint

speciﬁes that each observation in a cluster with more than one

observation must have another observation within a distance of at

most 𝜀(see Figure 3.29). Another example of a constraint includes the

requirement to have balanced clusters, whereby each cluster contains

114 FRAUD ANALYTICS

≥δ

Figure 3.28 𝛿-Constraints in Semi-Supervised Clustering

Figure 3.29 𝜀-Constraints in Semi-Supervised Clustering

the same amount of observations. Negative background information

can also be provided whereby the aim is to ﬁnd a clustering which is

different from a given clustering.

The constraints can be enforced during the clustering process.

For example, in a k-means clustering setup, the cluster seeds will

be chosen such that the constraints are respected. Each time an

observation is (re-)assigned, the constraints will be veriﬁed and the

(re-)assignment halted in case violations occur. In a hierarchical

clustering procedure, a must-link constraint can be enforced by setting

the distance between two observations to 0, whereas a cannot-link

constraint can be enforced by setting the distance to a very high value.

Evaluating and Interpreting Clustering Solutions

Evaluating a clustering solution is by no means a trivial exercise since

there exists no universal criterion. From a statistical perspective, the

sum of squared errors (SSE) can be computed as follows:

SSE =K

i=1x∈Ci

dist2(x,mi),

where Krepresents the number of clusters and mithe centroid (e.g.,

mean) of cluster i. When comparing two clustering solutions, the one

with the lowest SSE can then be chosen. Besides a statistical evaluation,

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 115

a clustering solution will also be evaluated in terms of its interpretation.

To facilitate the interpretation of a clustering solution, various options

are available. A ﬁrst one is to compare cluster distributions with pop-

ulation distributions across all variables on a cluster-by-cluster basis.

This is illustrated in Figure 3.30 whereby the distribution of a cluster

C1is contrasted with the overall population distribution for the RFM

Recency

<1 1–2 2–3 3–4 4+

<1,000 1,000–5,000 5,000–10,000 10,000–100,000 100,000+

Monetary

<5 5–10 10–20 20–50 50+

Frequency

Population Cluster C1

Figure 3.30 Cluster Proﬁling Using Histograms

116 FRAUD ANALYTICS

Table 3.4 Output from a k-Means Clustering Exercise (k=4)

Claim Recency Frequency Monetary .... ClusterID

Claim1 Cluster2

Claim2 Cluster4

Claim3 Cluster3

Claim4 Cluster2

Claim5 Cluster1

Claim6 Cluster4

…

variables. It can be clearly seen that cluster C1has observations with

low recency values and high monetary values, whereas the frequency

is relatively similar to the original population.

Another way to explain a given clustering solution is by building a

decision tree with the ClusterID as the target variable. We will discuss

how to build decision trees in the next chapter, but for the moment

it sufﬁces to understand how they should be interpreted. Assume we

have the following output from a k-means clustering exercise with k

equal to 4. (See Table 3.4)

We can now build a decision tree with the ClusterID as the target

variable as follows.

The decision tree in Figure 3.31 gives us a clear insight into the

distinguishing characteristics of the various clusters. For example,

cluster 2 is characterized by observations having recency <1 day and

monetary >1,000. Hence, using decision trees, we can easily assign

new observations to the existing clusters. This is an example of how

supervised or predictive techniques can be used to explain the solution

from a descriptive analytics exercise.

Recency < 1 day

Frequency > 5 Monetary >1000

No No

Cluster1

Yes

Cluster3

Yes

Cluster4Cluster2

Yes

Figure 3.31 Using Decision Trees for Clustering Interpretation

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 117

ONE-CLASS SVMS

One-class SVMs try to maximize the distance between a hyperplane

and the origin (Schölkopf et al. 2001). The idea is to separate the

majority of the observations from the origin. The observations that

lie on the other side of the hyperplane, closest to the origin, are then

considered as outliers. This is illustrated in Figure 3.32.

Let’s deﬁne the hyperplane as follows:

wT𝜑(x)−𝜌=0.

Normal observations lie above the hyperplane and outliers below

it, or in other words normal observations (outliers) will return a posi-

tive (negative) value for:

f(x)=sign(wT𝜑(x)−𝜌).

One-class SVMs then aim at solving the following optimization

function:

Minimize 1

2N

i=1w2

i−𝜌+1

𝜐nn

i=1ei

subject towT𝜑(xk)≥𝜌−ek,k=1…n

ek≥0.

The error variables eiare introduced to allow observations to lie on

the side of the hyperplane closest to the origin. The parameter 𝜐is a

wT φ(x) – ρ = 0

Outliers

Figure 3.32 One-Class Support Vector Machines

118 FRAUD ANALYTICS

regularization term. Mathematically, it can be shown that the distance

between the hyperplane and the origin equals 𝜌

w (see Figure 3.32).

This distance is now maximized by minimizing 1

2N

i=1w2

i–𝜌,whichis

the ﬁrst part in the objective function. The second part of the objective

function then accounts for errors, or thus outliers. The constraints force

the majority of observations to lie above the hyperplane. The parame-

ter 𝜐ranges between 0 and 1, and sets an upper bound on the fraction

of outliers. A lower (higher) value of the regularization parameter 𝜐

will increase (decrease) the weight assigned to errors and thus decrease

(increase) the number of outliers. Given the importance of this param-

eter, one-class SVMs are sometimes also referred to as 𝜐-SVMs.

As with SVMs for supervised learning (see Chapter 4), the opti-

mization problem can be solved by formulating its dual variant, which

also here yields a quadratic programming (QP) problem, and applying

the kernel trick. By again using Lagrangian optimization, the following

decision function is obtained

f(x)=sign(wT𝜑(x)−𝜌)=sign n



i=1

𝛼iKx,xi−𝜌,

where αirepresent the Lagrange multipliers, and K(x,xi)the kernel

function. See Schölkopf et al. (2001) for more details.

REFERENCES

Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining Association Rules

between Sets of Items in Massive Databases, Proceedings of the ACM SIGMOD

International Conference on Management of Data. Washington, D.C.

Basu, S., Davidson, I., & Wagstaff, K. L. (2012). Constrained Clustering: Advances

in Algorithms, Theory, and Applications, Boca Raton, FL: Chapman & Hall/

CRC.

Bolton, R. J., & Hand, D. J. (2002). Statistical Fraud Detection: A Review,

Statistical Science 17 (3): 235–255.

Cullinan, G. J. (1977). Picking Them by Their Batting Averages’ Recency–Frequency–

Monetary Method of Controlling Circulation, Manual Release 2103. New York:

Direct Mail/Marketing Association.

Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (2010). Cluster Analysis,5thed.

Hoboken, NJ: John Wiley & Sons.

DESCRIPTIVE ANALYTICS FOR FRAUD DETECTION 119

Grubbs, F. E. (1950). Sample Criteria for Testing Outlying Observations, The

Annals of Mathematical Statistics 21(1): 27–58.

Grubbs, F. E. (1969). Procedures for Detecting Outlying Observations in

Samples, Technometrics 11(1): 1–21.

Huysmans, J., Baesens, B., Van Gestel, T., & Vanthienen, J. (2006). Failure Pre-

diction with Self Organizing Maps, Expert Systems with Applications, Special

Issue on Intelligent Information Systems for Financial Engineering 30(3):

479–487.

Jain, A. K. (2010). Data Clustering: 50 Years Beyond K-Means, Pattern Recog-

nition Letters 31(8): 651–666.

Kohonen, T. (2000). Self-Organizing Maps. New York: Springer.

MacQueen, J. (1967). Some Methods for classiﬁcation and Analysis of

Multivariate Observations, Proceedings of 5th Berkeley Symposium on Math-

ematical Statistics and Probability, Berkeley: University of California Press,

pp. 281–297.

Schölkopf, B., and Platt, J. C., Shawe-Taylor, J., Smola, A., & Williamson, R. C.

(2001). Estimating the support of a high-dimensional distribution, Neural

Computation 13(7): 1443–1471.

Seret, A., Verbraken, T., Versailles, S., & Baesens, B. (2012). A New SOM-Based

Method for Proﬁle Generation: Theory and an Application in Direct Mar-

keting, European Journal of Operational Research 220 (1): 199–209.

Weston, D. J., Hand, D. J., Adams, N. M., Whitrow, C., & Juszczak, P. (2008).

Plastic card fraud detection using peer group analysis, Advances in Data

Analysis and Classiﬁcation 2(1): 45–62.

CHAPTER 4

Predictive

Analytics for

Fraud Detection

121

INTRODUCTION

In predictive analytics, the aim is to build an analytical model pre-

dicting a target measure of interest (Baesens 2014; Duda et al. 2001;

Flach 2012; Han and Kamber 2001; Hastie et al. 2001; Tan et al.

2006). The target is then typically used to steer the learning process

during an optimization procedure. Two types of predictive analytics

can be distinguished depending on the measurement level of the

target: regression and classiﬁcation. In regression, the target variable is

continuous and varies along a predeﬁned interval. This interval can

be limited (e.g., between 0 and 1) or unlimited (e.g., between 0 and

inﬁnity). A typical example in a fraud detection setting is predicting

the amount of fraud. In classiﬁcation, the target is categorical which

means that it can only take on a limited set of predeﬁned values.

In binary classiﬁcation, only two classes are considered (e.g., fraud

versus no-fraud) whereas in multiclass classiﬁcation, the target can

belong to more than two classes (e.g., severe fraud, medium fraud,

no fraud).

In fraud detection, both classiﬁcation and regression models can

be used simultaneously. Consider, for example, an insurance fraud set-

ting. The expected loss due to fraud can be calculated as follows

Expected fraud loss (EFL)=PF ×LGF +(1−PF)×0=PF ×LGF

where PF represents the probability of fraud and LGF the loss given

fraud. The latter can be expressed as an amount or as a percentage

of a maximum amount (e.g., the maximum insured amount). PF can

then be estimated using a classiﬁcation technique whereas for LGF,a

regression model will be estimated.

Different types of predictive analytics techniques have been

developed in the literature originating from a variety of different

disciplines such as statistics, machine learning, artiﬁcial intelligence,

pattern recognition, and data mining. The distinction between

those disciplines is getting more and more blurred and is actu-

ally not that relevant. In what follows, we will discuss a selection

of techniques with a particular focus on the fraud practitioner’s

perspective.

122

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 123

TARGET DEFINITION

Since the target variable plays an important role in the learning pro-

cess, it is of key importance that it is appropriately deﬁned. In fraud

detection, the target fraud indicator is usually hard to determine since

one can never be fully sure that a certain transaction (e.g., credit card

fraud), claim (e.g., insurance fraud), or company (e.g., tax evasion

fraud) is fraudulent.

Let’s take the example of insurance fraud (Viaene et al. 2002). If

an applicant ﬁles a claim, the insurance company will perform various

checks to ﬂag the claim as suspicious or nonsuspicious. When the claim

is considered as suspicious, the insurance ﬁrm will ﬁrst decide whether

it’s worthwhile the effort to pursue the investigation. Obviously, this

will also depend on the amount of the claim, such that small amount

claims are most likely not further considered, even if they are fraudu-

lent. When the claim is considered worthwhile to investigate, the ﬁrm

might start a legal procedure resulting into a court judgment and/or

legal settlement ﬂagging the claim as fraudulent or not. It is clear that

also this procedure is not 100 percent error-proof and thus nonfraud-

ulent claims might end up being ﬂagged as fraudulent, or vice versa.

Another example is tax evasion fraud. An often-used fraud mech-

anism in this setting is a spider construction, as depicted in Figure 4.1

(Van Vlasselaer et al. 2013 and 2015).

The key company in the middle represents the ﬁrm who is the key

perpetrator of the fraud. It starts up a side company (Side Company

1), which makes revenue but deliberately does not pay its taxes and

hence intentionally goes bankrupt. On bankruptcy, its resources (e.g.,

employees, machinery, equipment, buyers, suppliers, physical address,

and other assets) are shifted toward a new side company (e.g., Side

Company 2), which repeats the fraud behavior and thus again goes

bankrupt. As such, a web of side companies evolves around the key

company. It is clear that in this setting it becomes hard to distinguish

a regular bankruptcy due to insolvency from a fraudulent bankruptcy

due to malicious intent. In other words, all fraudulent companies

go bankrupt, but not all bankrupt companies are fraudulent. This is

depicted in Figure 4.2. Although suspension is seen as a normal way of

124 FRAUD ANALYTICS

Figure 4.1 A Spider Construction in Tax Evasion Fraud

Figure 4.2 Regular Versus Fraudulent Bankruptcy

stopping a company’s activities (i.e., all debts redeemed), bankruptcy

indicates that the company did not succeed to pay back all its cred-

itors. Distinguishing between regular and fraudulent bankruptcies is

subtle and hard to establish. Hence, it can be expected that some reg-

ular bankruptcies are in fact undetected fraudulent bankruptcies.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 125

Also, the intensity of the fraud when measured as an amount might

be hard to determine since one has to take into account direct costs,

indirect costs, reputation damage, and the time value of the money

(e.g., by discounting).

To summarize, in supervised fraud detection the target labels are

typically not noise-free hereby complicating the analytical modeling

exercise. It is thus important for analytical techniques to be able to

cope with this.

LINEAR REGRESSION

Linear regression is undoubtedly the most commonly used technique

to model a continuous target variable. For example, in a car insurance

fraud detection context, a linear regression model can be deﬁned to

model the amount of fraud in terms of the age of the claimant, claimed

amount, severity of the accident, and so on.

Amount of fraud =𝛽0+𝛽1Age +𝛽2ClaimedAmount +𝛽3Severity +…

The general formulation of the linear regression model then

becomes:

Y=𝛽0+𝛽1X1+…+𝛽NXN,

where Yrepresents the target variable, and X1,…,XNthe explanatory

variables. The 𝛽parameters measure the impact on the target variable

Yof each of the individual explanatory variables.

Let’s now assume we start with a data set with nobservations and

Nexplanatory variables structured as depicted in Table 4.1.

Table 4.1 Data Set for Linear Regression

Observation X𝟏X𝟐…XNY

1X11 X21 …XN1Y1

2X12 X22 …XN2Y2

…

nX1nX2n…XNn Yn

126 FRAUD ANALYTICS

The 𝛽parameters of the linear regression model can then be esti-

mated by minimizing the following squared error function:

2n

i=1e2

i=1

2n

i=1(Yi−̂

Yi)2=1

2n

i=1(Yi−(𝛽0+𝛽1X1i+…+𝛽NXNi))2,

where Yirepresents the target value for observation iand ̂

Yithe predic-

tion made by the linear regression model for observation i. Graphically,

this idea corresponds to minimizing the sum of all error squares as rep-

resented in Figure 4.3.

Straightforward mathematical calculus then yields the following

closed-form formula for the weight parameter vector ̂

𝛽:

𝛽=

𝛽0

𝛽1

…

𝛽N



=(XTX)−1XTY,

where Xrepresents the matrix with the explanatory variable values

augmented with an additional column of ones to account for the inter-

cept term 𝛽0,andYrepresents the target value vector (see Table 4.1).

This model and corresponding parameter optimization procedure are

often referred to as ordinary least squares (OLS) regression.

A key advantage of OLS regression is that it is simple and thus easy

to understand. Once the parameters have been estimated, the model

can be evaluated in a straightforward way, hereby contributing to its

operational efﬁciency.

Y = β0 + β1X

β0

Figure 4.3 OLS Regression

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 127

Note that more sophisticated variants have been suggested in

the literature, such as ridge regression, lasso regression, time series

models (ARIMA, VAR, GARCH), multivariate adaptive regression

splines (MARS), and so on. Most of these relax the linearity assumption

by introducing additional transformations, however, at the cost of

increased complexity.

LOGISTIC REGRESSION

Basic Concepts

Consider a classiﬁcation data set in a tax-evasion setting as depicted in

Table 4.2.

When modeling the binary fraud target using linear regression,

one gets:

Y=𝛽0+𝛽1Revenue +𝛽2Employees +𝛽3VATCompliant

When estimating this using OLS, two key problems arise:

1. The errors/target are not normally distributed but follow a

Bernoulli distribution with only two values;

2. There is no guarantee that the target is between 0 and 1, which

would be handy since it can then be interpreted as a probability.

Consider now the following bounding function:

f(z)= 1

1+e−z

which is shown in Figure 4.4.

Table 4.2 Example Classiﬁcation Data Set

Company Revenue Employees VATCompliant …Fraud Y

ABC 3,000k 400 YNo 0

BCD 200k 800 NNo 0

CDE 4,2000k 2,200 NYes 1

…

XYZ 34k 50 NYes 1

128 FRAUD ANALYTICS

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

–7–5–3–11 3 5 7

Figure 4.4 Bounding Function for Logistic Regression

For every possible value of z, the outcome is always between 0

and 1. Hence, by combining the linear regression with the bounding

function, we get the following logistic regression model:

P(fraud =yesRevenue,Employees,VATCompliant)

1+e−(𝛽0+𝛽1Revenue+𝛽2Employees+𝛽3VATcompliant)

The outcome of the above model is always bounded between 0 and

1, no matter which values of revenue, employees, and VAT compliant

are being used, and can as such be interpreted as a probability.

The general formulation of the logistic regression model then

becomes (Allison 2001):

P(Y=1X1,…,Xn)= 1

1+e−(𝛽0+𝛽1X1+…+𝛽NXN),

or alternatively,

P(Y=0X1,…,XN)=1−P(Y=1X1,…,XN)

=1−1

1+e−(𝛽0+𝛽1X1+…+𝛽NXN)=1

1+e(𝛽0+𝛽1X1+…+𝛽NXN)

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 129

Hence, both P(Y=1X1,…,XN)and P(Y=0X1,…,XN)are

bounded between 0 and 1.

Reformulating in terms of the odds, the model becomes:

P(Y=1X1,…,XN)

P(Y=0X1,…,XN)=e(𝛽0+𝛽1X1+…+𝛽NXN)

or in terms of the log odds (logit),

ln PY=1X1,…,XN

P(Y=0X1,…,XN)=𝛽0+𝛽1X1+…+𝛽NXN

The 𝛽iparameters of a logistic regression model are then estimated

using the idea of maximum likelihood. Maximum likelihood optimiza-

tion choses the parameters in such a way as to maximize the probability

of getting the sample at hand. First, the likelihood function is con-

structed. For observation i, the probability of observing either class

equals:

P(Y=1X1i,…,XNi)Yi(1−P(Y=1X1i,…,XNi)1−Yi,

where Yirepresents the target value (either 0 or 1) for observation i.

The likelihood function across all nobservations then becomes:

n

i=1P(Y=1X1i,…,XNi)Yi(1−P(Y=1X1i,…,XNi)1−Yi.

To simplify the optimization, the logarithmic transformation of the

likelihood function is taken and the corresponding log-likelihood can

then be optimized using for instance the Newton-Raphson method.

Logistic Regression Properties

Since logistic regression is linear in the log odds (logit), it basically

estimates a linear decision boundary to separate both classes. This is

illustrated in Figure 4.5 whereby Frepresents fraudulent ﬁrms and L

indicates legitimate or thus nonfraudulent ﬁrms.

130 FRAUD ANALYTICS

Employees

Revenue

L L

L L L

L L

F F

F F F

Figure 4.5 Linear Decision Boundary of Logistic Regression

To interpret a logistic regression model, one can calculate the odds

ratio. Suppose variable Xiincreases with one unit with all other vari-

ables being kept constant (ceteris paribus), then the new logit becomes

the old logit with 𝛽iadded. Likewise, the new odds become the old odds

multiplied by e𝛽i.The latter represents the odds ratio, that is, the multi-

plicative increase in the odds when Xiincreases by 1 (ceteris paribus).

Hence,

◾𝛽i>0 implies e𝛽i>1 and the odds and probability increase

with Xi

◾𝛽i<0 implies e𝛽i<1 and the odds and probability decrease

with Xi

Another way of interpreting a logistic regression model is by cal-

culating the doubling amount. This represents the amount of change

required for doubling the primary outcome odds. It can be easily seen

that for a particular variable Xi, the doubling amount equals log(2)∕𝛽i.

Note that next to the f(z)transformation, other transformations

have been suggested in the literature. Popular examples are the probit

and cloglog transformation as follows:

f(z)= 1

2𝜋∫z

−∞e−t2

2dt,

f(z)=1−e−ez.

These transformations are visualized in Figure 4.6.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 131

0,2

0,4

0,6

0,8

–

11 3

linear

logit

probit

cloglog

Figure 4.6 Other Transformations

Note, however, that empirical evidence suggests that all three

transformations typically perform equally well.

Building a Logistic Regression Scorecard

Logistic regression is a very popular supervised fraud-detection tech-

nique due to its simplicity and good performance. Just as with linear

regression, once the parameters have been estimated, it can be eval-

uated in a straightforward way, hereby contributing to its operational

efﬁciency. From an interpretability viewpoint, it can be easily trans-

formed into an interpretable, user friendly points based fraud score-

card. Let’s assume we start from the following logistic regression model

whereby the explanatory variables have been coded using weight of

evidence coding:

P(fraud =yesRevenue,Employees,VATCompliant,…)

1+e−(𝛽0+𝛽1WOERevenue+𝛽2WOEEmployees+𝛽3WOEVATCompliant+…).

As discussed earlier, this model can be easily reexpressed in a linear

way, in terms of the log odds as follows:

log P(fraud =yesRevenue,Employees,VATCompliant,…)

P(fraud =noRevenue,Employees,VATCompliant,…) 

=𝛽0+𝛽1WOERevenue +𝛽2WOEEmployees +𝛽3WOEVATCompliant +… .

132 FRAUD ANALYTICS

A scaling can then be introduced by calculating a fraud score, which

is linearly related to the log odds as follows:

Fraud score =Offset +Factor ×log(odds).

Assume that we want a fraud score of 100 for odds of 50:1, and a

fraud score of 120 for odds of 100:1. This gives the following:

100 =Offset +Factor ×log(50)

120 =Offset +Factor ×log(100)

The offset and factor then become:

Factor =20∕ln(2)=28.85

Offset =100 −Factor ×ln(50)=–12.87

Once these values are known, the fraud score becomes:

Fraud score =N

i=1WOEi×𝛽i+𝛽0×Factor +Offset

Fraud score =N

i=1WOEi×𝛽i+𝛽0

N×Factor +Offset

Fraud score =N

i=1WOEi×𝛽i+𝛽0

N×Factor +Offset

N

Hence, the points for each attribute are calculated by multiplying

the weight of evidence of the attribute with the regression coefﬁcient

of the characteristic, then adding a fraction of the regression intercept,

multiplying the result by the factor, and ﬁnally adding a fraction of

the offset. The corresponding fraud scorecard can then be visualized as

depicted in Figure 4.7.

The fraud scorecard is very easy to work with. Suppose a new ﬁrm

with the following characteristics needs to be scored:

Revenue =750.000,Employees =420,VAT Compliant =No,…

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 133

Characteristic Name Attribute Points

Revenue 1 Up to 100.000 80

Revenue 2 100.000–500.000 120

Revenue 3 500.000–1.000000 160

Revenue 4 1.000.000+ 240

Employees 1 Up to 50 5

Employees 2 50–500 20

Employees 3 500+ 80

VAT Compliant Yes 100

VAT Compliant No140

…

Figure 4.7 Fraud Detection Scorecard

The score for this ﬁrm can then be calculated as follows: 160 +20

+140 +… This score can then be compared against a critical cut-off

to help decide whether the ﬁrm is fraudulent. A key advantage of the

fraud scorecard is its interpretability. One can clearly see which are

the most risky categories and how they contribute to the overall fraud

score. Hence, this is a very useful technique in fraud detection settings

where interpretability is a key concern.

VARIABLE SELECTION FOR LINEAR AND LOGISTIC

REGRESSION

Variable selection aims at reducing the number of variables in a

model. It will make the model more concise and faster to evalu-

ate, which is especially relevant in a fraud detection setting. Both

linear and logistic regressions have built-in procedures to perform

variable selection. These are based on statistical hypotheses tests to

verify whether the coefﬁcient of a variable iis signiﬁcantly different

from zero:

H0∶β

i=0

HA∶β

i≠0

134 FRAUD ANALYTICS

In linear regression, the test statistic becomes:

t=̂

𝛽i

s.e.(̂

𝛽i),

and follows a Student’s t-distribution with n−2 degrees of freedom,

whereas in logistic regression, the test statistic is:

𝜒2=̂

𝛽i

s.e.̂

𝛽i2

and follows a Chi-squared distribution with 1 degree of freedom. Note

that both test statistics are intuitive in the sense that they will reject

the null hypothesis H0if the estimated coefﬁcient ̂

𝛽iis high in absolute

value compared to its standard error s.e.(̂

𝛽i). The latter can be easily

obtained as a byproduct of the optimization procedure. Based on the

value of the test statistic, one calculates the p-value, which is the prob-

ability of a getting a more extreme value than the one observed. This

is visualized in Figure 4.8 assuming a value of 3 for the test statistic.

Note that since the hypothesis test is two-sided, the p-value adds the

areas to the right of 3 and to the left of –3.

In other words, a low (high) p-value represents an (in)signiﬁcant

variable. From a practical viewpoint, the p-value can be compared

against a signiﬁcance level. Table 4.3 presents some commonly used

values to decide on the degree of variable signiﬁcance.

Various variable selection procedures can now be used based on

the p-value. Suppose one has four variables V1,V2,V3,and V4(e.g.,

amount of transaction, currency, transaction type, and merchant cat-

egory). The number of optimal variable subsets equals 24−1 or 15, as

displayed in Figure 4.9.

When the number of variables is small, an exhaustive search

amongst all variable subsets can be performed. However, as the

number of variables increases, the search space grows exponentially

and heuristic search procedures are needed. Using the p-values, the

variable space can be navigated in three possible ways. Forward

regression starts from the empty model and always adds variables

based on low p-values. Backward regression starts from the full

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 135

–0.03

0.02

0.07

0.12

0.17

0.22

0.27

0.32

–10 –8–6–4–20246810

–3

p-value

Figure 4.8 Calculating the p-Value with a Student’s t-Distribution

Table 4.3 Reference Values for Variable Signiﬁcance

p-value <0.01 Highly signiﬁcant

0.01 <p-value <0.05 Signiﬁcant

0.05 <p-value <0.10 Weakly signiﬁcant

p-value >0.10 Not signiﬁcant

{V1}

{V1,V2}

{V1, V2, V3}{V

1, V2, V4}{V

1, V3, V4}{V

2, V3, V4}

{V1, V2, V3, V4}

{V1,V3}{V

2,V3}{V

1,V4}{V

2,V4}{V

3,V4}

{V2}{V

3}{V

{}

Figure 4.9 Variable Subsets for Four Variables V1,V2,V3,andV4

136 FRAUD ANALYTICS

model and always removes variables based on high p-values. Stepwise

regression is a mix between both. It starts off like forward regression,

but once the second variable has been added, it will always check

the other variables in the model and remove them if they turn out

to be insigniﬁcant according to their p-value. Obviously, all three

procedures assume preset signiﬁcance levels, which should be set by

the user before the variable selection procedure starts.

In fraud detection, it is very important to be aware that statistical

signiﬁcance is only one evaluation criterion to do variable selection. As

mentioned before, interpretability is also an important criterion. In both

linear and logistic regression, this can be easily evaluated by inspecting

the sign of the regression coefﬁcient. It is hereby highly preferable that

a coefﬁcient has the same sign as anticipated by the business expert,

otherwise he/she will be reluctant to use the model.

Coefﬁcients can have unexpected signs due to multicollinearity

issues, noise or small sample effects. Sign restrictions can be easily

enforced in a forward regression setup by preventing variables with

the wrong sign from entering the model. Another criterion for variable

selection is operational efﬁciency. This refers to the amount of resources

that are needed for the collection and preprocessing of a variable.

For example, although trend variables are typically very predictive,

they require a lot of effort to be calculated and may thus not be

suitable to be used in an online, real-time fraud scoring environment

such as credit card fraud detection. The same applies to external

data, where the latency might hamper a timely decision. In both

cases, it might be worthwhile to look for a variable that is correlated

and less predictive but easier to collect and calculate. Finally, also

legal issues need to be properly taken into account. Some variables

cannot be used in fraud-detection applications because of privacy or

discrimination concerns.

DECISION TREES

Basic Concepts

Decision trees are recursive-partitioning algorithms (RPAs) that come

up with a tree-like structure representing patterns in an underlying

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 137

Transaction amount > $100,000

Previous fraud Unemployed

No No

Yes Ye s

Fraud No Fraud No FraudFraud

Yes

Figure 4.10 Example Decision Tree

data set (Duda et al. 2001). Figure 4.10 provides an example of a deci-

sion tree in a fraud-detection setting.

The top node is the root node specifying a testing condition, of

which the outcome corresponds to a branch leading up to an internal

node. The terminal nodes of the tree assign the classiﬁcations (in our

case fraud labels) and are also referred to as the leave nodes. Many

algorithms have been suggested in the literature to construct decision

trees. Among the most popular are: C4.5 (See5) (Quinlan 1993), CART

(Breiman et al. 1984), and CHAID (Hartigan 1975). These algorithms

differ in their way of answering the key decisions to build a tree:

◾Splitting decision: Which variable to split at what value (e.g.,

Transaction amount is >$100,000 or not, Previous fraud is yes

or no, unemployed is yes or no)

◾Stopping decision: When to stop adding nodes to the tree?

◾Assignment decision: What class (e.g., fraud or no fraud) to

assign to a leave node?

Usually, the assignment decision is the most straightforward to

make since one typically looks at the majority class within the leave

node to make the decision. This idea is also referred to as winner-take-

all learning. The other two decisions are less straightforward to be

made and are elaborated on in what follows.

Splitting Decision

In order to answer the splitting decision, one must deﬁne the con-

cept of impurity or chaos. Consider, for example, the three data

sets of Figure 4.11 each containing good (unﬁlled circles) and bad

138 FRAUD ANALYTICS

Minimal ImpurityMinimal Impurity Maximal Impurity

Figure 4.11 Example Data Sets for Calculating Impurity

(ﬁlled circles) customers. Quite obviously, the good customers are

nonfraudulent, whereas the bad customers are fraudulent. Minimal

impurity occurs when all customers are either good or bad. Maximal

impurity occurs when one has the same number of good and bad

customers (i.e., the data set in the middle).

Decision trees will now aim at minimizing the impurity in the data.

In order to do so appropriately, one needs a measure to quantify impu-

rity. Various measures have been introduced in the literature and the

most popular are:

◾Entropy: E(S)=–pGlog 2(pG)−pBlog 2(pB)(C4.5/See5)

◾Gini: Gini(S)=2pGpB(CART)

◾Chi-squared analysis (CHAID)

with pGand pBbeing the proportions of good and bad, respectively.

Both measures are depicted in Figure 4.12 where it can be clearly seen

that the entropy (gini) is minimal when all customers are either good

or bad, and maximal in case of the same number of good and bad

customers.

In order to answer the splitting decision, various candidate splits

will now be evaluated in terms of their decrease in impurity. Consider

a split on age, as depicted in Figure 4.13.

The original data set had maximum entropy since the amount of

goods and bads were the same. The entropy calculations now become:

◾Entropy top node =−1∕2×log2(1∕2)−1∕2×log2(1∕2)=1

◾Entropy left node =−1∕3×log2(1∕3)−2∕3×log2(2∕3)=0.91

◾Entropy right node =−1×log2(1)−0×log2(0)=0

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 139

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.9 1

Entropy Gini

Figure 4.12 Entropy Versus Gini

200 0

Age < 30 Age ≥ 30

400 400

200 400

Figure 4.13 Calculating the Entropy for Age Split

The weighted decrease in entropy, also known as the gain, can then

be calculated as follows:

Gain =1−(600∕800)×0.91 −(200∕800)×0=0.32

The gain measures the weighted decrease in entropy thanks to the

split. It speaks for itself that a higher gain is to be preferred. The decision

140 FRAUD ANALYTICS

tree algorithm will now consider different candidate splits for its root

node and adopt a greedy strategy by picking the one with the biggest

gain. Once the root node has been decided on, the procedure continues

in a recursive way, each time adding splits with the biggest gain. In

fact, this can be perfectly parallelized and both sides of the tree can

grow in parallel, hereby increasing the efﬁciency of the tree construction

algorithm.

Stopping Decision

The third decision relates to the stopping criterion. Obviously, if the tree

continues to split, it will become very detailed with leaf nodes contain-

ing only a few observations. In the most extreme case, the tree will

have one leaf node per observation and as such perfectly ﬁt the data.

However, by doing so, the tree will start to ﬁt the speciﬁcities or noise

in the data, which is also referred to as overﬁtting. In other words, the

tree has become too complex and fails to correctly model the noise free

pattern or trend in the data. As such, it will generalize poorly to new

unseen data. In order to avoid this from happening, the data will be

split into a training sample and a validation sample. The training sam-

ple will be used to make the splitting decision. The validation sample is

an independent sample, set aside to monitor the misclassiﬁcation error

(or any other performance metric such as a proﬁt-based measure) as

the tree is grown. A commonly used split up is a 70 percent training

sample and 30 percent validation sample. One then typically observes

a pattern as depicted in Figure 4.14.

Validation set

Training

set

minimum

Misclassification error

STOP Growing tree!

Number of tree nodes

Figure 4.14 Using a Validation Set to Stop Growing a Decision Tree

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 141

The error on the training sample keeps on decreasing as the splits

become more and more speciﬁc and tailored towards it. On the val-

idation sample, the error will initially decrease, which indicates that

the tree splits generalize well. However, at some point the error will

increase since the splits become too speciﬁc for the training sample as

the tree starts to memorize it. Where the validation set curve reaches its

minimum, the procedure should be stopped, as otherwise overﬁtting

will occur. Note that, as already mentioned, besides classiﬁcation error,

one might also use accuracy or proﬁt based measures on the Y-axis

to make the stopping decision. Also note that sometimes, simplicity is

preferred above accuracy, and one can select a tree that does not neces-

sarily have minimum validation set error, but a lower number of nodes.

Decision Tree Properties

In the example of Figure 4.10, every node had only two branches.

The advantage of this is that the testing condition can be implemented

as a simple yes/no question. Multiway splits allow for more than

two branches and can provide trees that are wider but less deep. In a

read-once decision tree, a particular attribute can be used only once

in a certain tree path. Every tree can also be represented as a rule set

since every path from a root note to a leave node makes up a simple

if-then rule. For the tree depicted in Figure 4.10, the corresponding

rules are:

If Transaction amount >$100,000 And Unemployed =No

Then no fraud

If Transaction amount >$100,000 And Unemployed =Yes Then

fraud

If Transaction amount ≤$100,000 And Previous fraud =Yes

Then fraud

If Transaction amount ≤$100,000 And Previous fraud =No

Then no fraud

These rules can then be easily implemented in all kinds of software

packages (e.g., Microsoft Excel).

Decision trees essentially model decision boundaries orthogonal to

the axes. This is illustrated in Figure 4.15 for an example decision tree.

142 FRAUD ANALYTICS

1200

30 Recency

Amount

Recency

Amount NF

NFF

≤ 1200 > 1200

≤ 30 > 30

NFNF

Figure 4.15 Decision Boundary of a Decision Tree

Regression Trees

Decision trees can also be used to predict continuous targets. Consider

the example of Figure 4.16 where a regression tree is used to predict

the fraud percentage (FP). The latter can be expressed as the percentage

of a predeﬁned limit based on, for example, the maximum transac-

tion amount.

Other criteria need now be used to make the splitting decision since

the impurity will need to be measured in another way. One way to

measure impurity in a node is by calculating the mean squared error

(MSE)asfollows: 1

nn

i=1(Yi−Y)2,

where nrepresents the number of observations in a leave node, Yi

the value of observation i,andY, the average of all values in the leave

node. Obviously, it is desirable to have a low MSE in a leave node since

this indicates that the node is more homogeneous.

Credit class

FP = 64%FP = 38%

Merchant known FP = 22%Previous fraud

FP = 6% FP = 82%

Low

Medium

High

NoYes No

Yes

Figure 4.16 Example Regression Tree for Predicting the Fraud Percentage

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 143

Another way to make the splitting decision is by conducting a sim-

ple analysis of variance (ANOVA) test and calculating an F-statistic as

follows:

F=SSbetween∕(B−1)

SSwithin∕(n−B)∼Fn−B,B−1,

whereby

SSbetween =



b=1

nb(Yb−Y)2

SSwithin =



b=1



i=1

(Ybi −Yb)2

with Bthe number of branches of the split, nbthe number of obser-

vations in branch b,Ybthe average in branch b,Ybi the value of

observation iin branch b,andYthe overall average. Good splits favor

homogeneity within a node (low SSwithin) and heterogeneity between

nodes (high SSbetween). In other words, good splits should have a high

F-value, or low corresponding p-value.

The stopping decision can be made in a similar way as for clas-

siﬁcation trees but using a regression based performance measure

(e.g., mean squared error, mean absolute deviation, R-squared) on the

Y-axis. The assignment decision can be made by assigning the mean

(or median) to each leave node. Note that standard deviations and thus

conﬁdence intervals may also be computed for each of the leaf nodes.

Using Decision Trees in Fraud Analytics

Decision trees can be used for various purposes in fraud analytics. First,

they can be used for variable selection as variables that occur at the top

of the tree are more predictive of the target. One could also simply cal-

culate the gain of a characteristic to gauge its predictive power. As an

alternative, remember that we already discussed the information value

in Chapter 2 to measure the predictive strength of a variable. Typically,

both the gain and information value consider similar attributes to be

predictive, so one can just choose the measure that is readily available

in the analytics software. Decision trees can also be used for high-level

segmentation. One then typically builds a tree two or three levels deep

144 FRAUD ANALYTICS

as the segmentation scheme and then uses second-stage logistic regres-

sion models for further reﬁnement. Finally, decision trees can also be

used as the ﬁnal analytical fraud model to be used directly into the busi-

ness environment. A key advantage here is that the decision tree gives

a white-box model with a clear explanation behind how it reaches its

classiﬁcations.

Many software tools will also allow to grow trees interactively by

providing at each level of the tree a top two (or more) of splits among

which the fraud modeler can choose. This allows users to choose splits

not only based on impurity reduction, but also on the interpretabil-

ity and/or computational complexity of the split criterion. Hence, the

modeler may favor a split on a less predictive variable, but which is

easier to collect and/or interpret.

Decision trees are very powerful techniques and allow for more

complex decision boundaries than a logistic regression. As discussed,

they are also interpretable and operationally efﬁcient. They are also

nonparametric in the sense that no normality or independence assump-

tions were needed to build a decision tree. Their most important disad-

vantage is that they are highly dependent on the sample that was used

for tree construction. A small variation in the underlying sample might

yield a totally different tree. In a later section, we will discuss how this

shortcoming can be addressed using the idea of ensemble learning.

NEURAL NETWORKS

Basic Concepts

A ﬁrst perspective on the origin of neural networks states that they

are mathematical representations inspired by the functioning of the

human brain. Although this may sound appealing, another more

realistic perspective sees neural networks as generalizations of existing

statistical models (Bishop 1995; Zurada 1992). Let’s take logistic

regression as an example:

P(Y=1X1,…,XN)= 1

1+e−(𝛽0+𝛽1X1+…+𝛽NXN),

We could visualize this model as shown in Figure 4.17.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 145

...

1 + e–(β0 + β1X1 +...+βNXN)

P(Y | X1,..., XN) =

X1β1

XN–1

βN–1

βN

β0

β2

Figure 4.17 Neural Network Representation of Logistic Regression

The processing element or neuron in the middle basically per-

forms two operations: it takes the inputs and multiplies them with

the weights (including the intercept term 𝛽0, which is called the

bias term in neural networks) and then puts this into a nonlinear

transformation function similar to the one we discussed in the section

on logistic regression. So logistic regression is a neural network with

one neuron. Similarly, we could visualize linear regression as a one

neuron neural network with the identity transformation f(z)=z.

We can now generalize the above picture to a multilayer perceptron

(MLP) neural network by adding more layers and neurons, as shown

in Figure 4.18 (Bishop 1995; Zurada 1992).

W11

W23

b2b4

i=1

hj = f ( xiwij + bj)

j=1

y = vjhj + b4

Figure 4.18 A Multilayer Perceptron (MLP) Neural Network

146 FRAUD ANALYTICS

The example in Figure 4.18 is an MLP with one input layer, one hid-

den layer, and one output layer. The hidden layer essentially works like

a feature extractor by combining the inputs into features that are then

subsequently offered to the output layer to make the optimal prediction.

The hidden layer has a nonlinear transformation function f() and the

output layer a linear transformation function. The most popular trans-

formation functions (also called squashing, activation functions) are:

◾Logistic, f(z)= 1

1+e−z, ranging between 0 and 1

◾Hyperbolic tangent, f(z)= ez−e−z

ez+e−z, ranging between –1 and +1

◾Linear, f(z)=z, ranging between −∞ and +∞

Although theoretically the activation functions may differ per neu-

ron, they are typically ﬁxed for each layer. For classiﬁcation (e.g., fraud

detection), it is common practice to adopt a logistic transformation in

the output layer, since the outputs can then be interpreted as prob-

abilities (Baesens et al. 2002). For regression targets (e.g., amount of

fraud), one could use any of the transformation functions listed above.

Typically, one will use the hyperbolic tangent activation function in the

hidden layer.

In terms of hidden layers, theoretical works have shown that neu-

ral networks with one hidden layer are universal approximators, capa-

ble of approximating any function to any desired degree of accuracy on

a compact interval (Hornik et al. 1989). Only for discontinuous func-

tions (e.g., a saw tooth pattern) or in a deep learning context, it could

make sense to try out more hidden layers. Note, however, that these

complex patterns rarely occur in practice. In a fraud setting, it is rec-

ommended to continue the analysis with one hidden layer.

In terms of data preprocessing, it is advised to standardize the

continuous variables using, for example, the z-scores. For categorical

variables, categorization can be used to reduce the number of cate-

gories, which can then be coded using, for example, dummy variables

or weight of evidence coding. Note that it is important to only consider

categorization for the categorical variables, and not for the continuous

variables. The latter can be categorized to model nonlinear effects

into linear models (e.g., linear or logistic regression), but since neural

networks are capable of modeling nonlinear relationships, it is not

needed here.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 147

Weight Learning

As discussed earlier, for simple statistical models such as linear regres-

sion, there exists a closed-form mathematical formula for the optimal

parameter values. However, for neural networks, the optimization is a

lot more complex and the weights sitting on the various connections

need to be estimated using an iterative algorithm. The algorithm then

optimizes a cost-function. Similarly to linear regression, when the tar-

get variable is continuous, a mean squared error (MSE) cost function

will be optimized as follows:

2n

i=1e2

i=1

2n

i=1(Yi−̂

Yi)2,

where Yinow represents the neural network prediction for observa-

tion i. In case of a binary target variable, a maximum likelihood cost

function can be optimized as follows:

n

i=1P(Y=1X1i,…,XNi)Yi(1−P(Y=1X1i,…,XNi )1−Yi,

where P(Y=1X1i,…,XNi)represents the probability prediction from

the neural network.

The optimization procedure typically starts from a set of random

weights (e.g., drawn from a standard normal distribution), which are

then iteratively adjusted to the patterns in the data using an optimiza-

tion algorithm. Popular optimization algorithms here are back propa-

gation learning, conjugate gradient, and Levenberg-Marquardt. See for

more details (Bishop 1995). A key issue to note here is the curvature

of the objective function, which is not convex and may be multimodal

as illustrated in Figure 4.19. The error function can thus have multi-

ple local minima but typically only one global minimum. Hence, if the

starting weights are chosen in a suboptimal way, one may get stuck in a

local minimum, which is clearly undesirable. One way to deal with this

is to try out different starting weights, start the optimization procedure

for a few steps, and then continue with the best intermediate solution.

This approach is sometimes referred to as preliminary training. The

optimization procedure then continues until the error function shows

no further progress, the weights stop changing substantially, or after a

ﬁxed number of optimization steps (also called epochs).

148 FRAUD ANALYTICS

Local minimum!

Global minimum!

Figure 4.19 Local Versus Global Minima

Although multiple output neurons could be used (e.g., predict-

ing fraud and fraud amount simultaneously), it is highly advised to

use only one to make sure that the optimization task is well focused.

The hidden neurons, however, should be carefully tuned and depend

on the nonlinearity in the data. More complex, nonlinear patterns

will require more hidden neurons. Although various procedures (e.g.,

cascade correlation, genetic algorithms, Bayesian methods) have been

suggested in the scientiﬁc literature to do this, the most straightfor-

ward, yet efﬁcient procedure is as follows (Moody and Utans 1994):

1. Split the data into a training, validation, and test set.

2. Vary the number of hidden neurons from 1 to 10 in steps of one

or more.

3. Train a neural network on the training set and measure the per-

formance on the validation set (may be train multiple neural

networks to deal with the local minimum issue).

4. Choose the number of hidden neurons with optimal validation

set performance.

5. Measure the performance on the independent test set.

Note that for fraud detection, the number of hidden neurons

typically varies between 6 and 12.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 149

Neural networks can model very complex patterns and decision

boundaries in the data and are as such very powerful. Just as with

decision trees, they are so powerful that they can even model the

noise in the training data, which is something that deﬁnitely should be

avoided. One way to avoid this overﬁtting is by using a validation set

in a similar way as with decision trees. This is illustrated in Figure 4.20.

The training set is used here to estimate the weights and the vali-

dation set is again an independent data set used to decide when to

stop training.

Another scheme to prevent a neural network from overﬁtting is

weight regularization, whereby the idea is to keep the weights small in

absolute sense since otherwise they may be ﬁtting the noise in the data.

This is then implemented by adding a weight size term (e.g., Euclidean

norm) to the objective function of the neural network (Bartlett 1997).

In case of a continuous output (and thus mean squared error), the

objective function then becomes:

2n

i=1e2

i+𝜆k

j=1w2

where krepresents the number of weights in the network and 𝜆a

weight decay (also referred to as weight regularization) parameter to

weigh the importance of error versus weight minimization. Setting 𝜆

too low will cause overﬁtting, whereas setting it to high will cause

underﬁtting. A practical approach to determining 𝜆is to try out dif-

ferent values on an independent validation set and select the one with

the best performance.

Validation set

Training

set

minimum

Error STOP training !

Training steps

Figure 4.20 Using a Validation Set for Stopping Neural Network Training

150 FRAUD ANALYTICS

Opening the Neural Network Black Box

Although neural networks have their merits in terms of modeling

power, they are commonly described as black-box techniques since

they relate the inputs to the outputs in a mathematically complex,

nontransparent, and opaque way. They have been successfully applied

as high-performance analytical tools in settings where interpretability

is not a key concern (e.g., credit card fraud detection).

However, in application areas where insight into the fraud behavior

is important, one needs to be careful with neural networks (Baesens,

Martens et al. 2011). In what follows, we will discuss the following

three ways of opening the neural network black box:

1. Variable selection

2. Rule extraction

3. Two-stage models

A ﬁrst way to get more insight into the functioning of a neural

network is by doing variable selection. As previously, the aim here

is to select those variables that actively contribute to the neural net-

work output. In linear and logistic regression, the variable importance

was evaluated by inspecting the p-values. Unfortunately, in neural net-

works this is not that easy, as no p-values are readily available. One easy

and attractive way to do it is by visualizing the weights in a Hinton

diagram. A Hinton diagram visualizes the weights between the inputs

and the hidden neurons as squares, whereby the size of the square is

proportional to the size of the weight and the color of the square rep-

resents the sign of the weight (e.g., black colors represent a negative

weight and white colors a positive weight). Clearly, when all weights

connecting a variable to the hidden neurons are close to zero, it does

not contribute very actively to the neural network’s computations, and

one may consider leaving it out. Figure 4.21 shows an example of a

Hinton diagram for a neural network with four hidden neurons and

ﬁve variables. It can be clearly seen that the income variable has a

small negative and positive weight when compared to the other vari-

ables and can thus be considered for removal from the network. A very

straightforward variable selection procedure is:

1. Inspect the Hinton diagram and remove the variable whose

weights are closest to zero.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 151

Age Income Claim

amount

Time since

claim

Accident

severity

Hidden neuron

Figure 4.21 Example Hinton Diagram

2. Reestimate the neural network with the variable removed. To

speed up the convergence, it could be beneﬁcial to start from

the previous weights.

3. Continue with step 1 until a stopping criterion is met. The stop-

ping criterion could be a decrease of predictive performance or

a ﬁxed number of steps.

Another way to do variable selection is by using the following back-

ward variable selection procedure:

1. Build a neural network with all Nvariables.

2. Remove each variable in turn and reestimate the network. This

will give Nnetworks each having N– 1 variables.

3. Remove the variable whose absence gives the best performing

network (e.g., in terms of misclassiﬁcation error, mean squared

error).

4. Repeat this procedure until the performance decreases signiﬁ-

cantly.

152 FRAUD ANALYTICS

Variables

Performance

Figure 4.22 Backward Variable Selection

When plotting the performance against the number of variables, a

pattern as depicted in Figure 4.22 will likely be obtained. Initially, the

performance will stagnate, or may even increase somewhat. When

important variables are being removed, the performance will start

decreasing. The optimal number of variables can then be situated

around the elbow region of the plot and can be decided in combi-

nation with a business expert. Sampling can be used to make the

procedure less resource intensive and more efﬁcient. Note that this

performance-driven way of variable selection can easily be adopted

with other analytical techniques such as linear or logistic regression

or support vector machines (see next section).

Although variable selection allows users to see which variables are

important to the neural network and which ones are not, it does not

offer a clear insight into its internal workings. The relationship between

the inputs and the output remains nonlinear and complex. A ﬁrst way

to get more transparency is by performing rule extraction, as will be

discussed next.

The purpose of rule extraction is to extract if-then classiﬁcation

rules, mimicking the behavior of the neural network (Baesens 2003;

Baesens et al. 2003; Setiono et al. 2009). Two important approaches

here are decompositional and pedagogical techniques. Decomposi-

tional rule extraction approaches decompose the network’s internal

workings by inspecting weights and/or activation values. A typical

approach here could be (Lu et al. 1995; Setiono et al. 2011):

1. Train a neural network and do variable selection to make it as

concise as possible.

2. Categorize the hidden unit activation values by using clustering.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 153

3. Extract rules that describe the network output in terms of the

categorized hidden unit activation values.

4. Extract rules that describe the categorized hidden unit activation

values in terms of the network inputs.

5. Merge the rules obtained in step 3 and 4 to directly relate the

inputs to the outputs.

This is illustrated in Figure 4.23.

Pedagogical rule extraction techniques consider the neural net-

work as a black box and use the neural network predictions as input

to a white-box analytical technique such as decision trees (Craven and

Shavlik 1996). This is illustrated in Figure 4.24.

In this approach, the learning data set can be further augmented

with artiﬁcial data, which is then labeled (e.g., classiﬁed or predicted)

by the neural network, so as to further increase the number of observa-

tions to make the splitting decisions when building the decision tree.

Note that since the pedagogical approach does not make use of the

parameters or internal model representation, it can essentially be used

Customer

Emma 281000

Will 44 1500

Dan 30 1200

Bob 582400

Emma 281000

Will 44 1500

Dan 30 1200

Bob 582400

Yes

–1.20 2,34 0,66 1 3 2

0,781,22 0,822 3 2

2,1 –0,180,16 3 1 2

–0,1 0,8–2,34 1 2 1

If h1 = 1 and h2 = 3 Then Fraud = No

If h2 = 2 Then Fraud = Yes

If Age < 28 and Income < 1000 Then h1 = 1

If Known Customer = Y Then h2 = 3

If Age > 34 and Income > 1500 Then h2 = 2

If Age < 28 and Income < 1000 and Known Customer = Y Then Fraud = No

If Age > 34 and Income > 1500 Then Fraud = Yes

Step 1: Start from original data

Step 2: Build a neural network

(e.g., 3 hidden neurons)

Step 3: Categorize hidden unit

activations

Step 4: Extract rules relating

network outputs to categorized

hidden units

Step 5: Extract rules relating

categorized hidden units to

inputs

Step 6: Merge

both rule sets

Age Income Known Customer

Customer Age Income Known Customer

…Fraud

h1 h2 h3 h1 h2 h3 Fraud

Figure 4.23 Decompositional Approach for Neural Network Rule Extraction

154 FRAUD ANALYTICS

Customer Age Income

Known

Customer

Network

Prediction Fraud

Step 1: Start from original data

Step 2: Build a neural network

Step 3: Get the network predictions

and add them to the data set

Step 4: Extract rules relating network

predictions to original inputs. Generate

additional data where necessary.

Customer

Emma 281000

Will 44 1500

Dan 30 1200

Bob 582400

Emma 281000

Will 44 1500

Dan 30 1200

Bob 582400

Yes

Age Income Known Customer … Fraud

Age > 40

Known Customer Income > 2000

No No

NN prediction:

No Fraud

Yes

NN prediction:

Fraud

Yes

NN prediction:

No Fraud

NN prediction:

Fraud

Yes

Figure 4.24 Pedagogical Approach for Rule Extraction

with any underlying algorithm, such as regression techniques, or SVMs

(see later).

When using either decompositional or pedagogical rule extraction

approaches, the rule sets should be evaluated in terms of their accuracy,

conciseness (e.g., number of rules, number of conditions per rule), and

ﬁdelity. The latter measures to what extent the extracted rule set suc-

ceeds in mimicking the neural network and is calculated as follows:

Neural Network Classiﬁcation

Rule Set Classiﬁcation No Fraud Fraud

No Fraud a b

Fraud c d

Fidelity =(a+d)∕(b+c).

It is also important to always benchmark the extracted rules/trees

with a tree built directly on the original data to see the beneﬁt of going

through the neural network.

Another approach to make neural networks more interpretable is

by using a two-stage model setup (Van Gestel et al. 2005, 2006). The

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 155

Customer Age Income Known Customer … Fraud

Emma 281000 Y No

Will 44 1500 NYes

Dan 30 1200 NNo

Bob 582400 Y Yes

Customer Age Income Known

Customer …Fraud

Logistic

Regression

output

Emma 281000 Y No (=0) 0.44

Will 44 1500 NYes (=1) 0.76

Dan 30 1200 NNo (=0) 0.18

Bob 582400 Y Yes(=1) 0.88

Customer Age Income Known

Customer …Fraud

Logistic

Regression

output

Emma 281000 Y No (=0) 0.44

Will 44 1500 NYes (=1) 0.76

Dan 30 1200 NNo (=0) 0.18

Bob 582400 Y Yes(=1) 0.88

Error

–0.44

0.24

–0.18

0.12

Customer Age Income Known

Customer …

Logistic

Regression

output

output Finaloutput

Bart 281000 Y 0.680.14 0.82

Step 1: Start from original data

Step 2: Build Logistic Regression Model

Step 3: Calculate errors from Logistic

Regression Model

Step 4: Build NN predicting errors from

Logistic Regression Model

Step 5: Score new observations by adding up

logistic regression and NN scores

Figure 4.25 Two-Stage Models

idea here is to estimate an easy-to-understand model ﬁrst (e.g., linear

regression, logistic regression). This will give us the interpretability part.

In a second stage, a neural network is used to predict the errors made

by the simple model using the same set of predictors. This will give us

the additional performance beneﬁt of using a nonlinear model. Both

models are then combined in an additive way, for example as follows:

◾Target =Linear regression (X1,X2,…XN)+Neural network

(X1,X2,…XN)

◾Score =Logistic regression (X1,X2,…XN)+Neural network

(X1,X2,…XN)

This setup provides an ideal balance between model interpretabil-

ity (which comes from the ﬁrst part) and model performance (which

comes from the second part). This is illustrated in Figure 4.25.

SUPPORT VECTOR MACHINES

Linear Programming

Two key shortcomings of neural networks are the fact that the

objective function is nonconvex (and hence may have multiple

156 FRAUD ANALYTICS

local minima) and the effort that is needed to tune the number of

hidden neurons. Support vector machines (SVMs) deal with both of

these issues (Cristianini and Taylor 2000; Schölkopf and Smola 2001;

Vapnik 1995).

The origins of classiﬁcation SVMs date back to the early dates

of linear programming (Mangasarian 1965). Consider, for example,

the following linear program (LP) for classiﬁcation in a fraud

setting:

min e1+e2+…+ennf +…ennf +nf

subject to

w1xi1 +w2xi2 +…+wNxiN ≥c−ei,1≤i≤nnf ,

w1xi1+w2xi2+…+wNxiN ≤c+ei,nnf +1≤i≤nnf +nf,

ei≥0,

with xij the value of variable jfor observation i,andnnf and nf

the number of no frauds and frauds, respectively. The LP assigns

the no frauds a score above the cut-off value c, and the frauds

a score below c. The error variables eiare needed to be able to

solve the program since perfect separation will typically not be

possible. Linear programming has been very popular in the early

days of credit scoring. One of its key beneﬁts is that it is easy to

include domain or business knowledge by adding extra constraints

to the model. Suppose prior business experience indicates that age

(variable 1) is more important than income (variable 2). This can

be easily enforced by adding the constraint w1≥w2to the linear

program.

A key problem with linear programming is that it can estimate

multiple optimal decision boundaries as illustrated in Figure 4.26 for a

perfectly linearly separable case, where class 1 represents the fraudsters

and class 2 the non-fraudsters.

The Linear Separable Case

SVMs add an extra objective to the analysis. Consider the situation

depicted in Figure 4.27 with two variables x1 (e.g, age) and x2

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 157

x x

+ +

Class 1

Class 2

Figure 4.26 Multiple Separating Hyperplanes

Class 1

Class 2

H1: wT x+b=+1

H0: wT x+b=0

H2: wT x+b=–1

2/||w||

Figure 4.27 SVM Classiﬁer for the Perfectly Linearly Separable Case

(e.g, income). It has two hyperplanes sitting at the edges of both

classes, and a hyperplane in between which will serve as the classiﬁ-

cation boundary. The perpendicular distance from the ﬁrst hyperplane

H1to the origin equals b−1∕w, whereby w represents the

Euclidean norm of wcalculated as w=w2

1+w2

2. Likewise, the

perpendicular distance from H2to the origin equals b+1∕w.

158 FRAUD ANALYTICS

Hence, the margin between both hyperplanes equals 2∕w.SVMs

will now aim at maximizing this margin to pull both classes as far

apart as possible. Maximizing the margin is similar to minimizing

w, or minimizing 1

2N

i=1w2

i. In case of perfect linear separation, the

SVM classiﬁer then becomes as follows.

Consider a training set: {xk,yk}n

k=1with xk∈RNand yk∈{−1;+1}

The goods (e.g., class +1) should be above hyperplane H1,andthe

bads (e.g., class –1) below hyperplane H2, which gives:

wTxk+b≥1,if yk=+1

wTxk+b≤−1,if yk=−1

Both can be combined as follows:

yk(wTxk+b)≥1.

The optimization problem then becomes:

Minimize 1



i=1

subject to yk(wTxk+b)≥1,k=1…n.

This quadratic programming (QP) problem can now be solved

using Lagrangian optimization (Cristianini and Taylor 2000; Schölkopf

and Smola 2001; and Vapnik 1995). Important to note is that

the optimization problem has a quadratic cost function, giving a

convex optimization problem with no local minima and only one

global minimum. Training points that lie on one of the hyperplanes

H1or H2are called support vectors and are essential to the clas-

siﬁcation. The classiﬁcation hyperplane itself is H0and for new

observations, it needs to be checked whether they are situated above

H0in which case the prediction is +1 or below (prediction −1).

This can be easily accomplished using the sign operator as fol-

lows: y(x)=sign(wTx+b). Remember, sign(x)is +1ifx≥0, and –1,

otherwise.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 159

The Linear Nonseparable Case

The SVM classiﬁer discussed thus far assumed perfect separation is pos-

sible, which will, of course, be rarely the case for real-life data sets. In

case of overlapping class distributions (as illustrated in Figure 4.28),

the SVM classiﬁer can be extended with error terms eias follows:

Minimize 1



i=1

i+C



i=1

subject to yk(wTxk+b)≥1−ek,k=1…n

ek≥0.

The error variables ekare needed to allow for misclassiﬁcations. The

Chyperparameter in the objective function balances the importance of

maximizing the margin versus minimizing the error on the data. A high

(low) value of Cimplies a higher (lower) risk of overﬁtting. Note the

similarity with the idea of weight regularization discussed in the section

on neural networks. Also, there the objective function consisted out of

an error term and the sum of the squared weights. We will discuss

procedures to determine the optimal value of Claterinthissection.

Just as before, the problem is a quadratic programming (QP) problem,

which can be solved using Lagrangian optimization.

Class 1

Class 2

H1: wT x+b=+1

H0: wT x+b=0

H2: wT x+b= –1

2/||w||

Figure 4.28 SVM Classiﬁer in Case of Overlapping Distributions

160 FRAUD ANALYTICS

The Nonlinear SVM Classiﬁer

Finally, the nonlinear SVM classiﬁer will ﬁrst map the input data to

a higher dimensional feature space using some mapping 𝜑(x).Thisis

illustrated in Figure 4.29.

The SVM problem formulation now becomes:

Minimize 1



i=1

i+C



i=1

subject to yk(wT𝜑(xk)+b)≥1−ek,k=1…n

ek≥0.

When working out the Lagrangian optimization (Cristianini and

Taylor 2000; Schölkopf and Smola 2001; and Vapnik 1995), it turns

out that the mapping 𝜑(x)is never explicitly needed, but only implic-

itly by means of the kernel function Kdeﬁned as follows K(xk,xl)=

𝜑(xk)T𝜑(xl). Hence, the feature space does not need to be explicitly

speciﬁed. The nonlinear SVM classiﬁer then becomes:

y(x)=sign n



k=1

𝛼kykKx,xk+b,

where 𝛼kare the Lagrangian multipliers stemming from the optimiza-

tion. Support vectors will have nonzero 𝛼ksince they are needed to

X X

O O O

O O

Input Space

X X

Feature Space

K(x1,x2) = φ(x1)T φ(x2)

x φ(x)

wTφ(xi) + b = 0

Figure 4.29 The Feature Space Mapping

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 161

construct the classiﬁcation hyperplane. All other observations have

zero 𝛼k, which is often referred to as the sparseness property of SVMs.

Different types of kernel functions can be used. The most popular are:

◾Linear kernel: K(x,xk)=xT

◾Polynomial kernel: K(x,xk)=(1+xT

kx)d

◾Radial basis function (RBF) kernel: K(x,xk)=exp{−x−xk2∕

𝜎2}

Empirical evidence has shown that the RBF kernel usually per-

forms best, but note that it includes an extra parameter 𝜎to be tuned

(Van Gestel et al. 2004).

A key question to answer when building SVM classiﬁers is the tun-

ing of the hyperparameters. For example, suppose one has an RBF

SVM, which has two hyperparameters Cand 𝜎. Both can be tuned

using the following procedure (Van Gestel et al. 2004):

1. Partition the data into 40%∕30%∕30%training, validation and

test data.

2. Build an RBF SVM classiﬁer for each (𝜎,C) combination from

the sets 𝜎∈{0.5, 5, 10, 15, 25, 50, 100, 250, 500} and C∈{0.01,

0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500}.

3. Choose the (𝜎,C) combination with the best validation set per-

formance.

4. Build an RBF SVM classiﬁer with the optimal (𝜎,C) combination

on Combined training +Validation data set.

5. Calculate the performance of the estimated RBF SVM classiﬁer

on the test set.

In case of linear or polynomial kernels, a similar procedure can be

adopted.

SVMs for Regression

SVMs can also be used for regression applications with a continuous

target. The idea here is to ﬁnd a function f(x)that has at most 𝜀devia-

tion from the actual targets yifor all the training data, and is at the same

time as ﬂat as possible. Hence, the loss function will tolerate (penalize)

errors less (higher) than 𝜀. This is visualized in Figure 4.30.

162 FRAUD ANALYTICS

+ε

–ε

Loss

function

Figure 4.30 SVMs for Regression

Consider a training set: {xk,yk}n

k=1with xk∈RNand yk∈R

The SVM formulation then becomes:

Minimize 1



i=1

i+C



i=1

(𝜀k+𝜀∗

subject to

yk−wT𝜑(xk)−b≤𝜀+𝜀k

wT𝜑(xk)+b−yk≤𝜀+𝜀∗

𝜀, 𝜀k,𝜀

∗

k≥0.

The hyperparameter Cdetermines the trade-off between the ﬂat-

ness of fand the amount to which deviations larger than 𝜀are tolerated.

Note the feature space mapping 𝜑(x), which is also used here. Using

Lagrangian optimization, the resulting nonlinear regression function

becomes:

f(x)=



i=1

(𝛼k−𝛼∗

k)K(xk,x)+b,

where 𝛼kand 𝛼∗

krepresent the Lagrangian multipliers. The hyperpa-

rameters Cand 𝜀can be tuned using a procedure similar to the one

outlined for classiﬁcation SVMs.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 163

Opening the SVM Black Box

Similar to neural networks, SVMs have a universal approximation

property. As an extra beneﬁt, they do not require tuning of the num-

ber of hidden neurons and are characterized by convex optimization.

However, they are also very complex to be used in settings where

interpretability is important. Just as with neural networks, procedures

can be used to provide more transparency by opening up the SVM

black box.

Variable selection can be performed using the backward variable

selection procedure discussed in the section on neural networks. This

will essentially reduce the variables but not provide any additional

insight into the workings of the SVM. Rule extraction approaches can

then be used in a next step. In order to apply decompositional rule

extraction approaches, the SVM can be represented as a neural net-

work as depicted in Figure 4.31.

The hidden layer uses kernel activation functions, whereas the out-

put layer uses a linear activation function. Note that the number of

hidden neurons now corresponds to the number of support vectors

and follows automatically from the optimization. This is in strong con-

trast to neural networks where the number of hidden neurons needs

to be tuned manually. The decompositional approach can then pro-

ceed by ﬁrst extracting If-Then rules relating the output to the hidden

K(x,x1)

K(x,x2)

K(x,xns)

xnb

αns

α2

α1

Figure 4.31 Representing an SVM Classiﬁer as a Neural Network

164 FRAUD ANALYTICS

unit activation values. In a next step, rules are extracted relating the

hidden unit activation values to the inputs, followed by the merger of

both rule sets.

Since a pedagogical approach considers the underlying model as a

black box, it can be easily combined with SVMs. Just as in the neural

network case, the SVM is ﬁrst used to construct a data set with SVM

predictions for each of the observations. This data set is then given

to a decision tree algorithm to build a decision tree. Also here, addi-

tional training set observations can be generated to facilitate the tree

construction process.

Finally, also two-stage models can be used to provide more com-

prehensibility. Remember, in this approach a simple model (e.g., linear

or logistic regression) is estimated ﬁrst, followed by an SVM to correct

the errors of the latter.

ENSEMBLE METHODS

Ensemble methods aim at estimating multiple analytical models

instead of using only one. The idea here is that multiple models can

cover different parts of the data input space and as such complement

each other’s deﬁciencies. In order to successfully accomplish this, the

analytical technique needs to be sensitive to changes in the underlying

data. This is especially the case for decision trees and that’s why they

are commonly used in ensemble methods. In what follows, we will

discuss bagging, boosting, and random forests.

Bagging

Bagging (Bootstrap aggregating) starts by taking B bootstraps from the

underlying sample (Breiman 1996). Note that a bootstrap is a sample

with replacement (see section on evaluating predictive models). The

idea is then to build a classiﬁer (e.g., decision tree) for every bootstrap.

For classiﬁcation, a new observation will be classiﬁed by letting all B

classiﬁers vote, using, for example, a majority voting scheme whereby

ties are resolved arbitrarily. For regression, the prediction is the average

of the outcome of the B models (e.g., regression trees). Note that here

also a standard error and thus conﬁdence interval can be calculated.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 165

The number of bootstraps B can either be ﬁxed (e.g., 30) or tuned via

an independent validation data set.

The key element for bagging to be successful is the instability of

the analytical technique. If perturbing the data set by means of the

bootstrapping procedure can alter the model constructed, then bagging

will improve the accuracy (Breiman 1996). However, for models that

are robust with respect to the underlying data set, it will not give much

added value.

Boosting

Boosting works by estimating multiple models using a weighted sample

of the data (Freund and Schapire 1997, 1999). Starting from uniform

weights, boosting will iteratively reweight the data according to the

classiﬁcation error whereby misclassiﬁed cases get higher weights. The

idea here is that difﬁcult observations should get more attention. Either

the analytical technique can directly work with weighted observations,

or if not, we can just sample a new data set according to the weight

distribution. The ﬁnal ensemble model is then a weighted combination

of all the individual models. A popular implementation of this is the

Adaptive boosting/Adaboost procedure, which works as follows:

1. Given the following observations: (x1,y1),…,(xn,yn)where xiis

the attribute vector of observation iand yi∈{1,–1}

2. Initialize the weights as follows: W1(i)=1∕n,i=1,…,n

3. For t=1…T

a. Train a weak classiﬁer (e.g., decision tree) using the

weights Wt

b. Get weak classiﬁer Ctwith classiﬁcation error 𝜀t

c. Choose 𝛼t=1

2ln 1−𝜀t

𝜀t

d. Update the weights as follows:

i. Wt+1(i)= Wt(i)

e−𝛼tif Ct(x)=yi

ii. Wt+1(i)= Wt(i)

e𝛼tif Ct(x)≠yi

4. Output the ﬁnal ensemble model: E(x)=sign T

t=1𝛼tCt(x)

166 FRAUD ANALYTICS

Note that in this procedure, Trepresents the number of boosting

runs, 𝛼tmeasures the importance that is assigned to classiﬁer Ctand

increases as 𝜀tgets smaller, Ztis a normalization factor needed to make

sure that the weights in step tmake up a distribution and as such

sum to 1, and Ct(x)represents the classiﬁcation of the classiﬁer built

in step tfor observation x. Multiple loss functions may be used to cal-

culate the error 𝜀talthough the misclassiﬁcation rate is undoubtedly

the most popular. In substep i of step d, it can be seen that correctly

classiﬁed observations get lower weights, whereas substep ii assigns

higher weights to the incorrectly classiﬁed cases. Again, the number

of boosting runs Tcan be ﬁxed or tuned using an independent vali-

dation set. Note that various variants of this Adaboost procedure exist,

such as Adaboost.M1, Adaboost.M2 (both for multiclass classiﬁcation),

and Adaboost.R1, Adaboost.R2 (both for regression). See Freund and

Schapire (1997 and 1999) for more details. A key advantage of boosting

is that it is really easy to implement. A potential drawback is that there

may be a risk of overﬁtting to the hard (potentially noisy) examples in

the data, which will get higher weights as the algorithm proceeds. This

is especially relevant in a fraud detection setting because, as mentioned

earlier, the target labels in a fraud setting are typically quite noisy.

Random Forests

The technique of random forests was ﬁrst introduced by Breiman

(2001). It creates a forest of decision trees as follows:

1. Given a data set with n observations and Ninputs.

2. m=constant chosen on beforehand.

3. For t=1,…,T

a. Take a bootstrap sample with nobservations.

b. Build a decision tree whereby for each node of the tree,

randomly choose mvariables on which to base the splitting

decision.

c. Split on the best of this subset.

d. Fully grow each tree without pruning.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 167

Common choices for mare 1, 2, or ﬂoor (log2(N)+1), which is rec-

ommended. Random forests can be used with both classiﬁcation trees

and regression trees. Key in this approach is the dissimilarity amongst

the base classiﬁers (i.e., decision trees), which is obtained by adopting

a bootstrapping procedure to select the training samples of the indi-

vidual base classiﬁers, the selection of a random subset of attributes at

each node, and the strength of the individual base models. As such the

diversity of the base classiﬁers creates an ensemble that is superior in

performance compared to the single models.

More recently, an alternative to random forests was proposed: rota-

tion forests. This ensemble technique takes the idea of random forests

one step further. It combines the idea of pooling a large number of

decision trees built on a subset of the attributes and data, with the

application of principal component analysis prior to decision tree build-

ing, explaining its name. Rotating the axes prior to model building was

found to enhance base classiﬁer accuracy at the expense of losing the

ability of ranking individual attributes by their importance (Rodriguez

et al. 2006).

Evaluating Ensemble Methods

Various benchmarking studies have shown that random forests can

achieve excellent predictive performance. Actually, they generally

rank amongst the best performing models across a wide variety of

prediction tasks (Dejaeger et al. 2012). They are also perfectly capable

of dealing with data sets having only a few observations, but with lots

of variables. They are highly recommended when high performing

analytical methods are needed for fraud detection. However, the price

that is paid for this, is that they are essentially black-box models. Due

to the multitude of decision trees that make up the ensemble, it is

very hard to see how the ﬁnal classiﬁcation is made. One way to shed

some light on the internal workings of an ensemble is by calculating

the variable importance. A popular procedure to do so is as follows:

1. Permute the values of the variable under consideration (e.g., Xj)

on the validation or test set.

168 FRAUD ANALYTICS

2. For each tree, calculate the difference between the error on the

original, unpermuted data and the error on the data with Xjper-

muted as follows:

VI (Xj)= 1

ntree 

(errort(D)−errort(̃

Dj)),

whereby ntree represents the number of trees in the ensemble,

Dthe original data, and ̃

Djthe data with variable Xjpermuted.

In a regression setting, the error can be the mean squared error

(MSE), whereas in a classiﬁcation setting, the error can be the

misclassiﬁcation rate.

3. Order all variables according to their VI value. The variable with

the highest VI value is the most important.

MULTICLASS CLASSIFICATION TECHNIQUES

In the introduction of this chapter, we already discussed the difﬁculty of

appropriately determining the target label in a fraud detection setting.

One way to deal with this is by creating more than two target val-

ues, e.g., as follows: clear fraud, doubt case, no fraud. These values are

nominal, implying that there is no meaningful order between them.

As an alternative, the target values can also be ordinal: severe fraud,

medium fraud, light fraud, no fraud. All of the classiﬁcation techniques

discussed earlier in this chapter can be easily extended to a multiclass

setting whereby more than two target values or classes are present.

Multiclass Logistic Regression

When estimating a multiclass logistic regression model, one ﬁrst needs

to know whether the target variable is nominal or ordinal. For nominal

target variables, one of the target classes (say class K) will be chosen as

the base class as follows (Allison 2001):

P(Y=1X1,…,XN)

P(Y=KX1,…,XN)=e(𝛽1

0+𝛽1

1X1+𝛽1

2X2+…𝛽1

NXN)

P(Y=2X1,…,XN)

P(Y=KX1,…,XN)=e(𝛽2

0+𝛽2

1X1+𝛽2

2X2+…𝛽2

NXN)

…

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 169

P(Y=K−1X1,…,XN)

P(Y=KX1,…,XN)=e(𝛽K−1

0+𝛽K−1

1X1+𝛽K−1

2X2+…𝛽K−1

NXN)

Using the fact that all probabilities must sum to one, one can then

obtain the following:

P(Y=1X1,…,XN)= e(𝛽1

0+𝛽1

1X1+𝛽1

2X2+…𝛽1

NXN)

1+K−1

k=1e(𝛽k

0+𝛽k

1X1+𝛽k

2X2+…𝛽k

NXN)

P(Y=2X1,…,XN)= e(𝛽2

0+𝛽2

1X1+𝛽2

2X2+…𝛽2

NXN)

1+K−1

k=1e(𝛽k

0+𝛽k

1X1+𝛽k

2X2+…𝛽k

NXN)

P(Y=KX1,…,XN)= 1

1+K−1

k=1e(𝛽k

0+𝛽k

1X1+𝛽k

2X2+…𝛽k

NXN)

The 𝛽parameters are then usually estimated using maximum apos-

teriori estimation, which is an extension of maximum likelihood esti-

mation. As with binary logistic regression, the procedure comes with

standard errors, conﬁdence intervals, and p-values.

In case of ordinal targets, one could estimate a cumulative logistic

regression as follows (Allison 2001):

P(Y≤1)= 1

1+e−𝜃1+𝛽1X1+…+𝛽NXN

P(Y≤2)= 1

1+e−𝜃2+𝛽1X1+…+𝛽NXN

…

P(Y≤K−1)= 1

1+e−𝜃K−1+𝛽1X1+…+𝛽NXN

or,

P(Y≤1)

1−P(Y≤1)=e−𝜃1+𝛽1X1+…+𝛽NXN

P(Y≤2)

1−P(Y≤2)=e−𝜃2+𝛽1X1+…+𝛽NXN

…

170 FRAUD ANALYTICS

P(Y≤K−1)

1−P(Y≤K−1)=e−𝜃K−1+𝛽1X1+…+𝛽NXN.

Note that since P(Y≤K)=1, 𝜃K=+∞.

The individual probabilities can then be obtained as follows:

P(Y=1)= P(Y≤1)

P(Y=2)= P(Y≤2)–P(Y≤1)

…

P(Y=K)=1–P(Y≤K–1).

Also for this model, the 𝛽parameters can be estimated using a max-

imum likelihood procedure.

Multiclass Decision Trees

Decision trees can be easily extended to a multiclass setting. For the

splitting decision, assuming Kclasses, the impurity criteria become:

Entropy(S)=−



k=1

pklog2(pk)

Gini(S)=



k=1

pk(1−pk).

The stopping decision can be made in a similar way as for binary

target decision trees by using a training set for making the splitting

decision, and an independent validation data set on which the misclas-

siﬁcation error rate is monitored. The assignment decision then looks

for the most prevalent class in each of the leave nodes.

Multiclass Neural Networks

A straightforward option for training a multiclass neural network for K

classes, is to create Koutput neurons, one for each class. An observa-

tion is then assigned to the output neuron with the highest activation

value (winner-take-all learning). Another option is to use a softmax

activation function (Bishop 1995).

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 171

Multiclass Support Vector Machines

A common practice to estimate a multiclass support vector machine is

to map the multiclass classiﬁcation problem to a set of binary classiﬁ-

cation problems. Two well-known schemes here are One-versus-One

and One-versus-All coding (Van Gestel et al. 2015).

For Kclasses, One-versus-One coding estimates K(K−1)∕2binary

SVM classiﬁers contrasting every possible pair of classes. Every classi-

ﬁer as such can cast a vote on the target class and the ﬁnal classiﬁcation

is then the result of a (weighted) voting procedure. Ties are resolved

arbitrarily. This is illustrated in Figure 4.32 whereby the aim is to clas-

sify the white triangle.

For Kclasses, One-versus-All coding estimates Kbinary SVM clas-

siﬁers each time contrasting one particular class against all the other

ones. A classiﬁcation decision can then be made by assigning a par-

ticular observation to the class for which one of the binary classiﬁers

assigns the highest posterior probability. Ties are less likely to occur

with this scheme. This is illustrated in Figure 4.33, whereby the aim

is to classify the white triangle. Note that for more than three classes,

One-versus-All coding estimates less classiﬁers than One-versus-One

coding. However, in One-versus-One coding the binary classiﬁers are

estimated on a reduced subset of the data (i.e., each time contrasting

only two classes), whereas in One-versus-All coding, all binary classi-

ﬁers are estimated on the entire data set (i.e., each time contrasting a

class against all the rest).

a) or :

b) or :

c) or :

Class is !

Figure 4.32 One-Versus-One Coding for Multiclass Problems

172 FRAUD ANALYTICS

a) or other; p( ) = 0.92

b) or other; p( ) = 0.18

c) or other; p( ) = 0.30

Class is !

Figure 4.33 One-Versus-All Coding for Multiclass Problems

Both One-versus-One and One-versus-All coding are meta

schemes that can be used with other base classiﬁers (e.g., neural

networks) as well.

EVALUATING PREDICTIVE MODELS

Splitting Up the Data Set

When evaluating predictive models, two key decisions need to be

made. A ﬁrst decision concerns the data set split up, which speciﬁes

on what part of the data the performance will be measured. A second

decision concerns the performance metric. In what follows, we will

elaborate on both.

The decision how to split up the data set for performance mea-

surement depends on its size. In case of large data sets (say, more

than 1,000 observations), the data can be split up into a training and

a test sample. The training sample (also called development or estima-

tion sample) will be used to build the model whereas the test sample

(also called the hold out sample) will be used to calculate its perfor-

mance (see Figure 4.34). A commonly applied split up is a 70 percent

training sample and a 30 percent test sample. There should be a strict

separation between training and test sample. No observation that was

used for model development, can be used for independent testing.

Note that in case of decision trees or neural networks, the validation

173

Data

Build Model

Apply

Model

Customer Age Income Gender …

…

Fraud Score

Emma 281,000 F No0,44

Will 44 1,500 M Yes 0,76

Dan 30 1,200 M No0,18

Bob 582,400 M

Customer Age Income Gender

John 30 1,200 M

Sarah 25 800 F

Sophie 52 2,200 F

David 482,000 M

Peter 34 1,800 M

Yes

Fraud Target

0,88

Train data

Test data

...)0034.050.010.0(

incomeage

P(Fraud | age, income,...)

Figure 4.34 Training Versus Test Sample Set Up for Performance Estimation

174 FRAUD ANALYTICS

sample is a separate sample since it is actively being used during model

development (i.e., to make the stopping decision). A typical split-up in

this case is a 40 percent training sample, 30 percent validation sam-

ple and 30 percent test sample. A stratiﬁed split-up ensures that the

fraudsters/nonfraudsters are equally distributed amongst the various

samples.

In case of small data sets (say, less than 1,000 observations), special

schemes need to be adopted. A very popular scheme is cross-validation.

In cross-validation, the data is split into Kfolds (e.g., 5 or 10). An

analytical model is then trained on K– 1 training folds and tested

on the remaining validation fold. This is repeated for all possible val-

idation folds resulting in Kperformance estimates, which can then

be averaged. Note that also a standard deviation and/or conﬁdence

interval can be calculated if desired. Common choices for Kare5and

10. In its most extreme case, cross-validation becomes leave-one-out

cross-validation whereby every observation is left out in turn and a

model is estimated on the remaining K– 1 observations. This gives

Kanalytical models in total. In stratiﬁed cross validation, special care is

taken to make sure the no fraud/fraud odds are the same in each fold

(see Figure 4.35).

A key question to answer when doing cross-validation is what

should be the ﬁnal model that is being outputted from the procedure.

Since cross-validation gives multiple models, this is not an obvious

question. Of course, one could let all models collaborate in an ensem-

ble setup by using a (weighted) voting procedure. A more pragmatic

Validation fold

Training fold

Figure 4.35 Cross-Validation for Performance Measurement

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 175

answer would be to do leave one out cross-validation and pick one of

the models at random. Since the models differ up to one observation

only, they will be quite similar anyway. Alternatively, one may also

choose to build one ﬁnal model on all observations but report the

performance coming out of the cross-validation procedure as the best

independent estimate.

For small samples, one may also adopt bootstrapping procedures

(Efron 1979). In bootstrapping, one takes samples with replacement

from a data set D (see Figure 4.36).

The probability that a customer is sampled equals 1/n,withnthe

number of observations in the data set. Hence, the probability that a

customer is not sampled equals 1 – 1/n. Assuming a bootstrap with n

samples, the fraction of customers that is not sampled equals:

1−1

nn.

We then have:

lim

n→∞1−1

nn

=e−1=0.368

where the approximation already works well for small values of n. So,

0.368 is the probability that a customer does not appear in the sample

and 0.632 the probability that a customer does appear. If we then take

the bootstrap sample as the training set, and the test set as all sam-

ples in Dbut not in the bootstrap, we can calculate the performance

as follows:

Error estimate =0.368 Error (Training)+0.632 Error (Test),

whereby obviously a higher weight is being put on the test set

performance.

Bootstrap 1

C5 C3C2C3 C2

C2 C1C4C1 C2

Bootstrap 2

Figure 4.36 Bootstrapping

176 FRAUD ANALYTICS

Performance Measures for Classiﬁcation Models

Consider the following fraud detection example for a ﬁve-customer

data set. The ﬁrst column in Table 4.4 depicts the fraud status, whereas

the second column the fraud score as it comes from a logistic regression,

decision tree, neural network, and so on.

One can now map the scores to a predicted classiﬁcation label by

assuming a default cut-off of 0.5 as shown in Figure 4.37.

A confusion matrix can now be calculated as shown in Table 4.5.

Based on this matrix, one can now calculate the following perfor-

mance measures:

◾Classiﬁcation accuracy =(TP +TN)∕(TP +FP +FN +TN)=3∕5

◾Classiﬁcation error =(FP +FN)∕(TP +FP +FN +TN)=2∕5

Table 4.4 Example Data Set for Performance Calculation

Fraud Fraud Score

John Yes 0.72

Sophie No 0.56

David Yes 0.44

Emma No 0.18

Bob No 0.36

Cut-Off = 0.50

Fraud Fraud Score

John Yes 0.72

Sophie No 0.56

David Yes 0.44

Emma No 0.18

Bob No 0.36

Fraud Fraud Score Predicted

John Yes 0.72 Yes

Sophie No 0.56 Yes

David Yes 0.44 No

Emma No 0.18No

Bob No 0.36 No

Figure 4.37 Calculating Predictions Using a Cut-Off

Table 4.5 Confusion Matrix

Actual Status

Positive (Fraud) Negative (No Fraud)

Predicted status Positive (Fraud) True Positive (John) False Positive (Sophie)

Negative (No Fraud) False Negative (David) True Negative (Emma, Bob)

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 177

◾Sensitivity =Recall =Hit rate =TP∕(TP +FN)=1∕2

◾Speciﬁcity =TN∕(FP +TN)=2∕3

◾Precision =TP∕(TP +FP)=1∕2

◾F-measure =2×(Precision ×Recall)∕(Precision +Recall)=

1∕2

The classiﬁcation accuracy is the percentage of correctly classiﬁed

observations. The classiﬁcation error is the complement thereof and

also referred to as the misclassiﬁcation rate. The sensitivity, recall or

hit rate measures how many of the fraudsters are correctly labeled

by the model as a fraudster. The speciﬁcity looks at how many of

the nonfraudsters are correctly labeled by the model as nonfraudster.

Theprecision indicateshow manyof thepredicted fraudstersare actually

fraudsters.

Note that all these classiﬁcation measures depend on the cut-off.

For example, for a cut-off of 0 (1), the classiﬁcation accuracy becomes

40 percent (60 percent), the error 60 percent (40 percent), the sen-

sitivity 100 percent (0), the speciﬁcity 0 (100 percent), the precision

40 percent (0) and the F-measure 0.57 (0). Given this dependence, it

would be nice to have a performance measure that is independent from

the cut-off. One could construct a table with the sensitivity, speciﬁcity,

and 1-speciﬁcity for various cut-offs as shown in Table 4.6.

The receiver operating characteristic (ROC) curve then plots the

sensitivity versus 1-speciﬁcity as illustrated in Figure 4.38 (Fawcett

2003).

Table 4.6 Table for ROC Analysis

Cut-off Sensitivity Speciﬁcity 1-Speciﬁcity

0 1 0 1

0.01

0.02

....

0.99

1 0 1 0

178 FRAUD ANALYTICS

0,2

0,4

0,6

0,8

0 0,2 0,4 0,6 0,81

Sensitivity

(1 - Specificity)

ROC Curve

Mode l - ARandom Mode l - B

Figure 4.38 The Receiver Operating Characteristic Curve

Note that a perfect model detects all the fraudsters and nonfraud-

sters at the same time, which results into a sensitivity of 1, and a speci-

ﬁcity of 1 and is thus represented by the upper-left corner. The closer the

curve approaches this point, the better the performance. In Figure 4.38,

model A has a better performance than model B. A problem, however,

arises if the curves intersect. In this case, one can calculate the area

under the ROC curve (AUC) as a performance metric. The AUC provides

a simple ﬁgure-of-merit for the performance of the constructed classi-

ﬁer. The higher the AUC, the better the performance. The AUC is always

bounded between 0 and 1 and can be interpreted as a probability. In

fact, it represents the probability that a randomly chosen fraudster gets a

higher score than a randomly chosen nonfraudster (DeLong et al. 1988;

Hanley and McNeil 1982). Note that the diagonal represents a random

scorecard whereby sensitivity equals 1-speciﬁcity for all cut-off points.

Hence, a good classiﬁer should have an ROC above the diagonal and

AUC bigger than 50 percent.

A lift curve is another important performance metric. It starts by

sorting the population from high score to low score. Suppose now that

in the top 10 percent highest scores, there are 60 percent fraudsters

whereas the total population has 10 percent fraudsters. The lift value

in the top decile then becomes 60 percent/10 percent, or 6. In other

words, the lift value represents the cumulative percentage of fraudsters

per decile, divided by the overall population percentage of fraud-

sters. Using no model, or a random sorting, the fraudsters would be

equally spread across the entire range and the lift value would always

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 179

10 20 30 40 50 60

% of sorted population

model

baseline

70 8090100

Figure 4.39 Lift Curve

equal 1. Obviously, the lift curve always decreases as one considers

bigger deciles, until it will reach 1. This is illustrated in Figure 4.39.

Note that a lift curve can also be expressed in a noncumulative way,

and is also often summarized as the top decile lift.

The cumulative accuracy proﬁle (CAP), Lorenz or Power curve is

very closely related to the lift curve (see Figure 4.40). It also starts by

sorting the population from high score to low score and then measures

the cumulative percentage of fraudsters for each decile on the Y-axis.

The perfect models gives a linearly increasing curve up to the sam-

ple fraud rate and then ﬂattens out. The diagonal again represents the

random model.

The CAP curve can be summarized in an accuracy ratio (AR) as

depicted in Figure 4.41.

The accuracy ratio is then deﬁned as follows (see Figure 4.41):

(Area below power curve for current model – Area below power

curve for random model)∕(Area below power curve for

perfect model −Area below power curve for random model)

A perfect model will thus have an AR of 1 and a random model

an AR of 0. Note that the accuracy ratio is also often referred to as the

180

0 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

scorecard 0% 30% 50% 65% 78%85% 90% 95% 97% 99% 100%

random model 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

perfect model 0% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Percentage of fraudsters

Percentage of sorted population

scorecard

random model

perfect model

Figure 4.40 Cumulative Accuracy Proﬁle

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 181

Perfect model

Current model

AR=B/(A+B)

Figure 4.41 Calculating the Accuracy Ratio

Gini coefﬁcient. There is also a linear relation between the AR and the

AUC as follows: AR =2×AUC −1.

The Kolmogorov-Smirnov distance is a separation measure

calculating the maximum distance between the cumulative score

distributions of the nonfraudsters P(sNF)and fraudsters P(sF)deﬁned

as follows,

P(sF)=

x≤s

p(xF)

P(sNF)=

x≤s

p(xNF).

Note that by deﬁnition P(sF)equals 1 – Sensitivity, and P(sNF)

equals the speciﬁcity. Hence, it can easily be veriﬁed that the KS

distance can also be measured on an ROC graph. It fact, it is equal

to the maximum vertical distance between the ROC curve and the

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Score

P(s|NF)

P(s|F)

KS distance

Figure 4.42 The Kolmogorov-Smirnov Statistic

182 FRAUD ANALYTICS

diagonal. The KS statistic ranges between 0 and 1. If there exists a

cut-off ssuch that all fraudsters have a score higher than sand all

nonfraudsters a score lower than s, then perfect separation is achieved

and the KS-statistic will equal 1.

Another performance measure is the Mahalanobis distance M

between the score distributions deﬁned as follows:

M=𝜇F−𝜇NF 

𝜎,

where 𝜇NF (𝜇F)represents the mean score of the nonfraudsters (fraud-

sters) and 𝜎the pooled standard deviation. Obviously, a high Maha-

lanobis distance is preferred since it means both score distributions are

well separated. Closely related is the divergence metric Dcalculated as

follows:

D=(𝜇NF −𝜇F)2

2(𝜎2

NF +𝜎2

F).

For both Mand Dthe minimum value is zero and there is no the-

oretical upper bound.

The Brier score (BS) measures the quality of the fraud probability

estimates as follows:

BS =n

i=1(PFi−𝜃i)2,

where PFiis the probability of fraud for observation i,andθiabinary

indicator (θiequals 1 if fraud; 0 otherwise). The Brier score is always

bounded between 0 and 1 and lower values indicate better discrimina-

tion ability.

In case of multiclass targets, other performance measures need to

be adopted. Assume we have developed an analytical fraud detection

model with four classes: A, B, C, and D. A ﬁrst performance measure is

the multiclass confusion matrix, which contrasts the predicted classes

versus the actual classes as depicted in Table 4.7.

The on-diagonal elements correspond to the correct classiﬁcations.

Off-diagonal elements represent errors. For Table 4.7, the classiﬁca-

tion accuracy becomes (50 +20 +10 +4)∕100, or 84 percent, and thus

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 183

Table 4.7 Multiclass Confusion Matrix

Actual class

A B C D

Predicted class A50 2 1 1

B 3 20 2 1

C 1 2 10 0

D 1 0 2 4

the classiﬁcation error equals 16 percent. Note that the sensitivity and

speciﬁcity are no longer uniquely deﬁned and need to be considered

for each class separately. For example, for class A, 50 out of the 55

observations are correctly classiﬁed or 91 percent. Also the precision

needs to be considered for each class individually. For example, for the

26 class B predictions, 20 are correct, or thus 77 percent.

Assume now that the ratings are ordinal, A =severe fraud, B =

medium fraud, C =light fraud, D =no fraud. In this case, not all errors

have equal impact. Given the ordinal nature of the target variable, the

further away from the diagonal, the bigger the impact of the error. For

example, when target class A is predicted as B, this is a less severe error

than when target class A is predicted as D. One could summarize this

in a notch difference graph, which is a bar chart depicting the cumula-

tive accuracy for increasing notch differences. For our example, at the

0 notch difference level the cumulative accuracy equals 84 percent, at

the one-notch difference level 95 percent, at the two-notch difference

level 98 percent, and at the three-notch difference level 100 percent.

Figure 4.43 gives the corresponding notch difference graph. Obviously,

the cumulative accuracy always increases as the notch level increases

and will become 100 percent eventually.

Although ROC analysis was originally introduced in the context

of binary classiﬁcation, several multiclass extensions have been

developed in the literature, in line with the One-versus-All or One-

versus-One setups of casting a multiclass problem to several binary

ones, as discussed earlier (Hand 2001). A ﬁrst approach is to generate

an ROC curve using each class in turn as the positive class and merging

all other classes into one negative class (similar to One-versus-All

184 FRAUD ANALYTICS

75%

80%

85%

90%

95%

100%

0 notch 1 notch 2 notch 3 notch

cumulative notch difference

Figure 4.43 A Cumulative Notch Difference Graph

coding). The multiclass AUC, AUCm, can then be computed as the

sum of the binary AUCs weighted by the class distribution as follows:

AUCm=m

i=1AUC(ci)p(ci),

where mis the number of classes, AUC(ci)the AUC obtained from con-

sidering class cias the reference class, and p(ci)the prior probability

of class ci. Another approach is based on the pairwise discriminabil-

ity of classes and computes the multiclass AUC as follows (similar to

One-versus-One coding):

AUCm=2

m(m−1)i<jAUC(ci,cj),

where AUC(ci,cj)is the area under the two-class ROC curve involving

classes ciand cj. The summation is averaged over all m(m−1)∕2 possible

pairs of classes.

Note that in case of ordinal targets, also rank order statistics such

as Spearman’s rank order correlation, Kendall’s tau, and Goodman-

Kruskal’s gamma can be computed, as we will discuss later.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 185

Performance Measures for Regression Models

A ﬁrst way to evaluate the predictive performance of a regression

model is by visualizing the predicted target against the actual target

using a scatter plot (see Figure 4.44). The more the plot approaches

a straight line through the origin, the better the performance of the

regression model. It can be summarized by calculating the Pearson

correlation coefﬁcient as follows:

corr(̂

y,y)= n

i=1(̂

yi−̂

y)(yi−y)

n

i=1(̂

yi−̂

y)2n

i=1(yi−y)2

where ̂

yirepresents the predicted value for observation i,̂

ythe average

of the predicted values, yithe actual value for observation i,andythe

average of the actual values. The Pearson correlation always varies

between –1 and +1. Values closer to +1 indicate better agreement

and thus better ﬁt between the predicted and actual values of the

target variable.

0 5 10 15 20 25 30 35 40 45

Actual Fraud

Predicted Fraud

Figure 4.44 Scatter Plot: Predicted Fraud Versus Actual Fraud

186 FRAUD ANALYTICS

Another key performance metric is the coefﬁcient of determination

or R2deﬁned as follows:

R2=n

i=1(̂

yi−y)2

n

i=1(yi−y)2=1−n

i=1(yi−̂

yi)2

n

i=1(yi−y)2,

The R2always varies between 0 and 1, and higher values are to

be preferred. Basically, this measure tells us how much better we can

predict by using the analytical model to compute ̂

yithan by using the

mean yas predictor. To compensate for the variables in the model, an

adjusted R2,R2

adj, has been suggested as follows:

adj =1−n−1

n−p−1(1−R2)=1−n−1

n−p−1n

i=1(yi−̂

yi)2

n

i=1(yi−y)2,

where prepresents the number of variables in the model. Note that

when the R2is measured on a test set, negative values are possible

if the average value of the target on the training set differs from the

average value of the target on the test set.

Two other popular measures are the mean squared error (MSE)

and mean absolute deviation (MAD) deﬁned as follows:

MSE =n

i=1(yi−̂

yi)2

MAD =n

i=1yi−̂

yi

A perfect model would have an MSE and MAD of 0. Higher values

for both MSE and MAD indicate less good performance. Note that the

MSE is sometimes also reported as the root mean squared error (RMSE)

whereby RMSE =MSE.

Also, CAP curves can be used to visualize and calculate the

performance of regression models. Just as in classiﬁcation, the X-axis

represents the percentage of the sorted population, but now sorted

based on the outcome of the regression model. For the Y-axis, a

binary variable needs to be deﬁned. One option is to create a binary

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 187

20%

40%

60%

80%

100%

120%

0 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percentage of fraud amounts lower

than average

Percentage of sorted population

model Random model Perfect Model

Figure 4.45 CAP Curve for Continuous Targets

outcome representing whether the observed fraud amount is lower

than the average fraud amount. This is illustrated in Figure 4.45. The

corresponding CAP plot and accuracy ratio will then indicate how

much better the analytical model predicts than the average. Another

option would be to create a binary variable indicating whether the

observed fraud amount is smaller than the 25th (75th) percentile of the

distribution, in which case the CAP curve and corresponding accuracy

ratio will indicate how much better the analytical model allows to

predict small (high) fraud amounts.

Another popular visual representation is the regression error char-

acteristic (REC) curve (Bi and Bennett 2003). This is a regression vari-

ant of the ROC curve in classiﬁcation and plots the error tolerance on

the X-axis versus the percentage of points predicted within the toler-

ance on the Y-axis. The resulting curve estimates the cumulative dis-

tribution function of the error. The error on the X-axis can be deﬁned

as the squared error (yi−̂

yi)2or the absolute deviation yi−̂

yi.Justas

with the ROC curve, the perfect model is situated in the upper-left cor-

ner. Hence, the quicker the curve approaches this point, the better the

model. The area above the curve then represents an overall error mea-

sure, which should preferably be as small as possible. As an example,

consider the data represented in Table 4.8.

The corresponding REC curve is depicted in Figure 4.46.

188 FRAUD ANALYTICS

Table 4.8 Data for REC Curve

Tolerance (X) Correct Predictions (Cumulative) Cumulative Accuracy (Y)

0 1 10%

0,05 440%

0,1 770%

0,2 990%

0,5 10 100%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Cumulative Accuracy

Tolerance (MAD)

Figure 4.46 Regression Error Characteristic (REC) Curve

OTHER PERFORMANCE MEASURES FOR PREDICTIVE

ANALYTICAL MODELS

As already mentioned in Chapter 1, statistical performance is just one

aspect of model performance. Other important criteria are comprehen-

sibility, justiﬁability, and operational efﬁciency.

Although comprehensibility is subjective and depends on the

background and experience of the fraud analyst, linear and logistic

regression, and decision trees are commonly referred to as white-box,

comprehensible techniques. Neural networks, SVMs, and ensemble

methods are essentially opaque models and thus much harder to

understand. However, in fraud settings where statistical performance

is superior to interpretability, they are the method of choice.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 189

Remember, justiﬁability goes one step further and veriﬁes to what

extent the relationships modeled are in line with prior business knowl-

edge and/or expectations. In a practical setting, this often boils down

to verifying the univariate impact of a variable on the model’s out-

put. For example, for a linear/logistic regression model, the signs of

the regression coefﬁcients will be veriﬁed.

Finally, the operational efﬁciency can also be an important

evaluation criterion to consider when selecting the optimal analytical

model. Operational efﬁciency represents the ease with which one can

implement, use, and monitor the ﬁnal model. For example, in a (near)

real-time fraud environment, it is important to be able to quickly

evaluate the fraud model. With regards to implementation, rule-based

models excel since implementing rules can be done very easily, even in

spreadsheet software. Linear models are also quite easy to implement

whereas nonlinear models are much more difﬁcult to implement, due

to the complex transformations that are being used by the model.

DEVELOPING PREDICTIVE MODELS

FORSKEWEDDATASETS

Fraud-detection data sets often have a very skew target class distri-

bution whereby typically only about 1 percent or even less of the

transactions are fraudulent. Obviously, this creates problems for the

analytical techniques discussed earlier since they are being ﬂooded by

all nonfraudulent observations and will thus tend toward classifying

every observation as nonfraudulent. Think about decision trees, for

example. If they start from a data set with 99 percent/1 percent

nonfraudulent/fraudulent observations, then the entropy is already

very low and, hence, it is very likely that the decision tree does not

ﬁnd any useful split and classiﬁes all observations as nonfraudulent,

hereby achieving a classiﬁcation accuracy of 99 percent, but essentially

detecting none of the fraudsters. It is thus recommended to increase

the number of fraudulent observations or their weight, such that

the analytical techniques can pay better attention to them. Various

procedures are possible to do this and will be outlined in what

follows.

190 FRAUD ANALYTICS

Varying the Sample Window

A ﬁrst way to increase the number of fraudsters is by increasing the

time horizon for prediction. For example, instead of predicting fraud

with a six-month forward-looking time horizon, a 12-month time

horizon can be adopted. This is likely to add more fraudsters to the

sample and thus enable the analytical techniques to ﬁnd a meaningful

discrimination. Another approach works by sampling every fraudster

twice (or more) as depicted in Figure 4.47. Let’s assume we predict

fraud with a one-year forward-looking time horizon using informa-

tion from a one year backward looking time horizon. By shifting the

observation point earlier or later, the same fraudulent observation can

be sampled twice. Obviously, the variables collected will be similar

but not perfectly the same, since they are measured on a different

(although overlapping) time frame. This added variability can then

come in handy for the analytical techniques to better discriminate

between the fraudsters and nonfraudsters. Note that depending on the

skewness of the target, multiple observation points can be considered

such that the number of fraudsters is multiplied by 2, 3, 4,…Finding

the optimal number is subject to a trial-and-error exercise.

Undersampling and Oversampling

Another way to increase the weight of the fraudsters is by either over-

sampling them or by undersampling the nonfraudsters. Oversampling

is illustrated in Figure 4.48. Here, the idea is to replicate the fraud-

sters two or more times so as to make the distribution less skew. In our

Obs. point Fraud

1 year

Obs. point Fraud

1 year

Figure 4.47 Varying the Time Window to Deal with Skewed Data Sets

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 191

ID Variables Class

1 Fraud

2No Fraud

3No Fraud

4 Fraud

5No Fraud

6No Fraud

7No Fraud

8No Fraud

9 Fraud

10 No Fraud

ID Variables Class

1 Fraud

2No Fraud

3No Fraud

4 Fraud

5No Fraud

6No Fraud

7No Fraud

8No Fraud

9 Fraud

10 No Fraud

Train

Tes t

Train

Tes t

Original data Oversampled data

Figure 4.48 Oversampling the Fraudsters

example, observations 1 and 4, both fraudsters, have been replicated

so as to create an equally balanced training sample having the same

amount of fraudsters and nonfraudsters, respectively.

Undersampling is illustrated in Figure 4.49. Here, observations

number 2 and 5, which are both nonfraudulent, have been left out

so as to create an equally balanced training sample. The undersam-

pling can be done based on business experience whereby obviously

legitimate observations are removed. Also, low-value transactions or

inactive accounts can be considered for removal.

ID Variables Class

1 Fraud

2No Fraud

3No Fraud

4 Fraud

5No Fraud

6No Fraud

7No Fraud

8No Fraud

9 Fraud

10 No Fraud

ID Variables Class

1Fraud

3No Fraud

4Fraud

6No Fraud

7No Fraud

8No Fraud

9Fraud

10 No Fraud

Train

Tes t

Train

Tes t

Original data Undersampled data

Figure 4.49 Undersampling the Nonfraudsters

192 FRAUD ANALYTICS

Under- and oversampling can also be combined. In the literature, it

has been shown that undersampling usually results in better classiﬁers

than oversampling (Chawla et al. 2002).

It is very important to note that both oversampling and undersam-

pling should be conducted on the training data and not on the test

data. Remember, the latter should remain untouched during model

development in order to give an unbiased view on model performance.

A practical question concerns the optimal nonfraud/fraud odds, which

should be aimed for by doing under- or oversampling. Although work-

ing toward a balanced sample with the same number of fraudsters

and nonfraudsters seems attractive, it severely biases the probabili-

ties, which will be output by the analytical technique. Hence, it is

recommended to stay as close as possible to the original class distribu-

tion to avoid unnecessary bias. One practical approach to determine

the optimal class distribution works as follows. In the ﬁrst step, an

analytical model is built on the original data set with the skew class

distribution (e.g., 95%/5% nonfraudsters/fraudsters). The AUC of this

model is recorded (possibly on an independent validation data set). In

a next step, over- or undersampling is used to change the class distri-

bution by 5 percent (e.g., 90%/10%). Again, the AUC of the model

is recorded. Subsequent models are built on samples of 85%/15%,

80%/20%, 75%/25%, hereby each time recording their AUC. Once

the AUC starts to stagnate (or drop), the procedure stops and the opti-

mal odds ratio has been found. Although it does depend on the data

characteristics and quality, practical experience has shown that the

ratio 80%/20% is quite commonly used in the industry.

Synthetic Minority Oversampling Technique (SMOTE)

Rather than replicating the minority observations (e.g., fraudsters),

Synthetic Minority Oversampling works by creating synthetic observa-

tions based on the existing minority observations (Chawla et al. 2001).

This is illustrated in Figure 4.50 where the circles represent the major-

ity class and the squares the minority class. For each minority class

observation, SMOTE calculates the knearest neighbors. Let’s assume

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 193

Figure 4.50 Synthetic Minority Oversampling Technique (SMOTE)

we consider the crossed square and pick the 5 nearest neighbors repre-

sented by the black squares. Depending on the amount of oversampling

needed, one or more of the k-nearest neighbors are selected to create

the synthetic examples.

Let’s say our oversampling percentage is set at 200 percent. In this

case, two of the ﬁve nearest neighbors are selected at random. The

next step is then to randomly create two synthetic examples along the

line connecting the observation under investigation (crossed square)

with the two random nearest neighbors. These two synthetic examples

are represented by dashed squares in the ﬁgure. As an example, con-

sider an observation with characteristics (e.g., age and income) of 30

and 1,000, and its nearest neighbor with corresponding characteristics

62 and 3,200. We generate a random number between 0 and 1—let’s

say, 0.75. The synthetic example then has age 30 +0.75 ×(62 –30),

or 54, and income 1000 +0.75 ×(3200 −1000)=2,650. SMOTE then

combines the synthetic oversampling of the minority class with under-

sampling the majority class. Note that in their original paper, Chawla

et al. (2001) developed an extension of SMOTE to work with categori-

cal variables. Empirical evidence has shown that SMOTE usually works

better than either under- or oversampling. Also, for fraud detection it

has proven to be very valuable (Van Vlasselaer et al. 2013, 2015).

194 FRAUD ANALYTICS

Likelihood Approach

Another interesting approach to work with a low number of fraud-

sters is the likelihood approach developed by Pluto and Tasche (2005).

Although it was originally developed in a credit-risk modeling setting

to tackle the issue of low-default portfolios, it can be easily transferred

to a fraud-detection setting.

Let’s start with the most extreme example of a skewed data set,

which is a data set with no fraudsters at all. Obviously, none of the

sampling approaches discussed so far will work for this. Assume

now that we have an expert-based fraud detection system that can

discriminate the observations into fraud risk classes A, B, and C using

a set of predeﬁned business rules. Although these three classes allow

analysts to discriminate the observations in terms of their fraud risk,

it would also be handy to accompany each of these classes with fraud

probability estimates. These probabilities can be used to calculate

the expected fraud loss. Remember, Expected fraud loss (EFL)=

Probability of fraud (PF)×Loss given fraud (LGF). The EFL can then

be used for provisioning purposes whereby a ﬁrm anticipates future

losses by setting aside provisions.

In a ﬁrst step, we will try to calculate the probability of fraud (PF)

for class A, PFA. A key assumption we will make is that fraud occurs

independently. Although this assumption might seem naïve at ﬁrst

sight, it allows us to derive probability estimates in a fairly straight-

forward way given this complex setting with no data about fraud-

sters. More speciﬁcally, we will ﬁrst assume that the ranking of the

observations across the three fraud risk classes is correct, or in other

words: PFA≤PFB≤PFC. The most prudent estimate (sometimes also

referred to as the most conservative estimate) is then obtained under

the temporary assumption that PFA=PFB=PFC. Hence, the probabil-

ity of being fraudulent equals PFAfor every observation. Given that we

have nAobservations in class A, nBobservations in class B, nCobser-

vations in class C, and that fraud occurs independently, the likelihood

of not observing any fraudster in the total data set equals:

(1−PFA)nA+nB+nC

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 195

Table 4.9 Values for PFAfor a Data Set with No Fraudsters

α50% 75% 90% 95% 99% 99.9%

PFA0,20% 0,39% 0,65% 0,85% 1,31% 1,95%

Table 4.10 Values for PFBfor a Data Set with No Fraudsters

α50% 75% 90% 95% 99% 99.9%

PFB0,28% 0,55% 0,92% 1,19% 1,82% 2,72%

We can now specify a conﬁdence region for PFA,whichisthe

region of all values of PFAsuch that the probability of not observing

any fraudster is higher than 1 −𝛼, or in other words:

1−𝛼≤(1−PFA)nA+nB+nC,

PFA≤1−(1−𝛼)1∕(nA+nB+nC).

Assume we have 100 observations in class A, 200 in class B, and

50 in class C. Table 4.9 illustrates the values obtained for PFAby vary-

ing the conﬁdence level from 50 percent to 99.9 percent. As can be

observed, PFAincreases as the conﬁdence level increases.

We can now continue this same procedure to compute PFB.We

have nB+nCobservations left. The most prudent estimate of PFBis

obtained by again assuming PFB=PFC. Hence, we have:

1−𝛼≤(1−PFB)nB+nC,

PFB≤1−(1−𝛼)1∕(nB+nC).

For our data set, this gives the values reported in Table 4.10.

Finally, we can calculate PFCas follows:

1−𝛼≤(1−PFC)nC,

196 FRAUD ANALYTICS

Table 4.11 Values for PFCfor a Data Set with No Fraudsters

α50% 75% 90% 95% 99% 99.9%

PFC1,38% 2,73% 4,50% 5,81% 8,80% 12,90%

PFC≤1−(1−𝛼)1∕nC.

This gives the values reported in Table 4.11.

Note that despite having no fraudsters in the data, PFCat the 99.9

percent conﬁdence level equals 12.90 percent, which is quite high.

Also observe that for a given conﬁdence level, PFA≤PFB≤PFCas

required at the outset. An obvious question is what conﬁdence level to

adopt. Before answering this question, we will extend the procedure

by assuming a few fraudsters occur in the data.

Let’s now assume we have one fraudster in class A, 2 in class B, and

4 in class C. We ﬁrst determine PFAusing again the most prudent esti-

mate principle: PFA=PFB=PFC. By using the binomial distribution to

calculate the probability of observing less than seven fraudsters, PFA

can be found as follows:

1−𝛼≤7

i=0nA+nB+nC

iPFi

A(1−PFA)nA+nB+nC−i.

Likewise, PFBand PFCcan be found as follows:

1−𝛼≤6

i=0nB+nC

iPFi

B(1−PFB)nB+nC−i,

1−𝛼≤4

i=0nC

iPFi

C(1−PFC)nC−i.

Table 4.12 displays the values obtained depending on the conﬁ-

dence levels. Again, note that the probabilities increase for increasing

conﬁdence levels. Just as in the previous examples, also observe that

PFA≤PFB≤PFCas required at the outset.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 197

Table 4.12 Values for PFA,PFB, and PFCfor a Data Set with Fraudsters

α50% 75% 90% 95% 99% 99.9%

PFA2,19% 2,76% 3,34% 3,72% 4,51% 5,51%

PFB2,66% 3,41% 4,17% 4,68% 5,73% 7,05%

PFC9,28% 12,26% 15,35% 17,38% 21,50% 26,56%

As already mentioned, a key question to answer when adopting

this approach is the setting of the conﬁdence level. Obviously, this

depends on how conservative the estimates should be. As said, a higher

conﬁdence level results into a higher probability estimate. In their orig-

inal paper, Pluto and Tasche suggest not exceeding 95 percent. In a

credit-risk setting, Benjamin et al. (2006) suggested adopting conﬁ-

dence levels between 50 and 75 percent.

Adjusting Posterior Probabilities

The key idea of undersampling, oversampling and SMOTE is to adjust

the class priors to enable the analytical technique to come up with a

meaningful model discriminating the fraudsters from the nonfraud-

sters. By doing so, the class posteriors will also become biased. This is

no problem in case the fraud analyst is only interested in ranking the

observations in terms of their fraud risk. However, if well-calibrated

fraud probabilities are needed (e.g., to accurately calculate expected

fraud losses), then the posterior probabilities need to be adjusted.

One straightforward way to do this is by using the following formula

(Saerens et al. 2002):

p(Cix)=

p(Ci)

pr(Ci)pr(Cix)

2

j=1

p(Cj)

pr(Cj)pr(Cjx)

whereby Cirepresents class i(e.g., class 1 for the fraudsters and 2

for the nonfraudsters), p(Ci)the prior probability (e.g., p(C1)=1%and

p(C2)=99%),p

r(Ci)the resampled prior probability due to oversam-

pling, undersampling or other resampling procedures (e.g., pr(C1)=

20%and pr(C2)=80%),andp

r(Cix)represents the posterior probability

198 FRAUD ANALYTICS

Table 4.13 Adjusting the Posterior Probability

Posteriors Using

Resampled Data

Posteriors Re-Calibrated

to Original Data

P(Fraud) P(No Fraud) P(Fraud) P(No Fraud)

Customer 1 0.1 0.9 0.004 0.996

Customer 2 0.3 0.7 0.017 0.983

Customer 3 0.5 0.5 0.039 0.961

Customer 4 0.6 0.4 0.057 0.943

Customer 5 0.85 0.15 0.186 0.814

Customer 6 0.9 0.1 0.267 0.733

for observation xas calculated by the analytical technique using the

resampled data. Note that the formula can be easily extended to more

than two classes.

Table 4.13 shows an example of adjusting the posterior probabil-

ity, whereby p(C1)=0,01;p(C2)=0,99,pr(C1)=0,20;and pr(C2)=

0,80. It can be easily veriﬁed that the rank ordering of the customers

in terms of their fraud risk remains preserved after the adjustment.

Cost-sensitive Learning

Cost-sensitive learning is another alternative to deal with highly

skewed data sets. The idea is to assign higher misclassiﬁcation costs to

the minority class, which is, in our case, the fraudsters. These costs

are then taken into account during classiﬁer estimation or evaluation.

Table 4.14 gives the overview of the costs in a binary classiﬁcation

Table 4.14 Misclassiﬁcation Costs

Predicted Class

Positive Negative

Actual class Positive C(+,+) C(–,+)

Negative C(+,–) C(–,–)

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 199

setting whereby C(i,j) represents the cost of misclassifying an example

from class jinto class i.

Note that usually C(+,+) = C(–,–)=0, and C(–,+) >C(+,–).The

costs are typically also determined on an aggregated basis, rather than

on an observation-by-observation basis.

A ﬁrst straightforward way to make a classiﬁer cost-sensitive is by

adopting a cost-sensitive cut-off to map the posterior class probabilities

to class labels. In other words, an observation xwill be assigned to the

class that minimizes the expected misclassiﬁcation cost:

argminij∈{−,+}P(jx).C(i,j),

where P(jx)is the posterior probability of observation xto belong to

class j. As an example, consider a fraud detection setting whereby class

1 are the fraudsters and class 2 the nonfraudsters. An observation xwill

be classiﬁed as a fraudster (class 1) if

P(1x).C(1,1)+P(2x).C(1,2)<P(1x).C(2,1)+P(2x).C(2,2)

P(1x).C(2,1)>P(2x).C(1,2)

P(1x).C(2,1)>(1−P(1x)).C(1,2)

P(1x)>C(1,2)

C(1,2)+C(2,1)

P(1x)>1

1+C(2,1)

C(1,2)

So, the cut-off only depends on the ratio of the misclassiﬁcation

costs, which may be easier to determine than the individual misclassi-

ﬁcation costs themselves.

Another approach to cost-sensitive learning works by directly

minimizing the misclassiﬁcation cost during classiﬁer learning.

Again assuming there is no cost for correct classiﬁcations, the total

misclassiﬁcation cost is then as follows:

Total cost =C(–,+) × FN +C(+,–)×FP,

200 FRAUD ANALYTICS

whereby FN and FP represent the number of false negatives and

positives, respectively. Various cost-sensitive versions of existing classi-

ﬁcation techniques have been introduced in the literature. Ting (2002)

introduced a cost-sensitive version of the C4.5 decision tree algorithm

where the splitting and stopping decisions are based on the misclassiﬁ-

cation cost. Veropolous et al. (1999) developed a cost-sensitive version

of SVMs whereby the misclassiﬁcation costs are taken into account

in the objective function of the SVM. Domingos (1999) introduced

MetaCost, which is a meta-algorithm capable of turning any classiﬁer

into a cost-sensitive classiﬁer by ﬁrst relabeling observations with their

estimated minimal-cost classes and then estimating a new classiﬁer

on the relabeled data set. Fan et al. (1999) developed AdaCost, a

cost-sensitive variant of AdaBoost, which uses the misclassiﬁcation

costs to update the weights in successive boosting runs.

To summarize, cost-sensitive learning approaches are usually

more complex to work with than the sampling approaches discussed

earlier. López et al. (2012) conducted a comparison of sampling versus

cost-sensitive learning approaches for imbalanced data sets and found

both methods are good and equivalent. Hence, from a pragmatic

viewpoint, it is recommended to use the sampling approaches in a

fraud-detection setting.

FRAUD PERFORMANCE BENCHMARKS

To conclude this chapter, Table 4.15 provides some references of scien-

tiﬁc papers discussing fraud detection across a diversity of settings. The

type of fraud, size of data set used, class distribution, and performance

are reported. To facilitate the comparison, only papers that report the

area under the ROC curve (AUC) are included. The following conclu-

sions can be drawn:

◾Credit card, ﬁnancial statement, and telecommunications fraud

case studies report the highest AUC.

◾Insurance and social security fraud report the lowest AUC.

◾All case studies, except for ﬁnancial statement fraud, start from

highly skewed data sets.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 201

Table 4.15 Performance Benchmarks for Fraud Detection

Reference Type of Fraud Size of Data Class Performance

Set Used Distribution

Ortega,

Figuerora et al.

(2006)

Medical

Insurance

8,819 5% fraud AUC: 74%

Šubelj, Furlan

et al. (2011)

Automobile

Insurance fraud

3,451 1.3% fraud AUC: 71%–

92%

Bhattacharyya,

Jha et al.

(2011)

Credit card fraud 50 million

transactions on

about 1 million

credit cards

from a single

country

0.005%

fraud

AUC: 90,8%–

95,3%

Whitrow, Hand

et al. (2009)

Credit card fraud 33,000–36,000

activity records

0.1% fraud Gini: 85%

(∼ AUC =

92.5%)

Van Vlasselaer,

Bravo et al.

(2015)

Credit card fraud 3,3 million

transactions

<1% fraud AUC: 98.6%

Dongshan and

Girolami

(2007)

Telecommunica-

tions fraud

809,395 calls

from 1,087

accounts

0.024%

fraud

AUC: 99.5%

Van Vlasselaer,

Meskens et al.

(2013)

Social security

fraud

2000

observations

1% fraud AUC: 80–85%

Ravisankar,

Ravi et al.

(2011)

Financial

statement fraud

202 companies 50% fraud AUC: 98.09%

REFERENCES

Allison, P D. (2001). Logistic Regression Using the SAS® System: Theory and

Application. Hoboken, NJ: John Wiley-SAS.

Baesens, B. (2014). Analytics in a Big Data World. Hoboken, NJ: John Wiley

& Sons.

Baesens, B., Martens, D., Setiono, R., & Zurada, J. (2011). White Box Nonlin-

ear Prediction Models, editorial special issue, IEEE Transactions on Neural

Networks 22 (12): 2406–2408.

202 FRAUD ANALYTICS

Baesens, B., Mues, C., Martens, D., & Vanthienen, J. (2009). 50 Years of Data

Mining and OR: Upcoming Trends and Challenges. Journal of the Opera-

tional Research Society 60: 16–23.

Baesens, B., Setiono, R., Mues, C., & Vanthienen, J. (March 2003). Using

Neural Network Rule Extraction and Decision Tables for Credit-Risk

Evaluation. Management Science 49 (3): 312–329.

Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., & Van-

thienen, J. (2003). Benchmarking State of the Art Classiﬁcation Algo-

rithms for Credit Scoring. Journal of the Operational Research Society 54 (6):

627–635.

Baesens, B., Viaene, S., Van den Poel, D., Vanthienen, J., & Dedene, G. (2002).

Bayesian Neural Network Learning for Repeat Purchase Modelling

in Direct Marketing. European Journal of Operational Research 138 (1):

191–211.

Bartlett P. L. (1997). For Valid Generalization, the Size of the Weights Is More

Important than the Size of the Network. In M. C. Mozer, M. I. Jordan,

and T. Petsche (Eds.), Advances in Neural Information Processing Systems 9.

Cambridge, MA: MIT Press, pp. 134–140.

Benjamin, N., Cathcart, A., & Ryan, K. (2006). Low Default Portfolios: A Pro-

posal for Conservative Estimation of Default Probabilities; Discussion

paper, Financial Services Authority.

Bhattacharyya, S., Jha, S. K., Tharakunnel, K. K., & Westland, J. C. (2011).

Data Mining for Credit Card Fraud: A Comparative Study, Decision Support

Systems 50 (3): 602–613.

Bi, J., & Bennett, K. P. (2003). Regression Error Characteristic Curves, Proceed-

ings of the 20th International Conference on Machine Learning (ICML),

pp. 43–50.

Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford: Oxford

University Press.

Breiman, L. (1996). Bagging Predictors. Machine Learning 24 (2): 123–140.

Breiman, L., Friedman, J. H., Olshen, R. A., Stone, C. J. (1984). Classiﬁcation

and Regression Trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced

Books & Software.

Breiman, L., (2001). Random Forests. Machine Learning 45 (1): 5–32.

Chawla, N. V., Bowyer, K.W., Hall, L. O., Kegelmeyer, W. P. (2002). SMOTE:

Synthetic Minority Over-sampling Technique. Journal of Artiﬁcial Intelli-

gence Research 16: 321–357.

Craven, M., & Shavlik, J. (1996). Extracting Tree-Structured Representations

of Trained Networks. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 203

(Eds.), Advances in Neural Information Processing Systems 8. Cambridge, MA:

MIT Press, pp. 24–30.

Cristianini, N., Taylor J. S. (2000). An Introduction to Support Vector Machines

and Other Kernel-Based Learning Methods Cambridge: Cambridge University

Press.

Dejaeger, K., Verbeke, W., Martens, D., & Baesens, B. (2012). Data Mining

Techniques for Software Effort Estimation: A Comparative Study, IEEE

Transactions on Software Engineering 38 (2): 375–397.

Domingos, P. (1999). MetaCost: A General Method for Making Classiﬁers Cost-

Sensitive, Proceedings of the Fifth International Conference on Knowledge

Discovery and Data Mining. New York: ACM Press, pp. 155–164.

Dongshan X., & Girolami M. (2007). Employing Latent Dirichlet Allocation

for Fraud Detection in Telecommunications. Pattern Recognition Letters 28:

1727–1734.

Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classiﬁcation.NewYork:

John Wiley & Sons.

Efron B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals

of Statistics 7 (1): 1–26.

Fan, W., Stolfo, S. J., Zhang J., & Chan P. K. (1999). AdaCost: Misclassiﬁ-

cation Cost-sensitive Boosting. Proceedings of the Sixteenth International

Conference on Machine Learning (ICML). Waltham, MA: Morgan Kaufmann,

pp. 97–105.

Fawcett, T. (2003). ROC Graphs: Notes and Practical Considerations for

Researchers, HP Labs Tech Report HPL-2003–4.

Flach P. (2012). Machine Learning: The Art and Science of Algorithms that Make Sense

of Data. Cambridge: Cambridge University Press.

Freund, Y., & Schapire, R. E. (August 1997). A Decision-Theoretic General-

ization of On-line Learning and an Application to Boosting. Journal of

Computer and System Sciences 55 (1):119–139.

Freund, Y., & Schapire, R. E. (September 1999). A Short Introduction to Boost-

ing. Journal of Japanese Society for Artiﬁcial Intelligence 14 (5): 771–780.

Han, J., & Kamber, M. (2006). Data Mining: Concepts and Techniques, 2nd ed.

Waltham, MA: Morgan Kaufmann.

Hand, D., Till, R. J. (2001). A Simple Generalization of the Area under the

ROC Curve to Multiple Class Classiﬁcation Problems. Machine Learning 45

(2):171–186.

Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of Data Mining.

Cambridge, MA: MIT Press.

Hanley, J. A., & McNeil, B. J. (1982). The Meaning and Use of Area under the

ROC Curve. Radiology 143: 29–36.

204 FRAUD ANALYTICS

Hartigan, J. A. (1975). Clustering Algorithms. New York: John Wiley & Sons.

Hastie, T., Tibshirani, R., & Friedman, J. (2001). Elements of Statistical Learning:

Data Mining, Inference and Prediction. New York: Springer-Verlag.

Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer Feedfor-

ward Networks Are Universal Approximators. Neural Networks 2 (5):

359–366.

Huysmans, J., Baesens, B., Van Gestel, T., & Vanthienen, J. (April 2006). Using

Self-Organizing Maps for Credit Scoring. Expert Systems with Applications,

Special Issue on Intelligent Information Systems for Financial Engineering 30 (3):

479–487.

Kass, G. V. (1980). An Exploratory Technique for Investigating Large Quantities

of Categorical Data. Applied Statistics 29 (2): 119–127.

López, V., Fernández, A., Moreno-Torres, J., & Herrera, F. (2012). Analysis of

Preprocessing vs. Cost-Sensitive Learning for Imbalanced Classiﬁcation.

Open Problems on Intrinsic Data Characteristics. Expert Systems with Appli-

cations 39: 6585–6608.

Lu, H., Setiono, R., & Liu, H. (September 1995). NeuroRule: a Connectionist

Approach to Data Mining. In Proceedings of 21st International Conference on

Very Large Data Bases, Zurich, Switzerland, pp. 478–489.

Mangasarian, O. L. (May–June 1965). Linear and Nonlinear Separation of

Patterns by Linear Programming, Operations Research 13: 444–452.

Martens, D., Baesens, B., & Van Gestel, T. (2009). Decompositional Rule

Extraction from Support Vector Machines by Active Learning. IEEE

Transactions on Knowledge and Data Engineering 21 (1): 178–191.

Martens, D., Baesens, B., Van Gestel, T., & Vanthienen, J. (2007). Comprehen-

sible Credit Scoring Models Using Rule Extraction from Support Vector

Machines. European Journal of Operational Research 183: 1466–1476.

Martens, D., Vanthienen, J., Verbeke, W., & Baesens, B. (2011). Performance

of Classiﬁcation Models from a User Perspective. Decision Support Systems,

Special Issue on Recent Advances in Data, Text, and Media Mining & Information

Issues in Supply Chain and in Service System Design 51 (4): 782–793.

Moody, J., & Utans, J. (1994). Architecture Selection Strategies for

Neural Networks: Application to Corporate Bond Rating Prediction.

In Apostolos-Paul Refenes (Ed.), Neural Networks in the Capital Markets.

New York: John Wiley & Sons.

Ortega, P.A., Figueora C.J., & Ruz. G. A. (2006). Medical Claim Fraud/Abuse

Detection System based on Data Mining: A Case Study in Chile, Proceedings

of the 2006 International Conference on Data Mining, DMIN.

PREDICTIVE ANALYTICS FOR FRAUD DETECTION 205

Pluto, K., & Tasche, D. (2005). Estimating Probabilities of Default for Low

Default Portfolios. In B. Engelmann and R. Rauhmeier (Eds.), The Basel

II Risk Parameters. New York: Springer.

Quinlan, J. R. (1993). C4.5 Programs for Machine Learning. Waltham, MA:

Morgan Kauffman Publishers.

Ravisankar, P., Ravi, V., Rao, G. R., & Bose, I. (2011). Detection of Financial

Statement Fraud and Feature Selection Using Data Mining Techniques,

Decision Support Systems 50: 491–500.

Saerens M., Latinne P., & Decaestecker C. (2002). Adjusting the Outputs of a

Classiﬁer to New a Priori Probabilities: A Simple Procedure. Neural Compu-

tation 14 (1): 21–41.

Schölkopf B., & Smola A. (2002). Learning with Kernels. Cambridge: MIT Press.

Setiono R. Baesens B., & Mues C. (2009). A Note on Knowledge Discovery

Using Neural Networks and Its Application to Credit Card Screening. Euro-

pean Journal of Operational Research 192 (1): 326–332.

Setiono R., Baesens B. & Mues C. (2011). Rule Extraction from Minimal Neural

Network for Credit Card Screening. International Journal of Neural Systems

21 (4): 265–276.

Šubelj L., Furlan S., & Bajec M. An Expert System for Detecting Automobile

Insurance Fraud Using Social Network Analysis. Expert Systems with Appli-

cations 38 (1): 1039–1052.

Tan P. N., Steinbach M., & Kumar V. (2006). Introduction to Data Mining.New

York: Pearson.

Ting K. M. (2002). An Instance-Weighted Method to Induce Cost-Sensitive

Trees, IEEE Transactions on Knowledge and Data Engineering 14: 659–665.

Van Gestel, T., & Baesens B. (2009). Credit Risk Management: Basic Concepts: Finan-

cial Risk Components, Rating Analysis, Models, Economic and Regulatory Capital.

Oxford: Oxford University Press.

Van Gestel, T., Baesens B., & Martens D. (2015). Predictive Analytics: Techniques

and Applications in Credit Risk Modeling. Oxford: Oxford University Press,

forthcoming.

Van Gestel, T., Baesens, B., Van Dijcke, P., Suykens, J., Garcia, J., and & Alder-

weireld, T. (2005). Linear and Nonlinear Credit Scoring by Combining

Logistic Regression and Support Vector Machines. Journal of Credit Risk 1 (4).

Van Gestel, T., Suykens J., Baesens B., Viaene S., Vanthienen J., Dedene G., De

Moor B., & Vandewalle J. (January 2004). Benchmarking Least Squares

Support Vector Machine Classiﬁers. Machine Learning 54 (1): 5–32.

Van Vlasselaer, V., Akoglu L., Eliassi-Rad T., Snoeck M., & Baesens B.

(2015). Guilt-by-Constellation: Fraud Detection by Suspicious Clique

206 FRAUD ANALYTICS

Memberships, Proceedings of 48 Annual Hawaii International Conference on

System Sciences, HICSS-48, Kauai (Hawaii), January 5–8.

Van Vlasselaer, V., Bravo C., Caelen O., Eliassi-Rad T., Akoglu L., Snoeck M.,

& Baesens B. (2015). APATE: A novel approach for automated credit card

transaction fraud detection using network-based extensions. Decision Sup-

port Systems 75: 38–48.

Van Vlasselaer, V., Meskens J., Van Dromme D., & Baesens B. (2013). Using

Social Network Knowledge for Detecting Spider Constructions in Social

Security Fraud. Proceedings of the 2013 IEEE/ACM International Conference on

Advances in Social Network Analysis and Mining, Niagara Falls.

Vapnik, V. (1995). The Nature of Statistical Learning Theory, New York:

Springer-Verlag.

Veropoulos, K., Campbell, C., Cristianini N. (1999). Controlling the Sensitivity

of Support Vector Machines. Proceedings of the International Joint Conference

on AI, pp. 55–60.

Viaene, S., Derrig, R., Baesens, B., Dedene, G. (2002). A Comparison of

State-of-the-Art Classiﬁcation Techniques for Expert Automobile Insur-

ance Fraud Detection. Journal of Risk and Insurance, Special issue on Fraud

Detection 69 (3): 433–443.

Whitrow, C., Hand, D. J., Juszczak, P., Weston, D., Adams, N.M. (2009). Trans-

action Aggregation as a Strategy for Credit Card Fraud Detection. Data

Mining and Knowledge Discovery 18: 30–55.

Zurada, J.M. (1992). Introduction to Artiﬁcial Neural Systems. Boston: PWS

Publishing.

CHAPTER 5

Social Network

Analysis for Fraud

Detection

207

In the last decade, the use of social media websites in everybody’s

daily life is booming. People can continue their conversations on

online social network sites like Facebook, Twitter, LinkedIn, Google+,

Instagram, and so on and share their experiences with their acquain-

tances, friends, family, and others. It only takes one click to update

your whereabouts to the rest of the world. Plenty of options exist to

broadcast your current activities: by picture, video, geo-location, links,

or just plain text. You are on the top of the world—and everybody’s

watching. And this is where it becomes interesting.

Users of online social network sites explicitly reveal their relation-

ships with other people. As a consequence, social network sites are

a (almost) perfect mapping of the relationships that exist in the real

world. We know who you are, what your hobbies and interests are, to

whom you are married, how many children you have, your buddies

with whom you run every week, your friends at the wine club, etc. This

whole interconnected network of people knowing each other, some-

how, is an extremely interesting source of information and knowledge.

Marketing managers no longer have to guess who might inﬂuence

whom to create the appropriate campaign. It is all there—and that is

exactly the problem. Social network sites acknowledge the richness of

the data sources they have, and are not willing to share them as such

and free of cost. Moreover, those data are often privatized and regu-

lated, and well-hidden from commercial use. On the other hand, social

network sites offer many good built-in facilities to managers and other

interested parties to launch and manage their marketing campaigns by

exploiting the social network, without publishing the exact network

representation.

However, companies often forget that they can reconstruct (a

part of) the social network using in-house data. Telecommunication

providers, for example, have a massive transactional data base where

they record call behavior of their customers. Under the assumption

that good friends call each other more often, we can recreate the

network and indicate the tie strength between people based on the

frequency and/or duration of calls. Internet infrastructure providers

might map the relationships between people using their customers’

IP-addresses. IP-addresses that frequently communicate are repre-

sented by a stronger relationship. In the end, the IP-network will

208

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION 209

envisage the relational structure between people from another point

of view, but to a certain extent as observed in reality. Many more

examples can be found in the banking, retail, and online gaming

industry.

Also, the fraud detection domain might beneﬁt from the analysis

of social networks. In this chapter, we underline the social character

of fraud. This means that we assume that the probability of someone

committing fraud depends on the people (s)he is connected to. These

are the so-called guilt-by-associations (Koutra et al. 2011). If we know

that ﬁve friends of Bob are fraudsters, what would we say about Bob?

Is he also likely to be a fraudster? If these friends are Bob’s only friends,

is it more likely that Bob will be inﬂuenced to commit fraud? What if

Bob has 200 other friends, will the inﬂuence of these ﬁve fraudsters be

the same?

In this chapter, we will brieﬂy introduce the reader to networks

and their applications in a fraud detection setting. One of the main

questions answered in this chapter is how unstructured network infor-

mation can be translated into useful and meaningful characteristics of

a subject. We will analyze and extract features from the direct neigh-

borhood (i.e., the direct associates of a certain person or subject) as

well as the network as a whole (i.e., collective inferencing). Those

network-based features can serve as an enrichment of traditional data

analysis techniques.

NETWORKS: FORM, COMPONENTS, CHARACTERISTICS,

AND THEIR APPLICATIONS

Networks are everywhere. Making a telephone call requires setting up

a communication over a wired network of all possible respondents by

sending voice packages between the caller and the callee. The supply

of water, gas, and electricity for home usage is a complex distribution

network that consists of many source, intermediary, and destination

points where sources need to produce enough output such that they

meet the demand of the destination points. Delivery services need to

ﬁnd the optimal route to make sure that all the packages are delivered

at their ﬁnal destination as efﬁciently as possible. Even a simple trip to

210 FRAUD ANALYTICS

the store involves the processing of many networks. What is the best

route to drive from home to the store given the current trafﬁc? Given

a shopping list, how can I efﬁciently visit the store such that I have

every product on my list?

One of humans’ talents is exactly the processing of these networks.

Subliminally, people have a very good sense in ﬁnding an efﬁcient

way through a network. Consider your home-to-work connection.

Depending on the time and the day, you might change your route to

go from home to work without explicitly drawing the network and

running some optimization algorithm. Reaching other people, even

without the telecommunication media of nowadays like telephone

and internet, is often an easy task for people. There is always a friend

of a friend who knows the guy you are looking for.

The mathematical study of optimizing network-related problems

has been introduced many years ago by Euler (1736). He formulated

the problem of the Köningsberg bridges. Köningsberg (now Kaliningrad)

was a city in Lithuania that was divided into four parts by the river

Pregel. Seven bridges connected the four banks of the city (see

Figure 5.1a and Figure 5.1b). The problem is as follows, “Does there

exist a walking route that crosses all seven bridges exactly once?” A

path that can traverse all edges (here: bridges) of a network exactly

once, is a Eulerian path. Euler proved that such a path cannot exist

for the Köningsberg bridge problem. More speciﬁcally, an Eulerian

path only exists when all nodes (here: banks) are reached by an

even number of edges, except for the source and sink node of the

Figure 5.1a Köningsberg Bridges

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION 211

Figure 5.1b Schematic Representation of the Köningsberg Bridges

path which should have an odd number of bridges pointing to it.

Analogously, a Hamiltonian path in the network is a path that visits

each node exactly once. For example, the Traveling Salesman Problem

(TSP) tries to ﬁnd a Hamiltonian path in the network. Given a set of

cities, the idea is that a salesman has to visit each city (i.e., node)

exactly once to deliver the packages. As this is an NP-hard problem,

research mainly focuses on ﬁnding good heuristics to solve the TSP.

Social Networks

Although in the previous example networks are built and developed

by humans, they are not social. A key question here is, “What makes

a network social?” In general, we might say that a network is social

whenever the actors are people or groups of people. A connection

between actors is based on any form of social interaction between

them, such as a friendship. As in the real world, social networks are

also able to reﬂect the intensity of a relationship between people.

How well do you know your contacts? The relationship between two

best friends completely differs from the relationship between two

distant acquaintances. Those relationships and their intensity are an

important source of information exchange.

The psychologist Stanley Milgram measured in 1967 how social

the whole world is. He conducted a Small World experiment whereby he

distributed 100 letters to random people all over the world. The task

at hand was to return the letter to a speciﬁed destination, which was

212 FRAUD ANALYTICS

one of Milgram’s friends. Rather than sending the letter back by mail,

people could only pass the letter to someone they knew. This person,

on their turn, had to forward the letter to one of his/her contacts, and

so on …until the letter reached its ﬁnal destination. Milgram showed

that, on average, each letter reached its destination within six hops.

That is, less than six people are necessary to connect two random peo-

ple in the network. This is the average path length of the network. The

result of the experiment is widely known as the six degrees of separation

theorem. Milgram also found that many letters reached their target

destination within three steps. This is the so-called funneling effect. Some

people are known and know many other people, often from highly

diverse contact groups (e.g., work, friends, hobby). Those people are

sociometric superstars, connecting different parts of the network to

each other. Many paths in the network pass through these people, giv-

ing them a high betweenness score (see section on Centrality metrics).

While the six degrees of separation theorem is based on results in

real-life, many studies already proved that an average path length of six

is an overestimation in online social networks. Those studies reported

an average path length of approximately four hops between any two

random people in an online social network (Kwak et al. 2010). Online

social networks are thus denser than real-life networks. However, the

intensity between the relationships might strongly differ.

Social networks are an important element in the analysis of fraud.

Fraud is often committed through illegal set-ups with many accom-

plices. When traditional analytical techniques fail to detect fraud due

to a lack of evidence, social network analysis might give new insights by

investigating how people inﬂuence each other. These are the so-called

guilt-by-associations, where we assume that fraudulent inﬂuences run

through the network. For example, insurance companies often have

to deal with groups of fraudsters, trying to swindle by resubmitting the

same claim using different people. Suspicious claims often involve the

same claimers, claimees, vehicles, witnesses, and so on. By creating and

analyzing an appropriate network, inspectors might gain new insights

in the suspiciousness of the claim and can prevent pursuit of the claim.

In social security fraud, employers try to avoid paying their tax

contributions to the government by intentionally going bankrupt.

Bankrupt employers are not capable of redeeming their tax debts to

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION 213

the government, and are discharged from their obligations. However,

social network analysis can reveal that the employer is refounded

using almost the same structure a couple of weeks later. As such,

experts can declare the foundation of the new employer unlawful

and still recover the outstanding debts. Opinion fraud occurs when

people untruthfully praise or criticize a product in a review. Especially

online reviews lack control to establish the genuineness of the review.

Matching people to their reviews and comparing the reviews with

others using a network representation, enables review websites to

detect the illicit reviews.

Identity theft is a special form of social fraud, as introduced in

Chapter 1, where an illicit person adopts another person’s proﬁle.

Examples of identity theft can be found in telecommunications fraud

where fraudsters “share” an account with a legitimate customer. This

is depicted in Figure 5.2. As fraudsters cannot withstand to call their

family, friends, acquaintances, and so on, the network clearly links the

account with the fraudster’s previous (or current) account. Once the

fraudster takes over a new customers’ account, the contact list of the

customer is extended by other contacts who were never called before.

The frequent contact list of the fraudster is a strong indicator for fraud.

Figure 5.2 Identity Theft. The Frequent Contact List of a Person is Suddenly Extended

with Other Contacts (Light Gray Nodes). This Might Indicate that a Fraudster (Dark Gray

Node) Took Over that Customer’s Account and “shares” his/her Contacts

214 FRAUD ANALYTICS

While networks are a powerful visualization tool, they mainly

serve to support the ﬁndings by automated detection techniques.

We will focus on how to extend the detection process by extracting

useful and meaningful features from the network. The network

representation can be used afterward to verify the obtained results.

Network Components

This section will introduce the reader to graph theory, the mathemat-

ical foundation for the analysis and representation of networks.

Complex network analysis (CNA) studies the structure, charac-

teristics, and dynamics of networks that are irregular, complex, and

dynamically evolving in time (Boccaletti et al. 2006). Those networks

often consist of millions of closely interconnected units. Most real-life

networks are complex. CNA uses graph theory to extract useful statis-

tics from the network. Boccaletti et al. (2006) deﬁne graph theory as

the natural framework for the exact mathematical treatment of com-

plex networks, and, they state that formally, a complex network is

represented as a graph.

A graph 𝓖=(𝓥,𝓔)consists of a set 𝓥of vertices or nodes (the

points) and a set 𝓔of edges or links (the lines connecting the points).

This is illustrated in Figure 5.3. A node 𝓋∈𝓥represents real-world

objects such as people, computers or activities. An edge 𝓋∈𝓔connects

Figure 5.3 Network Representation

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION 215

two nodes in the network, and

e(𝓋1,𝓋2)e∈𝓔&𝓋i∈𝓥.

An edge represents a relationship between the nodes it connects,

such as a friendship (between people), a physical connection (between

computers), or attendance (of a person to an event).

A graph where the edges impose an order or direction between

the nodes in the network is a directed graph. If there is no order in

the network, we say that the graph is undirected. This is shown in

Figure 5.4. The social network website Twitter can be represented as

a directed graph. Users follow other users, without necessarily being

refollowed. This is expressed by the follower–followee relationships,

and is illustrated in Figure 5.5. User 1 follows User 2, 3, and 5 (follower

Figure 5.4 Example of a (Un)Directed Graph

Figure 5.5 Follower–Followee Relationships in a Twitter Network

216 FRAUD ANALYTICS

Figure 5.6 Edge Representation

relationships), and is followed by User 4 and 5 (followee relationship).

There is a mutual relationship between User 1 and 5.

In general, edges connect two nodes to each other. However, some

special variants are sometimes required to accurately map the reality

(see Figure 5.6):

◾Self-edge: A self-edge is a connection between the node and

itself. For example, a person who transfers money from his/her

account to another account s/he owns.

◾Multi-edge: A multi-edge exists when two nodes are con-

nected by more than one edge. For example, in credit card

transaction fraud, a credit card holder is linked to a merchant

by a multi-edge if multiple credit card transactions occurred

between them.

◾Hyper-edge: A hyper-edge is an edge that connects more than

one node in the network. For example, three people who went

to the same event.

A graph where the edges express the intensity of the relationships,

is a weighted graph 𝓖w=(𝓥,𝓔).

◾Binary weight: This is the standard network representation.

Here, the edge weight is either 0 or 1, and reﬂects whether or

not a link exists between two nodes. An extension of the binary

weighted graphs are the signed graphs where the edge weight

is negative (–1), neutral (0), or positive (1). Negative weights

are used to represent animosity, and positive weights are used

to represent friendships. Neutral weights represent an “I don’t

know you” relationship.

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION 217

◾Numeric weight: A numeric edge weight expresses the afﬁnity

of a person to other persons s/he is connected to. High values

indicate a closer afﬁliation. As people do not assign a weight

to each of their contacts by themselves, many approaches are

proposed to deﬁne an edge weight between nodes. A popular

way is the Common Neighbor approach. That is, the edge weight

equals the total number of common activities or events both

people attended. An activity/event should be interpreted in a

broad sense: the total number of messages sent between them,

common friends, likes on Facebook, and so on.

◾Normalized weight: The normalized weight is a variant of the

numeric weight where all the outgoing edges of a node sum up

to 1. The normalized weight is often used in inﬂuence propaga-

tion over a network.

◾Jaccard weight: The edge weight depends on how “social”

both nodes are (Gupte and Eliassi-Rad 2012), and

w(𝓋1,𝓋2)= 𝚪(𝓋1)∩𝚪(𝓋2)

𝚪(𝓋1)∪𝚪(𝓋2)

with 𝚪(𝓋i)the number of events node 𝓋iattended. For example,

assume that person A attended 10 events and person B attended 5

events. They both went to 3 common events. Then, according to the

Jaccard Index, their edge weight equals 1/4.

Edge weights represent the connectivity within a network, and

are in some way a measure of the sociality between the nodes in the

network. Nodes, on the other hand, use labels to express the local char-

acteristics. Those characteristics are mostly proper to the node and may

include, for example, demographics, preferences, interests, beliefs, and

so on. When analyzing fraud networks, we integrate the fraud label of

the nodes into the network. A node can be fraudulent or legitimate,

depending on the condition of the object it represents. For example,

Figure 5.7 shows a fraud network where the legitimate and fraudu-

lent people are represented by white- and black-colored nodes, respec-

tively. Given this graph, we know that node A and B committed fraud

beforehand. Node C is a friend of node A and is inﬂuenced by the

actions of node A. On the other hand, node D is inﬂuenced by both

node A and B. A simple conclusion would be that node D has the

highest probability of perpetrating fraud, followed by node C.

218 FRAUD ANALYTICS

Figure 5.7 Example of a Fraudulent Network

While real-life networks often contain billions of nodes and mil-

lions of links, sometimes the direct neighborhood of nodes provides

enough information to base decisions on. An ego-centered network or

egonet represents the one-hop neighborhood of the node of interest.

In other words, an egonet consists of a particular node and its immedi-

ate neighbors. The center of the egonet is the ego, and the surround-

ing nodes are the alters. An example of an egonet is illustrated in

Figure 5.8. Such networks are also called the ﬁrst-order neighborhood

of a node. Analogously, the n-order neighborhood of a node encom-

passes all the nodes that can be reached within nhops from the node

of interest.

Figure 5.8 An Egonet. The Ego is Surrounded by Six Alters, of Whom Two are Legitimate

(White Nodes) and Four are Fraudulent (Gray Nodes)

SOCIAL NETWORK ANALYSIS FOR FRAUD DETECTION 219

Network Representation

Transactional data sources often contain information about how

entities relate to each other (e.g., call record data, bank transfer data).

An example transactional data source of credit card fraud is given

in Table 5.1. Each line in the transactional data source represents

a money transfer between two actors: a credit card holder and a

merchant. Despite the structured representation of the data, the

relationships between credit card holders and merchants are hard to

capture. Real-life data sources contain billions of transactions, making

it impossible to extract correlations and useful insights. Network visu-

alization tools offer a powerful solution to make information hidden

in networks easy to interpret and understand. Inspecting the visual

representation of a network can be part of the preprocessing phase

as it familiarizes the user with the data and can often quickly result

in some ﬁrst ﬁndings and insights. In the post-processing phase, the

network is a useful representation to verify the obtained results and

understand the rationale. In general, a network can be represented in

two ways:

◾Graphically

◾Mathematically

Table 5.1 Example of Credit Card Transaction Data

Credit Card Merchant Merchant Country Amount Date Accept Fraud

Category

8202092217124626 207005 056 USA 112.99 2013-11-06